Opened 12 years ago

Closed 12 years ago

Last modified 6 years ago

#18676 closed Cleanup/optimization (fixed)

Django should do m2m deletes in a single query when possible

Reported by: Jeremy Dunck Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords:
Cc: django@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

https://groups.google.com/forum/?fromgroups#!topic/django-developers/zoS8gAyZWjI

As Anssi notes there:

To me it seems there is no reason to do two queries in delete if the
following conditions hold:
  - there are no signals sent for the deleted objects
  - the delete does not cascade further (actually you don't need to
collect even in this case - you can do the cascade query completely in
the DB without fetching the IDs first).

I think a good first-cut is auto-created intermediate table and no signals attached for that model.

Change History (14)

comment:1 by Preston Holmes, 12 years ago

Triage Stage: UnreviewedAccepted

This seems entirely reasonable as a goal, I'm accepting and further discussion can happen around any patch submitted.

comment:2 by Anssi Kääriäinen, 12 years ago

Has patch: set

I am going to hijack this ticket for "fast-path" deletion. This is in effect for any model that:

  • does not have signals attached for pre/post delete (or m2m_changed, if the model happens to be m2m model).
  • does not lead to further cascades (all cascades except DO_NOTHING prevent the fast path).
  • do not have parents (this is a special case of reverse-cascade).

I have a patch at https://github.com/akaariai/django/tree/fast_delete for this. I think it is ready for 1.5 inclusion.

The patch is in two commits:

If nothing surprising arises, I will push this into 1.5 before feature freeze.

comment:3 by Anssi Kääriäinen, 12 years ago

Some bugs optimizations done in the above branch. I know there is still more to do in DeleteQuery.delete_qs() - at least the logic for primary key table alias selection isn't valid - there is no guarantee that the get_initial_alias() contains the pk field.

I did some little benchmarking, the models:

class Place:
    name = TextField

class AA:
    place = ForeignKey(Place, null=True)

The tests are: create 0, 1, 10, or 100 AA and delete with AA.objects.all().delete(), and similarly 0, 1, 10 or 100 AA objects, this time deleted with AA.objects.filter(place__name='p').delete(). The first test hits the single-table delete in delete_qs, the second case hits the joined delete in delete_qs.

The speed results for PostgreSQL are as follows:

testmasterfast_delete
all, 0.203827.155007
all, 1..314117.157289
all, 10.371872.153490
all, 100.878677.203811
filter, 0.476457.545436
filter, 1.468119.503444
filter, 10.569284.503499
filter, 1001.356935.693341

So, on all but joined filter with 0 or 1 deleted object the patch is faster. There is some variance in the results, as can be seen from the .545 result for the 0 object filter case. That result is usually faster than the case with one object.

The reason for the speed difference is that the delete query with an inner join is a little slow for the 0 objects case, and there is some extra overhead from query.clone() in the inner join case.

The relative speeds are similar on in-memory SQLite.

The above test-case is favouring the old behaviour of fetching all the objs into memory, as there isn't actually any data. For large deletes we have a clear win. For small deletes we have minor improvement in the non-joined case, and a minor loss for the joined case. For huge deletes it is easy to construct arbitrarily large differences in speed.

comment:4 by Anssi Kääriäinen, 12 years ago

Triage Stage: AcceptedReady for checkin

I have updated the patch in fast_delete branch. I think this one is ready for commit.

I will be doing a squash merge with just two commits, one for the has_signals part, another for the rest of the patch.

Last chance to review!

comment:5 by Collin Anderson, 12 years ago

Here's the actual diff for your branch, btw. I had a hard time viewing without a pull request.

https://github.com/akaariai/django/compare/1360bd418...fast_delete

comment:6 by Anssi Kääriäinen, 12 years ago

Pull request no. 400 created: https://github.com/django/django/pull/400

comment:7 by Anssi Kääriäinen <akaariai@…>, 12 years ago

Resolution: fixed
Status: newclosed

In 1cd6e04cd4f768bcd4385b75de433d497d938f82:

Fixed #18676 -- Allow fast-path deletion of objects

Objects can be fast-path deleted if there are no signals, and there are
no further cascades. If fast-path is taken, the objects do not need to
be loaded into memory before deletion.

Thanks to Jeremy Dunck, Simon Charette and Alex Gaynor for reviewing
the patch.

comment:8 by Daniel Swarbrick, 12 years ago

After doing a git bisect, I discovered that the commit 1cd6e04cd4f768bcd4385b75de433d497d938f82 caused a regression in an existing project of mine, that uses a custom object manager. The setup is as follows (simplified - other models also have ForeignKey back to OrgUnit model):

class OrgUnit(models.Model):
    name = models.CharField(max_length=64, unique=True)
    groups = models.ManyToManyField(Group, blank=True)
    users = models.ManyToManyField(User, blank=True)

class LoginManager(models.Manager):

    def restricted(self, user):
        q_group = Q(orgunit__groups__user=user)
        q_user = Q(orgunit__users=user)
        return self.filter(q_group | q_user).distinct()

class Login(models.Model):
    description = models.CharField(max_length=32)
    orgunit = models.ForeignKey(OrgUnit)
    private_data = models.TextField(editable=False)

    objects = LoginManager()

My view POSTs a list of ids to a view that deletes Login objects, and it throws a DB exception "subquery has too many columns". I can see quite clearly from the generated SQL that the sub-select is including unnecessary fields for the "DELETE FROM foo WHERE id in (SELECT id, description FROM login ... )". I don't understand why the sub-select is including the Login.description field :-?

The following is the code in the view that triggers the exception:

id_list = map(int, request.POST.getlist('_id_list'))
Login.objects.restricted(request.user).filter(id__in=id_list).delete()

What I've found is that if I modify the LoginManager.restricted() to either not filter() (undesirable) or not .distinct() then the bug does not occur.

This code is a few years old now, and I don't remember why exactly I put a .distinct() in there. It doesn't seem to cause any undesirable side effects if I remove it, but I'd like to try to understand why this Django commit has caused this regression in the first place, as it may affect other people.

comment:9 by Anssi Kääriäinen, 12 years ago

I will take a look. I wonder why just one field is added to the subquery, I would guess that either all of them would be added or none would be added.

comment:10 by Anssi Kääriäinen, 12 years ago

Can't reproduce your example, this is what I tried: https://github.com/akaariai/django/commit/71064d04a0b1a834abe56e73e3c04e6687281748 - no errors on any DB.

However I spotted problems if the subquery has extra_select, select_related or aggregates in it and am working towards a fix for that. Could this be the reason for the problem?

comment:11 by Anssi Kääriäinen, 12 years ago

I created #19102 for the problem with extra, select_related and aggregates.

comment:12 by Daniel Swarbrick, 12 years ago

After trawling through my commit history, I still can't figure out why I added the distinct() to my custom model manager. Therefor, I've decided to drop the distinct(). I'm not noticing any unwelcome side effects.

As for not being able to reproduce the problem with my model schema above, I suppose it could be any number of other factors in my project, which I can't think of off the top of my head... perhaps it's because I have multiple models linking back to the OrgUnit model.

I think there's a good chance that the symptoms described in #19102 are related, and perhaps when they are resolved, then so will this ticket be. Certainly the unwanted appearance of the second field in the sub-select matches exactly the symptoms I was seeing with the distinct() in my model manager.

comment:13 by Daniel Swarbrick, 12 years ago

Cc: django@… added

comment:14 by Mariusz Felisiak <felisiak.mariusz@…>, 6 years ago

In 26c4be2e:

Refs #18676 -- Enabled fast-delete for m2m_changed senders.

There's no reason to disable fast-delete when an intermediary
many-to-many model has connected m2m_changed receivers because the
signal is only sent when related manager's clear() and remove() methods
are directly called.

This must have been overlooked in 1cd6e04cd4f768bcd4385b75de433d497d938f82
given no regression tests fail when m2m_changed is not taken into
consideration to determine if fast-delete can be enabled.

Note: See TracTickets for help on using tickets.
Back to Top