Code

Opened 2 years ago

Closed 22 months ago

Last modified 21 months ago

#18676 closed Cleanup/optimization (fixed)

Django should do m2m deletes in a single query when possible

Reported by: jdunck Owned by: nobody
Component: Database layer (models, ORM) Version: 1.4
Severity: Normal Keywords:
Cc: django@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

https://groups.google.com/forum/?fromgroups#!topic/django-developers/zoS8gAyZWjI

As Anssi notes there:

To me it seems there is no reason to do two queries in delete if the
following conditions hold:
  - there are no signals sent for the deleted objects
  - the delete does not cascade further (actually you don't need to
collect even in this case - you can do the cascade query completely in
the DB without fetching the IDs first).

I think a good first-cut is auto-created intermediate table and no signals attached for that model.

Attachments (0)

Change History (13)

comment:1 Changed 2 years ago by ptone

  • Triage Stage changed from Unreviewed to Accepted

This seems entirely reasonable as a goal, I'm accepting and further discussion can happen around any patch submitted.

comment:2 Changed 22 months ago by akaariai

  • Has patch set

I am going to hijack this ticket for "fast-path" deletion. This is in effect for any model that:

  • does not have signals attached for pre/post delete (or m2m_changed, if the model happens to be m2m model).
  • does not lead to further cascades (all cascades except DO_NOTHING prevent the fast path).
  • do not have parents (this is a special case of reverse-cascade).

I have a patch at https://github.com/akaariai/django/tree/fast_delete for this. I think it is ready for 1.5 inclusion.

The patch is in two commits:

If nothing surprising arises, I will push this into 1.5 before feature freeze.

comment:3 Changed 22 months ago by akaariai

Some bugs optimizations done in the above branch. I know there is still more to do in DeleteQuery.delete_qs() - at least the logic for primary key table alias selection isn't valid - there is no guarantee that the get_initial_alias() contains the pk field.

I did some little benchmarking, the models:

class Place:
    name = TextField

class AA:
    place = ForeignKey(Place, null=True)

The tests are: create 0, 1, 10, or 100 AA and delete with AA.objects.all().delete(), and similarly 0, 1, 10 or 100 AA objects, this time deleted with AA.objects.filter(place__name='p').delete(). The first test hits the single-table delete in delete_qs, the second case hits the joined delete in delete_qs.

The speed results for PostgreSQL are as follows:

testmasterfast_delete
all, 0.203827.155007
all, 1..314117.157289
all, 10.371872.153490
all, 100.878677.203811
filter, 0.476457.545436
filter, 1.468119.503444
filter, 10.569284.503499
filter, 1001.356935.693341

So, on all but joined filter with 0 or 1 deleted object the patch is faster. There is some variance in the results, as can be seen from the .545 result for the 0 object filter case. That result is usually faster than the case with one object.

The reason for the speed difference is that the delete query with an inner join is a little slow for the 0 objects case, and there is some extra overhead from query.clone() in the inner join case.

The relative speeds are similar on in-memory SQLite.

The above test-case is favouring the old behaviour of fetching all the objs into memory, as there isn't actually any data. For large deletes we have a clear win. For small deletes we have minor improvement in the non-joined case, and a minor loss for the joined case. For huge deletes it is easy to construct arbitrarily large differences in speed.

comment:4 Changed 22 months ago by akaariai

  • Triage Stage changed from Accepted to Ready for checkin

I have updated the patch in fast_delete branch. I think this one is ready for commit.

I will be doing a squash merge with just two commits, one for the has_signals part, another for the rest of the patch.

Last chance to review!

comment:5 Changed 22 months ago by CollinAnderson

Here's the actual diff for your branch, btw. I had a hard time viewing without a pull request.

https://github.com/akaariai/django/compare/1360bd418...fast_delete

comment:6 Changed 22 months ago by akaariai

Pull request no. 400 created: https://github.com/django/django/pull/400

comment:7 Changed 22 months ago by Anssi Kääriäinen <akaariai@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 1cd6e04cd4f768bcd4385b75de433d497d938f82:

Fixed #18676 -- Allow fast-path deletion of objects

Objects can be fast-path deleted if there are no signals, and there are
no further cascades. If fast-path is taken, the objects do not need to
be loaded into memory before deletion.

Thanks to Jeremy Dunck, Simon Charette and Alex Gaynor for reviewing
the patch.

comment:8 Changed 21 months ago by pressureman

After doing a git bisect, I discovered that the commit 1cd6e04cd4f768bcd4385b75de433d497d938f82 caused a regression in an existing project of mine, that uses a custom object manager. The setup is as follows (simplified - other models also have ForeignKey back to OrgUnit model):

class OrgUnit(models.Model):
    name = models.CharField(max_length=64, unique=True)
    groups = models.ManyToManyField(Group, blank=True)
    users = models.ManyToManyField(User, blank=True)

class LoginManager(models.Manager):

    def restricted(self, user):
        q_group = Q(orgunit__groups__user=user)
        q_user = Q(orgunit__users=user)
        return self.filter(q_group | q_user).distinct()

class Login(models.Model):
    description = models.CharField(max_length=32)
    orgunit = models.ForeignKey(OrgUnit)
    private_data = models.TextField(editable=False)

    objects = LoginManager()

My view POSTs a list of ids to a view that deletes Login objects, and it throws a DB exception "subquery has too many columns". I can see quite clearly from the generated SQL that the sub-select is including unnecessary fields for the "DELETE FROM foo WHERE id in (SELECT id, description FROM login ... )". I don't understand why the sub-select is including the Login.description field :-?

The following is the code in the view that triggers the exception:

id_list = map(int, request.POST.getlist('_id_list'))
Login.objects.restricted(request.user).filter(id__in=id_list).delete()

What I've found is that if I modify the LoginManager.restricted() to either not filter() (undesirable) or not .distinct() then the bug does not occur.

This code is a few years old now, and I don't remember why exactly I put a .distinct() in there. It doesn't seem to cause any undesirable side effects if I remove it, but I'd like to try to understand why this Django commit has caused this regression in the first place, as it may affect other people.

comment:9 Changed 21 months ago by akaariai

I will take a look. I wonder why just one field is added to the subquery, I would guess that either all of them would be added or none would be added.

comment:10 Changed 21 months ago by akaariai

Can't reproduce your example, this is what I tried: https://github.com/akaariai/django/commit/71064d04a0b1a834abe56e73e3c04e6687281748 - no errors on any DB.

However I spotted problems if the subquery has extra_select, select_related or aggregates in it and am working towards a fix for that. Could this be the reason for the problem?

comment:11 Changed 21 months ago by akaariai

I created #19102 for the problem with extra, select_related and aggregates.

comment:12 Changed 21 months ago by pressureman

After trawling through my commit history, I still can't figure out why I added the distinct() to my custom model manager. Therefor, I've decided to drop the distinct(). I'm not noticing any unwelcome side effects.

As for not being able to reproduce the problem with my model schema above, I suppose it could be any number of other factors in my project, which I can't think of off the top of my head... perhaps it's because I have multiple models linking back to the OrgUnit model.

I think there's a good chance that the symptoms described in #19102 are related, and perhaps when they are resolved, then so will this ticket be. Certainly the unwanted appearance of the second field in the sub-select matches exactly the symptoms I was seeing with the distinct() in my model manager.

comment:13 Changed 21 months ago by pressureman

  • Cc django@… added

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.