#18676 closed Cleanup/optimization (fixed)
Django should do m2m deletes in a single query when possible
Reported by: | Jeremy Dunck | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.4 |
Severity: | Normal | Keywords: | |
Cc: | django@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
https://groups.google.com/forum/?fromgroups#!topic/django-developers/zoS8gAyZWjI
As Anssi notes there:
To me it seems there is no reason to do two queries in delete if the following conditions hold: - there are no signals sent for the deleted objects - the delete does not cascade further (actually you don't need to collect even in this case - you can do the cascade query completely in the DB without fetching the IDs first).
I think a good first-cut is auto-created intermediate table and no signals attached for that model.
Change History (14)
comment:1 by , 12 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 12 years ago
Has patch: | set |
---|
I am going to hijack this ticket for "fast-path" deletion. This is in effect for any model that:
- does not have signals attached for pre/post delete (or m2m_changed, if the model happens to be m2m model).
- does not lead to further cascades (all cascades except DO_NOTHING prevent the fast path).
- do not have parents (this is a special case of reverse-cascade).
I have a patch at https://github.com/akaariai/django/tree/fast_delete for this. I think it is ready for 1.5 inclusion.
The patch is in two commits:
If nothing surprising arises, I will push this into 1.5 before feature freeze.
comment:3 by , 12 years ago
Some bugs optimizations done in the above branch. I know there is still more to do in DeleteQuery.delete_qs() - at least the logic for primary key table alias selection isn't valid - there is no guarantee that the get_initial_alias() contains the pk field.
I did some little benchmarking, the models:
class Place: name = TextField class AA: place = ForeignKey(Place, null=True)
The tests are: create 0, 1, 10, or 100 AA and delete with AA.objects.all().delete(), and similarly 0, 1, 10 or 100 AA objects, this time deleted with AA.objects.filter(place__name='p'
).delete(). The first test hits the single-table delete in delete_qs, the second case hits the joined delete in delete_qs.
The speed results for PostgreSQL are as follows:
test | master | fast_delete |
all, 0 | .203827 | .155007 |
all, 1 | ..314117 | .157289 |
all, 10 | .371872 | .153490 |
all, 100 | .878677 | .203811 |
filter, 0 | .476457 | .545436 |
filter, 1 | .468119 | .503444 |
filter, 10 | .569284 | .503499 |
filter, 100 | 1.356935 | .693341 |
So, on all but joined filter with 0 or 1 deleted object the patch is faster. There is some variance in the results, as can be seen from the .545 result for the 0 object filter case. That result is usually faster than the case with one object.
The reason for the speed difference is that the delete query with an inner join is a little slow for the 0 objects case, and there is some extra overhead from query.clone() in the inner join case.
The relative speeds are similar on in-memory SQLite.
The above test-case is favouring the old behaviour of fetching all the objs into memory, as there isn't actually any data. For large deletes we have a clear win. For small deletes we have minor improvement in the non-joined case, and a minor loss for the joined case. For huge deletes it is easy to construct arbitrarily large differences in speed.
comment:4 by , 12 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
I have updated the patch in fast_delete branch. I think this one is ready for commit.
I will be doing a squash merge with just two commits, one for the has_signals part, another for the rest of the patch.
Last chance to review!
comment:5 by , 12 years ago
Here's the actual diff for your branch, btw. I had a hard time viewing without a pull request.
https://github.com/akaariai/django/compare/1360bd418...fast_delete
comment:6 by , 12 years ago
Pull request no. 400 created: https://github.com/django/django/pull/400
comment:7 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:8 by , 12 years ago
After doing a git bisect, I discovered that the commit 1cd6e04cd4f768bcd4385b75de433d497d938f82 caused a regression in an existing project of mine, that uses a custom object manager. The setup is as follows (simplified - other models also have ForeignKey
back to OrgUnit
model):
class OrgUnit(models.Model): name = models.CharField(max_length=64, unique=True) groups = models.ManyToManyField(Group, blank=True) users = models.ManyToManyField(User, blank=True) class LoginManager(models.Manager): def restricted(self, user): q_group = Q(orgunit__groups__user=user) q_user = Q(orgunit__users=user) return self.filter(q_group | q_user).distinct() class Login(models.Model): description = models.CharField(max_length=32) orgunit = models.ForeignKey(OrgUnit) private_data = models.TextField(editable=False) objects = LoginManager()
My view POSTs a list of ids to a view that deletes Login objects, and it throws a DB exception "subquery has too many columns". I can see quite clearly from the generated SQL that the sub-select is including unnecessary fields for the "DELETE FROM foo WHERE id in (SELECT id, description FROM login ... )". I don't understand why the sub-select is including the Login.description
field :-?
The following is the code in the view that triggers the exception:
id_list = map(int, request.POST.getlist('_id_list')) Login.objects.restricted(request.user).filter(id__in=id_list).delete()
What I've found is that if I modify the LoginManager.restricted()
to either not filter()
(undesirable) or not .distinct()
then the bug does not occur.
This code is a few years old now, and I don't remember why exactly I put a .distinct()
in there. It doesn't seem to cause any undesirable side effects if I remove it, but I'd like to try to understand why this Django commit has caused this regression in the first place, as it may affect other people.
comment:9 by , 12 years ago
I will take a look. I wonder why just one field is added to the subquery, I would guess that either all of them would be added or none would be added.
comment:10 by , 12 years ago
Can't reproduce your example, this is what I tried: https://github.com/akaariai/django/commit/71064d04a0b1a834abe56e73e3c04e6687281748 - no errors on any DB.
However I spotted problems if the subquery has extra_select, select_related or aggregates in it and am working towards a fix for that. Could this be the reason for the problem?
comment:11 by , 12 years ago
I created #19102 for the problem with extra, select_related and aggregates.
comment:12 by , 12 years ago
After trawling through my commit history, I still can't figure out why I added the distinct()
to my custom model manager. Therefor, I've decided to drop the distinct()
. I'm not noticing any unwelcome side effects.
As for not being able to reproduce the problem with my model schema above, I suppose it could be any number of other factors in my project, which I can't think of off the top of my head... perhaps it's because I have multiple models linking back to the OrgUnit model.
I think there's a good chance that the symptoms described in #19102 are related, and perhaps when they are resolved, then so will this ticket be. Certainly the unwanted appearance of the second field in the sub-select matches exactly the symptoms I was seeing with the distinct()
in my model manager.
comment:13 by , 12 years ago
Cc: | added |
---|
This seems entirely reasonable as a goal, I'm accepting and further discussion can happen around any patch submitted.