#17276 closed Cleanup/optimization (fixed)
Slow anti-join query against Postgres
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.3 |
Severity: | Normal | Keywords: | orm, postgres, join, anti-join |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
For background on this ticket please see the following two discussions:
- http://stackoverflow.com/questions/8190474/postgres-slow-outer-query-when-using-a-primary-key
- https://groups.google.com/group/django-users/browse_thread/thread/7d13f2d8748b4f9f
Basically, in a many-to-many mappings between models Student and Course, if I want to find all instances of Students that aren't registered for classes, I would issue the following Django query:
Student.objects.filter(course__isnull=True)
Django translates this into the following query:
SELECT "student"."id", "student"."name" FROM "student" LEFT OUTER JOIN "course_students" ON ("student"."id" = "course_students"."student_id") LEFT OUTER JOIN "course" ON ("course_students"."course_id" = "course"."id") WHERE "course"."id" IS NULL
The problem is that the way the WHERE clause is generated is very inefficient (at least when used with Postgres). Changing WHERE to "course_students"."student_id" IS NOT NULL yields orders of magnitude improved query plan. Here's the difference I'm seeing on real data:
- Django way: (cost=1072479.36..6256437.79 rows=1 width=145)
- Hand-crafted: (cost=1518.71..1533.35 rows=1 width=145)
I'm attaching a sample project with the model already set up. To see the generated SQL query, simply run "python manage.py anti-join."
Attachments (2)
Change History (8)
by , 13 years ago
Attachment: | AntiJoin.rar added |
---|
comment:1 by , 13 years ago
Triage Stage: | Unreviewed → Accepted |
---|
Accepted on the basis of Russell's comments in the thread linked. When it comes to implementation and the need to do different things for different databases, we may have to WONTFIX this, but we can see.
comment:2 by , 13 years ago
Description: | modified (diff) |
---|
comment:3 by , 13 years ago
The query is asking for items with no m2m assigned items. Thus a join against only the m2m intermediate table should suffice.
The last join in that chain is not necessary at all. Foreign keys ensure that if there is a row in course_students table, then there will be a matching row in course table. This means that if the first join yields any rows, so must the second join, too. Naturally, if there is no rows in the first join, then there will be no rows in the second join, either. The last join is not needed, as the null/not null status of the first join tells us already if the filter condition is correct or not.
There is a slight complication because of deferred foreign keys used in Django - it is possible to insert a row into course_students table without there being a matching row in course table. I don't know if that is something Django should guard against. If the user ends up in situation where he has a row in the m2m table, but not in the target table, he has done something wrong - m2m targets should always be saved into the DB first before adding them to m2m collections.
This complication of deferred foreign keys also make this a hard optimization target for databases. Django has an advantage in that even if we are using deferred foreign keys, we know that in normal usage the defer should not have any effect.
Sidenote: why exactly are we using deferred foreign keys, not foreign keys with DEFERRABLE INITIALLY IMMEDIATE and set them to deferred only when needed in fixture loading? Or am I missing some other valid use case for deferred foreign keys in Django? If so, that would naturally also make the above join-removal optimization invalid.
Trivial test case included.
by , 13 years ago
Attachment: | 17276_tests.diff added |
---|
follow-up: 6 comment:4 by , 13 years ago
I realized over the weekend that the same issue exists for one-to-many relationships. For example, consider these models:
class Parent(models.Model): name = models.CharField(max_length=100) class Child(models.Model): name = models.CharField(max_length=200) parents = models.ForeignKey(to=Parent, db_column="parent_id")
and the following Django query:
Parent.objects.filter(child__isnull=True)
for which Django will generate this SQL:
SELECT "demo_parent"."id", "demo_parent"."name" FROM "demo_parent" LEFT OUTER JOIN "demo_child" ON ("demo_parent"."id" = "demo_child"."parent_id") WHERE "demo_child"."id" IS NULL
So the same issue as with many-to-many where this query is orders of magnitude slower than a slightly modified one:
SELECT "demo_parent"."id", "demo_parent"."name" FROM "demo_parent" LEFT OUTER JOIN "demo_child" ON ("demo_parent"."id" = "demo_child"."parent_id") WHERE "demo_child"."parent_id" IS NULL
('WHERE "demo_child"."parent_id" IS NULL' vs. 'WHERE "demo_child"."id" IS NULL')
comment:5 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in Django 1.6 by 69597e5bcc89aadafd1b76abf7efab30ee0b8b1a.
comment:6 by , 4 years ago
Replying to dberansky:
I realized over the weekend that the same issue exists for one-to-many relationships. For example, consider these models:
class Parent(models.Model): name = models.CharField(max_length=100) class Child(models.Model): name = models.CharField(max_length=200) parents = models.ForeignKey(to=Parent, db_column="parent_id")and the following Django query:
Parent.objects.filter(child__isnull=True)for which Django will generate this SQL:
SELECT "demo_parent"."id", "demo_parent"."name" FROM "demo_parent" LEFT OUTER JOIN "demo_child" ON ("demo_parent"."id" = "demo_child"."parent_id") WHERE "demo_child"."id" IS NULLSo the same issue as with many-to-many where this query is orders of magnitude slower than a slightly modified one:
SELECT "demo_parent"."id", "demo_parent"."name" FROM "demo_parent" LEFT OUTER JOIN "demo_child" ON ("demo_parent"."id" = "demo_child"."parent_id") WHERE "demo_child"."parent_id" IS NULL('WHERE "demo_child"."parent_id" IS NULL' vs. 'WHERE "demo_child"."id" IS NULL')
This One2Many inefficient query is still there at least with django 2.2.
If anyone ends up here like me: the workaround to get the efficient query is:
Parent.objects.filter(child__parent__isnull=True)
sample project