Opened 10 years ago
Last modified 10 years ago
#25464 new New feature
Allow skipping IN clause on prefetch queries — at Version 2
| Reported by: | Erik Cederstrand | Owned by: | nobody | 
|---|---|---|---|
| Component: | Database layer (models, ORM) | Version: | dev | 
| Severity: | Normal | Keywords: | |
| Cc: | Simon Charette, Shai Berger | Triage Stage: | Accepted | 
| Has patch: | yes | Needs documentation: | yes | 
| Needs tests: | yes | Patch needs improvement: | no | 
| Easy pickings: | no | UI/UX: | no | 
Description (last modified by )
When using prefetch_related() on a large queryset, the prefetch query SQL can be inefficient. Consider this:
    Category.objects.filter(type=5).prefetch_related('items')
If 100.000 categories have type=5, then an IN clause with 100.000 Category IDs is generated to get the Item objects. Even with a custom queryset using a Prefetch() object, the IN clause is generated, even though it is A) redundant, B) sends a potentially multi-megabyte SQL statement over the wire for the database to process, C) may confuse the query planner to generate an inefficient execution plan, and D) doesn't scale:
    Category.objects.filter(type=5).prefetch_related(Prefetch('items', queryset=Item.objects.filter(category__item=5)))
Pull request https://github.com/django/django/pull/5356 adds the possibility to skip the IN clause in cases where we are sure that a better queryset will get (at least) the same items as the IN clause would:
    Category.objects.filter(type=5).prefetch_related(Prefetch('items', queryset=Item.objects.filter(category__item=5), filter_on_instances=False))
In my tests, this speeds up prefetch_related() by 20x-50x on large querysets.
Change History (2)
comment:1 by , 10 years ago
| Needs documentation: | set | 
|---|---|
| Needs tests: | set | 
| Triage Stage: | Unreviewed → Accepted | 
comment:2 by , 10 years ago
| Description: | modified (diff) | 
|---|
Accepting the basic idea here. I'd be interested to see some performance profiles across a few databases here though, it seems your inner query there is potentially doing an enormous join to do it's filter, I'm slightly surprised this is that much more efficient.
Patch needs tests and documentation, and also needs to reassure me what happens if the prefetched queryset contains extraneous rows. Are they ignored? Does it throw an error? What should the result be?