Opened 7 years ago

Closed 7 years ago

Last modified 4 years ago

#25544 closed Cleanup/optimization (fixed)

prefetch_related sends duplicate ids to database

Reported by: Julien Hartmann Owned by: Ian Foote
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: prefetch duplicate
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

That may be intended, but it seems dubious, so…

Issue happens when prefetching a model relation from another one, linked by a M2M field. Sample case:

class Picture(Model):
    # some fields

class Collection(Model):
    pictures = ManyToManyField(Picture, blank=True,
                               related_name='collections')
    # some fields

class CollectionPermission(Model):
    content_object = ForeignKey(Collection,
                                related_name='permissions')
    # fk to users and fk to permissions

When prefetching collection permissions from a picture queryset, like this:

Picture.objects.prefetch_related('collections__permissions')

The third query (the second level of prefetching) looks like this:

SELECT "gallery_collectionpermission"."id", 
       "gallery_collectionpermission"."user_id", 
       "gallery_collectionpermission"."permission_id", 
       "gallery_collectionpermission"."content_object_id" 
FROM   "gallery_collectionpermission" 
WHERE  "gallery_collectionpermission"."content_object_id" IN (
             11, 11, 11, 11, 
             11, 11, 11, 11, 
             11, 11, 10, 10, 
             10, 10, 10, 10, 
             10, 10, 10, 10, 
             10, 10, 10, 10, 
             10, 10, 10, 10, 
             10, 10, 10, 10, 
             10, 10, 9, 9,
             9, 9, 9 )

The issue arises in two steps:

  • As prefetch has no identity mapping, the first prefetching step will build a different Collection instance for every picture that uses it.
  • Then the second prefetching step will gather all instances, and dump their ids to the database without looking for duplicates.

Ideally, this could be fixed at the first step: when a queryset prefetches the same row several times, it would make sense to return a single instance. In this example, all Pictures' cache should point to the same Gallery instances.

The second step could also be fixed: the related manager could eliminate duplicates when building the query. For some reason the ReverseSingleRelatedObjectDescriptor does it, but none of the other descriptors or managers do.

The cost seems huge. Typically, I have collections of hundreds of pictures, but each picture will only appear in 4 or 5 collections. The query thus sends the db server tens of thousands of ids instead of a few dozens.

Is there something I missed? Or is this use case out of the normal scope of prefetch_related?

Change History (16)

comment:1 Changed 7 years ago by Tim Graham

Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization

I'm not too familiar with this, but accepting on the basis for further investigation.

comment:2 Changed 7 years ago by Ian Foote

Owner: changed from nobody to Ian Foote
Status: newassigned

comment:3 Changed 7 years ago by Carl Meyer

Triage Stage: AcceptedReady for checkin

Patch is ready, IMO. Waiting for actual merge until the CI catches up...

comment:4 Changed 7 years ago by Tim Graham

Has patch: set
Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

postgres_tests.test_array.TestQuerying.test_in fails.

comment:5 Changed 7 years ago by Carl Meyer

@IanFoote Oops, my fault. I didn't consider the possibility that non-hashable values (e.g. lists for an ArrayField) could be passed to __in lookups. I don't really see a way to preserve this fix in the face of that possibility (going through the value and trying to convert everything to something hashable doesn't seem like a good plan), so unless you have any ideas I'm not seeing, I think we may need to revert to your previous fix at the prefetch_related layer.

Sorry for the bad direction!

comment:6 Changed 7 years ago by Simon Charette

@carljm, what do you think of simply special casing ArrayField's __in lookup which is something we already do for __exact anyway.

Something like the following comes to mind.

@ArrayField.register_lookup
class ArrayInLookup(InLookup):
    def get_prep_lookup(self):
        values = super(ArrayInLookup, self).get_prep_lookup()
        return [tuple(value) for value in values]

comment:7 Changed 7 years ago by Carl Meyer

@charettes I think that's a good idea! As long as we're confident that a) there are no other valid cases outside ArrayField where an unhashable value might be passed in an __in lookup rhs, and b) every value passed to an ArrayField lookup should always be iterable. But I can't come up with any counter-examples to either of those.

comment:8 Changed 7 years ago by Anssi Kääriäinen

A possibility is to try-except the deduping of the values. If it doesn't succeed, just use the original values.

comment:9 Changed 7 years ago by Julien Hartmann

I do not know if it is related, but ArrayField.get_db_prep_value always converts tuples to lists.

Last edited 7 years ago by Julien Hartmann (previous) (diff)

comment:10 Changed 7 years ago by Simon Charette

Might be worth pinging Mark about this.

comment:11 Changed 7 years ago by Ian Foote

Patch needs improvement: unset

comment:12 Changed 7 years ago by Carl Meyer

Triage Stage: AcceptedReady for checkin

comment:13 Changed 7 years ago by Tim Graham

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Latest comments on the PR indicates a few more tests should be added.

comment:14 Changed 7 years ago by Simon Charette

Patch needs improvement: unset
Version: 1.8master

The comments have been addressed.

comment:15 Changed 7 years ago by Tim Graham <timograham@…>

Resolution: fixed
Status: assignedclosed

In 86eccdc8:

Fixed #25544 -- Removed duplicate ids in prefetch_related() queries.

comment:16 Changed 4 years ago by Nick Pope

The fix to this resulted in a regression where parameters passed to __in were reordered due to the use of set(). See #29503.

Note: See TracTickets for help on using tickets.
Back to Top