Opened 3 months ago

Last modified 7 days ago

#35677 assigned Bug

Unexpected behaviour of Prefetch with queryset filtering on a through model

Reported by: David Glenck Owned by: YashRaj1506
Component: Database layer (models, ORM) Version: 5.1
Severity: Normal Keywords: Prefetch, prefetch_related, many-to-many
Cc: Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Currently prefetching many-to-many objects with a queryset that filters based on fields in the through model, will return different results than applying the same filter without prefetching.

Assume the following many-to-many relationship with a through model:

class Subscriber(models.Model):
    name = models.CharField(max_length=255)
    subscriptions = models.ManyToManyField('Subscription', related_name='subscribers', through='Status')


class Subscription(models.Model):
    provider_name = models.CharField(max_length=255)


class Status(models.Model):
    subscriber = models.ForeignKey(Subscriber, related_name='status', on_delete=models.CASCADE)
    subscription = models.ForeignKey(Subscription, related_name='status', on_delete=models.CASCADE)
    is_active = models.BooleanField(default=True)

Populated with the following data:

    subscriber1 = Subscriber.objects.create(name="Sofia")
    subscriber2 = Subscriber.objects.create(name="Peter")
    subscription1 = Subscription.objects.create(provider_name="ABC")
    subscription2 = Subscription.objects.create(provider_name="123")
    Status.objects.create(subscriber=subscriber1, subscription=subscription1, is_active=True)
    Status.objects.create(subscriber=subscriber1, subscription=subscription1, is_active=True)
    Status.objects.create(subscriber=subscriber1, subscription=subscription2, is_active=False)
    Status.objects.create(subscriber=subscriber2, subscription=subscription1, is_active=True)
    Status.objects.create(subscriber=subscriber2, subscription=subscription2, is_active=True)

To get the active subscriptions of Sofia I can do this:

subscriber1.subscriptions.filter(status__is_active=True)

which gives me as expected twice the first subscription

Now, I try to prefetch the active subscriptions like this:

prefetched = Subscriber.objects.filter(pk__in=[subscriber1.pk, subscriber2.pk]).prefetch_related(
    Prefetch(
        'subscriptions',
        queryset=Subscription.objects.filter(status__is_active=True),
        to_attr='active_subscriptions'
    )
)
prefetched[0].active_subscriptions

But if I do this, I get queryset with 4 times instead of 2 times the first subscription.

Looking into the source I found, that in the first case, the internal filter of the related_descriptor is made "sticky", such that it will be combined with the next filter that is applied, as if only one filter was used. So it behaves equivalent to:

Subscription.objects.filter(subscribers=subscriber1.id, status__is_active=True)

The prefetch on the other hand doesn't do any magic to stick the filters together and it will behave like this:

Subscription.objects.filter(subscribers=subscriber1.id).filter(status__is_active=True)

which, as well documented, is behaving differently for many-to-many relationships: https://docs.djangoproject.com/en/5.0/topics/db/queries/#spanning-multi-valued-relationships

Ideally the first filter on the queryset would also stick to the internal filter. Or at least there should be an easy way to change the behaviour, when needed.

Change History (12)

comment:1 by Natalia Bidart, 3 months ago

Resolution: needsinfo
Status: newclosed

Hello David, thank you for the detailed report and the models and reproduction steps you provided. I’m having trouble understanding the specific behavior you’re reporting that seems inconsistent with the documentation you referenced.

If the issue you’re encountering is related to the documented behavior of spanning multi-valued relationships, it might be a duplicate of issues such as #29196, #29271, #16554, or others. If the issue is different, could you please provide a brief summary or rephrase what you believe the bug in Django is?

comment:2 by David Glenck, 3 months ago

Hi Natalia

Thank you for pointing me to some related issues.
I checked them and they refer to some behaviour I also referred to, but the issue I am pointing out is a different one.

What I am trying to do is to optimize a repeating query of this shape (but of course more complex in reality)

for subscriber in subscribers:
    subscriber.subscriptions.filter(status__is_active=True)

by prefetching the related objects like this:

prefetched = Subscriber.objects.filter(pk__in=subscribers).prefetch_related(
    Prefetch(
        'subscriptions',
        queryset=Subscription.objects.filter(status__is_active=True),
        to_attr='active_subscriptions'
    )
)
for subscriber in prefetched:
    prefetched[0].active_subscriptions

From the code above I would expect, that the prefetch gives the same results as my un-prefetched query. But it doesn't.

This is because the first case, django does some magic internally, to combine my .filter(status__is_active=True), with the internal filter to select subscriptions from this specific subscriber. Such that the result is, as if it would be a single filter (which is relevant in many-to-many relations).
But in the case of the prefetch django doesn't do that magic internally and thus it returns a different result.

So my proposal is to make the prefetched result behave the same as the non-prefetched one by modifying the internal behaviour of the prefetch query. Or at least give an option to do so. Because currently I have no clean way to instruct the prefetch to do what I want.
I'm not sure if this is a bug or a feature request, because this specific case is not mentioned in the documentation, but I would argue that it is the expected behaviour.

For illustration, this dirty workaround makes the prefetch return the result, that I would expect (but has undesired side-effects):

class StickyQueryset(models.QuerySet):
    def _chain(self):
        self._sticky_filter = True
        return self

class Subscription(models.Model):
    provider_name = models.CharField(max_length=255)

    objects = StickyQueryset.as_manager()

I tried looking deeper into the QuerySet code to see if I can propose some patch. But it's rather complex code in there.

comment:3 by Simon Charette, 3 months ago

I tried looking deeper into the QuerySet code to see if I can propose some patch. But it's rather complex code in there.

I would search for the _next_is_sticky() token around get_prefetch_querysets implementations for many-to-many fields. It seems like we are already setting this flag in ManyRelatedManager.get_prefetch_querysets (can't link as Github is down) though so I'm surprised that the filter calls are not considered to be combined.

comment:4 by Simon Charette, 3 months ago

Cc: Simon Charette added
Type: UncategorizedBug

I suspect that the reason why subscriber.subscriptions.filter(status__is_active=True) works is that the descriptor for Subscriber.subscription calls _next_is_sticky() before any filters is applied while when it's not the case for the queryset passed to Prefetch.

From the _next_is_sticky docstring

Indicate that the next filter call and the one following that should be treated as a single filter.

In the case of subscriber.subscription.filter(status__is_active) the resulting chain is

Subcription.objects.all()._next_is_sticky().filter(subscribers=subscriber).filter(status__is_active)

while the call chain of prefetch(Prefetch("subscriptions", Subscripton.objects.filter(status__is_active)) is

Subscripton.objects.filter(status__is_active))._next_is_sticky().filter(subscribers=subscriber)

Which doesn't have the intended effect since _next_is_sticky() is not called prior to the first filter() call.

Calling _next_is_sticky() (which is effectively what you emulated with your StickyQueryset) before calling filter(status__is_active) has the same effect

prefetched = Subscriber.objects.filter(pk__in=[subscriber1.pk, subscriber2.pk]).prefetch_related(
    Prefetch(
        'subscriptions',
        queryset=Subscription.objects.all()._next_is_sticky().filter(status__is_active=True)._next_is_sticky(),
        to_attr='active_subscriptions'
    )
)

The second _next_is_sticky call is necessary because ManyRelatedManager.get_prefetch_querysets calls to using triggers _chain and clears it.

All in all this whole sticky notion is kind of hacky and simply doesn't appear appropriate in the context of ManyRelatedManager.get_prefetch_querysets (as there is no follow up filter call). It seems that we need in there is not _next_is_sticky but a way to let the ORM know that some filter calls against multi-valued relationships should reuse existing JOINs no matter what. I know we have a ticket for that but I can't find it.

comment:5 by Simon Charette, 3 months ago

Resolution: needsinfo
Status: closednew
Triage Stage: UnreviewedAccepted

I think the following is what you are requesting. I'm re-opening Natalia as I believe this is a legitimate bug.

  • django/db/models/fields/related_descriptors.py

    diff --git a/django/db/models/fields/related_descriptors.py b/django/db/models/fields/related_descriptors.py
    index bc288c47ec..5356e28d22 100644
    a b def __set__(self, instance, value):  
    9494        instance.__dict__[self.field.attname] = value
    9595
    9696
    97 def _filter_prefetch_queryset(queryset, field_name, instances):
     97def _filter_prefetch_queryset(queryset, field_name, instances, db):
    9898    predicate = Q(**{f"{field_name}__in": instances})
    99     db = queryset._db or DEFAULT_DB_ALIAS
     99    db = db or DEFAULT_DB_ALIAS
    100100    if queryset.query.is_sliced:
    101101        if not connections[db].features.supports_over_clause:
    102102            raise NotSupportedError(
    def get_prefetch_querysets(self, instances, querysets=None):  
    785785                )
    786786            queryset = querysets[0] if querysets else super().get_queryset()
    787787            queryset._add_hints(instance=instances[0])
    788             queryset = queryset.using(queryset._db or self._db)
     788            db = queryset._db or self._db
     789            queryset = queryset.using(db)
    789790
    790791            rel_obj_attr = self.field.get_local_related_value
    791792            instance_attr = self.field.get_foreign_related_value
    792793            instances_dict = {instance_attr(inst): inst for inst in instances}
    793             queryset = _filter_prefetch_queryset(queryset, self.field.name, instances)
     794            queryset = _filter_prefetch_queryset(
     795                queryset, self.field.name, instances, db
     796            )
    794797
    795798            # Since we just bypassed this class' get_queryset(), we must manage
    796799            # the reverse relation manually.
    def get_prefetch_querysets(self, instances, querysets=None):  
    11651168                )
    11661169            queryset = querysets[0] if querysets else super().get_queryset()
    11671170            queryset._add_hints(instance=instances[0])
    1168             queryset = queryset.using(queryset._db or self._db)
     1171            db = queryset._db or self._db
     1172            # Make sure filters are applied in a "sticky" fashion to reuse
     1173            # multi-valued relationships like direct filter() calls against
     1174            # many-to-many managers do.
     1175            queryset.query.filter_is_sticky = True
    11691176            queryset = _filter_prefetch_queryset(
    1170                 queryset._next_is_sticky(), self.query_field_name, instances
     1177                queryset, self.query_field_name, instances, db
    11711178            )
     1179            queryset = queryset.using(db)
    11721180
    11731181            # M2M: need to annotate the query in order to get the primary model
    11741182            # that the secondary model was actually related to. We know that

David, if you'd up to it it seems that all this bugfix is missing are tests that can be added to the prefetch_related test app if you're interested in submitting a PR once Github is back online.

in reply to:  4 comment:6 by Natalia Bidart, 3 months ago

Replying to Simon Charette:

All in all this whole sticky notion is kind of hacky and simply doesn't appear appropriate in the context of ManyRelatedManager.get_prefetch_querysets (as there is no follow up filter call). It seems that we need in there is not _next_is_sticky but a way to let the ORM know that some filter calls against multi-valued relationships should reuse existing JOINs no matter what. I know we have a ticket for that but I can't find it.

Simon, perhaps this is the ticket you are looking for? #27303, at first it seems like an admin specific report but reading on it feels it has some similarities with the "sticky" bits.
Also thank you for your further analysis and reopening.

comment:7 by Simon Charette, 3 months ago

Thanks Natalia, #27303 is effectively a manifestation of this problem in the admin. We don't have a good way to denote that joins against multi-valued relationships should be reused between different QuerySet method calls. It's a problem we hacked around for filter with this sticky notion but the problem exists for annotate and any other ORM method that allows for multi-valued (many-to-many or reverse one-to-many) to be referenced.

I believe there are other tickets that discussed a generic way to alias such relationships so they are always reusable, it was brought up during the design of FilteredRelation for sure.

IIRC the design was something along the lines of

Subscriber.objects.alias(
    # Force the reuse of all JOINs up to Susbcription (including Status)
    subscriptions=Relation("subscriptions", reuse=True),
).filter(subscriber=subscriber).filter(status__is_active=True)
Last edited 3 months ago by Simon Charette (previous) (diff)

comment:8 by David Glenck, 3 months ago

Hello Simon

thank you for the proposed bugfix. I prepared a test and applied your patch and it passes with your fix.
However, if I do an annotation after the filter (which I do in my real world use case). The same happens, when I add another prefetch_related after the filter.

prefetched = Subscriber.objects.filter(pk__in=subscribers).prefetch_related(
    Prefetch(
        'subscriptions',
        queryset=Subscription.objects.filter(status__is_active=True).annotate(active=F('status__is_active')),
        to_attr='active_subscriptions'
    )
)

This makes sense, because the last filter/annotation etc. is made sticky, not the first, like in the case of the many-to-many manager.
Knowing this, for some cases this can be fixed, by moving the additional annotation before the filter.

But if I do an annotation like this, is seems to fail regardless:

prefetched = Subscriber.objects.filter(pk__in=subscribers).prefetch_related(
    Prefetch(
        'subscriptions',
        queryset=Subscription.objects.annotate(active=F('status__is_active')).filter(active=True),
        to_attr='active_subscriptions'
    )
)

The many-to-many manager seems to handle this case correctly.

subscriber1.subscriptions.annotate(active=F('status__is_active')).filter(active=True)

I have started a branch with the tests here: https://github.com/django/django/compare/main...pascalfree:django:35677

comment:9 by Simon Charette, 2 months ago

I think there might be a way to solve the annotate issue as well but it's not trivial.

When annotate is used sql.Query.used_aliases doesn't get populated

>>> Subscription.objects.annotate(active=F('status__is_active')).query.used_aliases
set()

This means that even if the subsequent filter() wasn't clearing it (it does since filter_is_sticky is False by then) by the time it makes its way to the prefetching logic there is nothing left in used_alias that can be reused.

The only solution I can think of is to add the filter while allowing pre-existing many-to-many aliases to be reused

  • django/db/models/fields/related_descriptors.py

    diff --git a/django/db/models/fields/related_descriptors.py b/django/db/models/fields/related_descriptors.py
    index 5356e28d22..7b003d02cd 100644
    a b def _filter_prefetch_queryset(queryset, field_name, instances, db):  
    112112        if high_mark is not None:
    113113            predicate &= LessThanOrEqual(window, high_mark)
    114114        queryset.query.clear_limits()
    115     return queryset.filter(predicate)
     115    queryset.query.add_q(predicate, reuse_all_aliases=True)
     116    return queryset
    116117
    117118
    118119class ForwardManyToOneDescriptor:
    def get_prefetch_querysets(self, instances, querysets=None):  
    11721173            # Make sure filters are applied in a "sticky" fashion to reuse
    11731174            # multi-valued relationships like direct filter() calls against
    11741175            # many-to-many managers do.
    1175             queryset.query.filter_is_sticky = True
    11761176            queryset = _filter_prefetch_queryset(
    11771177                queryset, self.query_field_name, instances, db
    11781178            )
  • django/db/models/sql/query.py

    diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
    index aef3f48f10..0945aa9198 100644
    a b def build_filter(  
    16021602    def add_filter(self, filter_lhs, filter_rhs):
    16031603        self.add_q(Q((filter_lhs, filter_rhs)))
    16041604
    1605     def add_q(self, q_object):
     1605    def add_q(self, q_object, reuse_all_aliases=False):
    16061606        """
    16071607        A preprocessor for the internal _add_q(). Responsible for doing final
    16081608        join promotion.
    def add_q(self, q_object):  
    16161616        existing_inner = {
    16171617            a for a in self.alias_map if self.alias_map[a].join_type == INNER
    16181618        }
    1619         clause, _ = self._add_q(q_object, self.used_aliases)
     1619        if reuse_all_aliases:
     1620            can_reuse = set(self.alias_map)
     1621        else:
     1622            can_reuse = self.used_aliases
     1623        clause, _ = self._add_q(q_object, can_reuse)
    16201624        if clause:
    16211625            self.where.add(clause, AND)
    16221626        self.demote_joins(existing_inner)

Ideally we wouldn't reuse all aliases though, only the ones that we know for sure we want to reuse for _filter_prefetch_queryset(field_name), which can be obtained through a combination of names_to_path and comparison to alias_map.

Last edited 2 months ago by Simon Charette (previous) (diff)

comment:10 by Ahmed Ibrahim, 2 months ago

Can I help with this one? I have time

comment:11 by Simon Charette, 2 months ago

Ahmed, thanks for offering but without a track record of contributing to the ORM I think this one might be a bit too much particularly to get the discussed names_to_path and alias_map working. We should at least ensure that the proposed solution addresses David problem.

comment:12 by YashRaj1506, 7 days ago

Owner: set to YashRaj1506
Status: newassigned
Note: See TracTickets for help on using tickets.
Back to Top