Opened 3 years ago

Last modified 3 years ago

#24462 new New feature

Add a new QuerySet operation to use current results as a subquery

Reported by: trosos Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: distinct order_by subquery
Cc: mihadra@…, trosos Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When I try to order a queryset that uses distinct(...), e.g.:

qs2 = qs.order_by('a', 'b').distinct('a')
qs3 = qs2.order_by('c')

then Django seems to forget about the first order_by and apply the second order_by before applying distinct. I would expect the second order_by to be applied after applying distinct.

This is particularly troublesome when one wants to create a custom manager on top of qs2.

I don't know if this is the intended behavior, but it is very unexpected to me (see the section "Rationale").

Steps to reproduce

Use the Postgre SQL backend.

In models.py:

from django.db import models

class YoungestAmongNamesakes(models.Manager):
    def get_queryset(self):
        return super(YoungestAmongNamesakes, self).get_queryset().\
          order_by('name', 'age').distinct('name')

class Human(models.Model):
    nick = models.CharField(max_length=20)
    name = models.CharField(max_length=20)
    age = models.IntegerField()

    objects = models.Manager()
    youngest_among_namesakes = YoungestAmongNamesakes()

In the interactive shell:

Human.objects.create(nick='foo', name='Helen', age=20)
Human.objects.create(nick='bar', name='Helen', age=23)
Human.objects.create(nick='baz', name='Jennifer', age=15)

for human in Human.youngest_among_namesakes.all():
    print human.nick

for human in Human.youngest_among_namesakes.order_by('name', 'nick'):
    print human.nick

for human in Human.youngest_among_namesakes.order_by('nick'):
    print human.nick

Actual results

The first iteration outputs:

foo
baz

The second iteration outputs:

bar
baz

The third iteration raises

ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Human.youngest_among_namesakes.all() generates:

SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC

Human.youngest_among_namesakes.order_by('name', 'nick') generates:

SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."nick" ASC

Similarly, Human.youngest_among_namesakes.order_by('nick') generates:

SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."nick" ASC

Expected results

I would expect the second iteration to output:

foo
baz

I would expect the third iteration to output:

baz
foo

I would expect Human.youngest_among_namesakes.order_by('name', 'nick') to generate the following query:

SELECT "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" WHERE "myapp_human"."id" IN (SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC) ORDER BY "myapp_human"."name" ASC, "myapp_human"."nick" ASC

Similarly, I would expect Human.youngest_among_namesakes.order_by('nick') to generate the following query:

SELECT "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" WHERE "myapp_human"."id" IN (SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC) ORDER BY "myapp_human"."nick" ASC

Rationale

I would expect any queryset qs to contain the same objects as qs.order_by(...), just in a possibly different order.

Workaround

As a (suboptimal) workaround, one might use a subquery instead:

class YoungestAmongNamesakes(models.Manager):
    def get_queryset(self):
        qs = super(YoungestAmongNamesakes, self).get_queryset().\
          order_by('name', 'age').distinct('name')
        return super(YoungestAmongNamesakes, self).get_queryset().filter(pk__in=qs)

This however generates unnecessarily complex SQL query for Human.youngest_among_namesakes.all().

Suggestion

I would suggest to rewrite django.db.models.query.QuerySet.order_by not to unconditionally clear the previous ordering, but to first check whether distinct(...) is used in the present queryset and use a subquery in such a case.

Something like this (the code is untested and written quickly just to document my thought):

    def order_by(self, *field_names):
        """
        Returns a new QuerySet instance with the ordering changed.
        """
        assert self.query.can_filter(), \
            "Cannot reorder a query once a slice has been taken."
        obj = self._clone()

        if self.query.distinct and self.query.distinct_fields:
            from django.db.models.sql.subqueries import AggregateQuery
            subquery = obj

            obj = AggregateQuery(obj.model)
            try:
                obj.add_subquery(subquery, using=self.db)
            except EmptyResultSet:
                obj = subquery
        else:
            obj.query.clear_ordering(force_empty=False)

        obj.query.add_ordering(*field_names)
        return obj

Attachments (1)

24462.diff (664 bytes) - added by Tim Graham 3 years ago.

Download all attachments as: .zip

Change History (8)

comment:1 Changed 3 years ago by trosos

Cc: trosos added
Type: UncategorizedBug

comment:2 Changed 3 years ago by Tim Graham

Not sure the proposal is feasible or a good idea, but attaching a doc patch about order_by() clearing previous ordering which we can commit in the meantime.

Changed 3 years ago by Tim Graham

Attachment: 24462.diff added

comment:3 Changed 3 years ago by Anssi Kääriäinen

I don't find changing the way .order_by().distinct().order_by() works a good idea. It *will* break a lot of existing code. For example, if you have automatic Meta ordering for model, what should Human.objects.distinct('name').order_by('name', 'age') return?

Instead I think we need a new queryset operation to force Django use the current results as a subquery. There are various cases where you'd want this. Simplest one is SomeModel.objects.order_by('posted')[0:10].subquery().order_by('-posted').

This feature is likely pretty hard to implement correctly. Generating the SQL for the subquery is easy, but generating references to the subquery's available fields correctly is likely going to be hard.

comment:4 Changed 3 years ago by Tim Graham <timograham@…>

In b9d9ab23:

Refs #24462 -- Emphasized that order_by() clears previous ordering.

comment:5 Changed 3 years ago by Tim Graham <timograham@…>

In f8ed647:

[1.7.x] Refs #24462 -- Emphasized that order_by() clears previous ordering.

Backport of b9d9ab23bdcc404708aada664e718a9d56415ca3 from master

comment:6 Changed 3 years ago by Tim Graham <timograham@…>

In 5b91802:

[1.8.x] Refs #24462 -- Emphasized that order_by() clears previous ordering.

Backport of b9d9ab23bdcc404708aada664e718a9d56415ca3 from master

comment:7 Changed 3 years ago by Tim Graham

Summary: Using .distinct().order_by() on a queryset produces unexpected resultsAdd a new QuerySet operation to use current results as a subquery
Triage Stage: UnreviewedAccepted
Type: BugNew feature
Version: 1.7master
Note: See TracTickets for help on using tickets.
Back to Top