Code

Opened 6 years ago

Closed 2 years ago

#6422 closed New feature (fixed)

Support for 'DISTINCT ON' queries with QuerySet.distinct()

Reported by: Manfred Wassmann <manolo@…> Owned by: jgelens
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: dceu2011
Cc: cornbread, ramusus@…, tim.babych@…, hr.bjarni+django@…, jeffrey@…, jpic, taylor.mitchell@…, diegobz, jonas-django@…, anssi.kaariainen@…, Kronuz Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The patch included modifies django/db/models/query.py so that the distinct method of the QuerySet object optionally takes variable number of field names and, if given, modifies the SQL statements generated so that 'DISTINCT ON (field,...)' is used. The incentive is to allow things like described in the following example:

class Example(models.Model):
  name = models.TextField()
  date = models. DateTimeField()
  other = models.XXX()
  ...

qs = Example.objects.all().distinct('name').order_by('name','-date')

Now qs will return the latest entry for each distinct name. This cannot otherwise be achieved unless resorting to plain SQL.

It should be noted that - at least in standard SQL and Postgres - if name allows NULL values, the query will return all entries for which name is NULL not only the latest one as one might expect.

There will be a couple of '# CHECKME:' comments scattered over the patched file. I tried to mark all places where the change might have a side effect but it wasn't apparent to me if and how to fix it. I.e. there should be no side effect from the code outside the commented regions.

Attachments (19)

query.py.patch (3.9 KB) - added by Manfred Wassmann <manolo@…> 6 years ago.
Patch for django/db/models/query.py
query.py.2.patch (4.5 KB) - added by Manfred Wassmann <manolo@…> 6 years ago.
New patch fixes count method and resolves quoting issues
ticket6422.diff (3.3 KB) - added by mrts 4 years ago.
Update patch to 1.1.X
#6422-distinct.diff (1.9 KB) - added by Kronuz 3 years ago.
Patch updated for 1.3
distinct_on.2.diff (11.9 KB) - added by tmitchell 3 years ago.
Add (failing) test for M2M intermediate model
distinct_on.diff (12.8 KB) - added by jgelens 3 years ago.
distinct_on.3.diff (12.1 KB) - added by tmitchell 3 years ago.
distinct_on.4.diff (18.4 KB) - added by jgelens 3 years ago.
distinct_on.5.diff (18.5 KB) - added by koniiiik 3 years ago.
Fixed patch with failing testcase (sorry, shouldn't have modified the patch by hand hastily...)
distinct_on.5.2.diff (18.5 KB) - added by charstring 3 years ago.
Added a failing test case for relationship traversals. (this one does apply)
distinct_on.6.diff (18.5 KB) - added by jgelens 3 years ago.
Fixed field traversal
distinct_on.7.diff (18.5 KB) - added by jgelens 3 years ago.
Updated distinct method in base class to match the new params
distinct_on.8.diff (12.0 KB) - added by ramiro 3 years ago.
Patch with docs and tests updates
distinct_on.9.diff (19.5 KB) - added by jgelens 2 years ago.
Added docstring
distinct_on.10.diff (21.4 KB) - added by jgelens 2 years ago.
fixed when calling distinct() multiple times on the same query set
distinct_on.11.diff (23.5 KB) - added by akaariai 2 years ago.
Possible approach for using correct aliases
distinct_on.12.diff (30.0 KB) - added by akaariai 2 years ago.
distinct_on.13.diff (42.5 KB) - added by jgelens 2 years ago.
distinct_on.14.diff (29.1 KB) - added by ramiro 2 years ago.
Added Anssi to AUTHORS, release notes blurb, removed unrelated tests change, renamed DatabaseOperations.dictinct() to distinct_sql()

Download all attachments as: .zip

Change History (87)

Changed 6 years ago by Manfred Wassmann <manolo@…>

Patch for django/db/models/query.py

comment:1 Changed 6 years ago by Manfred Wassmann <manolo@…>

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

It should further be noted that - at least with Postgres - fields specified with distinct() have to appear first with the order_by() statement.

comment:2 Changed 6 years ago by Manfred Wassmann <manolo@…>

  • Has patch set

Changed 6 years ago by Manfred Wassmann <manolo@…>

New patch fixes count method and resolves quoting issues

comment:3 Changed 6 years ago by Manfred Wassmann <manolo@…>

I have uploaded a new patch which fixes the handling of multiple fields in a DISTINCT ON query as well as the count method for DISTINCT ON queries and resolves the qouting issues.

For DISTINCT ON queries the distinct method has now to be invoked as distinct(on_fields=<FIELD_LIST>), e.g. distinct(on_fields=('field1','field2')). For convenience a single field name can be passed as a plain string like distinct(on_fields='fieldname').

comment:4 Changed 6 years ago by mtredinnick

  • Needs documentation set
  • Needs tests set
  • Patch needs improvement set
  • Triage Stage changed from Unreviewed to Accepted

Patches against trunk's query.py aren't particularly useful at the moment, since that code has been more or less entirely rewritten in the queryset-refactor branch. So, if you feel enthusiastic, you could try to update this to work with that branch, otherwise it will need to be rewritten after the branch is merged into trunk.

As to the feature itself, the "on_fields" parameter doesn't really fit with the rest of the API. Call the parameter "fields" for consistency. In fact, given that distinct() takes no parameters currently, it probably makes the most sense just to pass it field names: so it will take *args. That's consistent with values(), valuelist() and select_related().

I notice the patch doesn't contain any tests, which will be necessary, since we need to be able to verify it's correct. Make sure you test any interaction with the .count() method on querysets, since that required some special handling with normal .distinct() calls. Documentation needed as well.

comment:5 Changed 5 years ago by ikelly

  • Cc ikelly added

comment:6 Changed 5 years ago by cornbread

  • Cc cornbread added; ikelly removed

I just ran into this issue. I am trying to return unique Session keys from a table. I tried PageRequest.objects.order_by('session').distinct().count() and I get 217 records and not 1 that I should be getting. Would be nice to just be able to do PageRequest.objects.all().distinct('session').count()

Looking forward to your new patch Manfred Wassmann. Thanks for the great work!

comment:7 Changed 5 years ago by ramusus

  • Cc ramusus@… added

comment:8 Changed 5 years ago by tymofiy

  • Cc tim.babych@… added

comment:9 Changed 4 years ago by hejsan

  • Cc hr.bjarni+django@… added

comment:10 follow-up: Changed 4 years ago by mnbayazit

Any updates on this? Why isn't it being merged with trunk? It's been 2 years!

Changed 4 years ago by mrts

Update patch to 1.1.X

comment:11 in reply to: ↑ 10 Changed 4 years ago by mrts

Replying to mnbayazit:

Any updates on this? Why isn't it being merged with trunk? It's been 2 years!

This can not be merged to trunk as it does not contain tests and works only on certain backends (e.g. it certainly does not work with SQLite), so it needs a design decision.

However, for people who need this functionality right now, a working patch against 1.1.X is included.

comment:12 Changed 4 years ago by ikelly

The current patch also doesn't work with Oracle, which doesn't support the "DISTINCT ON" syntax and requires an analytic query for this purpose. I tried coding that up some time ago, but I found that the query architecture didn't really lend itself well to using analytic functions. That was before the sql compiler refactor, so I'll have to give it another look. It may be easier now.

comment:13 Changed 3 years ago by jgelens

  • Cc jeffrey@… added

comment:14 Changed 3 years ago by jpic

  • Cc jpic added

comment:15 Changed 3 years ago by julien

  • Type set to New feature

comment:16 Changed 3 years ago by julien

  • Severity set to Normal

Changed 3 years ago by Kronuz

Patch updated for 1.3

comment:17 Changed 3 years ago by jgelens

  • Easy pickings unset
  • UI/UX unset

Working on this during the DjangoCon EU sprints. Improving the patch, writing tests and documentation.

comment:18 Changed 3 years ago by jgelens

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Uploaded patch including docs and tests.

comment:19 Changed 3 years ago by russellm

  • Triage Stage changed from Accepted to Ready for checkin

Technically, this introduces a backwards incompatibility on the true_or_false argument, but the argument is undocumented, untested, and based on a search of Google's code search, unused.

comment:20 Changed 3 years ago by ikelly

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

Glad to see this getting some attention, but I think we should hold off on RFC for now. The patch could still use further improvement in a few areas:

In particular, there is no mapping of the provided field names to actual fields and then to column names, which means that the values passed in must actually be the column names, not the field names. It also means there is no check that the specified fields are actually defined, prior to executing the query.

Also, at some point along the way we seem to have lost the special handling for queryset.distinct('somefield').count(). I don't think this is strictly necessary as long as we take care to raise an exception in that case, but right now I see no reason to exclude it.

The patch hard-wires in self.query.model._meta.db_table for the columns in the distinct list, which prevents fields on related tables from being specified (e.g. queryset.distinct('othertable__somefield')). Again, probably not necessary to commit, but this should be easily fixable once the field mapping is done.

comment:21 Changed 3 years ago by jgelens

True, fixing the things you mentioned

comment:22 Changed 3 years ago by charstring

We're almost there.. fixed the count case.. Now for the join case...

What would be the usecase for the queryset.distinct('othertable__somefield')

Last edited 3 years ago by charstring (previous) (diff)

comment:23 Changed 3 years ago by jgelens

  • Patch needs improvement unset
  • Triage Stage changed from Accepted to Ready for checkin

comment:24 Changed 3 years ago by jgelens

  • Owner changed from nobody to jgelens
  • Status changed from new to assigned

comment:25 Changed 3 years ago by jgelens

TODO: use column instead of given name. Will fix 2nd sprint day.

comment:26 Changed 3 years ago by jgelens

done, ready for checkin

comment:27 Changed 3 years ago by jgelens

  • Keywords dceu2011 added

comment:28 Changed 3 years ago by jgelens

  • Needs tests set
Last edited 3 years ago by jgelens (previous) (diff)

comment:29 Changed 3 years ago by jgelens

  • Needs tests unset

comment:30 Changed 3 years ago by tmitchell

  • Cc taylor.mitchell@… added

Changed 3 years ago by tmitchell

Add (failing) test for M2M intermediate model

comment:31 Changed 3 years ago by tmitchell

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

Tried this against our app and found some issues using it with a M2M model. Updated tests. I also seem to have lost the AUTHORS hunk in the process and Trac doesn't want to replace the attachment, apologies.

I would also add a +1 to adding join syntax if it can be done, I will try to think of a use case that's not as domain-specific as ours if needed.

comment:32 Changed 3 years ago by jgelens

Fixed your new issue, will create a patch later today.

comment:33 Changed 3 years ago by jgelens

Fixed issue with M2M models. Could you check if this is working?

comment:34 Changed 3 years ago by jgelens

  • Patch needs improvement unset
  • Triage Stage changed from Accepted to Ready for checkin

comment:35 Changed 3 years ago by jgelens

The patch was improved and this feature is working OK with M2M relations now. Added a test for this special case.

comment:36 Changed 3 years ago by tmitchell

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

There's an interplay between DISTINCT ON and ORDER BY that isn't clear. I was trying to nail it down before I replied but I got stuck, sorry about that.

To demonstrate, if you use the models from tests.regressiontests.queries and try to do: Tag.objects.distinct('parent') you get an error because the default ordering for Tag is by 'name'. I am not sure how to rectify this. Doing Tag.objects.order_by('parent').distinct('parent') doesn't work either. If this isn't a bug, then the docs should be expanded.

comment:37 Changed 3 years ago by jgelens

  • Patch needs improvement unset

The 'parent' is an foreign key to self. When you print the query you'll see it doesn't add the parent_id to the ORDER BY clause:

>>> print Tag.objects.all().order_by('parent').query
SELECT "queries_tag"."id", "queries_tag"."name", "queries_tag"."parent_id", "queries_tag"."category_id" FROM "queries_tag" LEFT OUTER JOIN "queries_tag" T2 ON ("queries_tag"."parent_id" = T2."id") ORDER BY T2."name" ASC


The DISTINCT ON expression requires the leftmost ORDER BY to match the fields given in the DISTINCT ON clause. To make sure the parent_id will be in ORDER BY, use parent_ _pk or parent_ _id:

>>> print Tag.objects.all().order_by('parent__pk').query
SELECT "queries_tag"."id", "queries_tag"."name", "queries_tag"."parent_id", "queries_tag"."category_id" FROM "queries_tag" LEFT OUTER JOIN "queries_tag" T2 ON ("queries_tag"."parent_id" = T2."id") ORDER BY "queries_tag"."parent_id" ASC

Now if you update your ORM call a lil, you'll get the expected results:

>>> print Tag.objects.distinct('parent').order_by('parent__pk')
[<Tag: t3>, <Tag: t5>, <Tag: t1>]

I added a note to the documentation about the 'order_by' requirement.

It's not a bug, so I hope the documentation is more clear now.

Last edited 3 years ago by jgelens (previous) (diff)

Changed 3 years ago by jgelens

comment:38 Changed 3 years ago by tmitchell

  • Patch needs improvement set

Looks like something went awry with the last patch posted, the tests and some other things disappeared. Here's the combined patch, with a test for the self-referential Fk described above and an edit to the docs for readability.

Changed 3 years ago by tmitchell

comment:39 Changed 3 years ago by tmitchell

  • Patch needs improvement unset

comment:40 Changed 3 years ago by jgelens

  • Triage Stage changed from Accepted to Ready for checkin

I guess it's ready for checkin now.

comment:41 Changed 3 years ago by diegobz

  • Cc diegobz added

comment:42 Changed 3 years ago by jonash

  • Cc jonas-django@… added

comment:43 Changed 3 years ago by jacob

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

Sadly the patch doesn't apply any more; bumping back to accepted.

I'm really sorry we dallied so long that the patch grew stale. I'll try to get it updated myself but if someone wants to save me a bit of time and update the patch that'd make me very happy.

comment:44 Changed 3 years ago by jgelens

I'll update it to match the current trunk in a couple of hours. It will be great if you could apply the patch soon after that, as it seems this part of the code is changing rapidly.

Changed 3 years ago by jgelens

comment:45 Changed 3 years ago by jgelens

  • Patch needs improvement unset

Added a new version of the patch.

comment:46 Changed 3 years ago by jgelens

  • Triage Stage changed from Accepted to Ready for checkin

comment:47 follow-up: Changed 3 years ago by koniiiik

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

I've come up with another case where it fails; added it to the tests in the patch.

Basically, the error lies in the assumption that all fields in the DISTINCT ON set belong to the base model of the QuerySet.

comment:48 in reply to: ↑ 47 Changed 3 years ago by anonymous

Replying to koniiiik:

I've come up with another case where it fails; added it to the tests in the patch.

Basically, the error lies in the assumption that all fields in the DISTINCT ON set belong to the base model of the QuerySet.

That patch doesn't apply:

(django)ligfiets$ patch -p1 <distinct_on.5.diff 
patching file AUTHORS
patching file django/db/backends/__init__.py
patching file django/db/backends/postgresql_psycopg2/base.py
patching file django/db/backends/postgresql_psycopg2/operations.py
patching file django/db/models/query.py
patching file django/db/models/sql/compiler.py
patching file django/db/models/sql/query.py
patching file docs/ref/models/querysets.txt
patching file tests/regressiontests/queries/models.py
patching file tests/regressiontests/queries/tests.py
patch: **** malformed patch at line 497:      def setUp(self):

comment:49 Changed 3 years ago by jgelens

Figured out how to fix that kind of cases. Expect a new patch in a couple of hours.

Changed 3 years ago by koniiiik

Fixed patch with failing testcase (sorry, shouldn't have modified the patch by hand hastily...)

Changed 3 years ago by charstring

Added a failing test case for relationship traversals. (this one does apply)

Changed 3 years ago by jgelens

Fixed field traversal

comment:50 Changed 3 years ago by jgelens

  • Patch needs improvement unset

comment:51 Changed 3 years ago by koniiiik

One more nit: in the backend prototype you forgot to update BaseDatabaseOperations.distinct to the new prototype without the db_table parameter. Other than that it looks quite promising.

Changed 3 years ago by jgelens

Updated distinct method in base class to match the new params

comment:52 Changed 3 years ago by koniiiik

  • Triage Stage changed from Accepted to Ready for checkin

comment:53 Changed 3 years ago by anonymous

bombom:/storage/dists/Django-1.3.1# patch -p1 </storage/dists/distinct_on.7.diff
patching file AUTHORS
Hunk #1 succeeded at 524 (offset -20 lines).
patching file django/db/backends/__init__.py
Hunk #1 succeeded at 335 (offset -33 lines).
Hunk #2 succeeded at 481 (offset -43 lines).
patching file django/db/backends/postgresql_psycopg2/base.py
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file django/db/backends/postgresql_psycopg2/base.py.rej
can't find file to patch at input line 97
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/django/db/backends/postgresql_psycopg2/operations.py b/django/db/backends/postgresql_psycopg2/operations.py
|--- a/django/db/backends/postgresql_psycopg2/operations.py
|+++ b/django/db/backends/postgresql_psycopg2/operations.py
--------------------------
File to patch:
Skip this patch? [y] y
Skipping patch.
1 out of 1 hunk ignored
patching file django/db/models/query.py
Hunk #1 succeeded at 651 (offset -42 lines).
Hunk #2 succeeded at 1078 (offset -42 lines).
patching file django/db/models/sql/compiler.py
Hunk #1 succeeded at 67 (offset -4 lines).
patching file django/db/models/sql/query.py
Hunk #1 FAILED at 121.
Hunk #2 FAILED at 260.
Hunk #3 FAILED at 389.
Hunk #4 FAILED at 1592.
4 out of 4 hunks FAILED -- saving rejects to file django/db/models/sql/query.py.rej
patching file docs/ref/models/querysets.txt
Hunk #1 FAILED at 340.
Hunk #2 FAILED at 369.
2 out of 2 hunks FAILED -- saving rejects to file docs/ref/models/querysets.txt.rej
patching file tests/regressiontests/queries/models.py
Hunk #2 FAILED at 340.
1 out of 2 hunks FAILED -- saving rejects to file tests/regressiontests/queries/models.py.rej
patching file tests/regressiontests/queries/tests.py
Hunk #1 FAILED at 10.
Hunk #2 FAILED at 1732.
Hunk #3 FAILED at 1818.
3 out of 3 hunks FAILED -- saving rejects to file tests/regressiontests/queries/tests.py.rej

Why this happning? :)

comment:54 Changed 3 years ago by koniiiik

Are you sure you're applying it on a recent SVN trunk? There it applies almost fine, except for a small whitespace error in the patch (probably caused by git) that makes one hunk in tests/regressiontests/queries/models.py fail.

Just out of pure curiosity I just tried it on the 1.3.X branch and there it looks suspiciously similar to your output.

Changed 3 years ago by ramiro

Patch with docs and tests updates

comment:55 Changed 3 years ago by ramiro

  • Patch needs improvement set
  • Triage Stage changed from Ready for checkin to Accepted

I've update the patch with:

  • The new tests failed for me because on my environment the Tag.objects.order_by('parent__pk').distinct('parent') call choose ['<Tag: t2>', '<Tag: t4>', '<Tag: t1>'] instead of ['<Tag: t3>', '<Tag: t45', '<Tag: t1>']. Had to use order_by('parent__pk', 'pk') instead.
  • Fleshed out a bit documentation changes:
    • I'd like the note about the need to use an order_by() call to be more clear. -- I'm clearing RFC because of this.
    • Added some examples of usage based on my interpretation of the new functionality.
  • Moved the addition to the AUTHORS file to the right spot (lexicographical ordering by last name)

comment:56 Changed 2 years ago by jgelens

  • Patch needs improvement unset

I updated the documentation explaining the use of order_by. Hope it's more clear like this.

comment:57 Changed 2 years ago by akaariai

Some comments (I haven't actually tried the patch, just read it in diffview):

  • IMHO add_distinct_fields needs a docstring. Other similar methods seem to have one, also.
  • What happens if you specify .distinct(field_list) multiple times with different fields? If the first .distinct call generated joins, won't those joins be left loitering around without any users, but still having refcount > 0?
  • Haven't tested this, but I suspect you can break this by adding fields to DISTINCT ON which aren't in the normal select list and then doing a group by. If you want correct results in this case, you will probably need to subquerify the DISTINCT ON query, then do a group by. NotImplementedError seems to be the sane choice, making this work correctly might be a bit hard and I haven't ever used distinct on + group by combination when writing SQL myself.

The true_or_false argument was removed. Is there any way to get rid of the distinct after this. Does there need to be one?

comment:58 Changed 2 years ago by jgelens

@akaariai

  • Added a docstring to add_distinct_fields.
  • If you specify distinct(field_list) multiple times it will overwrite the previous one. This is the intended behavior.
  • The fields in DISTINCT ON are always in the select list. I guess your explanation can't happen. If it does could you explain further and give some examples?
  • I discussed the true_or_false argument removal with Russell and Alex during the DjangoCon in Amsterdam this year. As it's a undocumented feature and we couldn't find any usage of it on the web, we decided to remove the argument. It's a weird argument anyway ;) .

Changed 2 years ago by jgelens

Added docstring

comment:59 Changed 2 years ago by akaariai

Assume you have models:

class A1(models.Model):
    foo = models.TextField()

class B1(models.Model):
    a = models.ForeignKey(A1)

And you do:

qs1 = B1.objects.distinct('a__id', 'a__foo')
print qs1.query
qs2 = qs1.distinct('pk')
print qs2.query

Now you will have a join to A even though you don't have a reference to it. The queries generated:

SELECT DISTINCT ON ("obj_creation_speed_a1"."id", "obj_creation_speed_a1"."foo") "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id" FROM "obj_creation_speed_b1" INNER JOIN "obj_creation_speed_a1" ON ("obj_creation_speed_b1"."a_id" = "obj_creation_speed_a1"."id")

SELECT DISTINCT ON ("obj_creation_speed_b1"."id") "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id" FROM "obj_creation_speed_b1" INNER JOIN "obj_creation_speed_a1" ON ("obj_creation_speed_b1"."a_id" = "obj_creation_speed_a1"."id")

Note that in the second query, there is a join into A even if no field of A is used in the query. The join should be removed. But I don't think this is a blocker issue. If you want to fix this, there are at least these two choices:

  • Apply the distinct on in the compiler when the query is generated, order by is handle this way IIRC.
  • Keep record of which aliases are referenced by the DISTINCT ON added fields, and subtract the refcount by one for those aliases if the distinct on is changed.

As said before, in my opinion this is not a must-fix in this ticket.

Now, if you continue from the above created qs1 and issue .annotate(Max('a__id')), you will get this query:

qs3 = qs1.annotate(Max('a__id'))
print qs3.query
list(qs3)

The query generated is this:

SELECT DISTINCT ON ("obj_creation_speed_a1"."id", "obj_creation_speed_a1"."foo") "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id", MAX("obj_creation_speed_b1"."a_id") AS "a__id__max" FROM "obj_creation_speed_b1" INNER JOIN "obj_creation_speed_a1" ON ("obj_creation_speed_b1"."a_id" = "obj_creation_speed_a1"."id") GROUP BY "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id"

And the list(qs3) will generate this:

django.db.utils.DatabaseError: column "obj_creation_speed_a1.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT DISTINCT ON ("obj_creation_speed_a1"."id", "obj_creat...
                            ^

You could "fix" this by appending the DISTINCT ON columns into the group by clause. Unfortunately (without going into details) the results aren't what the user would expect. If you want the results the user expects, you would need to do a subquery. If the distinct on is in the inner query or group by is in the inner query depends on the order in which the .annotate and .distinct are applied. A good fix for this would be to raise NotImplementedError("Can't handle queries with aggregates and distinct on clause"). I have written some SQL by hand, and can't remember ever using distinct on and group by in the same query. Maybe there is some use case, but I just don't see the need to block on this issue, either.

Last edited 2 years ago by akaariai (previous) (diff)

comment:60 Changed 2 years ago by jgelens

@ akaariai Thanks for your thorough comments :)
I fixed your first point about the join removal when calling distinct() multiple times.
It now applies the distinct in the compiler. This didn't fix the problem however, so I had to unref the fields manually, which is working great.

As you've said, the combination of a annotation and an distinct on is probably a use case which is never used. This is certainly not a blocker for this feature.

Anyways, this patch took a lot of time. I hope all the bugs were fixed :-D

Changed 2 years ago by jgelens

fixed when calling distinct() multiple times on the same query set

comment:61 Changed 2 years ago by akaariai

  • Cc anssi.kaariainen@… added

Sorry, I still found one more problem, and some smaller issues:

  • The distinct() in postgresql_psycopg2/operations.py doesn't take aliases into account. It assumes that the distinct is always for the main table alias. If you first do a filter on m2m relation, then a distinct on for the m2m relation you get:
    # I added coworkers = models.ManyToManyField('self') to Staff model used in tests
    >>> print Staff.objects.distinct('coworkers__name').order_by('coworkers__name').query
    SELECT DISTINCT ON ("queries_staff"."name")
         "queries_staff"."id", "queries_staff"."name", "queries_staff"."organisation", T3."name"
    FROM "queries_staff"
    LEFT OUTER JOIN "queries_staff_coworkers"
         ON ("queries_staff"."id" = "queries_staff_coworkers"."from_staff_id")
    LEFT OUTER JOIN "queries_staff" T3
         ON ("queries_staff_coworkers"."to_staff_id" = T3."id")
    ORDER BY T3."name" ASC
    

If you execute that query, it will error, because distinct on is for "queries_staff"."name", but order by is for T3."name"

  • In the attached patch there is a different approach of join cleanup. It also fixes one other long-standing expected_failure test in queries. The failure is related to similar "lingering joins" done by ordering.
  • The obj.query.distinct_fields = [] in query.distinct() is not needed. The variable will be set anyways in sql.query.add_distinct_fields(). obj.query.distinct = True can also be set in add_distinct_fields()
  • Should the joins be generated as OUTER or INNER joins? Not sure about this. I guess either way will be fine. The attached patch generates OUTER JOINS (because it uses similar code than the order by join creation).
  • Prevent different distinct_fields lists when combining querysets.

I attached a patch which fixes these issues. It isn't the cleanest one (non-DRY, code copy from order_by). But it can be used as a basis for a cleaner patch if needed.

I did also check out how to support aggregates with distinct_fields. However I had a hard time figuring out a realistic test case, so I removed that part. So, jgelens is absolutely correct here: better to first find an use case :)

All in all, the only major issue is the correct alias usage in compiler.py. The rest is just nitpicking. In my opinion the alias usage needs to be fixed before commit, but I may be over-complicating things again... Other than that I think this feature is now very solid. I might be fine to just let the distinct() generated joins linger in there after all. That is what order_by has done for ages.

The attached patch passes all tests, with one long-standing expected failure now gone. As said the patch is just to show one possible approach to the alias issue. I hope jgelens will check it out and see what to do to the above mentioned issues (if anything).

Changed 2 years ago by akaariai

Possible approach for using correct aliases

comment:62 Changed 2 years ago by jgelens

I think your additions to the patch are great. Everything works OK. The code copy from order_by is minor and imo unnecessary to split out and generalize for both functions.
Let's mark it as RFC as 1.4alpha is nearing completion. Thanks!

comment:63 Changed 2 years ago by akaariai

I hope to do a final cleanup tomorrow. I still think it is a good idea to throw NotImplementedError for aggregate & annotate (this is actually quite simple: in each method check if self.distinct_fields: raise NotImplementedError). The reason is that annotate + DISTINCT ON actually somewhat works, sometimes. But it doesn't work in any meaningful way. So to avoid backwards compatibility issues if annotate + distinct_on combination is going to ever be supported, it is good that there is no behaviour Django must support. The exceptions are simple enough to implement, so IMHO this is a good idea still. Sorry for coming back to this again and again. :)

In addition I will check if there is some easy way to avoid the code duplication between get_ordering and get_distinct join generation (in compiler.py). It is a good idea, as actually both of them _must_ generate same joins, otherwise you will get errors on ordering not matching prefix of DISTINCT ON.

comment:64 Changed 2 years ago by Kronuz

  • Cc Kronuz added

comment:65 Changed 2 years ago by akaariai

Final cleanup done & attached. All tests should pass on sqlite, and the changed tests do pass on PostgreSQL.

Changes from the previous version:

  • DRY in complier.py join setup for ordering and distinct
  • NotImplementedError for .distinct(fields) + annotate/aggregate.
  • Moved the new tests to modeltests/distinct_on_fields

There are a couple of test-fixes combined in the patch. Create 2500 objects -> bulk create 2500 objects, one expected failure is no more, and one test was testing how executing a query alters the query instance (it should not alter it). It now tests the created SQL instead.

As far as I understand this is now ready for checkin. The only minor thing missing is a check for order_by prefix matching distinct_fields prefix. The DB will complain if that is broken, so that really isn't a big deal.

Please review & mark ready for checkin if no problems found. Please check extra-carefully added comments. My English isn't that good...

Changed 2 years ago by akaariai

comment:66 Changed 2 years ago by jgelens

Thanks! Looks great, I ran the tests, you forgot to add a init.py to the modeltests/distinct_on_fields module. Also some Tag object creation was accidentally removed. I added this again.
I added a new patch with these changes.

Now it looks we've finally come to a version which is ready for checkin! :-D

Last edited 2 years ago by jgelens (previous) (diff)

Changed 2 years ago by jgelens

comment:67 Changed 2 years ago by jgelens

  • Triage Stage changed from Accepted to Ready for checkin

Changed 2 years ago by ramiro

Added Anssi to AUTHORS, release notes blurb, removed unrelated tests change, renamed DatabaseOperations.dictinct() to distinct_sql()

comment:68 Changed 2 years ago by ramiro

  • Resolution set to fixed
  • Status changed from assigned to closed

In [17244]:

Added support for modifying the effect of DISTINCT clauses so they
only consider some fields (PostgreSQL only).

For this, the distinct() QuerySet method now accepts an optional
list of model fields names and generates DISTINCT ON clauses on
these cases. Thanks Jeffrey Gelens and Anssi Kääriäinen for their work.

Fixes #6422.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.