#20842 closed Uncategorized (fixed)
A warning for .order_by() on foreign keys in docs is warranted
Reported by: | Daniele Procida | Owned by: | Daniele Procida |
---|---|---|---|
Component: | Documentation | Version: | 1.5 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
There's a note on distinct()
about order_by()
in the docs: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct.
I think there should also be a similar note for order_by()
.
I don't fully understand what's going on below, but knowing that things like .order_by('-children__date')
can confuse the ORM would be very useful.
How should such a warning be expressed?
I will try to create a failing test for this, in case it's deemed a bug worth fixing. I'm not sure it's even a bug though so much as a limitation of the ORM.
# the model class Event(Model): parent = models.ForeignKey('self', related_name='children') date = models.DateField() # in all the examples below, self.series_events is a queryset of Events # ======================================= case one - works as expected ======================================= self.items = self.series_events for item in self.items: print item print self.items.count() # correctly lists the 24 items # correctly returns a count of 24 # SQL for self.series_events SELECT DISTINCT `news_and_events_event`.`id`, `news_and_events_event`.`url`, `news_and_events_event`.`external_url_id`, `news_and_events_event`.`slug`, `news_and_events_event`.`precise_location`, `news_and_events_event`.`access_note`, `news_and_events_event`.`title`, `news_and_events_event`.`short_title`, `news_and_events_event`.`summary`, `news_and_events_event`.`published`, `news_and_events_event`.`in_lists`, `news_and_events_event`.`body_id`, `news_and_events_event`.`image_id`, `news_and_events_event`.`hosted_by_id`, `news_and_events_event`.`importance`, `news_and_events_event`.`content`, `news_and_events_event`.`type_id`, `news_and_events_event`.`parent_id`, `news_and_events_event`.`series`, `news_and_events_event`.`show_titles`, `news_and_events_event`.`display_series_summary`, `news_and_events_event`.`child_list_heading`, `news_and_events_event`.`date`, `news_and_events_event`.`start_time`, `news_and_events_event`.`end_date`, `news_and_events_event`.`end_time`, `news_and_events_event`.`single_day_event`, `news_and_events_event`.`building_id`, `news_and_events_event`.`jumps_queue_on`, `news_and_events_event`.`jumps_queue_e verywhere`, `news_and_events_event`.`lft`, `news_and_events_event`.`rght`, `news_and_events_event`.`tree_id`, `news_and_events_event`.`level` FROM `news_and_events_event` LEFT OUTER JOIN `news_and_events_event_publish_to` ON (`news_and_events_event`.`id` = `news_and_events_event_publish_to`.`event_id`) WHERE (`news_and_events_event`.`in_lists` = True AND `news_and_events_event`.`published` = True AND (`news_and_events_event`.`hosted_by_id` = 2 OR `news_and_events_event_publish_to`.`entity_id` = 2 ) AND `news_and_events_event`.`series` = True ) ORDER BY `news_and_events_event`.`date` ASC, `news_and_events_event`.`start_time` ASC # ======================================= case two - does not work as expected ======================================= # adding .order_by('-children__date') seems to affect the number of items returned! self.items = self.series_events.order_by('-children__date') for item in self.items: print item print self.items.count() # incorrectly lists 319 items # incorrectly returns a count of 319 # SQL for self.series_events.order_by('-children__date') SELECT DISTINCT `news_and_events_event`.`id`, `news_and_events_event`.`url`, `news_and_events_event`.`external_url_id`, `news_and_events_event`.`slug`, `news_and_events_event`.`precise_location`, `news_and_events_event`.`access_note`, `news_and_events_event`.`title`, `news_and_events_event`.`short_title`, `news_and_events_event`.`summary`, `news_and_events_event`.`published`, `news_and_events_event`.`in_lists`, `news_and_events_event`.`body_id`, `news_and_events_event`.`image_id`, `news_and_events_event`.`hosted_by_id`, `news_and_events_event`.`importance`, `news_and_events_event`.`content`, `news_and_events_event`.`type_id`, `news_and_events_event`.`parent_id`, `news_and_events_event`.`series`, `news_and_events_event`.`show_titles`, `news_and_events_event`.`display_series_summary`, `news_and_events_event`.`child_list_heading`, `news_and_events_event`.`date`, `news_and_events_event`.`start_time`, `news_and_events_event`.`end_date`, `news_and_events_event`.`end_time`, `news_and_events_event`.`single_day_event`, `news_and_events_event`.`building_id`, `news_and_events_event`.`jumps_queue_on`, `news_and_events_event`.`jumps_queue_e verywhere`, `news_and_events_event`.`lft`, `news_and_events_event`.`rght`, `news_and_events_event`.`tree_id`, `news_and_events_event`.`level`, T5.`date` FROM `news_and_events_event` LEFT OUTER JOIN `news_and_events_event_publish_to` ON (`news_and_events_event`.`id` = `news_and_events_event_publish_to`.`event_id`) LEFT OUTER JOIN `news_and_events_event` T5 ON (`news_and_events_event`.`id` = T5.`parent_id`) WHERE (`news_and_events_event`.`in_lists` = True AND `news_and_events_event`.`published` = True AND (`news_and_events_event`.`hosted_by_id` = 2 OR `news_and_events_event_publish_to`.`entity_id` = 2 ) AND `news_and_events_event`.`series` = True ) ORDER BY T5.`date` DESC # ======================================= case three - does not work as expected ======================================= # not looping over self.items means it does .count() correctly, but the items returned remain incorrect self.items = self.series_events.order_by('-children__date') # for item in self.items: # print item print items print self.items.count() # returns 24 incorrect items (only five different items are returned, each one multiple times, and others are missing) # correctly returns a count of 24
Change History (10)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|
Unfortunately you are asking the ORM to do something impossible. If you have this data:
id | parent | date |
1 | null | null |
2 | null | null |
3 | 1 | 1.1.2000 |
4 | 1 | 3.1.2000 |
5 | 2 | 2.1.2000 |
6 | 2 | 4.1.2000 |
and if you order by child's date, then should id=1 or id=2 be first? 1 has dates 1.1.2000 and 3.1.2000, 2 has 2.1.2000 and 4.1.2000.
What is happening at ORM level is that Django allows you to order by child's date even if there can be multiple child dates. This is done because in some common cases (for example some filter() on child) there will be only one child per parent in the query. But allowing this also allows the case where there are multiple items, and in that case id=1 will be found two times from the qs (places 1 and 3), and id=2 will be found also two times (places 2 and 4).
In the above scenario it is actually impossible to order by child's date. The ordering just doesn't exist.
It might be a good idea to add an "allow_m2m" flag to order_by(), so that you have to explicitly ask for this behaviour. If you ask for it, then same item might be multiple times in the result set. If you don't allow it and try to order by child's date, then you will get an error warning you about this situation.
I believe this is documented. In the order_by() docs the section: "It is permissible to specify a multi-valued..." and in the .distinct() the note is about this issue. Adding an example there might make things clearer.
I vote for adding the "allow_m2m" flag (with a better name if anybody can think of one) because the current behaviour is surprising. There are a couple other instances of similar problems (multiple m2m annotations, filtering m2m relation with multiple items from the relation matching), but there should be dealt separately.
I am accepting this, adding at least an example of this to docs seems like a good idea.
For the count() case, Django ignores the order_by() and thus the count() is correct (yes, this is again surprising). I am not sure what is happening in the print case.
comment:4 by , 11 years ago
Thanks for that very clear explanation, I will add something to the documentation to reflect it.
Is there scope in .order_by(children__date)
for something that would identify the 'best' (or 'worst') date for the children, and use that to order the parents? Or is this beyond what can be achieved in a single SQL query via the ORM?
comment:5 by , 11 years ago
I can't figure out anything the ORM can do automatically. The user can do qs.filter(chidren__isbest=True).order_by('children__date')
(or any other filter condition making the children unique), or qs.annotate(child_date_avg=Avg('children__date')).order_by('child_date_avg')
but it seems impossible to have universally correct automatic decision on the best date.
What do you think about the allow_m2m flag, would it be a good idea to have an explicit flag to order_by() so that users have to opt-in for possible duplicate rows, or would that just annoy users who already know what they are doing?
comment:6 by , 11 years ago
One problem with an allow_m2m flag is that if it is True by default, then those who weren't expecting this behaviour wouldn't benefit from the warning; if it's False by default, it's going to be backwards incompatible for some code that is already using it.
However, I think that making it False by default would be the more useful, and more useful than the current behaviour where it's unremarked.
comment:7 by , 11 years ago
My suggestion for the documentation: https://github.com/evildmp/django/compare/ticket_20845, which also addresses https://code.djangoproject.com/ticket/20845
comment:8 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
See https://code.djangoproject.com/attachment/ticket/20528/20528-test-edf93127bf2f9dc35b45cdea5d39a1b417ab1638.diff for an example of what a test for this might look like.