Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#20842 closed Uncategorized (fixed)

A warning for .order_by() on foreign keys in docs is warranted

Reported by: Daniele Procida Owned by: Daniele Procida
Component: Documentation Version: 1.5
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

There's a note on distinct() about order_by() in the docs: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct.

I think there should also be a similar note for order_by().

I don't fully understand what's going on below, but knowing that things like .order_by('-children__date') can confuse the ORM would be very useful.

How should such a warning be expressed?

I will try to create a failing test for this, in case it's deemed a bug worth fixing. I'm not sure it's even a bug though so much as a limitation of the ORM.

# the model

 class Event(Model):
    parent = models.ForeignKey('self', related_name='children')
    date = models.DateField()

# in all the examples below, self.series_events is a queryset of Events

# ======================================= case one - works as expected =======================================
self.items = self.series_events
for item in self.items:
    print item
print self.items.count()

# correctly lists the 24 items
# correctly returns a count of 24
# SQL for self.series_events
SELECT DISTINCT `news_and_events_event`.`id`, `news_and_events_event`.`url`,
`news_and_events_event`.`external_url_id`, `news_and_events_event`.`slug`,
`news_and_events_event`.`precise_location`,
`news_and_events_event`.`access_note`, `news_and_events_event`.`title`,
`news_and_events_event`.`short_title`, `news_and_events_event`.`summary`,
`news_and_events_event`.`published`, `news_and_events_event`.`in_lists`,
`news_and_events_event`.`body_id`, `news_and_events_event`.`image_id`,
`news_and_events_event`.`hosted_by_id`, `news_and_events_event`.`importance`,
`news_and_events_event`.`content`, `news_and_events_event`.`type_id`,
`news_and_events_event`.`parent_id`, `news_and_events_event`.`series`,
`news_and_events_event`.`show_titles`,
`news_and_events_event`.`display_series_summary`,
`news_and_events_event`.`child_list_heading`, `news_and_events_event`.`date`,
`news_and_events_event`.`start_time`, `news_and_events_event`.`end_date`,
`news_and_events_event`.`end_time`, `news_and_events_event`.`single_day_event`,
`news_and_events_event`.`building_id`,
`news_and_events_event`.`jumps_queue_on`, `news_and_events_event`.`jumps_queue_e
verywhere`, `news_and_events_event`.`lft`, `news_and_events_event`.`rght`,
`news_and_events_event`.`tree_id`, `news_and_events_event`.`level` FROM
`news_and_events_event` LEFT OUTER JOIN `news_and_events_event_publish_to` ON
(`news_and_events_event`.`id` = `news_and_events_event_publish_to`.`event_id`)
WHERE (`news_and_events_event`.`in_lists` = True AND
`news_and_events_event`.`published` = True AND
(`news_and_events_event`.`hosted_by_id` = 2 OR
`news_and_events_event_publish_to`.`entity_id` = 2 ) AND
`news_and_events_event`.`series` = True ) ORDER BY
`news_and_events_event`.`date` ASC, `news_and_events_event`.`start_time` ASC



# ======================================= case two - does not work as expected =======================================
# adding .order_by('-children__date') seems to affect the number of items returned!
self.items = self.series_events.order_by('-children__date')
for item in self.items:
    print item
print self.items.count()

# incorrectly lists 319 items
# incorrectly returns a count of 319
# SQL for self.series_events.order_by('-children__date')
SELECT DISTINCT `news_and_events_event`.`id`, `news_and_events_event`.`url`,
`news_and_events_event`.`external_url_id`, `news_and_events_event`.`slug`,
`news_and_events_event`.`precise_location`,
`news_and_events_event`.`access_note`, `news_and_events_event`.`title`,
`news_and_events_event`.`short_title`, `news_and_events_event`.`summary`,
`news_and_events_event`.`published`, `news_and_events_event`.`in_lists`,
`news_and_events_event`.`body_id`, `news_and_events_event`.`image_id`,
`news_and_events_event`.`hosted_by_id`, `news_and_events_event`.`importance`,
`news_and_events_event`.`content`, `news_and_events_event`.`type_id`,
`news_and_events_event`.`parent_id`, `news_and_events_event`.`series`,
`news_and_events_event`.`show_titles`,
`news_and_events_event`.`display_series_summary`,
`news_and_events_event`.`child_list_heading`, `news_and_events_event`.`date`,
`news_and_events_event`.`start_time`, `news_and_events_event`.`end_date`,
`news_and_events_event`.`end_time`, `news_and_events_event`.`single_day_event`,
`news_and_events_event`.`building_id`,
`news_and_events_event`.`jumps_queue_on`, `news_and_events_event`.`jumps_queue_e
verywhere`, `news_and_events_event`.`lft`, `news_and_events_event`.`rght`,
`news_and_events_event`.`tree_id`, `news_and_events_event`.`level`, T5.`date`
FROM `news_and_events_event` LEFT OUTER JOIN `news_and_events_event_publish_to`
ON (`news_and_events_event`.`id` =
`news_and_events_event_publish_to`.`event_id`) LEFT OUTER JOIN
`news_and_events_event` T5 ON (`news_and_events_event`.`id` = T5.`parent_id`)
WHERE (`news_and_events_event`.`in_lists` = True AND
`news_and_events_event`.`published` = True AND
(`news_and_events_event`.`hosted_by_id` = 2 OR
`news_and_events_event_publish_to`.`entity_id` = 2 ) AND
`news_and_events_event`.`series` = True ) ORDER BY T5.`date` DESC



# ======================================= case three - does not work as expected =======================================
# not looping over self.items means it does .count() correctly, but the items returned remain incorrect
self.items = self.series_events.order_by('-children__date')
# for item in self.items:
#     print item
print items
print self.items.count()

# returns 24 incorrect items (only five different items are returned, each one multiple times, and others are missing)
# correctly returns a count of 24


Change History (10)

comment:2 by Daniele Procida, 11 years ago

Owner: changed from nobody to Daniele Procida
Status: newassigned

comment:3 by Anssi Kääriäinen, 11 years ago

Triage Stage: UnreviewedAccepted

Unfortunately you are asking the ORM to do something impossible. If you have this data:

idparentdate
1nullnull
2nullnull
311.1.2000
413.1.2000
522.1.2000
624.1.2000

and if you order by child's date, then should id=1 or id=2 be first? 1 has dates 1.1.2000 and 3.1.2000, 2 has 2.1.2000 and 4.1.2000.

What is happening at ORM level is that Django allows you to order by child's date even if there can be multiple child dates. This is done because in some common cases (for example some filter() on child) there will be only one child per parent in the query. But allowing this also allows the case where there are multiple items, and in that case id=1 will be found two times from the qs (places 1 and 3), and id=2 will be found also two times (places 2 and 4).

In the above scenario it is actually impossible to order by child's date. The ordering just doesn't exist.

It might be a good idea to add an "allow_m2m" flag to order_by(), so that you have to explicitly ask for this behaviour. If you ask for it, then same item might be multiple times in the result set. If you don't allow it and try to order by child's date, then you will get an error warning you about this situation.

I believe this is documented. In the order_by() docs the section: "It is permissible to specify a multi-valued..." and in the .distinct() the note is about this issue. Adding an example there might make things clearer.

I vote for adding the "allow_m2m" flag (with a better name if anybody can think of one) because the current behaviour is surprising. There are a couple other instances of similar problems (multiple m2m annotations, filtering m2m relation with multiple items from the relation matching), but there should be dealt separately.

I am accepting this, adding at least an example of this to docs seems like a good idea.

For the count() case, Django ignores the order_by() and thus the count() is correct (yes, this is again surprising). I am not sure what is happening in the print case.

comment:4 by Daniele Procida, 11 years ago

Thanks for that very clear explanation, I will add something to the documentation to reflect it.

Is there scope in .order_by(children__date) for something that would identify the 'best' (or 'worst') date for the children, and use that to order the parents? Or is this beyond what can be achieved in a single SQL query via the ORM?

comment:5 by Anssi Kääriäinen, 11 years ago

I can't figure out anything the ORM can do automatically. The user can do qs.filter(chidren__isbest=True).order_by('children__date') (or any other filter condition making the children unique), or qs.annotate(child_date_avg=Avg('children__date')).order_by('child_date_avg') but it seems impossible to have universally correct automatic decision on the best date.

What do you think about the allow_m2m flag, would it be a good idea to have an explicit flag to order_by() so that users have to opt-in for possible duplicate rows, or would that just annoy users who already know what they are doing?

comment:6 by Daniele Procida, 11 years ago

One problem with an allow_m2m flag is that if it is True by default, then those who weren't expecting this behaviour wouldn't benefit from the warning; if it's False by default, it's going to be backwards incompatible for some code that is already using it.

However, I think that making it False by default would be the more useful, and more useful than the current behaviour where it's unremarked.

comment:7 by Daniele Procida, 11 years ago

comment:8 by Tim Graham <timograham@…>, 11 years ago

Resolution: fixed
Status: assignedclosed

In e8183a8193ebc80fdc6fef789813c2c0c3615e5c:

Fixed #20842 and #20845 - Added a note on order_by() and improved prefetch_related() docs.

comment:9 by Tim Graham <timograham@…>, 11 years ago

In 74205c4a3c587a9eedceef1b98cd4b0d529a7510:

[1.6.x] Fixed #20842 and #20845 - Added a note on order_by() and improved prefetch_related() docs.

Backport of e8183a8193 from master

comment:10 by Tim Graham <timograham@…>, 11 years ago

In bc33fef90a851da91758fa6c1126456e61aadf35:

[1.5.x] Fixed #20842 and #20845 - Added a note on order_by() and improved prefetch_related() docs.

Backport of e8183a8193 from master

Note: See TracTickets for help on using tickets.
Back to Top