Opened 21 months ago

Closed 21 months ago

Last modified 21 months ago

#20842 closed Uncategorized (fixed)

A warning for .order_by() on foreign keys in docs is warranted

Reported by: EvilDMP Owned by: EvilDMP
Component: Documentation Version: 1.5
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

There's a note on distinct() about order_by() in the docs: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct.

I think there should also be a similar note for order_by().

I don't fully understand what's going on below, but knowing that things like .order_by('-children__date') can confuse the ORM would be very useful.

How should such a warning be expressed?

I will try to create a failing test for this, in case it's deemed a bug worth fixing. I'm not sure it's even a bug though so much as a limitation of the ORM.

# the model

 class Event(Model):
    parent = models.ForeignKey('self', related_name='children')
    date = models.DateField()

# in all the examples below, self.series_events is a queryset of Events

# ======================================= case one - works as expected =======================================
self.items = self.series_events
for item in self.items:
    print item
print self.items.count()

# correctly lists the 24 items
# correctly returns a count of 24
# SQL for self.series_events
SELECT DISTINCT `news_and_events_event`.`id`, `news_and_events_event`.`url`,
`news_and_events_event`.`external_url_id`, `news_and_events_event`.`slug`,
`news_and_events_event`.`precise_location`,
`news_and_events_event`.`access_note`, `news_and_events_event`.`title`,
`news_and_events_event`.`short_title`, `news_and_events_event`.`summary`,
`news_and_events_event`.`published`, `news_and_events_event`.`in_lists`,
`news_and_events_event`.`body_id`, `news_and_events_event`.`image_id`,
`news_and_events_event`.`hosted_by_id`, `news_and_events_event`.`importance`,
`news_and_events_event`.`content`, `news_and_events_event`.`type_id`,
`news_and_events_event`.`parent_id`, `news_and_events_event`.`series`,
`news_and_events_event`.`show_titles`,
`news_and_events_event`.`display_series_summary`,
`news_and_events_event`.`child_list_heading`, `news_and_events_event`.`date`,
`news_and_events_event`.`start_time`, `news_and_events_event`.`end_date`,
`news_and_events_event`.`end_time`, `news_and_events_event`.`single_day_event`,
`news_and_events_event`.`building_id`,
`news_and_events_event`.`jumps_queue_on`, `news_and_events_event`.`jumps_queue_e
verywhere`, `news_and_events_event`.`lft`, `news_and_events_event`.`rght`,
`news_and_events_event`.`tree_id`, `news_and_events_event`.`level` FROM
`news_and_events_event` LEFT OUTER JOIN `news_and_events_event_publish_to` ON
(`news_and_events_event`.`id` = `news_and_events_event_publish_to`.`event_id`)
WHERE (`news_and_events_event`.`in_lists` = True AND
`news_and_events_event`.`published` = True AND
(`news_and_events_event`.`hosted_by_id` = 2 OR
`news_and_events_event_publish_to`.`entity_id` = 2 ) AND
`news_and_events_event`.`series` = True ) ORDER BY
`news_and_events_event`.`date` ASC, `news_and_events_event`.`start_time` ASC



# ======================================= case two - does not work as expected =======================================
# adding .order_by('-children__date') seems to affect the number of items returned!
self.items = self.series_events.order_by('-children__date')
for item in self.items:
    print item
print self.items.count()

# incorrectly lists 319 items
# incorrectly returns a count of 319
# SQL for self.series_events.order_by('-children__date')
SELECT DISTINCT `news_and_events_event`.`id`, `news_and_events_event`.`url`,
`news_and_events_event`.`external_url_id`, `news_and_events_event`.`slug`,
`news_and_events_event`.`precise_location`,
`news_and_events_event`.`access_note`, `news_and_events_event`.`title`,
`news_and_events_event`.`short_title`, `news_and_events_event`.`summary`,
`news_and_events_event`.`published`, `news_and_events_event`.`in_lists`,
`news_and_events_event`.`body_id`, `news_and_events_event`.`image_id`,
`news_and_events_event`.`hosted_by_id`, `news_and_events_event`.`importance`,
`news_and_events_event`.`content`, `news_and_events_event`.`type_id`,
`news_and_events_event`.`parent_id`, `news_and_events_event`.`series`,
`news_and_events_event`.`show_titles`,
`news_and_events_event`.`display_series_summary`,
`news_and_events_event`.`child_list_heading`, `news_and_events_event`.`date`,
`news_and_events_event`.`start_time`, `news_and_events_event`.`end_date`,
`news_and_events_event`.`end_time`, `news_and_events_event`.`single_day_event`,
`news_and_events_event`.`building_id`,
`news_and_events_event`.`jumps_queue_on`, `news_and_events_event`.`jumps_queue_e
verywhere`, `news_and_events_event`.`lft`, `news_and_events_event`.`rght`,
`news_and_events_event`.`tree_id`, `news_and_events_event`.`level`, T5.`date`
FROM `news_and_events_event` LEFT OUTER JOIN `news_and_events_event_publish_to`
ON (`news_and_events_event`.`id` =
`news_and_events_event_publish_to`.`event_id`) LEFT OUTER JOIN
`news_and_events_event` T5 ON (`news_and_events_event`.`id` = T5.`parent_id`)
WHERE (`news_and_events_event`.`in_lists` = True AND
`news_and_events_event`.`published` = True AND
(`news_and_events_event`.`hosted_by_id` = 2 OR
`news_and_events_event_publish_to`.`entity_id` = 2 ) AND
`news_and_events_event`.`series` = True ) ORDER BY T5.`date` DESC



# ======================================= case three - does not work as expected =======================================
# not looping over self.items means it does .count() correctly, but the items returned remain incorrect
self.items = self.series_events.order_by('-children__date')
# for item in self.items:
#     print item
print items
print self.items.count()

# returns 24 incorrect items (only five different items are returned, each one multiple times, and others are missing)
# correctly returns a count of 24


Change History (10)

comment:1 Changed 21 months ago by EvilDMP

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

comment:2 Changed 21 months ago by EvilDMP

  • Owner changed from nobody to EvilDMP
  • Status changed from new to assigned

comment:3 Changed 21 months ago by akaariai

  • Triage Stage changed from Unreviewed to Accepted

Unfortunately you are asking the ORM to do something impossible. If you have this data:

idparentdate
1nullnull
2nullnull
311.1.2000
413.1.2000
522.1.2000
624.1.2000

and if you order by child's date, then should id=1 or id=2 be first? 1 has dates 1.1.2000 and 3.1.2000, 2 has 2.1.2000 and 4.1.2000.

What is happening at ORM level is that Django allows you to order by child's date even if there can be multiple child dates. This is done because in some common cases (for example some filter() on child) there will be only one child per parent in the query. But allowing this also allows the case where there are multiple items, and in that case id=1 will be found two times from the qs (places 1 and 3), and id=2 will be found also two times (places 2 and 4).

In the above scenario it is actually impossible to order by child's date. The ordering just doesn't exist.

It might be a good idea to add an "allow_m2m" flag to order_by(), so that you have to explicitly ask for this behaviour. If you ask for it, then same item might be multiple times in the result set. If you don't allow it and try to order by child's date, then you will get an error warning you about this situation.

I believe this is documented. In the order_by() docs the section: "It is permissible to specify a multi-valued..." and in the .distinct() the note is about this issue. Adding an example there might make things clearer.

I vote for adding the "allow_m2m" flag (with a better name if anybody can think of one) because the current behaviour is surprising. There are a couple other instances of similar problems (multiple m2m annotations, filtering m2m relation with multiple items from the relation matching), but there should be dealt separately.

I am accepting this, adding at least an example of this to docs seems like a good idea.

For the count() case, Django ignores the order_by() and thus the count() is correct (yes, this is again surprising). I am not sure what is happening in the print case.

comment:4 Changed 21 months ago by EvilDMP

Thanks for that very clear explanation, I will add something to the documentation to reflect it.

Is there scope in .order_by(children__date) for something that would identify the 'best' (or 'worst') date for the children, and use that to order the parents? Or is this beyond what can be achieved in a single SQL query via the ORM?

comment:5 Changed 21 months ago by akaariai

I can't figure out anything the ORM can do automatically. The user can do qs.filter(chidren__isbest=True).order_by('children__date') (or any other filter condition making the children unique), or qs.annotate(child_date_avg=Avg('children__date')).order_by('child_date_avg') but it seems impossible to have universally correct automatic decision on the best date.

What do you think about the allow_m2m flag, would it be a good idea to have an explicit flag to order_by() so that users have to opt-in for possible duplicate rows, or would that just annoy users who already know what they are doing?

comment:6 Changed 21 months ago by EvilDMP

One problem with an allow_m2m flag is that if it is True by default, then those who weren't expecting this behaviour wouldn't benefit from the warning; if it's False by default, it's going to be backwards incompatible for some code that is already using it.

However, I think that making it False by default would be the more useful, and more useful than the current behaviour where it's unremarked.

comment:7 Changed 21 months ago by EvilDMP

comment:8 Changed 21 months ago by Tim Graham <timograham@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In e8183a8193ebc80fdc6fef789813c2c0c3615e5c:

Fixed #20842 and #20845 - Added a note on order_by() and improved prefetch_related() docs.

comment:9 Changed 21 months ago by Tim Graham <timograham@…>

In 74205c4a3c587a9eedceef1b98cd4b0d529a7510:

[1.6.x] Fixed #20842 and #20845 - Added a note on order_by() and improved prefetch_related() docs.

Backport of e8183a8193 from master

comment:10 Changed 21 months ago by Tim Graham <timograham@…>

In bc33fef90a851da91758fa6c1126456e61aadf35:

[1.5.x] Fixed #20842 and #20845 - Added a note on order_by() and improved prefetch_related() docs.

Backport of e8183a8193 from master

Note: See TracTickets for help on using tickets.
Back to Top