Opened 4 years ago

Closed 3 years ago

#17198 closed Bug (fixed)

In admin results can be omitted due to pagination and inadequate ordering clauses

Reported by: lukeplant Owned by: julien
Component: contrib.admin Version:
Severity: Release blocker Keywords:
Cc: anssi.kaariainen@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

In the admin, pagination is achieved using LIMIT/OFFSET. However, if the ordering specified in the SQL does not totally define an order of rows, LIMIT/OFFSET is not deterministic with respect to exactly which rows are returned.

For example, using django.contrib.auth, and suppose you have hundreds of users with is_staff == False. Then, User.objects.order_by('is_staff')[0:20] must return those with is_staff == False, (since these sort first), but it can return any of them. If you then ask it for User.objects.order_by('is_staff')[20:40], it is perfectly free to give you same 20 as the first time - since by the same token it can return any of them.

This means that if you are in the admin, looking at the User list and you have ordered by is_staff, paging through the results is not guaranteed to show you all rows - it could duplicate some and omit others.

I have observed this behaviour in the wild with Postgres. The client in question was extremely confused as to why, when sorting by a certain boolean field, some results disappeared, and I was stumped for a while. I also confirmed from a Django shell that a series of consecutive 'slices' of a QuerySet can return some rows twice and some not at all.

We can argue that the database is being completely correct in what it is returning, because the question asked is a silly question - you can't deterministically take items X to Y of a set that doesn't have a strict total order. But I don't think we can argue that this is not a bug in the admin - the admin should not be asking the database silly questions.

It applies to any situation where the ordering does not totally define the order of results returned. It's especially serious for booleans, because at least half your dataset will share a value.

One possible solution would be to add an ordering by 'pk' to whatever is explicitly chosen - so you would have User.objects.order_by('is_staff', 'pk'). I don't know what performance impact this would have, however.

Attachments (3)

17198.changelist-order.diff (10.8 KB) - added by julien 3 years ago.
17198.changelist-order.2.diff (11.5 KB) - added by julien 3 years ago.
17198.changelist-order.3.diff (14.0 KB) - added by julien 3 years ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 4 years ago by julien

I agree that it's the admin that should be fixed. In fact, this is causing other issues with list_editable as the forms aren't always processed in a deterministic order: see in #16819.

As you've suggested, I think that appending 'pk' to the list of order fields is probably the way to go. That would solve those two tickets at once.

comment:2 Changed 4 years ago by akaariai

I tested this on PostgreSQL 9.0. The only case where appending pk to the order by clause could have serious performance implications is if you have:

  • a lot of rows
  • an index on the sort_col
  • no index on sort_col, pk

The test shows that in the above case you will get a limit 20 index scan if you order by the sort_col, but you will get a sequential scan + sort if you order by sort_col, pk. Now, if you have a lot of rows, the sort is going to be expensive. Of course, dropping the index on sort_col and creating a sort_col, pk index will solve that performance issue. If you don't have any index on the sort_col to begin with, you will get a seq scan + sort anyways.

For a 100000 rows table I measured a difference of 1ms <-> 75ms. If you have million rows, then the difference was 1ms <-> 1 seconds. The difference is likely bigger if you have lots of columns (only 2 in the test). In the worst case you will need to do disk IO, and the difference will explode to much bigger numbers.

The fix is an index on sort_col, pk in the rare cases where that is needed.

comment:3 Changed 3 years ago by anonymous

Using a deterministic order is certainly more important than potential performance issues. According to akaariai's benchmark results, those performance issues don't seem significant in Postgres anyway. It'd be worth checking with MySQL/Oracle though.

The fix should probably be done at the end of ChangeList.get_ordering(). 'pk' should be appended to the list of ordering fields, unless that list already contains 'pk', '-pk' or the explicit primary key field (both ASC and DESC).

comment:4 Changed 3 years ago by julien

Oops, that was me.

comment:5 Changed 3 years ago by julien

  • Owner changed from nobody to julien

I'm working on this.

Changed 3 years ago by julien

comment:6 Changed 3 years ago by julien

  • Has patch set

The attached patch ensures that 'pk' is systematically used as an ordering field in the changelist. Tests are also added for the related ticket #16819.

It breaks an existing test though (admin_views.AdminViewBasicTest.testChangeListSortingPreserveQuerySetOrdering), which was added in r16316 to fix #7309. Luke, do you remember the logic for that test? It seems like it expects the default order to be '-pk', so I'd just like to check if it's ok to make it 'pk' instead.

Thanks!

Last edited 3 years ago by julien (previous) (diff)

comment:7 Changed 3 years ago by lukeplant

@Julien:

The test failure does highlight a genuine problem, AFAICS:

Essentially the patch assumes that the query params and/or ModelAdmin options (and/or the model Meta, IIRC) are all the places where ordering can be defined. If there is nothing there, it adds an ordering clause, assuming that there isn't one already.

However, there are other places where ordering can be defined:

  • default managers
  • ModelAdmin.queryset()

and possibly others. Ticket #7309 is due to someone using ModelAdmin.queryset() to do ordering, and finding it was overridden by the Admin.

This was especially important when the admin did not support sorting by multiple fields, which it does now. However, there may still be cases where it is important not to break #7309 - for example, the current multiple sort UI can only manage sorting by fields that are visible in the change list, and someone might use ModelAdmin.queryset() to provide a default sort that is not on fields in the change list. In any case, we have a regression of some sort if we break #7309.

The essential problem is that we need a QuerySet.order_by() method that doesn't overwrite the existing order by clause, but just adds to the end. This isn't what you usually want order_by to do, but it is here. Alternatively, we could check the actual QuerySet for ordering. Ideally the admin would use a public QuerySet API to do this, but we don't have one yet. I guess using an internal would be acceptable, seeing that the admin probably does a lot of this anyway e.g. _meta.

On another ticket (#17668) just now, in discussion with Anssi Kääriäinen we thought about the need for some public QuerySet API that tells you the state of things like prefetch_related(), order_by().

comment:8 Changed 3 years ago by lukeplant

Actually, it looks like we do have a public API for introspecting QuerySet ordering: QuerySet.ordered

Changed 3 years ago by julien

comment:9 Changed 3 years ago by julien

Thank you, Luke, for the detailed feedback. I've amended the patch to work with pre-existing order_by clauses. See in particular: ordering.extend(queryset.query.order_by)

This removes the test failure. I need to write more tests to be sure all cases are covered. In the meantime, any further feedback is welcome. Thanks again.

Changed 3 years ago by julien

comment:10 Changed 3 years ago by julien

I've added more comprehensive tests to the latest patch. All cases previously mentioned should now be covered:

  • When the default manager has an order_by clause.
  • When ModelAdmin.queryset() has an order_by clause.
  • When no default order is defined on the model.

comment:11 Changed 3 years ago by julien

I've briefly talked about this with jacobkm on IRC. He wonders whether we should do this at all and whether this ticket shouldn't be wontfixed instead. His point is that this adds unnecessary complexity and that we shouldn't automagically order things by 'pk'. Quoting Jacob: "if you don't have a total ordering on your models then you need to understand the consequences".

From a user perspective it's definitely important to guarantee a deterministic order in the admin changelist, otherwise pagination becomes confusing and unusable (especially on Postgres). The question now is: should this be done implicitly by Django, or explicitly by the developer?

Requiring developers to explicitly set the total order can be tricky. If they care about ordering (which they should) then they would have to explicitly set the order for every model registered in the admin. On the other hand, automatically adding 'pk' to the list of ordering fields would make the admin opinionated in that there should *always* be a deterministic order in the change list, and that the primary key would be the discriminating field.

Just an idea: Maybe a middle ground would be to add an option ModelAdmin.discriminating_order_field that would default to 'pk', and that could be modified to any other field name or to None to disable the feature.

So, this seems like a case of how much flexibility, default behavior, explicitness and transparency we want the framework to provide. Jacob and I are interested in hearing more opinions on whether this deserves to be fixed or wontfixed.

comment:12 Changed 3 years ago by lukeplant

"if you don't have a total ordering on your models then you need to understand the consequences".

I agree with that, but put like that it kind of implies that the developer is choosing not to have a total ordering, which isn't quite fair when it is the Django default. It's especially unfair since before 1.4 if you specified an ordering for the admin which was a total ordering, everything but the first item was silently ignored. (I understand I'm reading something into the 'tone' of that comment which might not have been there, but you get my point).

I'm happy with this being a wontfix, but if so I think at the very least we need some doc fixes. Is there perhaps some way to do a warning as well?

Last edited 3 years ago by lukeplant (previous) (diff)

comment:13 Changed 3 years ago by aaugustin

IMO the bug originally reported by Luke is legitimate and should be fixed in the admin.

We can't expect users to understand what a partial ordering is, nor ask developers to add a boilerplate ordering = ['pk'] to all models or admins.

For these reasons I'm strongly against a wontfix.

Last edited 3 years ago by aaugustin (previous) (diff)

comment:14 Changed 3 years ago by akaariai

There is a small problem in the patch as written. Say, you have some unique column ('title' for example) and your default sorting is according to that field. The sorting is already deterministic. Now, the 'pk' is going to be appended to that sorting, and because of that, the existing unique index on title can't be used.

But the good thing is, if you have so many objects that not using an index for sorting is going to cause you problems, then doing the count(*) for pagination is going to be a problem anyways. So I guess the non-indexed sorting will not be a problem in practice.

The non-deterministic ordering is a nasty feature. Usually databases are going to be returning you the objects in a way that seems to work. But, if somebody happens to do some updates/inserts/deletes or maintenance operations to the table, the results will change. Or maybe there is a hash-join somewhere in the query, which will cause some randomness to the returned results.

In short, this is the kind of bug you will _not_ spot in testing. You will likely not spot this even if a user complains. You go check the list, and everything seems to work. My bet is that 50%+ of non-trivial admin installations suffer from this problem in one way or another. The users haven't noticed this, though. That is the reason why the correct default is important here.

I would add an opt-out if that is somewhat easy to do.

comment:15 Changed 3 years ago by julien

Having a deterministic order in the changelist is absolutely essential for a decent user experience. I don't think anybody argues against that. The question is more around whether that should be the developer's responsibility, or whether Django should provide some sane, overridable, defaults.

I personally think we should do the latter, i.e. let Django always guarantee a deterministic order by systematically appending 'pk' to the ordering fields. That would work in the majority of cases and would avoid requiring the developer to explicitly set the total ordering for every registered model. There should also be an API to let the developer either change the field name or disable the feature entirely, for the rare cases where this would, for example, cause performance issues.

In comment:11 I've suggested an idea which perhaps may work for this. What about adding a new option to ModelAdmin (e.g. called unique_order_field, or discriminating_order_field) that would default to 'pk'. The developer could change that value to another field name, or set it to None to disable the feature.

What do you think?

comment:16 Changed 3 years ago by aaugustin

Yes, I agree with that.

Adding a new option to ModelAdmin would be a new feature.

What about :

  • just providing a "sane default" (always enabled) for 1.4 — which fixes this bug,
  • making it overridable in 1.5 — which probably deserves a bit more discussion on the implementation?

comment:17 Changed 3 years ago by akaariai

  • Cc anssi.kaariainen@… added

Are there use-cases where the patch will break current functionality? Something as sane as intentional random ordering... I can't come up with a real use case. If there aren't, then add the default. If there are, you can't add the default without breaking backwards compatibility, or by adding the "opt-out" feature.

So, to me it seems that sane use cases for not having the pk appended should punt this whole problem to 1.5.

About the API: there might be some point in having a "discretize_order(True/False or 'by_field')" API directly for QuerySets. That way you could opt-out by using the queryset API directly, and you would have this feature available outside admin, too. (I know: another ticket...).

Edits: I can't write anything properly today.

Last edited 3 years ago by akaariai (previous) (diff)

comment:18 Changed 3 years ago by lukeplant

I thought again and I realised that you can't make this the developer's responsibility, because with current Django there is no way to do this. If the user clicks to sort on a column that is not unique, the admin will override and clear any ordering on the queryset and order by that column alone, and there is nothing the developer can do to fix it (unless I missed something? Perhaps there is something ModelAdmin method you could override, but by that argument you could declare ModelAdmin bug free by saying you can just override all the ModelAdmin methods).

I think we should go for the sane default, and extra features for 1.5. For bonus points, we can avoid adding the pk column to the ordering if the columns already added amount to providing a total order, which we can deduce in some cases from Field.unique and Model.Meta.unique_together, which will also fix the performance problem raise by Anssi.

This last one does assume that a set of unique values always has a total order in the database, which holds for most datatypes. I believe there are some unusual ones like points (http://www.postgresql.org/docs/8.1/static/datatype-geometric.html#AEN5194) which could potentially have a unique constraint applied but have no defined order at all. There are also some cases where a unique constraint only specifies a partial order. The only one I can think of that we should worry about is nullable fields - you can have multiple NULL values that are considered unique since NULL != NULL, with no defined order. Of course this also means that if PK is nullable too then you still won't have a total order, but there is nothing we can do about that.

comment:19 Changed 3 years ago by akaariai

It will be time-consuming and error-prone to deduce if there is already a discrete ordering (the nulls problem above, .extra(), related model ordering etc) and in addition there are fields which in practice give you good enough ordering. Last modified can have a microsecond resolution, yet is not usually unique. So, definitely 1.5 stuff, and even then it would be good to check just for the most common cases, and leave the rest for the "opt-out" API to deal with.

comment:20 Changed 3 years ago by jacob

Updatelet: I've read comments and talked it over with some other devs, and I'm clearly in the minority in wanting to wontfix this. So objection withdrawn, let's get some sort of fix in.

comment:21 Changed 3 years ago by julien

Thank you all for sharing your thoughts on this. It seems the consensus is to let Django systematically ensure a total order by appending 'pk' to the list of ordering fields, and to consider introducing an API for overriding that default behavior in 1.5.

This is the approach that is taken in the current patch. Would you like to review it so we can proceed with a fix? Thank you!

comment:22 Changed 3 years ago by aaugustin

  • Severity changed from Normal to Release blocker

#16819 was a subset (or symptom) of this problem and a release blocker. Since I closed it as a duplicate, I'm making this a release blocker.

comment:23 Changed 3 years ago by julien

  • Resolution set to fixed
  • Status changed from new to closed

In [17635]:

Fixed #17198 -- Ensured that a deterministic order is used across all database backends for displaying the admin change list's results. Many thanks to Luke Plant for the report and general approach, to everyone involved in the design discussions, and to Carl Meyer for the patch review. Refs #16819.

Note: See TracTickets for help on using tickets.
Back to Top