Opened 14 years ago

Closed 3 years ago

#15819 closed Bug (fixed)

Admin searches should use distinct, if query involves joins

Reported by: Adam Kochanowski <aip@…> Owned by: Ryan Kaskel
Component: contrib.admin Version: dev
Severity: Normal Keywords:
Cc: Ryan Kaskel, eschler@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

If search on inline relation in django-admin I've got multiple same rows. It do search unusable... On django 1.2 it works right.

Attachments (5)

non-unique-related-fields-distinct1.diff (5.0 KB ) - added by Ryan Kaskel 14 years ago.
non-unique-related-fields-distinct2.diff (6.0 KB ) - added by Ryan Kaskel 14 years ago.
non-unique-related-fields-distinct3.diff (7.8 KB ) - added by Ryan Kaskel 14 years ago.
non-unique-related-fields-distinct4.diff (696 bytes ) - added by johnfink8 13 years ago.
query_contains_multijoin.diff (6.4 KB ) - added by Anssi Kääriäinen 13 years ago.

Download all attachments as: .zip

Change History (33)

comment:1 by Karen Tracey, 14 years ago

Resolution: needsinfo
Status: newclosed

Please provide more specifics. Search is available on changelist pages, inlines are on edit pages, so I'm confused about what you are trying to report. Models, model admin configs, and instructions for reproducing what you are seeing would help someone diagnose. As it is there is not enough information here to move forward.

comment:2 by Ryan Kaskel, 14 years ago

Resolution: needsinfo
Status: closedreopened

I haven't tested the issue in 1.2 as it would require too much backporting from 1.3 but I noticed the same issue today.

Say you have a model Foo with an FK to user and well as a first_name and last_name. You also provide the User model with a ModelAdmin that has search_fields = ['foo__first_name', 'foo__last_name'].

There is one user Ryan and two Foo objects that have foreign keys to that User. One has first_name = 'Ryan' and last_name = 'Smith' and the other has first_name = 'Ryan' and last_name = 'Jones'.

Then search for 'Ryan Smith' (i.e. /admin/auth/user/?q=ryan+smith) and the same Ryan User object is listed twice. In other words, distinct() is never called. I hope this makes sense.

I think this is related to #13902 so the facilities are there. The check on http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/views/main.py#L267 would need to be changed.

comment:3 by Ryan Kaskel, 14 years ago

Cc: Ryan Kaskel added

comment:4 by Luke Plant, 14 years ago

Note that using distinct is not always simple - #15559, #11707

comment:5 by Ryan Kaskel, 14 years ago

Yes, by the looks of those tickets it doesn't always seem simple.

I checked with 1.2.5 and the original ticket reporter is right in the behavior changed from 1.2.5 to 1.3 (I think with the patch for #13902).

Just to clarify the behavior: I created a little app called alias with model like:

models.py:

class Alias(models.Model):
    user = models.ForeignKey(User)
    first_name = models.CharField(max_length=36)
    last_name = models.CharField(max_length=36)

admin.py:

class MyUserAdmin(UserAdmin):
    search_fields = ('username', 'first_name', 'last_name', 'email',
                     'alias__first_name', 'alias__last_name')

with a fixture like:

[{"pk": 1, "model": "alias.alias", "fields": {"first_name": "ryan", "last_name": "jones", "user": 1}}, 
 {"pk": 2, "model": "alias.alias", "fields": {"first_name": "ryan", "last_name": "smith", "user": 1}}]

... a search of /admin/auth/user/?q=ryan+smith returns the same User twice.

Would it be possible to add an option to ModelAdmin that forced distinct() to be called on a search?

comment:6 by Carl Meyer, 14 years ago

Confirmed that this is a problem, and that it is a regression in 1.3.

It should just be fixed outright, not with a ModelAdmin option - there's no use case for duplicate search results.

As for use of distinct not being simple - this is what .distinct() is for. If there are bugs in distinct(), we need to fix them, not avoid using it. It does appear that we should fix #15559 along with this (and also re-apply the fix for #11707). However, that doesn't mean that work on a patch here needs to be stalled waiting for #15559.

I don't think we should call .distinct() unconditionally - just expand the checks from r15526 so they catch the non-M2M cases that might cause dupe results.

comment:7 by Ryan Kaskel, 14 years ago

Easy pickings: unset
Owner: changed from aip@… to Ryan Kaskel
Status: reopenednew

by Ryan Kaskel, 14 years ago

comment:8 by Ryan Kaskel, 14 years ago

Has patch: set

I've added an initial patch. I've also discovered that the problem exists for list_filters and the patch reflects thats. Since the same check to use distinct() was used in two places, I've factored that out into a function:

def field_needs_distinct(field):
    if ((hasattr(field, 'rel') and
         isinstance(field.rel, models.ManyToManyRel)) or
        (isinstance(field, models.related.RelatedObject) and
         not field.field.unique)):
        return True
    return False

Distinct isn't called unconditionally: only if it is a ManyToMany relationship or some other kind of relationship where the key has unique=False.

The only other piece of code I'm unsure about is http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/views/main.py#L238. Should select_related() be called for other types of relationships? For example, this case where a ForeignKey has unique=False.

I don't know if this is an acceptable test for the condition so any feedback would be great. I'd also like feedback on the style as it's my first patch.

by Ryan Kaskel, 14 years ago

comment:9 by Ryan Kaskel, 14 years ago

I've added an alternative patch that removes some of the mock request boilerplate in the tests. I'm not sure if that's okay because it isn't important for the patch but cuts down on unnecessary code.

comment:10 by Carl Meyer, 14 years ago

Triage Stage: UnreviewedAccepted

Hi Ryan - this looks pretty good to me. The second patch doesn't appear to be complete, it only includes the changed test file, not the code changes. Can you upload a complete updated patch? (The boilerplate-removal in the second patch looks fine to me, it's related since its adding flexibility to test-support code that your added tests use).

Re style, is there a reason for the field_needs_distinct function to be a closure defined inside get_query_set, rather than a module-level function? I would just do the latter for simplicity if we don't need the closure.

Re the select_related() call, that's not, erm, related to this bug. It actually may be that we could call select_related in a couple more cases (specifically, one-to-ones and reverse one-to-ones), but we'd need a separate ticket for that optimization.

comment:11 by Aymeric Augustin, 14 years ago

#15839 is probably a duplicate.

by Ryan Kaskel, 14 years ago

comment:12 by Ryan Kaskel, 14 years ago

Thanks Carl. I've moved the function the module level and have attached the complete patch.

comment:13 by Carl Meyer, 14 years ago

Looks good, I'm preparing to commit this. Unlike #11707, I don't think there's an argument to be made for holding this until #15559 is fixed. For people in situations where .distinct() is broken for whatever reason, the workaround here is simple: don't use search_fields with relations in a way that triggers it.

Thanks for the patch!

comment:14 by Carl Meyer, 14 years ago

Resolution: fixed
Status: newclosed

In [16093]:

Fixed #15819 - Fixed 1.3 regression from r15526 causing duplicate search results in admin with search_fields traversing to non-M2M related models. Thanks to Adam Kochanowski for the report and Ryan Kaskel for the patch.

comment:15 by Carl Meyer, 14 years ago

In [16094]:

[1.3.X] Fixed #15819 - Fixed 1.3 regression from r15526 causing duplicate search results in admin with search_fields traversing to non-M2M related models. Thanks to Adam Kochanowski for the report and Ryan Kaskel for the patch.

Backport of r16093 from trunk.

comment:16 by johnfink8, 13 years ago

Resolution: fixed
Status: closedreopened
UI/UX: unset
Version: 1.3SVN

Still produces duplicate rows under the right circumstances. The lookup_needs_distinct check only goes one join deep looking for a ManyToMany or a reverse to a ForeignKey, but if the search field is more deeply nested than that than the possibility of needing distinct() rises sharply.

Version 0, edited 13 years ago by johnfink8 (next)

comment:17 by Anssi Kääriäinen, 13 years ago

Maybe the queryset should expose some way to tell if distinct should be used? The queryset should contain that logic, the admin should not try to duplicate / guess that. It would be pretty easy to set a flag "contains_multijoin" whenever such a join is seen in sql/query/setup_joins, and after that Admin could just do if qs.query.contains_multijoin: qs = qs.distinct().

It would be possible to generate queries in a way that distinct isn't needed at all. This can be done using subqueries instead of joins for reverse fk/m2m filters. Unfortunately some backends will likely choke on such queries (while others will benefit) so just changing the ORM to use subqueries isn't possible, or at least the backwards compatibility issues should be investigated thoroughly. In addition, changing the ORM code to generate such queries isn't the easiest thing to do, so that is another problem for this path.

comment:18 by Anssi Kääriäinen, 13 years ago

Here is a quick patch moving the "needs distinct" logic into the sql/query class. No new tests, but I am pretty sure it works correctly even for deeper join structures.

by Anssi Kääriäinen, 13 years ago

comment:19 by Henrique C. Alves, 12 years ago

The problem, though, is that DISTINCT is painfully slow on most databases. A query like...

SELECT DISTINCT "mytable"."id", ... ORDER BY "mytable"."somefield" DESC LIMIT 100

..will force a full table scan and query time will increase linearly with the number of rows. Django shouldn't guess when to use it internally without a way for client code to override it, specially if this logic is moved to the ORM.

This currently happens on contrib.admin if you include a M2M relation in the list_filters, and makes the changelist views unusable if your database has more than a couple hundred queries. See ticket #18729 for that.

More often than not, you want to optimize the query to use a subquery (giving hints to the query planner because it increases selectivity), or querying without DISTINCT and removing duplicates on Python side. Here's a case study on optimizing queries to avoid use of DISTINCT: http://www.databasejournal.com/features/postgresql/article.php/3437821/SELECT-DISTINCT-A-SQL-Case-Study.htm

comment:20 by Anssi Kääriäinen, 12 years ago

Yeah, I agree about distinct being slow, and as mentioned in the article it is usually a sign of badly written query.

I find it hard to believe doing the distinct operation on Python side would lead to a net win. Databases are written in C and they will be faster at doing the distinct than transferring the whole dataset to Python and then doing the distinct on that side.

Making the ORM to do transformation of m2m filter to subquery should be easy in the trivial cases, as we already do them for m2m exclude joins. The harder problem is doing them for queries like this:

    qs.filter(Q(m2m_rel__foo='Bar')|Q(m2m_rel__foo='Baz'))

The problem is that lookups inside one filter should target the same join (IIRC), but the transform-to-subquery logic doesn't know anything about that.

Still, as a long term goal it would be nice to move away from the current confusing situation where filters to m2m can create duplicate results. However, doing that cleanly in the ORM is hard, and then there are backwards compatibility issues too.

There is a closely related issue: use aggregations and m2m filters in the same query -> wrong results.

So, for the time being, I think we will just need to use the ORM to detect when to use distinct. It should be a small step forward. I hope we can have automatic subqueries, but this isn't likely to happen in the close future.

in reply to:  20 comment:21 by Henrique C. Alves, 12 years ago

Replying to akaariai:

Yeah, I agree about distinct being slow, and as mentioned in the article it is usually a sign of badly written query.

I find it hard to believe doing the distinct operation on Python side would lead to a net win.

Depends completely on the use case. If I'm querying 1 million tuples with DISTINCT ... LIMIT 100 my database dies because that's a full table scan (twice) just to retrieve 100. If I query without DISTINCT, I send 100 tuples over the wire and remove duplicates in Python - that's trivial.

That's what I mean about the ORM not guessing when to use DISTINCT, specially if it's going to happen deep inside ORM code without a clear way to override.

comment:22 by Travis Swicegood, 12 years ago

#18729 was opened as a dupe of this.

comment:23 by Aymeric Augustin, 12 years ago

Status: reopenednew

comment:24 by Erin Kelly, 12 years ago

The DISTINCT also causes an Oracle error if the model contains a TextField. This was reported on django-users:

https://groups.google.com/forum/?fromgroups=#!topic/django-users/eYTdpy4eWLQ

comment:25 by Dirk Eschler, 12 years ago

Cc: eschler@… added

comment:26 by iorlas, 11 years ago

Needs tests: set

#20991 - one more dupe of this bug..., mkay

So, I'll try to write few tests on this week.
Is it that patch that should be fine?
https://code.djangoproject.com/attachment/ticket/15819/query_contains_multijoin.diff

comment:27 by Tim Graham, 11 years ago

Patch needs improvement: set

Yes, the patch needs to be updated to apply cleanly. You can probably model the test for the reverse foreign key situation on the one that was added in [065e6b6153].

comment:28 by Mariusz Felisiak, 3 years ago

Resolution: fixed
Status: newclosed

Nested case was fixed in d084176cc1273d5faf6f88eedb4c490e961f3a68.

Note: See TracTickets for help on using tickets.
Back to Top