Code

Opened 3 years ago

Last modified 11 months ago

#15819 new Bug

Admin searches should use distinct, if query involves joins

Reported by: Adam Kochanowski <aip@…> Owned by: ryankask
Component: contrib.admin Version: master
Severity: Normal Keywords:
Cc: ryankask, eschler@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

If search on inline relation in django-admin I've got multiple same rows. It do search unusable... On django 1.2 it works right.

Attachments (5)

non-unique-related-fields-distinct1.diff (5.0 KB) - added by ryankask 3 years ago.
non-unique-related-fields-distinct2.diff (6.0 KB) - added by ryankask 3 years ago.
non-unique-related-fields-distinct3.diff (7.8 KB) - added by ryankask 3 years ago.
non-unique-related-fields-distinct4.diff (696 bytes) - added by johnfink8 2 years ago.
query_contains_multijoin.diff (6.4 KB) - added by akaariai 2 years ago.

Download all attachments as: .zip

Change History (32)

comment:1 Changed 3 years ago by kmtracey

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to needsinfo
  • Status changed from new to closed

Please provide more specifics. Search is available on changelist pages, inlines are on edit pages, so I'm confused about what you are trying to report. Models, model admin configs, and instructions for reproducing what you are seeing would help someone diagnose. As it is there is not enough information here to move forward.

comment:2 Changed 3 years ago by ryankask

  • Resolution needsinfo deleted
  • Status changed from closed to reopened

I haven't tested the issue in 1.2 as it would require too much backporting from 1.3 but I noticed the same issue today.

Say you have a model Foo with an FK to user and well as a first_name and last_name. You also provide the User model with a ModelAdmin that has search_fields = ['foo__first_name', 'foo__last_name'].

There is one user Ryan and two Foo objects that have foreign keys to that User. One has first_name = 'Ryan' and last_name = 'Smith' and the other has first_name = 'Ryan' and last_name = 'Jones'.

Then search for 'Ryan Smith' (i.e. /admin/auth/user/?q=ryan+smith) and the same Ryan User object is listed twice. In other words, distinct() is never called. I hope this makes sense.

I think this is related to #13902 so the facilities are there. The check on http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/views/main.py#L267 would need to be changed.

comment:3 Changed 3 years ago by ryankask

  • Cc ryankask added

comment:4 Changed 3 years ago by lukeplant

Note that using distinct is not always simple - #15559, #11707

comment:5 Changed 3 years ago by ryankask

Yes, by the looks of those tickets it doesn't always seem simple.

I checked with 1.2.5 and the original ticket reporter is right in the behavior changed from 1.2.5 to 1.3 (I think with the patch for #13902).

Just to clarify the behavior: I created a little app called alias with model like:

models.py:

class Alias(models.Model):
    user = models.ForeignKey(User)
    first_name = models.CharField(max_length=36)
    last_name = models.CharField(max_length=36)

admin.py:

class MyUserAdmin(UserAdmin):
    search_fields = ('username', 'first_name', 'last_name', 'email',
                     'alias__first_name', 'alias__last_name')

with a fixture like:

[{"pk": 1, "model": "alias.alias", "fields": {"first_name": "ryan", "last_name": "jones", "user": 1}}, 
 {"pk": 2, "model": "alias.alias", "fields": {"first_name": "ryan", "last_name": "smith", "user": 1}}]

... a search of /admin/auth/user/?q=ryan+smith returns the same User twice.

Would it be possible to add an option to ModelAdmin that forced distinct() to be called on a search?

comment:6 Changed 3 years ago by carljm

Confirmed that this is a problem, and that it is a regression in 1.3.

It should just be fixed outright, not with a ModelAdmin option - there's no use case for duplicate search results.

As for use of distinct not being simple - this is what .distinct() is for. If there are bugs in distinct(), we need to fix them, not avoid using it. It does appear that we should fix #15559 along with this (and also re-apply the fix for #11707). However, that doesn't mean that work on a patch here needs to be stalled waiting for #15559.

I don't think we should call .distinct() unconditionally - just expand the checks from r15526 so they catch the non-M2M cases that might cause dupe results.

comment:7 Changed 3 years ago by ryankask

  • Easy pickings unset
  • Owner changed from aip@… to ryankask
  • Status changed from reopened to new

Changed 3 years ago by ryankask

comment:8 Changed 3 years ago by ryankask

  • Has patch set

I've added an initial patch. I've also discovered that the problem exists for list_filters and the patch reflects thats. Since the same check to use distinct() was used in two places, I've factored that out into a function:

def field_needs_distinct(field):
    if ((hasattr(field, 'rel') and
         isinstance(field.rel, models.ManyToManyRel)) or
        (isinstance(field, models.related.RelatedObject) and
         not field.field.unique)):
        return True
    return False

Distinct isn't called unconditionally: only if it is a ManyToMany relationship or some other kind of relationship where the key has unique=False.

The only other piece of code I'm unsure about is http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/views/main.py#L238. Should select_related() be called for other types of relationships? For example, this case where a ForeignKey has unique=False.

I don't know if this is an acceptable test for the condition so any feedback would be great. I'd also like feedback on the style as it's my first patch.

Changed 3 years ago by ryankask

comment:9 Changed 3 years ago by ryankask

I've added an alternative patch that removes some of the mock request boilerplate in the tests. I'm not sure if that's okay because it isn't important for the patch but cuts down on unnecessary code.

comment:10 Changed 3 years ago by carljm

  • Triage Stage changed from Unreviewed to Accepted

Hi Ryan - this looks pretty good to me. The second patch doesn't appear to be complete, it only includes the changed test file, not the code changes. Can you upload a complete updated patch? (The boilerplate-removal in the second patch looks fine to me, it's related since its adding flexibility to test-support code that your added tests use).

Re style, is there a reason for the field_needs_distinct function to be a closure defined inside get_query_set, rather than a module-level function? I would just do the latter for simplicity if we don't need the closure.

Re the select_related() call, that's not, erm, related to this bug. It actually may be that we could call select_related in a couple more cases (specifically, one-to-ones and reverse one-to-ones), but we'd need a separate ticket for that optimization.

comment:11 Changed 3 years ago by aaugustin

#15839 is probably a duplicate.

Changed 3 years ago by ryankask

comment:12 Changed 3 years ago by ryankask

Thanks Carl. I've moved the function the module level and have attached the complete patch.

comment:13 Changed 3 years ago by carljm

Looks good, I'm preparing to commit this. Unlike #11707, I don't think there's an argument to be made for holding this until #15559 is fixed. For people in situations where .distinct() is broken for whatever reason, the workaround here is simple: don't use search_fields with relations in a way that triggers it.

Thanks for the patch!

comment:14 Changed 3 years ago by carljm

  • Resolution set to fixed
  • Status changed from new to closed

In [16093]:

Fixed #15819 - Fixed 1.3 regression from r15526 causing duplicate search results in admin with search_fields traversing to non-M2M related models. Thanks to Adam Kochanowski for the report and Ryan Kaskel for the patch.

comment:15 Changed 3 years ago by carljm

In [16094]:

[1.3.X] Fixed #15819 - Fixed 1.3 regression from r15526 causing duplicate search results in admin with search_fields traversing to non-M2M related models. Thanks to Adam Kochanowski for the report and Ryan Kaskel for the patch.

Backport of r16093 from trunk.

comment:16 Changed 2 years ago by johnfink8

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • UI/UX unset
  • Version changed from 1.3 to SVN

Still produces duplicate rows under the right circumstances. The lookup_needs_distinct check only goes one join deep looking for a ManyToMany or a reverse to a ForeignKey, but if the search field is more deeply nested than that than the possibility of needing distinct() rises sharply.

Last edited 2 years ago by johnfink8 (previous) (diff)

Changed 2 years ago by johnfink8

comment:17 Changed 2 years ago by akaariai

Maybe the queryset should expose some way to tell if distinct should be used? The queryset should contain that logic, the admin should not try to duplicate / guess that. It would be pretty easy to set a flag "contains_multijoin" whenever such a join is seen in sql/query/setup_joins, and after that Admin could just do if qs.query.contains_multijoin: qs = qs.distinct().

It would be possible to generate queries in a way that distinct isn't needed at all. This can be done using subqueries instead of joins for reverse fk/m2m filters. Unfortunately some backends will likely choke on such queries (while others will benefit) so just changing the ORM to use subqueries isn't possible, or at least the backwards compatibility issues should be investigated thoroughly. In addition, changing the ORM code to generate such queries isn't the easiest thing to do, so that is another problem for this path.

comment:18 Changed 2 years ago by akaariai

Here is a quick patch moving the "needs distinct" logic into the sql/query class. No new tests, but I am pretty sure it works correctly even for deeper join structures.

Changed 2 years ago by akaariai

comment:19 Changed 2 years ago by hcarvalhoalves

The problem, though, is that DISTINCT is painfully slow on most databases. A query like...

SELECT DISTINCT "mytable"."id", ... ORDER BY "mytable"."somefield" DESC LIMIT 100

..will force a full table scan and query time will increase linearly with the number of rows. Django shouldn't guess when to use it internally without a way for client code to override it, specially if this logic is moved to the ORM.

This currently happens on contrib.admin if you include a M2M relation in the list_filters, and makes the changelist views unusable if your database has more than a couple hundred queries. See ticket #18729 for that.

More often than not, you want to optimize the query to use a subquery (giving hints to the query planner because it increases selectivity), or querying without DISTINCT and removing duplicates on Python side. Here's a case study on optimizing queries to avoid use of DISTINCT: http://www.databasejournal.com/features/postgresql/article.php/3437821/SELECT-DISTINCT-A-SQL-Case-Study.htm

comment:20 follow-up: Changed 2 years ago by akaariai

Yeah, I agree about distinct being slow, and as mentioned in the article it is usually a sign of badly written query.

I find it hard to believe doing the distinct operation on Python side would lead to a net win. Databases are written in C and they will be faster at doing the distinct than transferring the whole dataset to Python and then doing the distinct on that side.

Making the ORM to do transformation of m2m filter to subquery should be easy in the trivial cases, as we already do them for m2m exclude joins. The harder problem is doing them for queries like this:

    qs.filter(Q(m2m_rel__foo='Bar')|Q(m2m_rel__foo='Baz'))

The problem is that lookups inside one filter should target the same join (IIRC), but the transform-to-subquery logic doesn't know anything about that.

Still, as a long term goal it would be nice to move away from the current confusing situation where filters to m2m can create duplicate results. However, doing that cleanly in the ORM is hard, and then there are backwards compatibility issues too.

There is a closely related issue: use aggregations and m2m filters in the same query -> wrong results.

So, for the time being, I think we will just need to use the ORM to detect when to use distinct. It should be a small step forward. I hope we can have automatic subqueries, but this isn't likely to happen in the close future.

comment:21 in reply to: ↑ 20 Changed 2 years ago by hcarvalhoalves

Replying to akaariai:

Yeah, I agree about distinct being slow, and as mentioned in the article it is usually a sign of badly written query.

I find it hard to believe doing the distinct operation on Python side would lead to a net win.

Depends completely on the use case. If I'm querying 1 million tuples with DISTINCT ... LIMIT 100 my database dies because that's a full table scan (twice) just to retrieve 100. If I query without DISTINCT, I send 100 tuples over the wire and remove duplicates in Python - that's trivial.

That's what I mean about the ORM not guessing when to use DISTINCT, specially if it's going to happen deep inside ORM code without a clear way to override.

comment:22 Changed 22 months ago by tswicegood

#18729 was opened as a dupe of this.

comment:23 Changed 16 months ago by aaugustin

  • Status changed from reopened to new

comment:24 Changed 15 months ago by ikelly

The DISTINCT also causes an Oracle error if the model contains a TextField. This was reported on django-users:

https://groups.google.com/forum/?fromgroups=#!topic/django-users/eYTdpy4eWLQ

comment:25 Changed 14 months ago by deschler

  • Cc eschler@… added

comment:26 Changed 11 months ago by iorlas

  • Needs tests set

#20991 - one more dupe of this bug..., mkay

So, I'll try to write few tests on this week.
Is it that patch that should be fine?
https://code.djangoproject.com/attachment/ticket/15819/query_contains_multijoin.diff

comment:27 Changed 11 months ago by timo

  • Patch needs improvement set

Yes, the patch needs to be updated to apply cleanly. You can probably model the test for the reverse foreign key situation on the one that was added in [065e6b6153].

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from ryankask to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.