Opened 14 years ago
Closed 3 years ago
#15819 closed Bug (fixed)
Admin searches should use distinct, if query involves joins
Reported by: | Owned by: | Ryan Kaskel | |
---|---|---|---|
Component: | contrib.admin | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Ryan Kaskel, eschler@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
If search on inline relation in django-admin I've got multiple same rows. It do search unusable... On django 1.2 it works right.
Attachments (5)
Change History (33)
comment:1 by , 14 years ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
comment:2 by , 14 years ago
Resolution: | needsinfo |
---|---|
Status: | closed → reopened |
I haven't tested the issue in 1.2 as it would require too much backporting from 1.3 but I noticed the same issue today.
Say you have a model Foo
with an FK to user
and well as a first_name
and last_name
. You also provide the User model with a ModelAdmin
that has search_fields = ['foo__first_name', 'foo__last_name']
.
There is one user Ryan
and two Foo
objects that have foreign keys to that User. One has first_name
= 'Ryan' and last_name
= 'Smith' and the other has first_name
= 'Ryan' and last_name
= 'Jones'.
Then search for 'Ryan Smith' (i.e. /admin/auth/user/?q=ryan+smith
) and the same Ryan
User object is listed twice. In other words, distinct()
is never called. I hope this makes sense.
I think this is related to #13902 so the facilities are there. The check on http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/views/main.py#L267 would need to be changed.
comment:3 by , 14 years ago
Cc: | added |
---|
comment:5 by , 14 years ago
Yes, by the looks of those tickets it doesn't always seem simple.
I checked with 1.2.5 and the original ticket reporter is right in the behavior changed from 1.2.5 to 1.3 (I think with the patch for #13902).
Just to clarify the behavior: I created a little app called alias
with model like:
models.py:
class Alias(models.Model): user = models.ForeignKey(User) first_name = models.CharField(max_length=36) last_name = models.CharField(max_length=36)
admin.py:
class MyUserAdmin(UserAdmin): search_fields = ('username', 'first_name', 'last_name', 'email', 'alias__first_name', 'alias__last_name')
with a fixture like:
[{"pk": 1, "model": "alias.alias", "fields": {"first_name": "ryan", "last_name": "jones", "user": 1}}, {"pk": 2, "model": "alias.alias", "fields": {"first_name": "ryan", "last_name": "smith", "user": 1}}]
... a search of /admin/auth/user/?q=ryan+smith
returns the same User twice.
Would it be possible to add an option to ModelAdmin
that forced distinct()
to be called on a search?
comment:6 by , 14 years ago
Confirmed that this is a problem, and that it is a regression in 1.3.
It should just be fixed outright, not with a ModelAdmin option - there's no use case for duplicate search results.
As for use of distinct not being simple - this is what .distinct() is for. If there are bugs in distinct(), we need to fix them, not avoid using it. It does appear that we should fix #15559 along with this (and also re-apply the fix for #11707). However, that doesn't mean that work on a patch here needs to be stalled waiting for #15559.
I don't think we should call .distinct() unconditionally - just expand the checks from r15526 so they catch the non-M2M cases that might cause dupe results.
comment:7 by , 14 years ago
Easy pickings: | unset |
---|---|
Owner: | changed from | to
Status: | reopened → new |
by , 14 years ago
Attachment: | non-unique-related-fields-distinct1.diff added |
---|
comment:8 by , 14 years ago
Has patch: | set |
---|
I've added an initial patch. I've also discovered that the problem exists for list_filters
and the patch reflects thats. Since the same check to use distinct()
was used in two places, I've factored that out into a function:
def field_needs_distinct(field): if ((hasattr(field, 'rel') and isinstance(field.rel, models.ManyToManyRel)) or (isinstance(field, models.related.RelatedObject) and not field.field.unique)): return True return False
Distinct isn't called unconditionally: only if it is a ManyToMany relationship or some other kind of relationship where the key has unique=False
.
The only other piece of code I'm unsure about is http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/views/main.py#L238. Should select_related()
be called for other types of relationships? For example, this case where a ForeignKey
has unique=False
.
I don't know if this is an acceptable test for the condition so any feedback would be great. I'd also like feedback on the style as it's my first patch.
by , 14 years ago
Attachment: | non-unique-related-fields-distinct2.diff added |
---|
comment:9 by , 14 years ago
I've added an alternative patch that removes some of the mock request boilerplate in the tests. I'm not sure if that's okay because it isn't important for the patch but cuts down on unnecessary code.
comment:10 by , 14 years ago
Triage Stage: | Unreviewed → Accepted |
---|
Hi Ryan - this looks pretty good to me. The second patch doesn't appear to be complete, it only includes the changed test file, not the code changes. Can you upload a complete updated patch? (The boilerplate-removal in the second patch looks fine to me, it's related since its adding flexibility to test-support code that your added tests use).
Re style, is there a reason for the field_needs_distinct function to be a closure defined inside get_query_set, rather than a module-level function? I would just do the latter for simplicity if we don't need the closure.
Re the select_related() call, that's not, erm, related to this bug. It actually may be that we could call select_related in a couple more cases (specifically, one-to-ones and reverse one-to-ones), but we'd need a separate ticket for that optimization.
by , 14 years ago
Attachment: | non-unique-related-fields-distinct3.diff added |
---|
comment:12 by , 14 years ago
Thanks Carl. I've moved the function the module level and have attached the complete patch.
comment:13 by , 14 years ago
Looks good, I'm preparing to commit this. Unlike #11707, I don't think there's an argument to be made for holding this until #15559 is fixed. For people in situations where .distinct() is broken for whatever reason, the workaround here is simple: don't use search_fields with relations in a way that triggers it.
Thanks for the patch!
comment:16 by , 13 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
UI/UX: | unset |
Version: | 1.3 → SVN |
Still produces duplicate rows under the right circumstances. The lookup_needs_distinct check only goes one join deep looking for a ManyToMany or a reverse to a ForeignKey, but if the search field is more deeply nested than that than the possibility of needing distinct() rises sharply.
by , 13 years ago
Attachment: | non-unique-related-fields-distinct4.diff added |
---|
comment:17 by , 13 years ago
Maybe the queryset should expose some way to tell if distinct should be used? The queryset should contain that logic, the admin should not try to duplicate / guess that. It would be pretty easy to set a flag "contains_multijoin" whenever such a join is seen in sql/query/setup_joins, and after that Admin could just do if qs.query.contains_multijoin: qs = qs.distinct().
It would be possible to generate queries in a way that distinct isn't needed at all. This can be done using subqueries instead of joins for reverse fk/m2m filters. Unfortunately some backends will likely choke on such queries (while others will benefit) so just changing the ORM to use subqueries isn't possible, or at least the backwards compatibility issues should be investigated thoroughly. In addition, changing the ORM code to generate such queries isn't the easiest thing to do, so that is another problem for this path.
comment:18 by , 13 years ago
Here is a quick patch moving the "needs distinct" logic into the sql/query class. No new tests, but I am pretty sure it works correctly even for deeper join structures.
by , 13 years ago
Attachment: | query_contains_multijoin.diff added |
---|
comment:19 by , 12 years ago
The problem, though, is that DISTINCT
is painfully slow on most databases. A query like...
SELECT DISTINCT "mytable"."id", ... ORDER BY "mytable"."somefield" DESC LIMIT 100
..will force a full table scan and query time will increase linearly with the number of rows. Django shouldn't guess when to use it internally without a way for client code to override it, specially if this logic is moved to the ORM.
This currently happens on contrib.admin
if you include a M2M relation in the list_filters
, and makes the changelist
views unusable if your database has more than a couple hundred queries. See ticket #18729 for that.
More often than not, you want to optimize the query to use a subquery (giving hints to the query planner because it increases selectivity), or querying without DISTINCT
and removing duplicates on Python side. Here's a case study on optimizing queries to avoid use of DISTINCT
: http://www.databasejournal.com/features/postgresql/article.php/3437821/SELECT-DISTINCT-A-SQL-Case-Study.htm
follow-up: 21 comment:20 by , 12 years ago
Yeah, I agree about distinct being slow, and as mentioned in the article it is usually a sign of badly written query.
I find it hard to believe doing the distinct operation on Python side would lead to a net win. Databases are written in C and they will be faster at doing the distinct than transferring the whole dataset to Python and then doing the distinct on that side.
Making the ORM to do transformation of m2m filter to subquery should be easy in the trivial cases, as we already do them for m2m exclude joins. The harder problem is doing them for queries like this:
qs.filter(Q(m2m_rel__foo='Bar')|Q(m2m_rel__foo='Baz'))
The problem is that lookups inside one filter should target the same join (IIRC), but the transform-to-subquery logic doesn't know anything about that.
Still, as a long term goal it would be nice to move away from the current confusing situation where filters to m2m can create duplicate results. However, doing that cleanly in the ORM is hard, and then there are backwards compatibility issues too.
There is a closely related issue: use aggregations and m2m filters in the same query -> wrong results.
So, for the time being, I think we will just need to use the ORM to detect when to use distinct. It should be a small step forward. I hope we can have automatic subqueries, but this isn't likely to happen in the close future.
comment:21 by , 12 years ago
Replying to akaariai:
Yeah, I agree about distinct being slow, and as mentioned in the article it is usually a sign of badly written query.
I find it hard to believe doing the distinct operation on Python side would lead to a net win.
Depends completely on the use case. If I'm querying 1 million tuples with DISTINCT ... LIMIT 100
my database dies because that's a full table scan (twice) just to retrieve 100. If I query without DISTINCT
, I send 100 tuples over the wire and remove duplicates in Python - that's trivial.
That's what I mean about the ORM not guessing when to use DISTINCT
, specially if it's going to happen deep inside ORM code without a clear way to override.
comment:23 by , 12 years ago
Status: | reopened → new |
---|
comment:24 by , 12 years ago
The DISTINCT also causes an Oracle error if the model contains a TextField. This was reported on django-users:
https://groups.google.com/forum/?fromgroups=#!topic/django-users/eYTdpy4eWLQ
comment:25 by , 11 years ago
Cc: | added |
---|
comment:26 by , 11 years ago
Needs tests: | set |
---|
#20991 - one more dupe of this bug..., mkay
So, I'll try to write few tests on this week.
Is it that patch that should be fine?
https://code.djangoproject.com/attachment/ticket/15819/query_contains_multijoin.diff
comment:27 by , 11 years ago
Patch needs improvement: | set |
---|
Yes, the patch needs to be updated to apply cleanly. You can probably model the test for the reverse foreign key situation on the one that was added in [065e6b6153].
comment:28 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Nested case was fixed in d084176cc1273d5faf6f88eedb4c490e961f3a68.
Please provide more specifics. Search is available on changelist pages, inlines are on edit pages, so I'm confused about what you are trying to report. Models, model admin configs, and instructions for reproducing what you are seeing would help someone diagnose. As it is there is not enough information here to move forward.