Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#28292 closed Bug (invalid)

ORing on ManyToMany to same model can result in duplicate items in queryset

Reported by: Till Theato Owned by: nobody
Component: Database layer (models, ORM) Version: 1.11
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no


One of my models has two ManyToManyFields to another model:

class Bar(models.Model):
    created = models.DateTimeField(auto_now_add=True)

class Foo(models.Model):
    rel1 = models.ManyToManyField('Bar', related_name='foo_rel1')
    rel2 = models.ManyToManyField('Bar', related_name='foo_rel2')

I have to query all instances of Foo that point to the same Bar via any of the both relationships. However when an instance of Foo had multiple relationships to unrelated Bar instances, I got duplicated instances in the resulting queryset:

b1 = Bar.objects.create()
b2 = Bar.objects.create()
b3 = Bar.objects.create()

f = Foo.objects.create()

assert Foo.objects.filter(rel2=b1).count() == 1 # Good

assert Foo.objects.filter(rel1=b1).count() == 0 # Good

qs = Foo.objects.filter(Q(rel1=b1) | Q(rel2=b1))

assert qs.count() == 1 # AssertionError, qs.count() == 2

print(qs.values('rel1', 'rel2'))
# <QuerySet [{'rel1': 2, 'rel2': 1}, {'rel1': 3, 'rel2': 1}]>

As the last output shows, one entry per relationship in rel1 is returned.

Change History (5)

comment:1 Changed 5 years ago by Tim Graham

Resolution: invalid
Status: newclosed

This is documented behavior. See the warning in the QuerySet.values() doc:

Because ManyToManyField attributes and reverse relations can have multiple related rows, including these can have a multiplier effect on the size of your result set. This will be especially pronounced if you include multiple such fields in your values() query, in which case all possible combinations will be returned.

comment:2 Changed 5 years ago by Till Theato

In the original code no ManyToMany fields are included in the values call. I added this only for documentation purposes, but that appears to make no sense. Sorry about that.
The actual problem lies one line ahead in the failing assertion: The count of the queryset is 2 instead of 1 as expected.

Last edited 5 years ago by Till Theato (previous) (diff)

comment:3 Changed 5 years ago by Tim Graham

I believe you need to use QuerySet.distinct():

By default, a QuerySet will not eliminate duplicate rows. In practice, this is rarely a problem, because simple queries such as Blog.objects.all() don’t introduce the possibility of duplicate result rows. However, if your query spans multiple tables, it’s possible to get duplicate results when a QuerySet is evaluated. That’s when you’d use distinct().

comment:4 Changed 5 years ago by Till Theato

I am aware of Queryset.distinct(), however I could not believe that the observed behavior was expected. Before reporting this bug I already tried using distinct on the original query where the problem occurred, but this did not help because I calculate a cumulative sum in an annotation of the query using a custom Func:

class CumSum(Func):
    function = 'sum'
    template = '%(function)s(%(expressions)s) OVER (ORDER BY created)'

The annotation happens before distinctness is enforced and therefore the CumSum of entries after duplication is wrong.
Looks like I will have to pull that out of DB then – if the reported problem is really expected behavior.

comment:5 Changed 5 years ago by Tim Graham

I'll admit that I don't have a deep understanding of the ORM. If you can offer a patch showing why Django is at fault, feel free to reopen the ticket.

Note: See TracTickets for help on using tickets.
Back to Top