Opened 3 years ago
Closed 3 years ago
#34393 closed Uncategorized (invalid)
A filter query returns more items than the original queryset provides after applying INNER JOIN
| Reported by: | Ľuboš Mjachky | Owned by: | nobody |
|---|---|---|---|
| Component: | Database layer (models, ORM) | Version: | 3.2 |
| Severity: | Normal | Keywords: | filter query duplicate distinct |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
In our project, we identified that the filter query returns more entries than the number of entries stored in the initial queryset.
The following piece of code is involved:
# qs.count() == 4
scoped_repos = repo_viewset.get_queryset().values_list("pk", flat=True)
filtered_content = qs.filter(repositories__in=scoped_repos)
# filtered_content.count() == 8
The generated query:
SELECT * FROM "rpm_package" INNER JOIN "core_content" ON ("rpm_package"."content_ptr_id" = "core_content"."pulp_id")
INNER JOIN "core_repositorycontent" ON ("core_content"."pulp_id" = "core_repositorycontent"."content_id")
WHERE "core_repositorycontent"."repository_id" IN (c35b7039-2c2c-48e3-8f4f-b0eeabad8af1, ee39a78b-9dd5-4bdf-85d9-eb6406b6ef49)
One of the things being noticed is that the query is constructed with an INNER JOIN clause instead of a LEFT JOIN clause. The core_repositorycontent table contains a lot of duplicates. We believe that this should not be a problem. Adding the distinct() query at the end of the call resolves the issue. See https://github.com/pulp/pulpcore/pull/3642.
The question is whether this is a bug in Django (i.e., a filter query can return more elements than there are in the original queryset) or on our side, and we should restructure the query in a specific way. Any advice is welcome.
This is a documented behavior. You can use
Subquery()to prevent duplicates.