#1807 closed enhancement (wontfix)
[patch] union and intersection operating on lists
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | minor | Keywords: | union performance and or ORM QuerySet |
Cc: | Triage Stage: | Design decision needed | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
This patch provides list_union and list_intersection functions that evaluate over a list of QuerySets. I'll add documentation if needed. The tests should demonstrate usage.
Please note that currently the last test fails due to the bug described in ticket #1801
Attachments (3)
Change History (11)
by , 19 years ago
Attachment: | list_union.diff added |
---|
by , 19 years ago
Attachment: | list_union.2.diff added |
---|
patch (replaces other patch, only removed 2 lines which confuse trac)
comment:1 by , 19 years ago
Summary: | union and intersection operating on lists → [patch] union and intersection operating on lists |
---|
Sorry, I was having difficulties with trac understanding my diffs. Take the last patch. Please additionally create the init.py file for the tests, I don't know how patch/trac would both understand that there's a new empty file.
comment:2 by , 18 years ago
Component: | Admin interface → Database wrapper |
---|---|
Triage Stage: | Unreviewed → Design decision needed |
comment:3 by , 17 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
You can already do and/or (in the queryset-refactor branch, at least); this would just be a redundant way of doing that.
comment:4 by , 16 years ago
Hi Jacob,
I think it's not just a redundant way of doing and/or. In some cases it's really necessary.
1) Imagine you need to sort the results of a query set in a way that depends on complex params (where clause). The results that match exact text comparison come first and the ones the match a LIKE comparison come after. How would you do that without a UNION?
2) When different aggregation criterias are needed in the queries of a union.
In some cases UNION and INTERSECTIONS are necessary and cannot be re-written using only AND and OR. Maybe it's worth to rethink this decision, isn't it?
comment:5 by , 15 years ago
Keywords: | union performance and or ORM QuerySet added |
---|---|
milestone: | → 1.2 |
Resolution: | wontfix |
Status: | closed → reopened |
So, I believe I have found a case where the standard behavior of the Django QuerySet code produces a highly suboptimal result due to lack of UNION over simple AND and OR
The default query here takes 2mins to execute on MySQL 5.0 whereas the variant using UNION takes 0.17 seconds
My model is a little complex perhaps, but not so unusual. In fact, it just so happens to be the underlying model of the JIRA issue tracking product from Atlassian which we're front-ending using Django.
Given a datamodel with four tables all revolving around the table Issue that has user-defined custom properties and tracks changes to these properties in an audit trail:
Issue
Customfield - defines "properties" of Issue that aren't columns in the schema
Customfieldvalue - provides *current* instance values of a given Customfield for a given Issue
Changegroup - tracks a group of related Changeitems against the "properties" of an Issue made at a time in the past
Changeitem - an individual change of a property from value / to value
Given a problem space where a single Customfield called "Iteration' has been defined against Issues,
Goal is to build a Django query that will find "all Issues where Iteration is currently in [X,Y] or *has ever been* in [X,Y] in the past"
(Naive?) approach:
currentIssues = models.Jiraissue.objects.distinct().filter( customfieldvalues__customfield__cfname="Iteration", customfieldvalues__stringvalue__in=[X,Y]) oldIssues = models.Jiraissue.objects.distinct().filter( Q( Q(changegroups__items__oldstring__in=[X,Y]]) | Q(changegroups__items__newstring__in=[X,Y]), changegroups__items__field='Iteration')) issues = currentIssues | oldIssues
The resulting query looks like this:
SELECT DISTINCT `jiraissue`.`PROJECT`, `jiraissue`.`ASSIGNEE`, `jiraissue`.`REPORTER`, `jiraissue`.`ID`, `jiraissue`.`pkey`, `jiraissue`.`issuetype`, `jiraissue`.`SUMMARY`, `jiraissue`.`DESCRIPTION`, `jiraissue`.`ENVIRONMENT`, `jiraissue`.`PRIORITY`, `jiraissue`.`RESOLUTION`, `jiraissue`.`issuestatus`, `jiraissue`.`CREATED`, `jiraissue`.`UPDATED`, `jiraissue`.`DUEDATE`, `jiraissue`.`RESOLUTIONDATE`, `jiraissue`.`VOTES`, `jiraissue`.`TIMEORIGINALESTIMATE`, `jiraissue`.`TIMEESTIMATE`, `jiraissue`.`TIMESPENT`, `jiraissue`.`WORKFLOW_ID`, `jiraissue`.`SECURITY`, `jiraissue`.`FIXFOR`, `jiraissue`.`COMPONENT` FROM `jiraissue` INNER JOIN `customfieldvalue` ON (`jiraissue`.`ID` = `customfieldvalue`.`ISSUE`) LEFT OUTER JOIN `customfield` ON (`customfieldvalue`.`CUSTOMFIELD` = `customfield`.`ID`) LEFT OUTER JOIN `changegroup` ON (`jiraissue`.`ID` = `changegroup`.`ISSUEID`) LEFT OUTER JOIN `changeitem` ON (`changegroup`.`ID` = `changeitem`.`GROUPID`) WHERE ((`customfield`.`cfname` = 'Iteration' AND `customfieldvalue`.`STRINGVALUE` IN (X, Y)) OR ((`changeitem`.`OLDSTRING` IN (X, Y) OR `changeitem`.`NEWSTRING` IN (X, Y)) AND `changeitem`.`FIELD` = 'Iteration' ))
The hand-crafted alternative using UNION on two subqueries which is *much* faster is this:
SELECT DISTINCT ji.PROJECT, ji.ASSIGNEE, ji.REPORTER, ji.ID, ji.pkey, ji.issuetype, ji.SUMMARY, ji.DESCRIPTION, ji.ENVIRONMENT, ji.PRIORITY, ji.RESOLUTION, ji.issuestatus, ji.CREATED, ji.UPDATED, ji.DUEDATE, ji.RESOLUTIONDATE, ji.VOTES, ji.TIMEORIGINALESTIMATE, ji.TIMEESTIMATE, ji.TIMESPENT, ji.WORKFLOW_ID, ji.SECURITY, ji.FIXFOR, ji.COMPONENT FROM jiraissue AS ji JOIN ( SELECT cg.ISSUEID AS ISSUEID FROM changegroup AS cg JOIN changeitem AS ci ON (cg.ID = ci.GROUPID) WHERE ci.FIELD = 'Iteration' AND (ci.OLDSTRING = 'March 17, 2010' OR ci.NEWSTRING IN (X,Y)) UNION SELECT cfv.ISSUE AS ISSUEID FROM customfieldvalue AS cfv JOIN customfield AS cf ON (cfv.CUSTOMFIELD = cf.ID) WHERE cf.cfname = 'Iteration' AND cfv.STRINGVALUE IN (X,Y) ) AS subq ON (ji.ID = subq.ISSUEID);
The *workaround* I've found with Django is to evaluate each query by itself and then use Python sets to union the resulting Jiraissue objects.
results = set(currentIssues.all()) results.update(oldIssues.all())
Not entirely satisfactory since I've lost the benefits of an unevaluated QuerySet as soon as I do so.
But in this case, performance matters more.
Please reconsider the introduction of UNION as a distinct operation from AND and OR
comment:6 by , 15 years ago
milestone: | 1.2 |
---|---|
Resolution: | → wontfix |
Status: | reopened → closed |
Several problems here.
One is that the ticket was closed wontfix by one of Django's lead developers; in that case re-opening the ticket is the wrong action. Instead, bring it up on the django-developers list to get feedback and see if there's support for reopening.
Another is that this is ostensibly a feature request, and Django 1.2 is feature-frozen, so no matter what this can't go on the 1.2 milestone.
comment:7 by , 10 years ago
Easy pickings: | unset |
---|---|
UI/UX: | unset |
This has been a pain point for me when moving from an application with lots to raw queries to using the Django ORM.
I agree with verveguy and routinely run into suboptimal queries that would be much much faster with support for UNION
. Due to abysmal performance, I am left implementing a UNION
type feature in Python, but I would prefer to use the ORM and DB optimizations to handle this properly. I would appreciate if this feature request would be reconsidered.
comment:8 by , 10 years ago
@judresne
Does the following works for you?
from django.db import router def union(model, querysets): queries = [] parameters = [] db = router.db_for_read(model) for queryset in querysets: assert queryset.model == model query, params = queryset.query.get_compiler(db).as_sql() queries.append(query) parameters.extend(params) return model._default_manager.raw(' UNION '.join(queries), parameters)
patch