Opened 9 years ago

Closed 5 years ago

Last modified 8 months ago

#1807 closed enhancement (wontfix)

[patch] union and intersection operating on lists

Reported by: mir@… Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: minor Keywords: union performance and or ORM QuerySet
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This patch provides list_union and list_intersection functions that evaluate over a list of QuerySets. I'll add documentation if needed. The tests should demonstrate usage.

Please note that currently the last test fails due to the bug described in ticket #1801

Attachments (3)

list_union.diff (4.8 KB) - added by mir@… 9 years ago.
patch
list_union.2.diff (4.7 KB) - added by mir@… 9 years ago.
patch (replaces other patch, only removed 2 lines which confuse trac)
list_union.3.diff (4.4 KB) - added by mir@… 9 years ago.
another patch, replaces other patches

Download all attachments as: .zip

Change History (11)

Changed 9 years ago by mir@…

patch

Changed 9 years ago by mir@…

patch (replaces other patch, only removed 2 lines which confuse trac)

Changed 9 years ago by mir@…

another patch, replaces other patches

comment:1 Changed 9 years ago by mir@…

  • Summary changed from union and intersection operating on lists to [patch] union and intersection operating on lists

Sorry, I was having difficulties with trac understanding my diffs. Take the last patch. Please additionally create the init.py file for the tests, I don't know how patch/trac would both understand that there's a new empty file.

comment:2 Changed 8 years ago by Simon G. <dev@…>

  • Component changed from Admin interface to Database wrapper
  • Triage Stage changed from Unreviewed to Design decision needed

comment:3 Changed 7 years ago by jacob

  • Resolution set to wontfix
  • Status changed from new to closed

You can already do and/or (in the queryset-refactor branch, at least); this would just be a redundant way of doing that.

comment:4 Changed 6 years ago by conchas

Hi Jacob,

I think it's not just a redundant way of doing and/or. In some cases it's really necessary.
1) Imagine you need to sort the results of a query set in a way that depends on complex params (where clause). The results that match exact text comparison come first and the ones the match a LIKE comparison come after. How would you do that without a UNION?

2) When different aggregation criterias are needed in the queries of a union.

In some cases UNION and INTERSECTIONS are necessary and cannot be re-written using only AND and OR. Maybe it's worth to rethink this decision, isn't it?

comment:5 Changed 5 years ago by verveguy

  • Keywords union performance and or ORM QuerySet added
  • milestone set to 1.2
  • Resolution wontfix deleted
  • Status changed from closed to reopened

So, I believe I have found a case where the standard behavior of the Django QuerySet code produces a highly suboptimal result due to lack of UNION over simple AND and OR

The default query here takes 2mins to execute on MySQL 5.0 whereas the variant using UNION takes 0.17 seconds

My model is a little complex perhaps, but not so unusual. In fact, it just so happens to be the underlying model of the JIRA issue tracking product from Atlassian which we're front-ending using Django.

Given a datamodel with four tables all revolving around the table Issue that has user-defined custom properties and tracks changes to these properties in an audit trail:

Issue

Customfield - defines "properties" of Issue that aren't columns in the schema
Customfieldvalue - provides *current* instance values of a given Customfield for a given Issue

Changegroup - tracks a group of related Changeitems against the "properties" of an Issue made at a time in the past
Changeitem - an individual change of a property from value / to value

Given a problem space where a single Customfield called "Iteration' has been defined against Issues,

Goal is to build a Django query that will find "all Issues where Iteration is currently in [X,Y] or *has ever been* in [X,Y] in the past"

(Naive?) approach:

currentIssues = models.Jiraissue.objects.distinct().filter(
        customfieldvalues__customfield__cfname="Iteration", customfieldvalues__stringvalue__in=[X,Y])

oldIssues = models.Jiraissue.objects.distinct().filter(
        Q(  Q(changegroups__items__oldstring__in=[X,Y]]) 
                    | Q(changegroups__items__newstring__in=[X,Y]),
               changegroups__items__field='Iteration'))

issues = currentIssues | oldIssues

The resulting query looks like this:

SELECT DISTINCT `jiraissue`.`PROJECT`, `jiraissue`.`ASSIGNEE`, `jiraissue`.`REPORTER`, `jiraissue`.`ID`,
`jiraissue`.`pkey`, `jiraissue`.`issuetype`, `jiraissue`.`SUMMARY`, `jiraissue`.`DESCRIPTION`,
`jiraissue`.`ENVIRONMENT`, `jiraissue`.`PRIORITY`, `jiraissue`.`RESOLUTION`, `jiraissue`.`issuestatus`,
`jiraissue`.`CREATED`, `jiraissue`.`UPDATED`, `jiraissue`.`DUEDATE`, `jiraissue`.`RESOLUTIONDATE`,
`jiraissue`.`VOTES`, `jiraissue`.`TIMEORIGINALESTIMATE`, `jiraissue`.`TIMEESTIMATE`, `jiraissue`.`TIMESPENT`,
`jiraissue`.`WORKFLOW_ID`, `jiraissue`.`SECURITY`, `jiraissue`.`FIXFOR`, `jiraissue`.`COMPONENT`
FROM `jiraissue`
INNER JOIN `customfieldvalue` ON (`jiraissue`.`ID` = `customfieldvalue`.`ISSUE`)
LEFT OUTER JOIN `customfield` ON (`customfieldvalue`.`CUSTOMFIELD` = `customfield`.`ID`)
LEFT OUTER JOIN `changegroup` ON (`jiraissue`.`ID` = `changegroup`.`ISSUEID`)
LEFT OUTER JOIN `changeitem` ON (`changegroup`.`ID` = `changeitem`.`GROUPID`)
WHERE ((`customfield`.`cfname` = 'Iteration'  AND `customfieldvalue`.`STRINGVALUE` IN (X, Y))
OR ((`changeitem`.`OLDSTRING` IN (X, Y) OR `changeitem`.`NEWSTRING` IN (X, Y)) AND `changeitem`.`FIELD` = 'Iteration' ))


The hand-crafted alternative using UNION on two subqueries which is *much* faster is this:

SELECT DISTINCT ji.PROJECT, ji.ASSIGNEE, ji.REPORTER, ji.ID,
    ji.pkey, ji.issuetype, ji.SUMMARY, ji.DESCRIPTION, ji.ENVIRONMENT,
    ji.PRIORITY, ji.RESOLUTION, ji.issuestatus, ji.CREATED, ji.UPDATED,
    ji.DUEDATE, ji.RESOLUTIONDATE, ji.VOTES, ji.TIMEORIGINALESTIMATE,
    ji.TIMEESTIMATE, ji.TIMESPENT, ji.WORKFLOW_ID, ji.SECURITY, ji.FIXFOR,
    ji.COMPONENT
FROM jiraissue AS ji
JOIN (
    SELECT cg.ISSUEID AS ISSUEID
      FROM changegroup AS cg
      JOIN changeitem AS ci ON (cg.ID = ci.GROUPID)
     WHERE ci.FIELD = 'Iteration'
       AND (ci.OLDSTRING = 'March 17, 2010' OR ci.NEWSTRING IN (X,Y))
    UNION
    SELECT cfv.ISSUE AS ISSUEID
      FROM customfieldvalue AS cfv
      JOIN customfield AS cf ON (cfv.CUSTOMFIELD = cf.ID)
     WHERE cf.cfname = 'Iteration'  AND cfv.STRINGVALUE IN (X,Y)
) AS subq ON (ji.ID = subq.ISSUEID);

The *workaround* I've found with Django is to evaluate each query by itself and then use Python sets to union the resulting Jiraissue objects.

        results = set(currentIssues.all())
        results.update(oldIssues.all())

Not entirely satisfactory since I've lost the benefits of an unevaluated QuerySet as soon as I do so.

But in this case, performance matters more.

Please reconsider the introduction of UNION as a distinct operation from AND and OR

comment:6 Changed 5 years ago by ubernostrum

  • milestone 1.2 deleted
  • Resolution set to wontfix
  • Status changed from reopened to closed

Several problems here.

One is that the ticket was closed wontfix by one of Django's lead developers; in that case re-opening the ticket is the wrong action. Instead, bring it up on the django-developers list to get feedback and see if there's support for reopening.

Another is that this is ostensibly a feature request, and Django 1.2 is feature-frozen, so no matter what this can't go on the 1.2 milestone.

comment:7 Changed 8 months ago by jdufresne

  • Easy pickings unset
  • UI/UX unset

This has been a pain point for me when moving from an application with lots to raw queries to using the Django ORM.

I agree with verveguy and routinely run into suboptimal queries that would be much much faster with support for UNION. Due to abysmal performance, I am left implementing a UNION type feature in Python, but I would prefer to use the ORM and DB optimizations to handle this properly. I would appreciate if this feature request would be reconsidered.

Last edited 8 months ago by jdufresne (previous) (diff)

comment:8 Changed 8 months ago by charettes

@judresne

Does the following works for you?

from django.db import router

def union(model, querysets):
   queries = []
   parameters = []
   db = router.db_for_read(model)

   for queryset in querysets:
       assert queryset.model == model
       query, params = queryset.query.get_compiler(db).as_sql()
       queries.append(query)
       parameters.extend(params)

   return model._default_manager.raw(' UNION '.join(queries), parameters)
Note: See TracTickets for help on using tickets.
Back to Top