Opened 12 years ago

Closed 11 years ago

Last modified 17 months ago

#17144 closed Bug (fixed)

Excessive GROUP BY clauses during aggregations.

Reported by: christian_oudard Owned by: Anssi Kääriäinen
Component: Database layer (models, ORM) Version: 1.3
Severity: Normal Keywords:
Cc: Michael Curry, kmike84@…, YenTheFirst, baryshev@…, anssi.kaariainen@…, hcarvalhoalves@…, dev@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Doing a simple aggregation using the Django ORM on MySQL, it is producing a GROUP BY clause that includes an unnecessary field. In the particular case where I discovered this, the unnecessary field is very large, and was slowing down the query by 100-fold.

Here is a simplified version of the model:

class Document(models.Model): 
    data = models.TextField() # Very large field

class Attachment(models.Model): 
    document = models.ForeignKey(Document) 

And the query I am running:

Document.objects.annotate(num_attachments=Count('attachment')) 

And the SQL output:

SELECT 
  `document_document`.`id`, 
  `document_document`.`data`, 
  COUNT(`document_attachment`.`id`) AS `num_attachments` 
FROM `document_document` 
  LEFT OUTER JOIN `document_attachment` 
    ON (`document_document`.`id` = 
`document_attachment`.`document_id`) 
GROUP BY 
  `document_document`.`id`, 
  `document_document`.`id`, 
  `document_document`.`data` 
ORDER BY NULL 

Doing GROUP BY on the data field is unnecessary, and it did not do this in Django 1.2.

From the Django mailing list:

Karen Tracey On Tue, Nov 1, 2011 at 6:19 PM

The SQL generated by the ORM for this query changed between Django version
1.2 and 1.3. The 1.2 SQL did a group by only on the id field. With 1.3
we're getting id twice and then all other fields in the model. Bisection
shows the change was made with r14715:
https://code.djangoproject.com/changeset/14715
It certainly looks to me like the old SQL was correct and preferable for
this particular case. In a brief search I did not find a ticket reporting
this issue -- could you open one?

Attachments (2)

group_by_bug_regressiontest.patch (1.6 KB ) - added by christian_oudard 12 years ago.
Regression test for the issue
17144.diff (8.1 KB ) - added by Anssi Kääriäinen 12 years ago.

Download all attachments as: .zip

Change History (27)

comment:1 by Alex Gaynor, 12 years ago

Triage Stage: UnreviewedAccepted

The actual source of this issue is that id is listed twice, which defeats the optimization which removes unused fields. Find why that happened and you'll have your fix.

comment:2 by christian_oudard, 12 years ago

Made a regression test reproducing the issue, and attached a patch.

The issue with the duplicated id is actually fixed in 1.4dev, so I'm not sure whether that's still affecting it.

The problem seems to happen because self.query.select is being added to the columns in compiler.get_grouping(). I don't want to remove it though, because it might break the postgres fixes from r14715.

Does anyone have any insight into how best to fix this?

Last edited 12 years ago by christian_oudard (previous) (diff)

by christian_oudard, 12 years ago

Regression test for the issue

comment:3 by Steve Bussetti, 12 years ago

This is a larger issue that merely the double inclusion of the pk. For any query using annotation, ALL fields from query.select are listed in the groupby clause, even when it would only be appropriate for the grouping to only occur on the primary key. For mysql this basically guarantees that all annotated queries will use the worst possible "Using temporary; Using filesort" method to evaluate the sql.

comment:4 by Michael Curry, 12 years ago

Cc: Michael Curry added

comment:5 by Mikhail Korobov, 12 years ago

Cc: kmike84@… added

comment:6 by YenTheFirst, 12 years ago

Cc: YenTheFirst added

comment:7 by anonymous, 12 years ago

@sbussetti is absolutely right. I can't do half the things I need to do with the ORM once I need annotation. The issue of group by needs to be revisited, and perhaps redesigned. This is a major caveat in my day to day work.

in reply to:  7 comment:8 by anonymous, 12 years ago

Replying to anonymous:

@sbussetti is absolutely right. I can't do half the things I need to do with the ORM once I need annotation. The issue of group by needs to be revisited, and perhaps redesigned. This is a major caveat in my day to day work.

Agreed - this is also preventing me from doing most of the queries I need to do - a huge road block in my work.

comment:9 by Aymeric Augustin, 12 years ago

Owner: changed from nobody to Aymeric Augustin

I've randomly stumbled upon this issue this week.

comment:10 by Ilya Baryshev <baryshev@…>, 12 years ago

Cc: baryshev@… added

I confirm this regression, .annotate() does add unnecessary fields to GROUP BY clause, when .values() is not used. On large datasets this bug results in a bad query performance.

comment:11 by anonymous, 12 years ago

this bug is really annoying for many projects. can this be patched or does it need some deep refactoring ?

what can i do to help ?

i can try to play with https://code.djangoproject.com/changeset/14715 but i dont have (yet) any idea of underlying behaviour

comment:12 by Anssi Kääriäinen, 12 years ago

Cc: anssi.kaariainen@… added

I will try to take a look. BTW I think PostgreSQL 9.1 onwards has support for grouping by just PK fields, so this is of interest for users of that DB, too.

comment:13 by Anssi Kääriäinen, 12 years ago

Has patch: set
Owner: changed from Aymeric Augustin to Anssi Kääriäinen

I think the attached patch fixes this issue. The aggregation + aggregation_regress tests pass on sqlite, pgsql and mysql. Full test suite pass on sqlite.

I will leave my computer running the full test suite on mysql and postgresql, and see if there is any surprising breakages.

The patch does a general cleanup of compiler.py get_grouping. In addition it has a small cleanup to sql/query.py, removing an unused variable (I think the quote_cache was moved to compiler.py, but it wasn't removed from query.py).

Adding group_by PK support to PostgreSQL is harder: for PostgreSQL every table which is used in the select, order by or where clauses needs to have its PK added to the GROUP BY. For MySQL it is enough to add the main table's PK. This needs a somewhat larger restructuring of query.py.

comment:14 by revolunet <julien@…>, 12 years ago

Hi akaariai, thanks for looking at this nasty bug;

Just applied the patch on a fresh 1.4, i'm testing with sqlite and still have a strange behaviour here.

results = mymodels.Book.objects.annotate(models.Count('category'))

All the model columns are still present in the group_by so i think it still fail; did i miss something ?

thanks!

comment:15 by Anssi Kääriäinen, 12 years ago

Only MySQL supports grouping by primary key. The SQL specification actually requires all columns in the select list to appear in the group by clause. MySQL just skips this requirement, and there is something similar in PostgreSQL 9.1+.

So, to test the patch you need to be using MySQL. I hope to integrate support for PostgreSQL 9.1+ too, but that will be done in separate ticket.

Last edited 12 years ago by Anssi Kääriäinen (previous) (diff)

by Anssi Kääriäinen, 12 years ago

Attachment: 17144.diff added

comment:16 by Anssi Kääriäinen, 12 years ago

The patch now passes the full test suite on mysql, sqlite and pgsql. I don't have Oracle installation available, so can't test that one.

If MySQL users could test & report if this fixes the performance problems it would help a lot.

The patch was updated slightly, the change is still allowing directly adding strings to query.group_by (an internal variable). One test relied on that, and keeping that ability available costs practically nothing. So, the updated patch has that ability restored.

comment:17 by Aymeric Augustin, 12 years ago

The aggregation_regress tests pass under Oracle with this patch:

Running tests...
----------------------------------------------------------------------
..s....s....................s..
----------------------------------------------------------------------
Ran 31 tests in 12.253s

OK

However, if the changes to django.db.models are reverted, test_aggregate_duplicate_columns doesn't fail:

----------------------------------------------------------------------
..s....s..............E.....s..
======================================================================
ERROR [0.040s]: test_more_more (regressiontests.aggregation_regress.tests.AggregationTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/jobs/Django trunk + Oracle/workspace/tests/regressiontests/aggregation_regress/tests.py", line 472, in test_more_more
    grouping, gb_params = qs.query.get_compiler(qs.db).get_grouping([])
TypeError: get_grouping() takes exactly 1 argument (2 given)

----------------------------------------------------------------------
Ran 31 tests in 11.787s

FAILED (errors=1)

(It's another test that fails.)

So it appears that the regression test doesn't validate the fix, at least under Oracle.

comment:18 by Anssi Kääriäinen, 12 years ago

The reason for non-failure is that the new test is ran only when the backend supports grouping by primary key, that is, only on MySQL. That another test fails because sql/compiler.py's get_grouping signature was changed.

So, I think Oracle is OK at least for aggregation regress which is good news. That is the most likely place to break.

comment:19 by Anssi Kääriäinen, 12 years ago

I think I will be pushing this ticket to trunk next week. 1.4 will not get this back-patched, as I feel the risk of breaking something is too big.

Reviews welcome! If you want to review this, but don't have time right now just ask me to postpone the commit.

I will separate the sql/query.py quote_cache removal (unused variable) to different commit.

comment:20 by Henrique C. Alves, 11 years ago

Cc: hcarvalhoalves@… added

comment:21 by Anssi Kääriäinen <akaariai@…>, 11 years ago

Resolution: fixed
Status: newclosed

In cafb266954e21dd55ddfa90597bcf02c022bcb7d:

Fixed #17144 -- MySQL again groups by PK only

Thanks to Christian Oudard for the report and tests.

comment:22 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In 908226cf1acb6c4029d73318a0aaa923f58c9157:

[1.5.x] Fixed #17144 -- MySQL again groups by PK only

Thanks to Christian Oudard for the report and tests.

Backpatch of [cafb266954e21dd55ddfa90597bcf02c022bcb7d]

Conflicts:

django/db/models/sql/compiler.py

comment:23 by Anssi Kääriäinen, 11 years ago

Component: ORM aggregationDatabase layer (models, ORM)

comment:24 by Ryan Kaskel, 11 years ago

Cc: dev@… added

comment:25 by Mariusz Felisiak <felisiak.mariusz@…>, 17 months ago

In 5f09ab8:

Refs #17144 -- Removed support for grouping by primary key.

No core backend require the feature anymore as it was only added to
support a MySQL'ism that has been deprecated since then.

Note: See TracTickets for help on using tickets.
Back to Top