Code

Opened 2 years ago

Closed 18 months ago

Last modified 8 months ago

#17144 closed Bug (fixed)

Excessive GROUP BY clauses during aggregations.

Reported by: christian_oudard Owned by: akaariai
Component: Database layer (models, ORM) Version: 1.3
Severity: Normal Keywords:
Cc: inactivist, kmike84@…, YenTheFirst, baryshev@…, anssi.kaariainen@…, hcarvalhoalves@…, dev@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Doing a simple aggregation using the Django ORM on MySQL, it is producing a GROUP BY clause that includes an unnecessary field. In the particular case where I discovered this, the unnecessary field is very large, and was slowing down the query by 100-fold.

Here is a simplified version of the model:

class Document(models.Model): 
    data = models.TextField() # Very large field

class Attachment(models.Model): 
    document = models.ForeignKey(Document) 

And the query I am running:

Document.objects.annotate(num_attachments=Count('attachment')) 

And the SQL output:

SELECT 
  `document_document`.`id`, 
  `document_document`.`data`, 
  COUNT(`document_attachment`.`id`) AS `num_attachments` 
FROM `document_document` 
  LEFT OUTER JOIN `document_attachment` 
    ON (`document_document`.`id` = 
`document_attachment`.`document_id`) 
GROUP BY 
  `document_document`.`id`, 
  `document_document`.`id`, 
  `document_document`.`data` 
ORDER BY NULL 

Doing GROUP BY on the data field is unnecessary, and it did not do this in Django 1.2.

From the Django mailing list:

Karen Tracey On Tue, Nov 1, 2011 at 6:19 PM

The SQL generated by the ORM for this query changed between Django version
1.2 and 1.3. The 1.2 SQL did a group by only on the id field. With 1.3
we're getting id twice and then all other fields in the model. Bisection
shows the change was made with r14715:
https://code.djangoproject.com/changeset/14715
It certainly looks to me like the old SQL was correct and preferable for
this particular case. In a brief search I did not find a ticket reporting
this issue -- could you open one?

Attachments (2)

group_by_bug_regressiontest.patch (1.6 KB) - added by christian_oudard 2 years ago.
Regression test for the issue
17144.diff (8.1 KB) - added by akaariai 2 years ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 2 years ago by Alex

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

The actual source of this issue is that id is listed twice, which defeats the optimization which removes unused fields. Find why that happened and you'll have your fix.

comment:2 Changed 2 years ago by christian_oudard

Made a regression test reproducing the issue, and attached a patch.

The issue with the duplicated id is actually fixed in 1.4dev, so I'm not sure whether that's still affecting it.

The problem seems to happen because self.query.select is being added to the columns in compiler.get_grouping(). I don't want to remove it though, because it might break the postgres fixes from r14715.

Does anyone have any insight into how best to fix this?

Last edited 2 years ago by christian_oudard (previous) (diff)

Changed 2 years ago by christian_oudard

Regression test for the issue

comment:3 Changed 2 years ago by sbussetti

This is a larger issue that merely the double inclusion of the pk. For any query using annotation, ALL fields from query.select are listed in the groupby clause, even when it would only be appropriate for the grouping to only occur on the primary key. For mysql this basically guarantees that all annotated queries will use the worst possible "Using temporary; Using filesort" method to evaluate the sql.

comment:4 Changed 2 years ago by inactivist

  • Cc inactivist added

comment:5 Changed 2 years ago by kmike

  • Cc kmike84@… added

comment:6 Changed 2 years ago by YenTheFirst

  • Cc YenTheFirst added

comment:7 follow-up: Changed 2 years ago by anonymous

@sbussetti is absolutely right. I can't do half the things I need to do with the ORM once I need annotation. The issue of group by needs to be revisited, and perhaps redesigned. This is a major caveat in my day to day work.

comment:8 in reply to: ↑ 7 Changed 2 years ago by anonymous

Replying to anonymous:

@sbussetti is absolutely right. I can't do half the things I need to do with the ORM once I need annotation. The issue of group by needs to be revisited, and perhaps redesigned. This is a major caveat in my day to day work.

Agreed - this is also preventing me from doing most of the queries I need to do - a huge road block in my work.

comment:9 Changed 2 years ago by aaugustin

  • Owner changed from nobody to aaugustin

I've randomly stumbled upon this issue this week.

comment:10 Changed 2 years ago by Ilya Baryshev <baryshev@…>

  • Cc baryshev@… added

I confirm this regression, .annotate() does add unnecessary fields to GROUP BY clause, when .values() is not used. On large datasets this bug results in a bad query performance.

comment:11 Changed 2 years ago by anonymous

this bug is really annoying for many projects. can this be patched or does it need some deep refactoring ?

what can i do to help ?

i can try to play with https://code.djangoproject.com/changeset/14715 but i dont have (yet) any idea of underlying behaviour

comment:12 Changed 2 years ago by akaariai

  • Cc anssi.kaariainen@… added

I will try to take a look. BTW I think PostgreSQL 9.1 onwards has support for grouping by just PK fields, so this is of interest for users of that DB, too.

comment:13 Changed 2 years ago by akaariai

  • Has patch set
  • Owner changed from aaugustin to akaariai

I think the attached patch fixes this issue. The aggregation + aggregation_regress tests pass on sqlite, pgsql and mysql. Full test suite pass on sqlite.

I will leave my computer running the full test suite on mysql and postgresql, and see if there is any surprising breakages.

The patch does a general cleanup of compiler.py get_grouping. In addition it has a small cleanup to sql/query.py, removing an unused variable (I think the quote_cache was moved to compiler.py, but it wasn't removed from query.py).

Adding group_by PK support to PostgreSQL is harder: for PostgreSQL every table which is used in the select, order by or where clauses needs to have its PK added to the GROUP BY. For MySQL it is enough to add the main table's PK. This needs a somewhat larger restructuring of query.py.

comment:14 Changed 2 years ago by revolunet <julien@…>

Hi akaariai, thanks for looking at this nasty bug;

Just applied the patch on a fresh 1.4, i'm testing with sqlite and still have a strange behaviour here.

results = mymodels.Book.objects.annotate(models.Count('category'))

All the model columns are still present in the group_by so i think it still fail; did i miss something ?

thanks!

comment:15 Changed 2 years ago by akaariai

Only MySQL supports grouping by primary key. The SQL specification actually requires all columns in the select list to appear in the group by clause. MySQL just skips this requirement, and there is something similar in PostgreSQL 9.1+.

So, to test the patch you need to be using MySQL. I hope to integrate support for PostgreSQL 9.1+ too, but that will be done in separate ticket.

Last edited 2 years ago by akaariai (previous) (diff)

Changed 2 years ago by akaariai

comment:16 Changed 2 years ago by akaariai

The patch now passes the full test suite on mysql, sqlite and pgsql. I don't have Oracle installation available, so can't test that one.

If MySQL users could test & report if this fixes the performance problems it would help a lot.

The patch was updated slightly, the change is still allowing directly adding strings to query.group_by (an internal variable). One test relied on that, and keeping that ability available costs practically nothing. So, the updated patch has that ability restored.

comment:17 Changed 2 years ago by aaugustin

The aggregation_regress tests pass under Oracle with this patch:

Running tests...
----------------------------------------------------------------------
..s....s....................s..
----------------------------------------------------------------------
Ran 31 tests in 12.253s

OK

However, if the changes to django.db.models are reverted, test_aggregate_duplicate_columns doesn't fail:

----------------------------------------------------------------------
..s....s..............E.....s..
======================================================================
ERROR [0.040s]: test_more_more (regressiontests.aggregation_regress.tests.AggregationTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/jobs/Django trunk + Oracle/workspace/tests/regressiontests/aggregation_regress/tests.py", line 472, in test_more_more
    grouping, gb_params = qs.query.get_compiler(qs.db).get_grouping([])
TypeError: get_grouping() takes exactly 1 argument (2 given)

----------------------------------------------------------------------
Ran 31 tests in 11.787s

FAILED (errors=1)

(It's another test that fails.)

So it appears that the regression test doesn't validate the fix, at least under Oracle.

comment:18 Changed 2 years ago by akaariai

The reason for non-failure is that the new test is ran only when the backend supports grouping by primary key, that is, only on MySQL. That another test fails because sql/compiler.py's get_grouping signature was changed.

So, I think Oracle is OK at least for aggregation regress which is good news. That is the most likely place to break.

comment:19 Changed 2 years ago by akaariai

I think I will be pushing this ticket to trunk next week. 1.4 will not get this back-patched, as I feel the risk of breaking something is too big.

Reviews welcome! If you want to review this, but don't have time right now just ask me to postpone the commit.

I will separate the sql/query.py quote_cache removal (unused variable) to different commit.

comment:20 Changed 18 months ago by hcarvalhoalves

  • Cc hcarvalhoalves@… added

comment:21 Changed 18 months ago by Anssi Kääriäinen <akaariai@…>

  • Resolution set to fixed
  • Status changed from new to closed

In cafb266954e21dd55ddfa90597bcf02c022bcb7d:

Fixed #17144 -- MySQL again groups by PK only

Thanks to Christian Oudard for the report and tests.

comment:22 Changed 18 months ago by Anssi Kääriäinen <akaariai@…>

In 908226cf1acb6c4029d73318a0aaa923f58c9157:

[1.5.x] Fixed #17144 -- MySQL again groups by PK only

Thanks to Christian Oudard for the report and tests.

Backpatch of [cafb266954e21dd55ddfa90597bcf02c022bcb7d]

Conflicts:

django/db/models/sql/compiler.py

comment:23 Changed 14 months ago by akaariai

  • Component changed from ORM aggregation to Database layer (models, ORM)

comment:24 Changed 8 months ago by ryankask

  • Cc dev@… added

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.