Opened 4 years ago

Last modified 4 years ago

#23557 new Bug

annotate gives different results on postgresql and mysql

Reported by: Brian May Owned by: nobody
Component: Database layer (models, ORM) Version: 1.7
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hello,

With the following db model:

@python_2_unicode_compatible
class CPUJob(models.Model):
    account = models.ForeignKey(Account, blank=True, null=True)
    username = models.CharField(max_length=50, blank=True, null=True)
    project = models.ForeignKey(Project, null=True, blank=True)
    machine = models.ForeignKey(Machine, blank=True, null=True)
    date = models.DateField(db_index=True, blank=True, null=True)
    queue = models.ForeignKey(Queue, blank=True, null=True)
    cpu_usage = models.BigIntegerField(blank=True, null=True)
    mem = models.BigIntegerField(blank=True, null=True)
    vmem = models.BigIntegerField(blank=True, null=True)
    ctime = models.DateTimeField(blank=True, null=True)
    qtime = models.DateTimeField(blank=True, null=True)
    etime = models.DateTimeField(blank=True, null=True)
    start = models.DateTimeField(blank=True, null=True)
    act_wall_time = models.BigIntegerField(blank=True, null=True)
    est_wall_time = models.BigIntegerField(blank=True, null=True)
    jobid = models.CharField(max_length=50, blank=True, null=True, unique=True)
    cores = models.BigIntegerField(blank=True, null=True)
    list_mem = models.BigIntegerField(blank=True, null=True)
    list_pmem = models.BigIntegerField(blank=True, null=True)
    list_vmem = models.BigIntegerField(blank=True, null=True)
    list_pvmem = models.BigIntegerField(blank=True, null=True)
    exit_status = models.BigIntegerField(blank=True, null=True)
    jobname = models.CharField(max_length=256, blank=True, null=True)
    software = models.ManyToManyField(SoftwareVersion, blank=True, null=True)

    class Meta:                                                                  
        ordering = ['-date']                                                     
        db_table = 'cpu_job'                                                     

The following on mysql produces a good result:

q = CPUJob.objects.values('project').annotate(usage=Sum('cpu_usage'), jobs=Count('id'))
print q.query

of

SELECT `cpu_job`.`project_id`, SUM(`cpu_job`.`cpu_usage`) AS `usage`, COUNT(`cpu_job`.`id`) AS `jobs` FROM `cpu_job` GROUP BY `cpu_job`.`project_id` ORDER BY `cpu_job`.`date` DESC

However on Postgresql, with the same data, I get the following query:

SELECT "cpu_job"."project_id", SUM("cpu_job"."cpu_usage") AS "usage", COUNT("cpu_job"."id") AS "jobs" FROM "cpu_job" GROUP BY "cpu_job"."project_id", "cpu_job"."date" ORDER BY "cpu_job"."date" DESC

Note additional term "cpu_job"."date" in the GROUP BY. I did not ask for it, it got put there.

I suspect the problem is the sort order on the table. The SQL seems to be fine on mysql (really?????), but appears to be invalid on Postgresql (this actually makes more sense to me). So Django appears to be silently adjusting the request to make it valid on Postgresql.

karaage=> SELECT "cpu_job"."project_id", SUM("cpu_job"."cpu_usage") AS "usage", COUNT("cpu_job"."id") AS "jobs" FROM "cpu_job" GROUP BY "cpu_job"."project_id" ORDER BY "cpu_job"."date" DESC
;
ERROR:  column "cpu_job.date" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...cpu_job" GROUP BY "cpu_job"."project_id" ORDER BY "cpu_job"....

I would much rather Django gives an error under Postgresql (and maybe even MYSQL too) rather then silently changing the query, and giving different results to what I expected.

(for this particular query, I don't need sorting, and had not noticed it would cause problems - specifying order_by() seems to solve this problem).

Thanks

Change History (7)

comment:1 Changed 4 years ago by Josh Smeaton

The behaviour you're seeing with Postgres is correct, and documented: https://docs.djangoproject.com/en/dev/topics/db/aggregation/#interaction-with-default-ordering-or-order-by

Columns in the ORDER BY clause must be added to the GROUP BY clause on Postgres, and any RDBMS that conforms to the spec. MySQL and sqlite allow columns in the select list that aren't also in the GROUP BY. Django should be consistent though - it's weird that different results are returned based on the underlying engine.

I would propose that the MySQL backend should add the order by columns into the group by list.

http://dev.mysql.com/doc/refman/5.0/en/group-by-extensions.html describes the logic behind not requiring columns in the group by statement, but calls out that queries are non-deterministic unless the "free" column is unique for the group. This is a foot gun as far as I'm concerned, unless you *really* know what you're doing. I think it applies more to hand crafted queries rather than queries generated from an ORM.

"Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses." So ORDER BY has no effect on the results, unless the column is returned as part of the query.

Last edited 4 years ago by Josh Smeaton (previous) (diff)

comment:2 Changed 4 years ago by Josh Smeaton

Additionally, it may be worth talking about removing the non-standard behaviour from mysql in django by setting the session to "ONLY_FULL_GROUP_BY": http://dev.mysql.com/doc/refman/5.0/en/sql-mode.html#sqlmode_only_full_group_by but that may be backwards incompatible.

comment:3 Changed 4 years ago by Josh Smeaton

Triage Stage: UnreviewedAccepted

comment:4 Changed 4 years ago by Shai Berger

It should be noted that "full" group-by causes significant performance degradations on mysql; this is related to #17144.

In other respects, @jarshwah's comments are mostly correct -- except for one point, and that is that the SQL spec is retarded, and an ORM can know when it's safe to group by a subset of the retrieved fields (which are unique for the group).

I suspect the only way to resolve this without breaking stuff is to close this ticket as "wontfix".

comment:5 Changed 4 years ago by Brian May

My personal opinion is that Django should never alter silently alter queries. If there is something wrong with the query, it should generate an error rather then attempt to fix the problem by altering the query. The fact the current Postgresql behaviour is documented doesn't make it any more correct IMHO. The programmer should be forced to explicitly state the preferred solution rather then Django guess it.

I would suggest that Django has some sort of legacy mode, enabled by default, for preserving the current behaviour, and a new mode where errors such as these trigger an error condition. That way any problems can be identified and fixed, rather then silently await some change (e.g. different database, or somebody deciding to add an ordering field where there was none), and nobody noticing that the results generated are now wrong.

comment:6 Changed 4 years ago by Josh Smeaton

I would be in favour of clearing out the ordering when using a .values call, if an entry in values does not explicitly reference the ordered column. I think this would neatly solve the issue. Unfortunately, I think that'd be backwards incompatible. A flag could be used, but I'm not sure it'd be very nice. It could also be retired after a deprecation period where the behaviour was swapped.

Meta:
    ordering = ['col']
    values_respects_ordering = False # defaults to True

But then I'm sure someone else will say django is silently altering the query by not respecting the Meta.ordering once the behaviour is swapped.

I wish Meta.ordering didn't exist.

comment:7 Changed 4 years ago by Anssi Kääriäinen

I am with brian here - I think Django's way to calculate the GROUP BY is complex enough without the GROUP BY clause being altered by .order_by() calls.

In my opinion Django's group by logic should be as follows:

  1. Group by the primary key of the table in the query by default. The group by will need to be appended by functionally dependent columns. This means that we add any column from the same table to the group by, or any column from any table pointed by direct foreign key or reverse one to one field to the group by (there are some database specific optimizations we could use here).
  2. The above rule will be overridden when using .values() queries. In that case, group by the user's defined .values() call. Possibly allow extending it by functionally dependent columns. That is, if some table's primary key is in the values() list, then allow extension from that primary key along foreign keys and o2ofields similarly to 1).
  3. If any other column is needed in the group by, then error out. This means that: CPUJob.objects.values('project').annotate(usage=Sum('cpu_usage'), jobs=Count('id')).order_by('date') is an error.

The functionally dependent columns appending means that we need to do a group by author.id, author.name on some databases even though technically group by author.id gives the same result. Similarly, we need to group by author.id, author.name, book.id, book.name when .select_related('favorite_book') is applied, where favorite_book is a foreign key from author to book.

The main point here is that we really shouldn't alter the results because .order_by() was added to the query. Just error out in that case. Resist the temptation to guess, explicit is better than implicit and so on...

I think we can deprecate the current behavior of silently altering the group by. Checking if a column is functionally dependent on some subset of columns in the group by isn't exactly easy, but should be doable at least for primary key case. Multi-column unique indexes are going to be a bit harder to support...

Last edited 4 years ago by Anssi Kääriäinen (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top