Opened 6 months ago

Last modified 5 months ago

#28107 new Bug

Can't perform annotation on related table when un-managed model is backed by a DB view

Reported by: powderflask Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: QuerySet.extra
Cc: Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by powderflask)

I'm working with a legacy DB (ArcSDE database -- really ugly) -- have been able to accommodate its oddities without resorting much to raw SQL, but django upgrade (1.8 --> 1.11) caused a previously working annotation to fail:
psycopg2.ProgrammingError: :column ... must appear in the GROUP BY clause or be used in an aggregate function

DB: PostgreSQL 9.3.16 (i.e., this is not same issue as #26758 ) Python3.6, Django1.11

The annotation simply counts the number of related records from a related 'attachments' table:
qs.annotate(num_attachments=Count('attachments'))

The root cause appears to be that the relation between an unmanaged model (backed by a DB View) and attachments tables uses a unique field other than the main model's primary key (I know -- told you it was ugly - ArcSDE does not really support relations, except they implement attachments with this odd ball sigh ).
The change in behaviour seems to manifest from #19259 (I believe django1.8 added all fields to the groupby clause).
Since django now includes only the primary key in the groupby clause, postgresql won't do this aggregation across a relation that uses a non-pk field.

I suspect there is a general issue here that aggregations on a related-table won't work in postgresql unless the relation is on the primary key field (so, yeah, basically this issue applies to almost no one, right...).
UPDATE: The root cause is actually that Postgresql treats Views differently than Tables w.r.t. what is required in the group by clause.

Seems likely there is a better solution to this, but after a half-day of search / effort, I resorted to the following:

qs.extra(select={'num_attachments':
            'SELECT COUNT("attachmentid") FROM {attach_table} WHERE {attach_table}.rel_globalid = {model_table}.globalid'.format(
                    model_table  = qs.model._meta.db_table,
                    attach_table = qs.model.AttachmentModel._meta.db_table,
            )},)

This works and achieves my goal -- to annotate model with the number of related attachments.
Since the model.attachments.count() query works just fine, I'm considering eliminating this annotation and replacing with a property on the model class, what's one more query... I'm sure there must be an smoother way, but it eludes me...

Since the docs suggested to open a ticket for queries that could not be resolved without resorting to extra(), here it is, for whatever its worth. Hope this hasn't been a complete waste of time for you.

Attachments (1)

issue28107.zip (3.8 KB) - added by powderflask 6 months ago.
small django app with models, migrations, and tests that illustrate the issue

Download all attachments as: .zip

Change History (23)

comment:1 Changed 6 months ago by Simon Charette

Cc: Simon Charette added

It's possible that's an edge case with the handling of to_field.

Would it be possible for your to provide all the models involved in the queryset you're generating as it's really hard to figure out if the issue actually lies in the _GROUP BY_ logic without a set of models to reproduce the issue.

comment:2 Changed 6 months ago by powderflask

The DB and relations are just nasty to set it all up. I'm pretty sure I know what's needed to reproduce this -- I'll try to get something simpler to break in the same way, no sense in you banging your head against the ArcSDE wall.

BTW - edge case is a very polite term for what this really is... After writing this up, I decided to sacrifice the extra query to get the attachment.count() to buy a world of ponies. My code base loves me for it

comment:3 Changed 6 months ago by powderflask

Resolution: worksforme
Status: newclosed

I tried. I cannot reproduce this with a simpler test case. My original suspicion is completely wrong - it works just fine.

I can only reproduce this in the insanely complex DB -- I will continue to try to track down the cause, but for now I'd say this issue is too undefined to work on. Sorry for the trouble, I'm closing this for now.

comment:4 Changed 6 months ago by powderflask

This may have to do with one of the original models drawing data from a View - found this:
" The feature of Postgres to be able to use the primary key of a table with GROUP BY and not need to add the other columns of that table in the GROUP BY clause is relatively new and works only for base tables. The optimizer is not (yet?) clever enough to identify primary keys for views, ctes or derived tables."
https://dba.stackexchange.com/questions/88988/postgres-error-column-must-appear-in-the-group-by-clause-or-be-used-in-an-aggre

I will try to reproduce this.

comment:5 Changed 6 months ago by Simon Charette

powderflask, if this happens to be the source of the issue and you are using un-managed models (_meta.managed = False) maybe we could branch on that to prevent the optimization.

Changed 6 months ago by powderflask

Attachment: issue28107.zip added

small django app with models, migrations, and tests that illustrate the issue

comment:6 Changed 6 months ago by powderflask

Resolution: worksforme
Status: closednew

OK - that's got it. Yep -- this issues occurs when at least one of the models in the aggregation query is un-managed and backed by a DB view rather than a table.

Attached is a zip of a simple django app that runs in a default django container, demonstrates the issue:

1) Demonstrate these weird relations work with models backed by DB tables:

  • INSTALLED_APPS = [ 'issue28107.apps.Issue28107Config', ... ]
  • configure / create postgre DB,
  • manage.py migrate
  • run the unit-test -- it should pass

2) Replace table with view (created by migrations)

  • in models.py, remove 2 comments from Treatment.Meta:
            managed = False
            db_table = 'issue28107_treatment_vw'
    
    
  • re-run unit-test -- it should fail:

column "issue28107_treatment_vw.globalid" must appear in the GROUP BY clause or be used in an aggregate function

This is a backwards-compatibility issue -- this worked at least up to django1.8

I have no idea where to begin with this in terms of suggesting a patch -- any pointers?
thank you for the quick reply and suggestions -- awesome.

comment:7 Changed 6 months ago by powderflask

It occurs to me that if there were some way to force values into the group-by clause, that would serve as a reasonable workaround, given this is bound to be a fairly rare use-case.
I did read some hacks that did this in 1.8, but it looks like query.group_by is now a boolean rather than a list of fields... was pretty ugly anyhow.

But I'm not missing something more obvious here am I -- like an extra() clause or something that could force the offending aggregate fields into the group_by clause?

comment:8 Changed 6 months ago by Simon Charette

Has patch: set
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug
Version: 1.11master

Thanks to your detailed report writing a patch was easy.

PR

comment:9 Changed 6 months ago by powderflask

Should I change the title of this issue to reflect the root cause was a model backed by a View?

comment:10 Changed 6 months ago by Simon Charette

Sure go ahead!

comment:11 Changed 6 months ago by powderflask

Description: modified (diff)
Summary: Can't perform annotation on related table when relation between tables not on primary keyCan't perform annotation on related table when un-managed model is backed by a DB view

comment:12 Changed 6 months ago by Simon Charette

From what I understood you are still on 1.8 but this patch will only be available in a couple of months when 2.0 is released.

To get this fixed on your side in the mean time you'll have to define your own django.db.backends.postgresql.base.DatabaseWrapper subclass but it's not as intimidating as it sounds.

Simply define a DatabaseWrapper subclass in a chosen_module_name.base module that extends the aforementioned class and set it's features_class attribute to a subclass of django.db.backends.postgresql.features.DatabaseFeatures with allows_group_by_selected_pks = False

django.db.backends.postgresql import base, features

class DatabaseWrapper(base.DatabaseWrapper):
    class features_class(features.DatabaseFeatures):
        allows_group_by_selected_pks = False

Once this is done you just have to point the BACKEND key of your DATABASES setting to chosen_module. I suggest you have a look at how django-transaction-hooks did it if you need a more concrete example.

Note that this disable the optimization all together. If you want to keep the optimization for managed models you'll have to define your own SQLCompiler subclass and override the collapse_group_by method which is a bit harder. I suggest you have a look how it's done for the MySQL backend to figure it out.

Last edited 6 months ago by Simon Charette (previous) (diff)

comment:13 Changed 6 months ago by powderflask

Thank you so much Simon - really above the call.

I am migrating to django1.11, which caused me to stumble into this -- I can certainly implement a workaround until we move to 2.0

thanks again for all your help -- much appreciated.

comment:14 Changed 6 months ago by felixxm

Triage Stage: AcceptedReady for checkin

comment:15 Changed 5 months ago by Simon Charette <charette.s@…>

Resolution: fixed
Status: newclosed

In daf2bd3:

Fixed #28107 -- Disabled grouping of selected primary keys for unmanaged models.

The grouping caused an issue with database views as PostgreSQL's query planer
isn't smart enough to introspect primary keys through views. Django doesn't
support database views but documents that unmanaged models should be used to
query them.

Thanks powderflask for the detailed report and investigation.

comment:16 Changed 5 months ago by Claude Paroz

Does this affect #27241 in any way?

comment:17 Changed 5 months ago by Simon Charette

Ahh it's the exact same issue Claude thanks for noticing. Somehow I thought both reports were the same ticket.

comment:18 Changed 5 months ago by Jaap Roes

Resolution: fixed
Status: closednew

This patch introduces a significant behavior change for all uses of unmanaged models (not only those that are backed by a database view) since Django 1.9. I believe the proper solution would be to just add the five lines of code that are posted in #comment:12 to the docs, somewhere near where it says that unmanaged models should be used for database views.

For what it's worth, I reported #27241 and implemented the same workaround, so this patch is unnecessary in my case. As Simon said in the commit message, Django doesn't support database views. Being forced to include a small workaround to do the unsupported seems like a small price to pay.

comment:19 Changed 5 months ago by powderflask

I was able to work-around this by using a Subquery to create the annotation -- willing to draft some documentation to this effect if needed.
Because this creates a backwards-compatibility issue, wondering if there should also be a note in the release notes somewhere if you go this route?

comment:20 Changed 5 months ago by Simon Charette

Jaap Roes,

At this point I'll take the issue to developers mailing list to gather more feedback and give this issue more exposure.

It's true that it could be causing a performance regression for unmanaged models in Django 2.0 relying on aggregation but since we've been documenting such models should be used to interface with views for so long I think there's still a point to be made about the fact the optimization broke the public API.

Let's not forget that pre-Django 1.9 users were using ORM aggregation on PostgreSQL just fine in most cases before the optimization landed and that it's only an issue when a model contains large columns.

comment:21 Changed 5 months ago by Tim Graham

Has patch: unset
Triage Stage: Ready for checkinAccepted

comment:22 Changed 5 months ago by Simon Charette

I just submitted a post on the mailing list to gather feedback. It includes an alternative solution to make turning the optimization on for unmanaged models backed by tables easier. I'd appreciate if you could chime in Jaap Roes.

Note: See TracTickets for help on using tickets.
Back to Top