Opened 12 years ago

Closed 11 years ago

Last modified 11 years ago

#18375 closed Bug (fixed)

F() doesn't work as expected across multijoin relations

Reported by: FunkyBob Owned by: Anssi Kääriäinen
Component: Database layer (models, ORM) Version: 1.4
Severity: Release blocker Keywords:
Cc: German M. Bravo Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Ramiro Morales)

When trying to compare a field to another on the same related record, F() will compare against a separate alias of the table, thus not guaranteeing filtering against the same row.

There doesn't appear to be anything in the docs or tests to indicate what the 'correct' behavior is. Also, there's no apparent way to control it.

Using the attached models, I get the following:

(InteractiveConsole)
>>> from sample import models
>>> from django.db.models import Q, F
>>> qset = models.Uber.objects.filter(unter__available__gt=F('unter__used'))
>>> str(qset.query)
'SELECT "sample_uber"."id", "sample_uber"."title" FROM "sample_uber" INNER JOIN "sample_unter" ON ("sample_uber"."id" = "sample_unter"."uber_id") INNER JOIN "sample_unter" T3 ON ("sample_uber"."id" = T3."uber_id") WHERE T3."available" >  "sample_unter"."used"'

The SQL nicely formatted shows:

SELECT "sample_uber"."id", "sample_uber"."title"
FROM "sample_uber"
INNER JOIN "sample_unter" ON ("sample_uber"."id" = "sample_unter"."uber_id")
INNER JOIN "sample_unter" T3 ON ("sample_uber"."id" = T3."uber_id")
WHERE T3."available" >  "sample_unter"."used"

So, the F('unter__used') is using the 'sample_unter' join, whereas the unter__available__gt is using the T3 alias.

Attachments (3)

models.py (238 bytes ) - added by FunkyBob 12 years ago.
ticket_18375_master.diff (4.7 KB ) - added by Anssi Kääriäinen 11 years ago.
ticket_18375_15.diff (4.3 KB ) - added by Anssi Kääriäinen 11 years ago.

Download all attachments as: .zip

Change History (25)

by FunkyBob, 12 years ago

Attachment: models.py added

comment:1 by Anssi Kääriäinen, 12 years ago

Summary: F() doesn't work as expected across relationsF() doesn't work as expected across multijoin relations
Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

To me this seems like a bug, or at least non-wanted behavior. Usually references inside one operation target the same alias, for example filter(unter__available__gt=1, unter__exact=foo) should target the same unter reference, while filter(unter__available__gt=1).filter(unter__exact=foo) should target different references.

If this can be fixed at this point needs at least some consideration. The change could break existing queries, and there wouldn't be any way to get the original behavior back, as you can split one filter clause to two different filter operations.

I have been thinking that it would be useful to be able to annotate the relations in the query. So, you could do:

.annotate(unter_ref='unter').filter(unter_ref__available__gt=F('unter_ref__used'))
OR
.annotate(unter_ref1='unter', unter_ref2='unter').filter(unter_ref1__available__gt=F('unter_ref2__used'))

The first would generate the query you are looking for, the second would generate what you get now. However there is a long way to get to the point where you can use the annotate() like in the example.

I am marking this as accepted. The reasoning is that the reference under the same .filter() call should target the same relation, and that the current behavior doesn't seem that useful. Backwards compatibility could turn this decision however.

For the time being I can't figure out a way to generate the query you want without using qs.extra().

comment:2 by FunkyBob, 12 years ago

A thought that came to me was a new keyword argument for F() to suggest it try to use an existing join to that table...

comment:3 by Anssi Kääriäinen, 12 years ago

Has patch: set

What happens is that when a filter is added to query, the right hand value (F-obj in this case) is added first. At this point no joins exist. Then, the left hand side is added to the query (the lookup is resolved), but it can't reuse the join generated by the F-obj. At this point a new join is generated -> problems.

Usually the F-obj will reuse any existing join, regardless if the join is generated in the same filter call or not. I am calling this a bug. Actually, this _must_ be a bug. Doing

.filter(unter__available__gt=0, unter__available__gt=F('unter__used'))

the F will target the existing join if the unter__available__gt=0 is evaluated first. If not then it will generate a new join. This means that randomized dict ordering will cause random results from the above query.

See https://github.com/akaariai/django/tree/ticket_18375 for one possible solution for this.

comment:4 by Anssi Kääriäinen, 12 years ago

There might be an easier fix: just pass the reusable joins set from add_filter to the SQLEvaluator, then to prepare and finally to setup_joins. This means just adding the parameter to couple of methods (it can be by default the empty tuple), and things should work.

comment:5 by Anssi Kääriäinen, 11 years ago

I have verified that dict-randomization will randomize the result of the query mentioned in comment:3.

I think we have to go with what FunkyBob suggests in comment:2 - add a "reuse_joins" argument to F() (better name welcome). If we change how F() is added to query without adding a way to define the join-reuse then we will be leaving some users in a hard situation: upgrade to 1.5 breaks their project with no way to fix the breakage.

I suggest we default to "reuse_joins=True", this way every clause under same .filter() call will target the same joins, unless explicitly asked to work otherwise.

I will try to write a patch for this.

comment:6 by Ian Clelland, 11 years ago

akaariai, shouldn't the query in comment:3 fail with a Syntax error? You can't specify the same keyword argument twice. The only way to filter twice on the same keyword is to call .filter() twice, which is well defined (in execution semantics, at least), or include duplicate Q objects (or mix Q objects with a keyword argument). Otherwise, you at least need to use a different operator, to filter twice on the same database field. Either

.filter(Q(unter__available__gt=0), Q(unter__available__gt=F('unter__used')))

or

.filter(Q(unter__available__gte=1, unter__available__gt=F('unter__used'))

should trigger this behaviour, though.

comment:7 by Anssi Kääriäinen, 11 years ago

True, that example isn't valid Python.

The simplest example is something like this:

.filter(unter__id=1, id=F('unter__id'))

where unter is any reverse relation.

Depending on the ordering of the kwargs dict the F() either creates a new join or not. This means the same query will execute differently on different runs if dict randomization is turned on.

comment:8 by Aymeric Augustin, 11 years ago

Since we aim for Python 3.3 support in Django 1.5, this is a release blocker.

comment:9 by Anssi Kääriäinen, 11 years ago

Owner: changed from nobody to Anssi Kääriäinen
Severity: NormalRelease blocker

comment:10 by Anssi Kääriäinen, 11 years ago

Hmmh, I think this one is even more complex than I previously thought - how about

qs.filter(someval=F('revrel__somecol'), revrel__othercol=F('someval'))

If the first kwarg is seen first, then we will create a join for revrel. However, this join is _not_ reusable for the second filter's lookup. If these are seen in the other order, then we first generate revrel for the lookup, then reuse it in F().

I know we can fix the above. But I am wondering if we could just weasel out of this situation: document that use Q() lookups instead if dict ordering might be a problem. Q() lookups do have a defined order, so there is no problem in that case.

I have a work in progress patch at: https://github.com/akaariai/django/compare/ticket_18375

comment:11 by Anssi Kääriäinen, 11 years ago

I rebased the ticket_18375 branch with another attempt. Now I like what it is doing in code, but I still wonder how to document this in release notes. What I have now seems a little complex...

Docs/API design help welcomed!

comment:12 by Nicolas Lara, 11 years ago

I think the documentation for this is overly complicated and requires detailed knowledge of what the SQL translation will be (which the ORM users shouldn't need).

The fix to only use one reference would probably work for 90% of the cases and for the other cases where you need to control exactly how the SQL query would be written you can fallback to raw SQL. I don't see a need for DupeMultisF; this fine grained control belongs in the SQL layer and not the ORM layer. Documenting that this generates a single join in 1.5 should be enough.

comment:13 by Ramiro Morales, 11 years ago

Description: modified (diff)

comment:14 by Preston Holmes, 11 years ago

I'm confused with what you are doing with the dupe_multis value in the patch.

You removed its use in:

https://github.com/django/django/commit/68847135bc9acb2c51c2d36797d0a85395f0cd35

and the only reference to it currently in Django is a forgotten removal of it in the docstring for setup_joins

What would be a use case of multiple references be?

comment:15 by Anssi Kääriäinen, 11 years ago

The patch is stale. The compare view is misleading, as it now compares my current master instead of the master at the time of when I created that patch... Maybe using that compare view isn't such a good idea after all. Updated patch available from here: https://github.com/akaariai/django/commit/bf6f1def617176503f2fc1c1e81c45c4b1b6fff2

As for what is the use case? Something like People.objects.filter(friends__age__gt=F('friends__age') * 2). This of course can't be true for single join, but multijoin case returns any people who have friends with 2x difference in age. Not too realistic example, but to me it seems there could be some real use cases for this.

IMO if we can get good notes about the backwards compatibility hack it doesn't cost us much in code complexity.

comment:16 by Anssi Kääriäinen, 11 years ago

After some thought the backwards compatibility hack is now gone. If somebody complains about the change behavior then lets consider adding it.

So, the change is this: when previously a query like

qs.filter(revfk__somecol=F('revfk__somecol2'))

would have generated two separate joins for the reverse relation, and this:

qs.filter(revfk__somecol__contains='foo', revfk__somecol=F('revfk__somecol2'))

would have generated either 1 or 2 joins depending on dict ordering (if in the order above, then 1, if F() first, then 2).

Now both queries will generate just one join. This is correct when considering the principle "things under one .filter() call target the same m2m join".

by Anssi Kääriäinen, 11 years ago

Attachment: ticket_18375_master.diff added

by Anssi Kääriäinen, 11 years ago

Attachment: ticket_18375_15.diff added

comment:17 by Aymeric Augustin, 11 years ago

I'm not really capable to validate the code, but I fully support the behavior described in comment 16.

comment:18 by Anssi Kääriäinen, 11 years ago

The previous patch had a problem where two chained filter calls would misuse joins. This was the failing test case:

   Employee.objects.filter(
            company_ceo_set__num_employees=F('pk')
   ).filter(
            company_ceo_set__num_employees=F('company_ceo_set__num_employees')
   )

Now, first filter adds a join to company_ceo_set. The second filter's F() will reuse any joins, so the join to company_ceo_set is reused. And, finally the second company_ceo_set will reuse the join used by F(), and we have two separate filter() calls reusing the same join.

The updated patch fixes this, and now F() expressions will behave just like other lookups in filter clauses when it comes to join reuse.

Patch available here: https://github.com/akaariai/django/commit/109a2b5598b7fc31e2b1d403e4da3a2a470748eb

This widens the scope of users possibly hit. Any query where there is a pre-existing revfk (or m2m) join and which then gets a F() added targeting the same relation will behave differently. Seems somewhat likely scenario to me.

To me it seems the patched version is doing what https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships tells the behaviour should be.

comment:19 by Anssi Kääriäinen <akaariai@…>, 11 years ago

Resolution: fixed
Status: newclosed

In 90b86291d022a09031d1df397d7aaebc30e435f7:

Fixed #18375 -- Removed dict-ordering dependency for F-expressions

F() expressions reuse joins like any lookup in a .filter() call -
reuse multijoins generated in the same .filter() call else generate
new joins. Also, lookups can now reuse joins generated by F().

This change is backwards incompatible, but it is required to prevent
dict randomization from generating different queries depending on
.filter() kwarg ordering. The new way is also more consistent in how
joins are reused.

comment:20 by Anssi Kääriäinen <akaariai@…>, 11 years ago

In 90c7aa074095311862b71f3b4ee7220369785375:

[1.5.x] Fixed #18375 -- Removed dict-ordering dependency for F-expressions

F() expressions reuse joins like any lookup in a .filter() call -
reuse multijoins generated in the same .filter() call else generate
new joins. Also, lookups can now reuse joins generated by F().

This change is backwards incompatible, but it is required to prevent
dict randomization from generating different queries depending on
.filter() kwarg ordering. The new way is also more consistent in how
joins are reused.

Backpatch of 90b86291d022a09031d1df397d7aaebc30e435f7

comment:21 by Anssi Kääriäinen, 11 years ago

So, the final decision made was to just alter the behaviour of F() without a backwards compatibility escape.

I am not too worried about the changed behaviour itself. The queries hit are going to be rare. The bigger problem is that if somebody had a correctly working query using the old behaviour there is simply no way to get the old behaviour back without resorting to raw SQL.

It will be really easy to add a backwards compatibility F if needed (or a backwards compat flag to F()). See: https://github.com/akaariai/django/commit/6f50382b099410346f5305ea06cec93d6d006941

So, lets wait and see if somebody will complain about the change. If so, then lets add and document the backwards compatibility escape for those who need it. We can do this even in a point release in my opinion. Adding it just in case is also an option but if we add & document it, then we need to support it...

comment:22 by German M. Bravo, 11 years ago

Cc: German M. Bravo added

I would definitely go for an annotate(), for the akaariai's example:

People.objects.filter(friends__age__gt=F('friends__age') * 2)

This, after the fix, makes just one join.

If one, however, wants the friends with 2x the age (two separate joins), it'd need annotate() , like so:

People.objects.annotate(twice_age=F('friends__age') * 2).filter(friends__age__gt=F('twice_age'))

How far is django from such behavior? ...and also, it doesn't still feel entirely good (annotate always creating new joins for such cases), since one might actually want to get the reused join... I would probably suggest the order of the calling of annotate to change things (if putting annotate after the filter, it'd reuse the joins from the previous filter, otherwise it would start afresh). This way, the next query would be equivalent to the first one:

People.objects.filter(friends__age__gt=F('twice_age')).annotate(twice_age=F('friends__age') * 2)

Provided one could use F() of not yet "annotated" things.

Any thoughts regarding this?

Last edited 11 years ago by German M. Bravo (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top