Code

Opened 3 years ago

Closed 23 months ago

#16715 closed Bug (fixed)

Wrong JOIN with nested null-able foreign keys

Reported by: sebastian Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: join, values, nested, foreign key, null-able
Cc: sebastian Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Consider the following models:

class Event(models.Model):
    screening = models.ForeignKey('Screening', blank=True, null=True)

class Screening(models.Model):
    movie = models.ForeignKey('Movie')

class Movie(models.Model):
    title = models.CharField(max_length=200)

An Event can optionally include a movie screening, so we set null=True. A Screening is always associated with a Movie, so we have an implicit null=False.

We populate the database with the following instances: an event with a screening, and another event without a screening:

star_wars = Movie.objects.create(title=u'Star Wars')
first_screening = Screening.objects.create(movie=star_wars)
event_with_screening = Event.objects.create(screening=first_screening)
event_without_screening = Event.objects.create(screening=None)

Now consider the following queries and their results:

>>> Event.objects.values('screening__movie__pk')
[{'screening__movie__pk': 1}, {'screening__movie__pk': None}]

>>> Event.objects.values('screening__movie__title')
[{'screening__movie__title': u'Star Wars'}, {'screening__movie__title': None}]

>>> Event.objects.values('screening__movie__pk', 'screening__movie__title')
[{'screening__movie__title': u'Star Wars', 'screening__movie__pk': 1}]

Notice how the event without screening appears in the first two result sets but suddenly disappears in the last query? An inspection of django.db.connection.queries leads to the following surprising observation:

>>> connection.queries[-3]
{'time': '0.001', 'sql': u'SELECT "app_screening"."movie_id" FROM "app_event" LEFT OUTER JOIN "app_screening" ON ("app_event"."screening_id" = "app_screening"."id") LIMIT 21'}

>>> connection.queries[-2]
{'time': '0.000', 'sql': u'SELECT "app_movie"."title" FROM "app_event" LEFT OUTER JOIN "app_screening" ON ("app_event"."screening_id" = "app_screening"."id") LEFT OUTER JOIN "app_movie" ON ("app_screening"."movie_id" = "app_movie"."id") LIMIT 21'}

>>> connection.queries[-1]
{'time': '0.000', 'sql': u'SELECT "app_screening"."movie_id", "app_movie"."title" FROM "app_event" LEFT OUTER JOIN "app_screening" ON ("app_event"."screening_id" = "app_screening"."id") INNER JOIN "app_movie" ON ("app_screening"."movie_id" = "app_movie"."id") LIMIT 21'}

In the first query we JOIN with the screening table only (and don't have to inspect the movie table at all) because we can already tell the result of screening__movie__pk by the referencing column Screening.movie_id. Also, we LEFT JOIN event with screening because the Event.screening field is null-able (the alternative would be an incorrect INNER JOIN). So everything works out all right.

In the second query we want to know about screening__movie__title, so we have to also JOIN with the movie table. Again, everything works out as expected: even though Screening.movie is not null-able, we have to use LEFT JOIN (not INNER JOIN) with the movie table because we do not want to exclude events that don't have a screening. So everything is all right in this case too.

But in the third query, Django unexpectedly changes the JOIN with movie from a LEFT JOIN to an INNER JOIN, and thereby dropping the event without a screening from the result set.

I assume this is a bug in how Django selects whether to use a LEFT JOIN vs. an INNER JOIN in foreign key lookups. The query is constructed as expected for the individual attributes screening__movie__pk and screening__movie__title, but Django seems to get confused when we want to have both values at once.

The same unexpected behavior can be observed with implicit foreign lookups, such as reverse references in one-to-one relations.

Attachments (6)

r16698.patch (633 bytes) - added by sebastian 3 years ago.
wrong-join-test-r16910.patch (3.3 KB) - added by sebastian 3 years ago.
nested-foreign-key-test-r16914.patch (6.3 KB) - added by sebastian 3 years ago.
patch-nested-foreign-keys-with-test-r16923.patch (12.3 KB) - added by sebastian 3 years ago.
16715_alternate.diff (19.6 KB) - added by akaariai 3 years ago.
16715_alternate-redux-r16928.diff (20.2 KB) - added by sebastian 3 years ago.

Download all attachments as: .zip

Change History (31)

Changed 3 years ago by sebastian

comment:1 Changed 3 years ago by sebastian

  • Has patch set
  • Needs documentation unset
  • Needs tests set
  • Patch needs improvement set

I am attaching a patch which might fix the problem. As I am not familiar with the Django code base, I cannot assess any unintended consequences this might have. The problem seems to have its cause in the propagation of LEFT JOINs vs. INNER JOINs. This propagation, or promotion, is done in django/db/models/sql/query.py, method promote_alias of class Query.

According to the doc-string of the promote_alias method, any JOINs that might refer to NULL values are promoted to LEFT JOIN (or LOUTER). But the actual promotion of chains of JOINs seems to happen in the separate method promote_alias_chain which makes sure that after one JOIN has been promoted, all following JOINs are promoted as well.

Unfortunately, this does not work when parts of the JOINs have already been introduced: values('screening__movie__pk', 'screening__movie__title') first creates the JOINs between event and screening, and then (in a separate call to promote_alias_chain) the chain event, screening, movie. As this second call does not promote the first JOIN again, the last JOIN between screening and movie remains an INNER JOIN, with the consequences mentioned above. (As a side note, changing the argument order in the call to values to values('screening__movie__title', 'screening__movie__pk') gives the correct result, since now the first call to promote_alias_chain gets all three joins.)

The patch I'm proposing adds another check to promote_alias to see if the JOIN on the left-hand side of the current JOIN is LEFT OUTER. If so, the current JOIN will become LEFT OUTER as well. This would make the method promote_alias_chain more or less obsolete. Again, I'm not sure at all if this has any unintended consequences, so please, please somebody review this and tell me why it can't work, or tell me why it works. (For instance, might it not be better to use column NULLABLE for this check, and in turn set this column to True on any JOIN that we promote from INNER JOIN to LEFT JOIN? I have no idea.)

Also, if this whole issue is really a bug (and not by design somehow), and the proposed patch or some derivation of it can fix the problem, a test case should be added to make sure something like this is not introduced again.

comment:2 Changed 3 years ago by sebastian

On a related note, here is something that the patch above doesn't fix. Maybe it is related to the original problem, maybe it is entirely unrelated. First we observe that select_related instead of values works as expected, even without the proposed patch:

>>> Event.objects.select_related('screening', 'screening__movie')
[<Event: Event object>, <Event: Event object>]

>>> connection.queries[-1]
{'time': '0.000', 'sql': u'SELECT "app_event"."id", "app_event"."screening_id", "app_screening"."id", "app_screening"."movie_id", "app_movie"."id", "app_movie"."title" FROM "app_event" LEFT OUTER JOIN "app_screening" ON ("app_event"."screening_id" = "app_screening"."id") LEFT OUTER JOIN "app_movie" ON ("app_screening"."movie_id" = "app_movie"."id") LIMIT 21'}

Both joins are LEFT JOINs and both events (with and without screening) get selected.

However, if we change the (forward) foreign key reference Event.screening to an implicit reverse one, such as introduced by a one-to-one relation, e.g. in multi-table inheritance, the wrong JOIN is selected. Let's assume the new model looks like this:

class Event(models.Model):
    pass

class Screening(Event):
    movie = models.ForeignKey('Movie')

class Movie(models.Model):
    title = models.CharField(max_length=200)

We fill it with data:

star_wars = Movie.objects.create(title=u'Star Wars')
event_with_screening = Screening.objects.create(movie=star_wars)
event_without_screening = Event.objects.create()

Now we run the same query as above:

>>> Event.objects.select_related('screening', 'screening__movie')
[<Event: Event object>]

>>> connection.queries[-1]
{'time': '0.000', 'sql': u'SELECT "app_event"."id", "app_screening"."event_ptr_id", "app_screening"."movie_id", "app_movie"."id", "app_movie"."title" FROM "app_event" LEFT OUTER JOIN "app_screening" ON ("app_event"."id" = "app_screening"."event_ptr_id") INNER JOIN "app_movie" ON ("app_screening"."movie_id" = "app_movie"."id") LIMIT 21'}

Again, as in the original report with values, only the event with a screening (or in this case, the event that is also a screening) is selected because screening is JOINed with movie using an INNER JOIN instead of the correct LEFT JOIN.

It seems that in this case, promote_alias is not called at all. So, maybe, changing this method is not the right place to fix the original issue after all … Again, maybe somebody with more insight into the Django codebase can help?

comment:3 Changed 3 years ago by aaugustin

  • Triage Stage changed from Unreviewed to Accepted

comment:4 Changed 3 years ago by sebastian

  • Severity changed from Normal to Release blocker

No comments? Let me reiterate why I think this is a critical bug.

The documentation for select_related does not mention in any way that the resulting queryset will differ from the one without select_related attached. This is also the expected behavior. Unfortunately, this does not match reality. Either the documentation should be changed to reflect the observed behavior (if it is indeed the intended behavior) or the observed behavior should be fixed to match the documentation.

The same is true for the documentation of values. While the change in behavior with regard to many-to-many relations is only natural and accordingly documented (multiple rows are returned, one for each referenced combination of values), the behavior currently exhibited with nullable foreign keys is surprising. While this might be reflected by a subtle change in documentation (something like "rows with nulled foreign keys are discarded from the result"), this would still fail to explain the inconsistent behavior between one-hop foreign key lookups and two-hop foreign key lookups, and the dependence on the order of arguments to values.

Finally, the observed (wrong) behavior might be only one symptom of a deeper problem within the query generator in Django. I am not familiar enough with how the query generator works in detail to verify or disprove that, but it might be that with multi-table joins are wrong the current way queries are created right now, especially the way the decision between INNER JOIN and LEFT OUTER JOIN is made.

Since these arguments feel strong enough to me to justify further insight into the matter, I am changing the severity to release blocker. If the problem cannot be fixed, it should at least be documented in select_related and values.

comment:5 Changed 3 years ago by carljm

  • Severity changed from Release blocker to Normal
  • Version changed from 1.3 to SVN

Thanks for the report and the patch! Sorry the patch hasn't been reviewed yet; sometimes it takes a little while (especially with bugs like this one that affect deep and sensitive ORM internals - tends to look like a time sink to potential reviewers!). If you'd like to draw some attention to the bug, a post to the django-developers mailing list about it (perhaps demonstrating some cases where it causes real problems in your code) would be a reasonable way to do that.

Nonetheless, I'm moving the severity back to Normal. We have fairly concrete guidelines about what constitutes a release-blocking bug: if it's a security problem or causes data-loss or a crash, or if its a regression from a previous Django release. This bug should be fixed (thus its Accepted), but it doesn't fall into any of the above categories, so it won't block a release.

comment:6 Changed 3 years ago by Alex

So this patch almost certainly works, but a) it needs tests veyr badly, and b) I'm not convinced that the fix is necessarily at the right level. Will have to give it some more thought.

comment:7 Changed 3 years ago by sebastian

Thank you both for your feedback. I agree with Alex that the patch is most probably not in the right place, but again I am still not familiar enough with the Django internals to have the slightest idea what is happening here and what would be the appropriate place for fixing this bug.

For the time being, I'll try to come up with some tests similar to the example in my original report and comment:2.

@carljm:
Regarding a demonstration of where this bug causes a real problem in my code, see comment:2. That example is a simplified version of what I have for my current project (including multi-table inheritance Event→Screening). I wanted to use select_related in a schedule of all upcoming events, so that I don't have to fetch the movie information for all screenings in the template, but I ended up having all non-screening events missing in the resulting queryset. This is not the behavior I expected and this means that I cannot use select_related. Without that, an additional query is issued for each screening where I want to retrieve the movie information (instead of only a single query for everything). I tried to work around this with values and noticed that this also doesn't work, as illustrated by my original report (values even fails without using multi-table inheritance, using only simple foreign key relations!).

Changed 3 years ago by sebastian

comment:8 Changed 3 years ago by sebastian

I am attaching a patch that adds another test to regressiontests/model_inheritance_select_related. In this test, very similar to what I had in comment:2, I basically have the following models:

class Chain(models.Model):
    pass

class Place(models.Model):
    pass

class Franchise(Place):
    chain = models.ForeignKey(Chain)

(That is, one multi-table inheritance from Place to Franchise, and one non-nullable foreign key from Franchise to Chain.)

I then create a Place object explicitly (i.e. a place that is not a franchise), and a Franchise object (implicitly creating a Place object) with an associated Chain object. The following three queries now should all return the same queryset consisting of the two Place objects (one for the Place, one for the Franchise):

Place.objects.all()
Place.objects.select_related("franchise")
Place.objects.select_related("franchise__chain")

However, due to the wrong join being selected between the Franchise and Chain model, the third queryset only returns the Place that is also a Franchise. In other words, the other Place that was created explicitly without a Franchise is now missing.

When adding an attribute null=True to the chain foreign key in Franchise, all three querysets return both objects. This is because null=True forces the correct left outer join between Franchise and Chain.

Regarding the issue with values and foreign keys (instead of the multi-table inheritance/one-to-one relation used in this test), I guess both problems have the some original cause, so I didn't write a test for values (neither for foreign-key instead of one-to-one relation). Fixing the select_related issue might also fix the other problem(s). My guess is that something deep inside the query generator goes wrong when it makes the choice of inner join vs. left outer join on nested foreign-key relationships.

Changed 3 years ago by sebastian

comment:9 Changed 3 years ago by sebastian

  • Has patch unset
  • Needs tests unset
  • Patch needs improvement unset

I attached another, more elaborate test, checking both explicit foreign key relations as well as implicit reverse ones (such as used by one-to-one relations, e.g. in multi-table inheritance).

comment:10 Changed 3 years ago by sebastian

  • Has patch set

I uploaded another patch. This one fixes the problem at what I think is the root of it. The patch makes two small changes to django/db/models/sql/compiler.py and django/db/models/sql/query.py.

In compiler.py, we force the reverse sides of foreign key relations to be null-able. This makes sense, since the existence of a remote object cannot be guaranteed, so we must use LEFT OUTER JOINs instead of INNER JOINs. I think the original code resulted from simply copy-pasting the section a couple lines above my fix which deals with non-related objects (where checking for field.null makes perfect sense), and forgetting to make the appropriate change to apply it to related objects. Please correct me if I'm wrong in my assumptions here.

The other change applies to query.py. Here, I added a simple check when introducing additional joins in setup_joins(…). The reasoning behind this change is that we must mark joins referring to existing LEFT OUTER ones as null-able, so that they may eventually be promoted to LEFT OUTER JOINs themselves.

These two changes combined fix the select_related(…) and values(…) problems in this ticket. My patch also includes a bunch of tests for these two methods (also some simple filter/exclude queries for good measure).

I made sure that all existing Django tests run as before, so that no failures or errors got introduced by my patch. (In particular, regressiontests/queries still succeeds in test_tickets_5324_6704(…) where the existence of expected INNER JOINs is checked which was created in response to tickets #5324 and #6704.)

Please let me know what you think. Is this patch/fix mature enough to be included in trunk? Are there any unintended side-effects?

comment:11 Changed 3 years ago by akaariai

I think the approach in the sql/compiler.py is correct. However the approach in sql/query.py is not correct in my opinion. What the NULLABLE in alias_map tuples mean, is if the join is nullable considered separate from the whole query. It is the responsibility of join promotion to do the correct promotions in the join chain, taking in account the parent joins. See attached patch for an approach doing this. I think my patch is actually simplifying the code while at the same time correcting this problem. However I am not yet at all certain it is actually correct. At least it passes the test :)

Changed 3 years ago by akaariai

comment:12 Changed 3 years ago by sebastian

@akaariai: Your patch looks good, and I think that unifying the promotion of joins to a single method promote_joins makes sense. It might also fix some issues with nested foreign keys that my patch wouldn't catch. One remark, however: on first sight, promote_joins doesn't seem to work recursively, i.e. the order in which we pass joins to this method matters (if one join references the other, passing them in in the opposite order might not lead to the desired avalanche of promotion). Is this a problem? Should we re-run the promotion until all join types stabilize, in order to catch such inter-join dependencies?

Also, why do you exclude auth_permission and django_content_type from promotion? That if-statement should at least have a comment explaining the reason for this.

(Sorry, please ignore that last paragraph, I misread "pass" for "continue". I guess that code was only used by you for debugging.)

Last edited 3 years ago by sebastian (previous) (diff)

comment:13 Changed 3 years ago by akaariai

Yeah, debugging leftover. Should always do a git grep pdb before posting a patch :)

As for stabilizing, yeah, that is a problem. The recursive comment refers to my original idea of doing

self.promote_joins(child_join_aliases)

for every alias that got promoted. However, it seemed that this is not necessary. But now that you mention it, it would still be the correct thing to do. The fix is easy to make, except I am not totally sure how to get the child joins of an alias... I will check how to do that.

comment:14 Changed 3 years ago by akaariai

  • Patch needs improvement set

It seems my patch is based on some really old version of Django. This is due to using a wrong directory of Django. I can't fix this before friday. So if you are trying to test this patch, do not wonder if it fails loudly when patching. So, even if I managed to pass all tests, it isn't that convincing.

Changed 3 years ago by sebastian

comment:15 Changed 3 years ago by sebastian

  • Patch needs improvement unset

No problem. I took your patch and adapted it to the current SVN version of Django ("16715_alternate-redux-r16928.diff"). Some changes were necessary as there seem to be more instances of promote_alias[_chain] in the current version.

I also added the ability to examine all children of the joins passed to the new method promote_joins. This should make sure that all known references to a join get updated as well. In fact, this change might enable us to remove all calls to promote_alias[_chain]/promote_joins until just before the final construction of the query. Until now, it seems that we had to call promote_alias[_chain] repeatedly whenever we changed certain aspects of the query, to ensure the correct join type. However, this might not be necessary anymore as the new method promote_joins does the promotion recursively (when A->B gets promoted, so do all B->X).

What do you think? Moving the promotion of joins right to the end of the query construction might simplify the code, since we only had to collect the requirements of each join (ie. referenced alias, fields, and if the join itself is nullable) and would add the join type only just before putting everything together (JOIN_TYPE would remain unset until this very last step). At least, this seems like the logical thing to do, anyway, but I don't know if JOIN_TYPE is used for other things during query construction (I guess it is).

BTW, the new patch too doesn't introduces any new failed tests.

comment:16 Changed 3 years ago by akaariai

Thank you very much for the cleanup. I did that mistake on my work computer, and can't (or rather will not) access it before friday. If trac could warn that 'your patch does not apply cleanly', that would be a great feature.

About doing the join promotion only at the time when the query is about to be executed. I have been thinking about something similar to that, it is IMO the right approach. Except I think we should begin with the assumption that all joins are LEFT JOINS and promote (unpromote?) them to INNER JOINS only when we can be sure it is safe. I can't think of a situation where it would be wrong to use a LEFT JOIN instead of INNER JOIN in Django's query generator. The other direction is certainly dangerous, there have been, and still is too numerous tickets about that.

The reason for doing promotion in reversed direction is that I believe getting the joins perfectly correctly is actually NP-complete (very similar to SAT). So, I believe we will never get perfect results. If we start from INNER JOINS and try to approach from above the perfect situation, we will have a lot of bugs to fix. But if we instead do it in the other direction, approaching from below, we only have some optimization problems ahead. If there is some way to do this perfectly, at least we haven't found it yet.

I have been working with the add_q code, and the query construction is very anal about the structure of the where three passed in. And the problem is always that it thinks INNER JOIN is the right thing to do when it is not. When looking at the promotion code and fixes to it, it is evident that the current approach of fixing the problems one at a time will not carry much further.

It might be that I am wrong about the NP-completeness, and I wouldn't be surprised if there are situations where we must use INNER JOINS. Still, reversing the promotion direction should result in less bugs.

comment:17 Changed 3 years ago by sebastian

So, what should be our next step? How do we get your patch reviewed and maybe approved of by another developer, or even one of the core developers? Making more fundamental changes to how queries are generated (as hinted at by you) doesn't seem a likely approach at the moment since it suggests a lot of work, essentially rewriting the entire query generator.

(However, I think that changing the way queries are generated as you suggested is something that should be kept in mind. It seems to be the Right Way of doing things, and it also seems to simplify a lot of code (I guess) by not having to keep track of all the resulting joins at all times but rather track only the properties of joins, determining the required join type only in the end.)

In the meantime, your patch certainly fixes the problem of picking the wrong join in the scope of this ticket. So how do we proceed?

comment:18 Changed 3 years ago by akaariai

I don't know how to proceed, it seems most of the core developers are really busy, especially those who know the ORM. One way is to go to #django-dev and try to get somebody interested. Finding somebody to review this would be the first step.

I definitely do think we should keep this ticket restricted in scope. When this gets committed, then maybe move forward. I do think the ORM needs some restructuring in the future. But that is not this ticket's problem.

comment:19 Changed 3 years ago by sebastian

  • Cc sebastian added

comment:20 Changed 2 years ago by akaariai

I have rebased the patch and it can be found from: https://github.com/akaariai/django/tree/ticket16715

The work done in this ticket seems valid and should make the join promotion code just a little bit easier to understand. There might be some places which could be cleaned up more after this patch, but there are already enough changes for one patch.

comment:21 Changed 2 years ago by akaariai

There is an additional cleanup done to add_filter available from: https://github.com/akaariai/django/tree/ticket16715_more

The code generates less outer joins, but passes all tests. I have to still verify the queries where the amount of outer joins is altered and see that the generated queries are valid. So, this is currently mostly investigative work.

However I am pretty sure there is something fishy going on in the add_fitlter join promotion. The problem is the assumed pairwise symmetry between the join_list and self.tables. As far as I understand this symmetry simply doesn't exist in general.

comment:22 Changed 2 years ago by akaariai

I am marking this RFC (not the "more" version, just the base promote_joins() patch in comment:20). I am confident this one will improve the queryset generation code. I have been playing around with the patch from #10790, and it has become obvious that the current join promotion logic in add_filter() is complicated, and doesn't work well. This patch should make cleaning up that logic _much_ easier.

If you have doubts about this patch and do not have time to review this patch just now, please make a note here. I aim to commit this one during next weekend.

comment:23 Changed 2 years ago by akaariai

  • Triage Stage changed from Accepted to Ready for checkin

And the actual RFC mark...

comment:24 Changed 23 months ago by akaariai

I have made some final adjustments to the patch. Now there is logic to make sure that we never generate join chains like a LOUTER b INNER c, which will result in broken queries.

I tested this with adding this little test to sql/compiler.py get_from_clause():

        for alias in self.query.tables:
+            if alias in self.query.alias_map:
+                cur_alias = self.query.alias_map[alias]
+                if cur_alias.join_type == self.query.INNER:
+                    parent_alias = self.query.alias_map[cur_alias.lhs_alias]
+                    assert parent_alias.join_type != self.query.LOUTER

This test fails when running queries tests on master (that is, we have join chains like a LOUTER b INNER c). After patch there isn't such join chains anymore in the whole test suite.

I am still leaving a day or so for review if somebody wants to do that. The patch is here.

comment:25 Changed 23 months ago by Anssi Kääriäinen <akaariai@…>

  • Resolution set to fixed
  • Status changed from new to closed

In [01b9c3d5193fe61b82ae8b26242a13fdec22f211]:

Fixed #16715 -- Fixed join promotion logic for nested nullable FKs

The joins for nested nullable foreign keys were often created as INNER
when they should have been OUTER joins. The reason was that only the
first join in the chain was promoted correctly. There were also issues
with select_related etc.

The basic structure for this problem was:

A -[nullable]-> B -[nonnull]-> C

And the basic problem was that the A->B join was correctly LOUTER,
the B->C join not.

The major change taken in this patch is that now if we promote a join
A->B, we will automatically promote joins B->X for all X in the query.
Also, we now make sure there aren't ever join chains like:

a LOUTER b INNER c

If the a -> b needs to be LOUTER, then the INNER at the end of the
chain will cancel the LOUTER join and we have a broken query.

Sebastian reported this problem and did also major portions of the
patch.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.