Opened 12 years ago

Closed 11 years ago

Last modified 11 years ago

#17502 closed Cleanup/optimization (fixed)

Filter on field from base class 2 levels up hierarchy causes extra join

Reported by: dbenamy@… Owned by: Łukasz Rekucki
Component: Database layer (models, ORM) Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Here's some simple code showing a 3 level concrete inheritance structure:

from django.db import models

class A(models.Model):
    alice = models.IntegerField()

class B(A):
    bob = models.IntegerField()

class C(B):
    charlie = models.IntegerField()

Filtering C objects on an attribute in B or C works fine. Filtering C objects on an attribute in A seems to cause an extra join to A. Is there some reason this is needed?

In [3]: str(C.objects.all().query)
Out[3]: 'SELECT "case1_a"."id", "case1_a"."alice", "case1_b"."a_ptr_id", "case1_b"."bob", "case1_c"."b_ptr_id", "case1_c"."charlie" FROM "case1_c" INNER JOIN "case1_a" ON ("case1_c"."b_ptr_id" = "case1_a"."id") INNER JOIN "case1_b" ON ("case1_c"."b_ptr_id" = "case1_b"."a_ptr_id")'

In [4]: str(C.objects.filter(charlie=0).query)
Out[4]: 'SELECT "case1_a"."id", "case1_a"."alice", "case1_b"."a_ptr_id", "case1_b"."bob", "case1_c"."b_ptr_id", "case1_c"."charlie" FROM "case1_c" INNER JOIN "case1_a" ON ("case1_c"."b_ptr_id" = "case1_a"."id") INNER JOIN "case1_b" ON ("case1_c"."b_ptr_id" = "case1_b"."a_ptr_id") WHERE "case1_c"."charlie" = 0 '

In [5]: str(C.objects.filter(bob=0).query)
Out[5]: 'SELECT "case1_a"."id", "case1_a"."alice", "case1_b"."a_ptr_id", "case1_b"."bob", "case1_c"."b_ptr_id", "case1_c"."charlie" FROM "case1_c" INNER JOIN "case1_b" ON ("case1_c"."b_ptr_id" = "case1_b"."a_ptr_id") INNER JOIN "case1_a" ON ("case1_c"."b_ptr_id" = "case1_a"."id") WHERE "case1_b"."bob" = 0 '

In [6]: str(C.objects.filter(alice=0).query)
Out[6]: 'SELECT T4."id", T4."alice", "case1_b"."a_ptr_id", "case1_b"."bob", "case1_c"."b_ptr_id", "case1_c"."charlie" FROM "case1_c" INNER JOIN "case1_b" ON ("case1_c"."b_ptr_id" = "case1_b"."a_ptr_id") INNER JOIN "case1_a" ON ("case1_b"."a_ptr_id" = "case1_a"."id") INNER JOIN "case1_a" T4 ON ("case1_c"."b_ptr_id" = T4."id") WHERE "case1_a"."alice" = 0 '

This is with 1.3.1.

Change History (5)

comment:1 by Łukasz Rekucki, 12 years ago

Owner: changed from nobody to Łukasz Rekucki

comment:2 by Łukasz Rekucki, 12 years ago

Status: newassigned
Triage Stage: UnreviewedAccepted

Going step by step:

  1. We start with table "C".
  2. Query.setup_inherited_model() adds INNER JOIN for all parent models. C is joined with B [{{{ ON (C.b_ptr_id = B.a_ptr_id) - B's parent pointer is it's own primary key). Then C is joined with A and because of B's PK is the same as B's parent pointer, Django optimizes? by doing (C.b_ptr_id [= B.a_ptr_id] = A.id).
  3. Then the Query.add_filter() lookups the alice field and setups neccesary joins with Query.setup_joins(). As the field resides on the base model, the following code is executed:
                    for int_model in opts.get_base_chain(model):
                        if int_model is proxied_model:
                            opts = int_model._meta
                        else:
                            lhs_col = opts.parents[int_model].column
                            dedupe = lhs_col in opts.duplicate_targets
                            if dedupe:
                                exclusions.update(self.dupe_avoidance.get(
                                        (id(opts), lhs_col), ()))
                                dupe_set.add((opts, lhs_col))
                            opts = int_model._meta
                            alias = self.join((alias, opts.db_table, lhs_col,
                                    opts.pk.column), exclusions=exclusions)
                            joins.append(alias)
                            exclusions.add(alias)
                            for (dupe_opts, dupe_col) in dupe_set:
                                self.update_dupe_avoidance(dupe_opts, dupe_col,
                                        alias)
    

This takes a different approach by joining C with B (alias lookup succeeds) and then B with A (alias lookup fails, because we previously used a different connection). This is where the additional join comes from.

Possible solutions:

  1. make the strategy in 2. and 3. consistent. Chaining the joins step by step instead of skipping should work, but if we select only values from C and A, we should be able to skip joining B. This would suggest changing setup_joins to try to reuse the optimized alias first.
  1. When doing the setup_inherited_models(), if we did join to B (i.e. there are selected fields to B), add a fake entry to alias_map for "B INNER JOIN A", that setup_joins() could reuse later.

comment:3 by Aymeric Augustin, 11 years ago

Component: UncategorizedDatabase layer (models, ORM)

comment:4 by Anssi Kääriäinen <akaariai@…>, 11 years ago

Resolution: fixed
Status: assignedclosed

In 2a2708e1b2a0e02fc4fdd4b050d60fb0033dfdde:

Fixed #17502 -- Made joining in inheritance cases consistent

The original problem was that when filtering two levels up in
inheritance chain, Django optimized the join generation so that the
middle model was skipped. But then Django generated joins from top
to middle to bottom for SELECT clause, and thus there was one extra
join (top->middle->bottom + top -> bottom).

This case is fixed in master as the filtering optimization is gone.
This has the side effect that in some cases there is still extra join
if the SELECT clause doesn't contain anything from middle or bottom.

comment:5 by Anssi Kääriäinen, 11 years ago

The original case is already fixed in master. However, the .values(from_top_only).filter(against_bottom) case does generate one extra join (assuming Top->Middle->Bottom inheritance chain). This is now expectedFailure, and I don't think it is too important to fix this one. Multilevel inheritance + select only from topmost models + filtering against bottom just doesn't seem like something worth using much time.

If this were to be fixed, it seems the way would be to do this (this is just another way to have "fake joining" as suggested by lrekucki):

  1. Alter the join reference counting in Query to count only those tables that are really used (that is, they are referenced by SELECT, ORDER BY, ...).
  2. When generating the query in compiler, check if there are joins that can be skipped (the "from" and "to" joins use the same column), and that have reference count of 0. If so, skip the join.

Of course, it is also possible to fix this by being clever in join generation stage. But it is hard to make all three following cases work correctly using this approach:

  1. Top.objects.values_list(from_top).filter(against_bottom)
  2. Top.objects.filter(against_bottom).values_list(from_top)
  3. Top.objects.filter(against_bottom)

If you don't skip the middle in case 2 you will have 2 joins. If you do skip joins, then case 3 will not work.

All in all, solving this properly requires too much effort compared to the gain.

Note: See TracTickets for help on using tickets.
Back to Top