Opened 7 years ago

Last modified 7 weeks ago

#29262 assigned New feature

Custom Left Outer Join in Queries

Reported by: Sassan Haradji Owned by: faizan2700
Component: Database layer (models, ORM) Version:
Severity: Normal Keywords: ORM Join
Cc: josh.smeaton@…, Alexandr Artemyev, Ivan Zaikin, frnhr, Alex Scott, elonzh, Evstifeev Roman Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I need a query that contains a left outer join, the table has ~160,000,000 rows and if I try to avoid outer join it'll reduce performance in an order that it'll be unusable.
So my only choice is using raw method of the object's manager.
I need to pass the queryset to rest framework so that it filter/sort/paginate/etc it.
Rest framework needs a normal queryset (with count, filter, order_by, etc) and I'm trying to solve it with lots of hacks (proxying objects, overriding internal methods and changing the order they call other internal methods and changing some standard tested code in the overridden methods.)

It's a terrible experience. I see requests and pull requests for supporting joins since 10 years ago and lots of related questions in stackoverflow, reddit and all around the web. So I'm here to ask you once again to do something about this issue.
At least you can provide a way to modify the sql command that's going to send to database by normal query (not rawquery) and let the developer to take the risk and see if it breaks things and handle it himself. It'll be only used in edge cases by people who really need it and if they need it they probably know what they're doing. It'll be better than the nightmare developers have to deal while in need of custom queries.

Change History (28)

comment:1 by Josh Smeaton, 7 years ago

A few things.

  1. I wasn't aware there were situations where a LEFT OUTER JOIN would have better performance than an INNER JOIN, since the LEFT JOIN looks at and includes more data. Are you able to provide the SQL for what is being generated and for what you want? The model definitions would also help.
  1. There have been lots of questions about customising joins. As far as I know, there have been no pull requests implementing such a thing in a reasonable manner. If you're aware of any pull requests it'd be good if you could share where those are so we can discuss the merits of each change.
  1. You're asking for an escape hatch that isn't the escape hatch that django is already providing. Django provides .raw() for exactly these purposes. If, for some technical reason, .raw() isn't appropriate, please discuss why so we can address those particular concerns.
  1. Is it possible to address your situation by using .union()? You can represent the FULL JOIN portion with 1 query, and the NULL join portion with a 2nd query, the .union() them together.
  1. There was some work done recently on annotating joins onto querysets, but I've been unable to find that ticket or patch.

Now I am picking up from your ticket that you're frustrated. But no open source contributor anywhere has ever responded to the equivalent of "this situation is really bad for me so you must fix it for me" by jumping to do exactly that. If you want people to work, for free, on something you care about, then it's always better to approach the conversation in a more positive and friendly tone.

We'd also be open to any contributions you or your company would be willing to make provided it made sense for the project.

comment:2 by Sassan Haradji, 7 years ago

  1. I like many need LEFT OUTER JOIN not INNER JOIN, if INNER JOIN was an efficient alternative to LEFT OUTER JOIN django itself would use it in its foreignkey joins, these joins don't do same thing and aren't an alternative for eachother.
  1. This is one from "10 years ago": https://code.djangoproject.com/ticket/7231
  1. .raw() may be good for some use cases, but I don't get it why should I lose all the features on a normal query (like count, filter, etc) just cause I wanna add a simple column by an outer join to that normal query. What I suggested as an "at least alternative" was an escape hatch that lets developer use normal queries (not raw queries) and patch them.

A framework is just a tool in hands of the developer, I think everyone on earth trying to build a tool should consider that the tool should not limit its users, but it should give him new opportunities. Django provides lots of opportunities, but not providing an easy way to patch final sql it compiles is not a good thing. There are hard ways, I can subclass connection and query and change their behavior but there should be an easy way to do it after all (even if django supports left outer join). In documentation it should try to convince normal users to not use it and talk about security problems and instabilities it may introduce to code. But if someone needs it and knows what he's doing, then he should be able to do it.

comment:3 by Sassan Haradji, 7 years ago

  1. I don't need union, I'm not trying to add rows to my main query, I'm trying to add a column to my query via outer join. So unfortunately union is not an option.
  1. Would be glad if you find it and share it here so that we can see if it solves the problem.

comment:4 by Sassan Haradji, 7 years ago

SpamBayes doesn't let me to send this part because it's referring some google searches, so I post it in a gist: https://gist.github.com/sassanh/43ef664872c322a5f88434f10c5ce4ea

Btw, there's an implicit message in every issue reported in open source community, the reporter doesn't ask any specific person to do it, as he has no right to do so as he's not paying them. What he wants by reporting an issue is trying to make consensus that the issue exists, get acknowledgement from the community that issue exists and someone (maybe himself, maybe a current contributor or maybe someone who is 2 years old now and is going to become a software developer in future) should solve it. An open issue/ticket means it is a step toward progression of the project, so while I do not ask current contributors to solve this issue, I do ask them to not abuse their privileges on this ticketing system and do not close it until it's either implemented in Django or someone provides good logic "why a developer using django framework can do whatever he wants without using custom outer joins."

Last edited 7 years ago by Sassan Haradji (previous) (diff)

comment:5 by Tim Graham, 7 years ago

Duplicate of #26426, "Add a way to customize a QuerySet's joins"?

comment:6 by Josh Smeaton, 7 years ago

It's more than likely that I have misinterpreted your request to address this issue as a demand for core to fix the issue due to language differences. I apologise.

#26426 (customise joins) and #25590 (custom join classes) are almost duplicates, but are much more general than the specific feature of supporting user-defined left joins. I'd probably argue that #26426 should be closed, as we have EXISTS expressions now that can solve that specific problem. Would you agree Tim?

I think there are really two features that we should try to support.

  1. Allow users to join data onto a query with a LEFT JOIN
  2. Allow users to add additional conditions onto a JOIN condition

With those two features we'd get pretty far down the line for supporting most common custom join requirements.

What we'd need to do is come to a consensus on what the right syntax would look like to make user defined LEFT JOINs possible. I'm not interested in providing hooks into customising already generated SQL. That would just be a hack to work around our lack of actual support. That's mostly why the linked ticket from 10 years ago languished - it was hacked into .extra() which we're no longer committed to updating.

So what would a decent syntax look like? Should we consider a new queryset method? Should we use .annotate()?

I'll throw out some ideas:

MyModel.objects.annotate(joined=Outer(OtherModel.objects.filter(mymodel_id=OuterRef('pk')))

MyModel.objects.outer('othermodel')

MyModel.objects.outer(othermodel=OtherModel.objects.filter(user=request.user))

MyModel.objects.partial_related(othermodel=OtherModel.objects.filter(user=request.user))

Django has tended to avoid using language that maps too closely to SQL in the past, though with the addition of more complex expression types that hasn't been so much of a blocker. I'd be hesitant to add a new queryset method called outer for that reason though. New classes can map to SQL because they're not as discoverable and really are for advanced usage. Increasing the scope of the queryset api with regard to SQL terminology seems off to me.

This is the kind of question that could probably be asked of the django-developers mailing list. There are lots of people with opinions there that'd be relevant to this discussion. In principle though, Django should definitely support adding left joins to a queryset.

Last edited 7 years ago by Tim Graham (previous) (diff)

comment:7 by Josh Smeaton, 7 years ago

Cc: josh.smeaton@… added

comment:8 by Josh Smeaton, 7 years ago

I've begun a discussion on the mailing list: https://groups.google.com/forum/#!topic/django-developers/2ITfPZlbsao

Please add your example to that thread if it's different to any already listed there.

comment:9 by Josh Smeaton, 7 years ago

Triage Stage: UnreviewedAccepted
Version: 2.0

comment:10 by Simon Charette, 6 years ago

I'd be curious to know whether or not FilteredRelation would solve your use case like it did in #29555.

This expression allows your to specify extra JOIN conditions so I'd assume annotating the subquery you want to JOIN against and then referencing it in the FilteredRelation(condition) should work. It'd help if you could provide your models definition and the exact query you're trying to generate through the ORM.

comment:11 by Sassan Haradji, 6 years ago

Unfortunately I don't think so. What I needed was exactly left join.
The query I needed was a rather complicated query, I try to abstract it here so that we can investigate it and find out what's needed in Django ORM to achieve it.
Suppose that I have this table:

CREATE TABLE foo(id,related_id,value,type)
AS VALUES
    ( 1 , 1,  'A1' , 1 ),
    ( 2 , 1,  'A2' , 2 ),
    ( 3 , 1,  'A3' , 3 ),
    ( 4 , 1,  'A4' , 4 ),
    ( 5 , 1,  'A5' , 5 ),
    ( 6 , 2,  'B1' , 1 ),
    ( 7 , 2,  'B2' , 2 ),
    ( 8 , 2,  'B3' , 3 ),
    ( 9 , 2,  'B4' , 4 ),
    ( 10, 2,  'B5' , 5 )
;

I want to aggregate these values and make this intermediate table:

-----------------------------------------------------------------------------
|  id   |   related_id   |  values                                          |
-----------------------------------------------------------------------------
|  1    |   1            |  (('A1',1),('A2',2),('A3',3),('A4',4),('A5',5))  |
-----------------------------------------------------------------------------
|  6    |   2            |  (('B1',1),('B2',2),('B3',3),('B4',4),('B5',5))  |
-----------------------------------------------------------------------------

To do so I need to do a simple aggregation:

foo.objects.values('related_id').annotate(
    id=Min('id'),
    values=ArrayAgg(
        Func(
            'value',
            'type',
            function='ARRAY',
            template='%(function)s[%(expressions)s]',
            arg_joiner=',',
        ), output_field=ArrayField(ArrayField(models.FloatField())),
    ),
)

This will generate this sql query (or something equivalent):

SELECT t.*
FROM (
  SELECT
    min(id),
    related_id,
    array_agg(ARRAY(value, type)) AS values,
  FROM foo
  GROUP BY id
) AS t

So far so good. Then I want to order this query based on value column but this order should order values of this column that are in a row that has type=X. I can do so by this sql:

SELECT t1.*
FROM (
  SELECT
    min(id),
    related_id,
    array_agg(ARRAY(value, type)) AS values,
  FROM foo
  GROUP BY id
) AS t1
LEFT OUTER JOIN (SELECT value FROM foo WHERE type=X) AS t2 USING (id)
ORDER BY t2.value

This is where I need this left join. It should be left join cause I don't wanna miss rows that don't have X type value.

Now do you think it's possible to do the above with current Django ORM API?
Consider that this table is really big, and the above sql query is the only one I found that executes in rational time and doesn't miss anything.
Also the real problem is much more complected, the number of columns that are involved are much more so if you think there's room for simplifying the above sql solution consider that in my real usecase it may not be applicable so I please lets concentrate on interpreting the exact above query into Django ORM API and not change the sql query so that it fits the API.

comment:12 by Simon Charette, 5 years ago

I think this ticket should either be repurposed or closed as FilteredRelation does allow custom left outer joins.

What comment:11 is missing at this point is a way to perform subquery pushdown and a way to JOIN subqueries.

comment:13 by Alexandr Artemyev, 5 years ago

Cc: Alexandr Artemyev added

comment:14 by Alexandr Artemyev, 5 years ago

I believe my case relates to this issue.

I have a user model and a second model like this

class SecondModel(models.Model):
    date = models.DateField()
    from_member = models.ForeignKey(User, models.CASCADE, '+',)
    to_member = models.ForeignKey(User, models.CASCADE, '+')
    first_value = models.IntegerField(default=0)
    second_value = models.IntegerField(default=0)


I want to left join this subquery with the User model on id=target:

subquery = (
    SecondModel.objects.filter(
        Q(from_member=self.user) | Q(to_member=self.user),
        date__gte=start_date, date__lte=end_date,
    ).annotate(
        target=Case(When(Q(from_member=self.user), then=F('to_member_id')), default=F('from_member_id')),
    ).values('target').annotate(
        sum_first_value=Sum('first_value'),
        sum_second_value=Sum('second_value'),
        total=F('sum_first_value') + F('sum_second_value'),
    )
)

I imagine it somehow like this:

qs = User.objects.annotate(subquery=subquery)

After that filter/exclude/order_by features should work.
Also I should have access to fields like qs[0].subquery_first_value/qs[0].subquery_second_value

Here is an example SQL query that I want to be generated:

SELECT "users".*, "subquery".*
FROM "users"
LEFT OUTER JOIN
  (SELECT CASE
              WHEN "second_model"."from_member_id" = 7 THEN "second_model"."to_member_id"
              ELSE "second_model"."from_member_id"
          END AS "target",
          SUM("second_model"."first_value") AS "sum_first_value",
          SUM("second_model"."second_value") AS "sum_second_value",
   FROM "second_model"
   WHERE ("second_model"."from_member_id" = 7
          OR "second_model"."to_member_id" = 7) -- also the condition for date on start and end
   GROUP BY CASE
                WHEN "second_model"."from_member_id" = 7 THEN "second_model"."to_member_id"
                ELSE "second_model"."from_member_id"
            END) subquery ON ("users"."id" = subquery."target")
WHERE "users"."id" = 502

I found a solution that resolves my problem:
https://stackoverflow.com/a/42816689/5628843
using CustomJoin helps me, but to access values i need to use extra, and this code doesn’t look very nice

join_to(User, subquery, 'id', 'target', queryset, 'subquery').extra(
            select={
                field_name: f'COALESCE({field_name}, 0)'
                for field_name in (
                    'sum_first_value',
                    'sum_second_value',
                )
            }
        )
Last edited 5 years ago by Alexandr Artemyev (previous) (diff)

comment:15 by Ivan Zaikin, 5 years ago

Cc: Ivan Zaikin added

comment:16 by frnhr, 5 years ago

Cc: frnhr added

comment:17 by Alex Scott, 5 years ago

Cc: Alex Scott added

@simon, FilteredRelation is great but only allows custom outer joins one level deep at the moment so isn't a full solution for many use cases. Added a comment here as well: https://code.djangoproject.com/ticket/29789

comment:18 by Hugo Maingonnat, 4 years ago

I also need a way to generate a query with custom a join:

SELECT "main_table"."id" FROM "main_table" LEFT JOIN (SELECT distinct(main_id) FROM through_table WHERE other_id = 14) as foo ON foo.main_id=main_table.id ORDER BY "main_table"."id" DESC LIMIT 1;

The main queryset is generated by the view, and the left join subquery is a custom django-filter. Using a id__in would work but I have to repeat the ORDER BY DESC and LIMIT 1 in the subquery (otherwise it gets really ineficient) and ideally the filter should be independant from the initial queryset (so it does not know the ordering and limit). Also django-filter needs to keep a queryset so I cannot generate a RawQueryset (I am not even sure this is doable).

Having a join keyword in the extra function would have unlocked me.

The FilteredRelation object is only limited to a condition (q object) and not a full SELECT query. If it could take a RawSQL or a queryset it would be perfect.

So it looks like I cannot generate a quite simple query using any of the django function, a bit frustrating :(

Like @Alexandr Artemyev I found out the CustomJoin trick but it looks very brittle.

Hope my feedback will help understanding the issue :)

Version 0, edited 4 years ago by Hugo Maingonnat (next)

comment:19 by Thorben Luepkes, 2 years ago

I think I am running into a similar problem here, where I would like to append multiple conditions onto my join:
I will try to be precise with this as much as possible.
Imagine these two models. whose relation was set up years ago:

class Event(models.Model):
    instance_created_date = models.DateTimeField(auto_now_add=True)
    car = models.ForeignKey(Car, on_delete=models.CASCADE, related_name="car_events")
    ...
    a lot of normal text fields here, but they dont matter for this problem.

and

class Car(models.Model):
    a lot of text fields here, but they dont matter for this problem.
    hide_from_company_search = models.BooleanField(default=False)
    images = models.ManyToManyField(Image, through=CarImage)

Lets say I want to query the amount of events for a given car:

def get_car_events_qs() -> QuerySet:
    six_days_ago = (timezone.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=6))
    cars = Car.objects.prefetch_related(
        'car_events',
    ).filter(
        some_conditions_on_fields=False,
    ).annotate(
        num_car_events=Count(
            'car_events',
            filter=Q(car_events__instance_created_date__gt=six_days_ago), distinct=True)
    )

    return cars

The really tricky part for this is the performance of the query: Cars has 450.000 entries, and Events has 156.850.048. All fields that I am using to query are indexed. The query takes around 4 minutes to complete, depending on the db load. It took 18 minutes before adding the indicies.

This above ORM query will result in the following sql:

SELECT
  "core_car"."id",
  		COUNT("analytics_carevent"."id") FILTER (WHERE ("analytics_carevent"."event" = 'view'
			AND "analytics_carevent"."instance_created_date" >= '2022-05-10T07:45:16.672279+00:00'::timestamptz
			AND "analytics_carevent"."instance_created_date" < '2022-05-11T07:45:16.672284+00:00'::timestamptz)) AS "num_cars_view",
	LEFT OUTER JOIN "analytics_carevent" ON ("core_car"."id" = "analytics_carevent"."car_id")
WHERE 
 ... some conditions that dont matter
GROUP BY
	"core_car"."id"

I somehow suspect this FILTER to be a problem.
I tried with

.annotate(num_car_events=Count('car_events'))

and moving the car_events__instance_created_date__gt=six_days_ago into the filter:

.filter(some_conditions_on_fields=False, car_events__instance_created_date__gt=six_days_ago)

But of course this would filter out Cars with no Events, which is not what we want - but it is super fast!
I fiddled a bit with it in raw sql and came to his nice working example, that I now would like to write into ORM, since we dont really want to use rawsql. This query takes 2.2s, which is in our acceptable boundary, but faaaaar less than the 18minutes.

SELECT
	"core_car"."id",
	COUNT(DISTINCT "analytics_carevent"."id") AS "num_cars_view",
FROM
	"core_car"
	LEFT JOIN "analytics_carevent" ON ("core_car"."id" = "analytics_carevent"."car_id" AND "analytics_carevent"."event" = 'view' AND "analytics_carevent"."instance_created_date" > '2022-05-14T00:00:00+02:00'::timestamptz
		AND "analytics_carevent"."instance_created_date" <= '2022-05-15T00:00:00+02:00'::timestamptz)
	
WHERE (some conditions that dont matter)
GROUP BY "core_car"."id";

My question now is:
How can I make the above query into the ORM?
I need to put the "filter" or conditions onto the left join. If I just use filter() it will just put it into the where clause, which is wrong.
I tried:

 two_days_ago = (timezone.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=2))
    cars = Car.objects.prefetch_related(
        'car_events',
    ).filter(some_filters,)

and

cars = cars.annotate(events=FilteredRelation('car_events')).filter(car_events__car_id__in=cars.values_list("id", flat=True), car_events__instance_created_date__gt=six_days_ago)

But I dont think this is quite correct. I also need the count of the annotation.
I also tried with

cars = cars.annotate(events=FilteredRelation('car_events', condition=Q(car_events__car_id__in=ads.values_list("id", flat=True)))).filter(events__instance_created_date__gt=six_days_ago)

But this results in an inner join, which i dont think is wanted here

Using Django 4 and latest python release as of this writing. :)

Thanks a lot!

TLDR: Putting a filter or conditions on LEFT JOIN in django, instead of queryset.filter()

Last edited 2 years ago by Thorben Luepkes (previous) (diff)

comment:20 by elonzh, 22 months ago

Cc: elonzh added

comment:21 by Gaurav Joshi, 20 months ago

Owner: changed from nobody to Gaurav Joshi
Status: newassigned

Could you please assign the issue to me so that I can work on it?

in reply to:  21 comment:22 by Mariusz Felisiak, 20 months ago

Replying to Gaurav Joshi:

Could you please assign the issue to me so that I can work on it?

You've already assigned it to yourself.

comment:23 by Gaurav Joshi, 20 months ago

Yes, I assigned that issue to myself 😅. As a new member here, I'm still trying to navigate the process and it has caused some confusion for me

comment:24 by Gaurav Joshi, 20 months ago

Owner: Gaurav Joshi removed
Status: assignednew

comment:25 by faizan2700, 15 months ago

Is anyone working on it. Please assign it to me.

comment:26 by faizan2700, 15 months ago

Owner: set to faizan2700
Status: newassigned

comment:27 by Mariusz Felisiak, 15 months ago

faizan2700, are you sure that you want to work on multiple tickets at once?

comment:28 by Evstifeev Roman, 7 weeks ago

Cc: Evstifeev Roman added
Note: See TracTickets for help on using tickets.
Back to Top