Opened 7 years ago
Last modified 7 weeks ago
#29262 assigned New feature
Custom Left Outer Join in Queries
Reported by: | Sassan Haradji | Owned by: | faizan2700 |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | |
Severity: | Normal | Keywords: | ORM Join |
Cc: | josh.smeaton@…, Alexandr Artemyev, Ivan Zaikin, frnhr, Alex Scott, elonzh, Evstifeev Roman | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I need a query that contains a left outer join, the table has ~160,000,000 rows and if I try to avoid outer join it'll reduce performance in an order that it'll be unusable.
So my only choice is using raw method of the object's manager.
I need to pass the queryset to rest framework so that it filter/sort/paginate/etc it.
Rest framework needs a normal queryset (with count, filter, order_by, etc) and I'm trying to solve it with lots of hacks (proxying objects, overriding internal methods and changing the order they call other internal methods and changing some standard tested code in the overridden methods.)
It's a terrible experience. I see requests and pull requests for supporting joins since 10 years ago and lots of related questions in stackoverflow, reddit and all around the web. So I'm here to ask you once again to do something about this issue.
At least you can provide a way to modify the sql command that's going to send to database by normal query (not rawquery) and let the developer to take the risk and see if it breaks things and handle it himself. It'll be only used in edge cases by people who really need it and if they need it they probably know what they're doing. It'll be better than the nightmare developers have to deal while in need of custom queries.
Change History (28)
comment:1 by , 7 years ago
comment:2 by , 7 years ago
- I like many need LEFT OUTER JOIN not INNER JOIN, if INNER JOIN was an efficient alternative to LEFT OUTER JOIN django itself would use it in its foreignkey joins, these joins don't do same thing and aren't an alternative for eachother.
- This is one from "10 years ago": https://code.djangoproject.com/ticket/7231
.raw()
may be good for some use cases, but I don't get it why should I lose all the features on a normal query (like count, filter, etc) just cause I wanna add a simple column by an outer join to that normal query. What I suggested as an "at least alternative" was an escape hatch that lets developer use normal queries (not raw queries) and patch them.
A framework is just a tool in hands of the developer, I think everyone on earth trying to build a tool should consider that the tool should not limit its users, but it should give him new opportunities. Django provides lots of opportunities, but not providing an easy way to patch final sql it compiles is not a good thing. There are hard ways, I can subclass connection and query and change their behavior but there should be an easy way to do it after all (even if django supports left outer join). In documentation it should try to convince normal users to not use it and talk about security problems and instabilities it may introduce to code. But if someone needs it and knows what he's doing, then he should be able to do it.
comment:3 by , 7 years ago
- I don't need union, I'm not trying to add rows to my main query, I'm trying to add a column to my query via outer join. So unfortunately union is not an option.
- Would be glad if you find it and share it here so that we can see if it solves the problem.
comment:4 by , 7 years ago
SpamBayes doesn't let me to send this part because it's referring some google searches, so I post it in a gist: https://gist.github.com/sassanh/43ef664872c322a5f88434f10c5ce4ea
Btw, there's an implicit message in every issue reported in open source community, the reporter doesn't ask any specific person to do it, as he has no right to do so as he's not paying them. What he wants by reporting an issue is trying to make consensus that the issue exists, get acknowledgement from the community that issue exists and someone (maybe himself, maybe a current contributor or maybe someone who is 2 years old now and is going to become a software developer in future) should solve it. An open issue/ticket means it is a step toward progression of the project, so while I do not ask current contributors to solve this issue, I do ask them to not abuse their privileges on this ticketing system and do not close it until it's either implemented in Django or someone provides good logic "why a developer using django framework can do whatever he wants without using custom outer joins."
comment:6 by , 7 years ago
It's more than likely that I have misinterpreted your request to address this issue as a demand for core to fix the issue due to language differences. I apologise.
#26426 (customise joins) and #25590 (custom join classes) are almost duplicates, but are much more general than the specific feature of supporting user-defined left joins. I'd probably argue that #26426 should be closed, as we have EXISTS expressions now that can solve that specific problem. Would you agree Tim?
I think there are really two features that we should try to support.
- Allow users to join data onto a query with a LEFT JOIN
- Allow users to add additional conditions onto a JOIN condition
With those two features we'd get pretty far down the line for supporting most common custom join requirements.
What we'd need to do is come to a consensus on what the right syntax would look like to make user defined LEFT JOINs possible. I'm not interested in providing hooks into customising already generated SQL. That would just be a hack to work around our lack of actual support. That's mostly why the linked ticket from 10 years ago languished - it was hacked into .extra() which we're no longer committed to updating.
So what would a decent syntax look like? Should we consider a new queryset method? Should we use .annotate()?
I'll throw out some ideas:
MyModel.objects.annotate(joined=Outer(OtherModel.objects.filter(mymodel_id=OuterRef('pk'))) MyModel.objects.outer('othermodel') MyModel.objects.outer(othermodel=OtherModel.objects.filter(user=request.user)) MyModel.objects.partial_related(othermodel=OtherModel.objects.filter(user=request.user))
Django has tended to avoid using language that maps too closely to SQL in the past, though with the addition of more complex expression types that hasn't been so much of a blocker. I'd be hesitant to add a new queryset method called outer for that reason though. New classes can map to SQL because they're not as discoverable and really are for advanced usage. Increasing the scope of the queryset api with regard to SQL terminology seems off to me.
This is the kind of question that could probably be asked of the django-developers mailing list. There are lots of people with opinions there that'd be relevant to this discussion. In principle though, Django should definitely support adding left joins to a queryset.
comment:7 by , 7 years ago
Cc: | added |
---|
comment:8 by , 7 years ago
I've begun a discussion on the mailing list: https://groups.google.com/forum/#!topic/django-developers/2ITfPZlbsao
Please add your example to that thread if it's different to any already listed there.
comment:9 by , 7 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Version: | 2.0 |
comment:10 by , 6 years ago
I'd be curious to know whether or not FilteredRelation
would solve your use case like it did in #29555.
This expression allows your to specify extra JOIN
conditions so I'd assume annotating the subquery you want to JOIN
against and then referencing it in the FilteredRelation(condition)
should work. It'd help if you could provide your models definition and the exact query you're trying to generate through the ORM.
comment:11 by , 6 years ago
Unfortunately I don't think so. What I needed was exactly left join.
The query I needed was a rather complicated query, I try to abstract it here so that we can investigate it and find out what's needed in Django ORM to achieve it.
Suppose that I have this table:
CREATE TABLE foo(id,related_id,value,type) AS VALUES ( 1 , 1, 'A1' , 1 ), ( 2 , 1, 'A2' , 2 ), ( 3 , 1, 'A3' , 3 ), ( 4 , 1, 'A4' , 4 ), ( 5 , 1, 'A5' , 5 ), ( 6 , 2, 'B1' , 1 ), ( 7 , 2, 'B2' , 2 ), ( 8 , 2, 'B3' , 3 ), ( 9 , 2, 'B4' , 4 ), ( 10, 2, 'B5' , 5 ) ;
I want to aggregate these values and make this intermediate table:
----------------------------------------------------------------------------- | id | related_id | values | ----------------------------------------------------------------------------- | 1 | 1 | (('A1',1),('A2',2),('A3',3),('A4',4),('A5',5)) | ----------------------------------------------------------------------------- | 6 | 2 | (('B1',1),('B2',2),('B3',3),('B4',4),('B5',5)) | -----------------------------------------------------------------------------
To do so I need to do a simple aggregation:
foo.objects.values('related_id').annotate( id=Min('id'), values=ArrayAgg( Func( 'value', 'type', function='ARRAY', template='%(function)s[%(expressions)s]', arg_joiner=',', ), output_field=ArrayField(ArrayField(models.FloatField())), ), )
This will generate this sql query (or something equivalent):
SELECT t.* FROM ( SELECT min(id), related_id, array_agg(ARRAY(value, type)) AS values, FROM foo GROUP BY id ) AS t
So far so good. Then I want to order this query based on value column but this order should order values of this column that are in a row that has type=X. I can do so by this sql:
SELECT t1.* FROM ( SELECT min(id), related_id, array_agg(ARRAY(value, type)) AS values, FROM foo GROUP BY id ) AS t1 LEFT OUTER JOIN (SELECT value FROM foo WHERE type=X) AS t2 USING (id) ORDER BY t2.value
This is where I need this left join. It should be left join cause I don't wanna miss rows that don't have X type value.
Now do you think it's possible to do the above with current Django ORM API?
Consider that this table is really big, and the above sql query is the only one I found that executes in rational time and doesn't miss anything.
Also the real problem is much more complected, the number of columns that are involved are much more so if you think there's room for simplifying the above sql solution consider that in my real usecase it may not be applicable so I please lets concentrate on interpreting the exact above query into Django ORM API and not change the sql query so that it fits the API.
comment:12 by , 5 years ago
I think this ticket should either be repurposed or closed as FilteredRelation
does allow custom left outer joins.
What comment:11 is missing at this point is a way to perform subquery pushdown and a way to JOIN subqueries.
comment:13 by , 5 years ago
Cc: | added |
---|
comment:14 by , 5 years ago
I believe my case relates to this issue.
I have a user model and a second model like this
class SecondModel(models.Model): date = models.DateField() from_member = models.ForeignKey(User, models.CASCADE, '+',) to_member = models.ForeignKey(User, models.CASCADE, '+') first_value = models.IntegerField(default=0) second_value = models.IntegerField(default=0)
I want to left join this subquery with the User model on id=target
:
subquery = ( SecondModel.objects.filter( Q(from_member=self.user) | Q(to_member=self.user), date__gte=start_date, date__lte=end_date, ).annotate( target=Case(When(Q(from_member=self.user), then=F('to_member_id')), default=F('from_member_id')), ).values('target').annotate( sum_first_value=Sum('first_value'), sum_second_value=Sum('second_value'), total=F('sum_first_value') + F('sum_second_value'), ) )
I imagine it somehow like this:
qs = User.objects.annotate(subquery=subquery)
After that filter/exclude/order_by features should work.
Also I should have access to fields like qs[0].subquery_first_value/qs[0].subquery_second_value
Here is an example SQL query that I want to be generated:
SELECT "users".*, "subquery".* FROM "users" LEFT OUTER JOIN (SELECT CASE WHEN "second_model"."from_member_id" = 7 THEN "second_model"."to_member_id" ELSE "second_model"."from_member_id" END AS "target", SUM("second_model"."first_value") AS "sum_first_value", SUM("second_model"."second_value") AS "sum_second_value", FROM "second_model" WHERE ("second_model"."from_member_id" = 7 OR "second_model"."to_member_id" = 7) -- also the condition for date on start and end GROUP BY CASE WHEN "second_model"."from_member_id" = 7 THEN "second_model"."to_member_id" ELSE "second_model"."from_member_id" END) subquery ON ("users"."id" = subquery."target") WHERE "users"."id" = 502
I found a solution that resolves my problem:
https://stackoverflow.com/a/42816689/5628843
using CustomJoin helps me, but to access values i need to use extra, and this code doesn’t look very nice
join_to(User, subquery, 'id', 'target', queryset, 'subquery').extra( select={ field_name: f'COALESCE({field_name}, 0)' for field_name in ( 'sum_first_value', 'sum_second_value', ) } )
comment:15 by , 5 years ago
Cc: | added |
---|
comment:16 by , 5 years ago
Cc: | added |
---|
comment:17 by , 5 years ago
Cc: | added |
---|
@simon, FilteredRelation
is great but only allows custom outer joins one level deep at the moment so isn't a full solution for many use cases. Added a comment here as well: https://code.djangoproject.com/ticket/29789
comment:18 by , 4 years ago
I also need a way to generate a query with custom a join:
SELECT "main_table"."id" FROM "main_table" LEFT JOIN (SELECT distinct(main_id) FROM through_table WHERE other_id = 14) as foo ON foo.main_id=main_table.id ORDER BY "main_table"."id" DESC LIMIT 1;
The main queryset is generated by the view, and the left join subquery is a custom django-filter. Using a id__in
would work but I have to repeat the ORDER BY DESC and LIMIT 1 in the subquery (otherwise it gets really ineficient) and ideally the filter should be independant from the initial queryset (so it does not know the ordering and limit). Also django-filter needs to keep a queryset so I cannot generate a RawQueryset (I am not even sure this is doable).
Having a join
keyword in the extra function would have unlocked me.
The FilteredRelation object is only limited to a condition (q object) and not a full SELECT query. If it could take a RawSQL or a queryset it would be perfect.
So it looks like I cannot generate a quite simple query using any of the django function, a bit frustrating :(
Like @Alexandr Artemyev I found out the CustomJoin trick but it looks very brittle.
Hope my feedback will help understanding the issue :)
comment:19 by , 2 years ago
I think I am running into a similar problem here, where I would like to append multiple conditions
onto my join:
I will try to be precise with this as much as possible.
Imagine these two models. whose relation was set up years ago:
class Event(models.Model): instance_created_date = models.DateTimeField(auto_now_add=True) car = models.ForeignKey(Car, on_delete=models.CASCADE, related_name="car_events") ... a lot of normal text fields here, but they dont matter for this problem.
and
class Car(models.Model): a lot of text fields here, but they dont matter for this problem. hide_from_company_search = models.BooleanField(default=False) images = models.ManyToManyField(Image, through=CarImage)
Lets say I want to query the amount of events for a given car:
def get_car_events_qs() -> QuerySet: six_days_ago = (timezone.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=6)) cars = Car.objects.prefetch_related( 'car_events', ).filter( some_conditions_on_fields=False, ).annotate( num_car_events=Count( 'car_events', filter=Q(car_events__instance_created_date__gt=six_days_ago), distinct=True) ) return cars
The really tricky part for this is the performance of the query: Cars
has 450.000
entries, and Events
has 156.850.048
. All fields that I am using to query are indexed. The query takes around 4 minutes to complete, depending on the db load. It took 18 minutes before adding the indicies.
This above ORM query will result in the following sql:
SELECT "core_car"."id", COUNT("analytics_carevent"."id") FILTER (WHERE ("analytics_carevent"."event" = 'view' AND "analytics_carevent"."instance_created_date" >= '2022-05-10T07:45:16.672279+00:00'::timestamptz AND "analytics_carevent"."instance_created_date" < '2022-05-11T07:45:16.672284+00:00'::timestamptz)) AS "num_cars_view", LEFT OUTER JOIN "analytics_carevent" ON ("core_car"."id" = "analytics_carevent"."car_id") WHERE ... some conditions that dont matter GROUP BY "core_car"."id"
I somehow suspect this FILTER
to be a problem.
I tried with
.annotate(num_car_events=Count('car_events'))
and moving the car_events__instance_created_date__gt=six_days_ago
into the filter
:
.filter(some_conditions_on_fields=False, car_events__instance_created_date__gt=six_days_ago)
But of course this would filter out Cars with no Events, which is not what we want - but it is super fast!
I fiddled a bit with it in raw sql and came to his nice working example, that I now would like to write into ORM, since we dont really want to use rawsql. This query takes 2.2s
, which is in our acceptable boundary, but faaaaar less than the 18minutes.
SELECT "core_car"."id", COUNT(DISTINCT "analytics_carevent"."id") AS "num_cars_view", FROM "core_car" LEFT JOIN "analytics_carevent" ON ("core_car"."id" = "analytics_carevent"."car_id" AND "analytics_carevent"."event" = 'view' AND "analytics_carevent"."instance_created_date" > '2022-05-14T00:00:00+02:00'::timestamptz AND "analytics_carevent"."instance_created_date" <= '2022-05-15T00:00:00+02:00'::timestamptz) WHERE (some conditions that dont matter) GROUP BY "core_car"."id";
My question now is:
How can I make the above query into the ORM?
I need to put the "filter" or conditions onto the left join
. If I just use filter()
it will just put it into the where
clause, which is wrong.
I tried:
two_days_ago = (timezone.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=2)) cars = Car.objects.prefetch_related( 'car_events', ).filter(some_filters,)
and
cars = cars.annotate(events=FilteredRelation('car_events')).filter(car_events__car_id__in=cars.values_list("id", flat=True), car_events__instance_created_date__gt=six_days_ago)
But I dont think this is quite correct. I also need the count of the annotation.
I also tried with
cars = cars.annotate(events=FilteredRelation('car_events', condition=Q(car_events__car_id__in=ads.values_list("id", flat=True)))).filter(events__instance_created_date__gt=six_days_ago)
But this results in an inner join, which i dont think is wanted here
Using Django 4 and latest python release as of this writing. :)
Thanks a lot!
TLDR: Putting a filter or conditions on LEFT JOIN
in django, instead of queryset.filter()
comment:20 by , 22 months ago
Cc: | added |
---|
follow-up: 22 comment:21 by , 20 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Could you please assign the issue to me so that I can work on it?
comment:22 by , 20 months ago
Replying to Gaurav Joshi:
Could you please assign the issue to me so that I can work on it?
You've already assigned it to yourself.
comment:23 by , 20 months ago
Yes, I assigned that issue to myself 😅. As a new member here, I'm still trying to navigate the process and it has caused some confusion for me
comment:24 by , 20 months ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:26 by , 15 months ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:27 by , 15 months ago
faizan2700, are you sure that you want to work on multiple tickets at once?
comment:28 by , 7 weeks ago
Cc: | added |
---|
A few things.
Now I am picking up from your ticket that you're frustrated. But no open source contributor anywhere has ever responded to the equivalent of "this situation is really bad for me so you must fix it for me" by jumping to do exactly that. If you want people to work, for free, on something you care about, then it's always better to approach the conversation in a more positive and friendly tone.
We'd also be open to any contributions you or your company would be willing to make provided it made sense for the project.