Opened 4 months ago

Last modified 10 days ago

#29262 new New feature

Custom Left Outer Join in Queries

Reported by: Sassan Haradji Owned by: nobody
Component: Database layer (models, ORM) Version:
Severity: Normal Keywords: ORM Join
Cc: josh.smeaton@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I need a query that contains a left outer join, the table has ~160,000,000 rows and if I try to avoid outer join it'll reduce performance in an order that it'll be unusable.
So my only choice is using raw method of the object's manager.
I need to pass the queryset to rest framework so that it filter/sort/paginate/etc it.
Rest framework needs a normal queryset (with count, filter, order_by, etc) and I'm trying to solve it with lots of hacks (proxying objects, overriding internal methods and changing the order they call other internal methods and changing some standard tested code in the overridden methods.)

It's a terrible experience. I see requests and pull requests for supporting joins since 10 years ago and lots of related questions in stackoverflow, reddit and all around the web. So I'm here to ask you once again to do something about this issue.
At least you can provide a way to modify the sql command that's going to send to database by normal query (not rawquery) and let the developer to take the risk and see if it breaks things and handle it himself. It'll be only used in edge cases by people who really need it and if they need it they probably know what they're doing. It'll be better than the nightmare developers have to deal while in need of custom queries.

Change History (11)

comment:1 Changed 4 months ago by Josh Smeaton

A few things.

  1. I wasn't aware there were situations where a LEFT OUTER JOIN would have better performance than an INNER JOIN, since the LEFT JOIN looks at and includes more data. Are you able to provide the SQL for what is being generated and for what you want? The model definitions would also help.
  1. There have been lots of questions about customising joins. As far as I know, there have been no pull requests implementing such a thing in a reasonable manner. If you're aware of any pull requests it'd be good if you could share where those are so we can discuss the merits of each change.
  1. You're asking for an escape hatch that isn't the escape hatch that django is already providing. Django provides .raw() for exactly these purposes. If, for some technical reason, .raw() isn't appropriate, please discuss why so we can address those particular concerns.
  1. Is it possible to address your situation by using .union()? You can represent the FULL JOIN portion with 1 query, and the NULL join portion with a 2nd query, the .union() them together.
  1. There was some work done recently on annotating joins onto querysets, but I've been unable to find that ticket or patch.

Now I am picking up from your ticket that you're frustrated. But no open source contributor anywhere has ever responded to the equivalent of "this situation is really bad for me so you must fix it for me" by jumping to do exactly that. If you want people to work, for free, on something you care about, then it's always better to approach the conversation in a more positive and friendly tone.

We'd also be open to any contributions you or your company would be willing to make provided it made sense for the project.

comment:2 Changed 4 months ago by Sassan Haradji

  1. I like many need LEFT OUTER JOIN not INNER JOIN, if INNER JOIN was an efficient alternative to LEFT OUTER JOIN django itself would use it in its foreignkey joins, these joins don't do same thing and aren't an alternative for eachother.
  1. This is one from "10 years ago": https://code.djangoproject.com/ticket/7231
  1. .raw() may be good for some use cases, but I don't get it why should I lose all the features on a normal query (like count, filter, etc) just cause I wanna add a simple column by an outer join to that normal query. What I suggested as an "at least alternative" was an escape hatch that lets developer use normal queries (not raw queries) and patch them.

A framework is just a tool in hands of the developer, I think everyone on earth trying to build a tool should consider that the tool should not limit its users, but it should give him new opportunities. Django provides lots of opportunities, but not providing an easy way to patch final sql it compiles is not a good thing. There are hard ways, I can subclass connection and query and change their behavior but there should be an easy way to do it after all (even if django supports left outer join). In documentation it should try to convince normal users to not use it and talk about security problems and instabilities it may introduce to code. But if someone needs it and knows what he's doing, then he should be able to do it.

comment:3 Changed 4 months ago by Sassan Haradji

  1. I don't need union, I'm not trying to add rows to my main query, I'm trying to add a column to my query via outer join. So unfortunately union is not an option.
  1. Would be glad if you find it and share it here so that we can see if it solves the problem.

comment:4 Changed 4 months ago by Sassan Haradji

SpamBayes doesn't let me to send this part because it's referring some google searches, so I post it in a gist: https://gist.github.com/sassanh/43ef664872c322a5f88434f10c5ce4ea

Btw, there's an implicit message in every issue reported in open source community, the reporter doesn't ask any specific person to do it, as he has no right to do so as he's not paying them. What he wants by reporting an issue is trying to make consensus that the issue exists, get acknowledgement from the community that issue exists and someone (maybe himself, maybe a current contributor or maybe someone who is 2 years old now and is going to become a software developer in future) should solve it. An open issue/ticket means it is a step toward progression of the project, so while I do not ask current contributors to solve this issue, I do ask them to not abuse their privileges on this ticketing system and do not close it until it's either implemented in Django or someone provides good logic "why a developer using django framework can do whatever he wants without using custom outer joins."

Last edited 4 months ago by Sassan Haradji (previous) (diff)

comment:5 Changed 4 months ago by Tim Graham

Duplicate of #26426, "Add a way to customize a QuerySet's joins"?

comment:6 Changed 4 months ago by Josh Smeaton

It's more than likely that I have misinterpreted your request to address this issue as a demand for core to fix the issue due to language differences. I apologise.

#26426 (customise joins) and #25590 (custom join classes) are almost duplicates, but are much more general than the specific feature of supporting user-defined left joins. I'd probably argue that #26426 should be closed, as we have EXISTS expressions now that can solve that specific problem. Would you agree Tim?

I think there are really two features that we should try to support.

  1. Allow users to join data onto a query with a LEFT JOIN
  2. Allow users to add additional conditions onto a JOIN condition

With those two features we'd get pretty far down the line for supporting most common custom join requirements.

What we'd need to do is come to a consensus on what the right syntax would look like to make user defined LEFT JOINs possible. I'm not interested in providing hooks into customising already generated SQL. That would just be a hack to work around our lack of actual support. That's mostly why the linked ticket from 10 years ago languished - it was hacked into .extra() which we're no longer committed to updating.

So what would a decent syntax look like? Should we consider a new queryset method? Should we use .annotate()?

I'll throw out some ideas:

MyModel.objects.annotate(joined=Outer(OtherModel.objects.filter(mymodel_id=OuterRef('pk')))

MyModel.objects.outer('othermodel')

MyModel.objects.outer(othermodel=OtherModel.objects.filter(user=request.user))

MyModel.objects.partial_related(othermodel=OtherModel.objects.filter(user=request.user))

Django has tended to avoid using language that maps too closely to SQL in the past, though with the addition of more complex expression types that hasn't been so much of a blocker. I'd be hesitant to add a new queryset method called outer for that reason though. New classes can map to SQL because they're not as discoverable and really are for advanced usage. Increasing the scope of the queryset api with regard to SQL terminology seems off to me.

This is the kind of question that could probably be asked of the django-developers mailing list. There are lots of people with opinions there that'd be relevant to this discussion. In principle though, Django should definitely support adding left joins to a queryset.

Last edited 4 months ago by Tim Graham (previous) (diff)

comment:7 Changed 4 months ago by Josh Smeaton

Cc: josh.smeaton@… added

comment:8 Changed 4 months ago by Josh Smeaton

I've begun a discussion on the mailing list: https://groups.google.com/forum/#!topic/django-developers/2ITfPZlbsao

Please add your example to that thread if it's different to any already listed there.

comment:9 Changed 4 months ago by Josh Smeaton

Triage Stage: UnreviewedAccepted
Version: 2.0

comment:10 Changed 10 days ago by Simon Charette

I'd be curious to know whether or not FilteredRelation would solve your use case like it did in #29555.

This expression allows your to specify extra JOIN conditions so I'd assume annotating the subquery you want to JOIN against and then referencing it in the FilteredRelation(condition) should work. It'd help if you could provide your models definition and the exact query you're trying to generate through the ORM.

comment:11 Changed 10 days ago by Sassan Haradji

Unfortunately I don't think so. What I needed was exactly left join.
The query I needed was a rather complicated query, I try to abstract it here so that we can investigate it and find out what's needed in Django ORM to achieve it.
Suppose that I have this table:

CREATE TABLE foo(id,related_id,value,type)
AS VALUES
    ( 1 , 1,  'A1' , 1 ),
    ( 2 , 1,  'A2' , 2 ),
    ( 3 , 1,  'A3' , 3 ),
    ( 4 , 1,  'A4' , 4 ),
    ( 5 , 1,  'A5' , 5 ),
    ( 6 , 2,  'B1' , 1 ),
    ( 7 , 2,  'B2' , 2 ),
    ( 8 , 2,  'B3' , 3 ),
    ( 9 , 2,  'B4' , 4 ),
    ( 10, 2,  'B5' , 5 )
;

I want to aggregate these values and make this intermediate table:

-----------------------------------------------------------------------------
|  id   |   related_id   |  values                                          |
-----------------------------------------------------------------------------
|  1    |   1            |  (('A1',1),('A2',2),('A3',3),('A4',4),('A5',5))  |
-----------------------------------------------------------------------------
|  6    |   2            |  (('B1',1),('B2',2),('B3',3),('B4',4),('B5',5))  |
-----------------------------------------------------------------------------

To do so I need to do a simple aggregation:

foo.objects.values('related_id').annotate(
    id=Min('id'),
    values=ArrayAgg(
        Func(
            'value',
            'type',
            function='ARRAY',
            template='%(function)s[%(expressions)s]',
            arg_joiner=',',
        ), output_field=ArrayField(ArrayField(models.FloatField())),
    ),
)

This will generate this sql query (or something equivalent):

SELECT t.*
FROM (
  SELECT
    min(id),
    related_id,
    array_agg(ARRAY(value, type)) AS values,
  FROM foo
  GROUP BY id
) AS t

So far so good. Then I want to order this query based on value column but this order should order values of this column that are in a row that has type=X. I can do so by this sql:

SELECT t1.*
FROM (
  SELECT
    min(id),
    related_id,
    array_agg(ARRAY(value, type)) AS values,
  FROM foo
  GROUP BY id
) AS t1
LEFT OUTER JOIN (SELECT value FROM foo WHERE type=X) AS t2 USING (id)
ORDER BY t2.value

This is where I need this left join. It should be left join cause I don't wanna miss rows that don't have X type value.

Now do you think it's possible to do the above with current Django ORM API?
Consider that this table is really big, and the above sql query is the only one I found that executes in rational time and doesn't miss anything.
Also the real problem is much more complected, the number of columns that are involved are much more so if you think there's room for simplifying the above sql solution consider that in my real usecase it may not be applicable so I please lets concentrate on interpreting the exact above query into Django ORM API and not change the sql query so that it fits the API.

Note: See TracTickets for help on using tickets.
Back to Top