Context Navigation

← Previous Ticket
Next Ticket →

#21604 assigned New feature

Embed raw queries as subqueries when used with an __in filter

Reported by:	alex@…	Owned by:	William
Component:	Database layer (models, ORM)	Version:	dev
Severity:	Normal	Keywords:
Cc:	Aivars Kalvāns	Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	yes
Easy pickings:	no	UI/UX:	no

Description ¶

As per this discussion on django-developers: https://groups.google.com/d/topic/django-developers/cWrcEHZaMTg/discussion

In brief: At the moment, when passing a RawQuerySet to an __in filter, the raw query is evaluated and the result passed as a parameter to the filter's SQL. It would be nice to have the option of using the raw SQL in a subquery instead, for performance reasons.

I've had an initial shot at implementing this feature and would love any feedback: https://github.com/AlexHill/django/compare/master...raw_subqueries

I've taken the approach of wrapping the raw query in another SELECT when it's used as a subquery, selecting the required columns (either the primary key or others as specified in only() - see below) and adding an IS NOT NULL clause.

Responding to Anssi's feedback on django-developers about this approach: ¶

It should not be assumed that the primary key is the field needed in the inner query. We do have foreign keys to other fields, and one can use this approach without foreign keys at all.

We discussed making a ValuesRawQuerySet that could be used like this: Book.objects.filter(subject_code__in=subject_raw_query.values('code'))

But that feels a bit too convoluted - as Anssi pointed out, it's a verbose way of writing raw SQL and getting plain data back, which Django already has.

What I think makes sense, and what I've implemented, is using only() for this purpose. QuerySet.only() and QuerySet.defer() appear to silently ignore requests to defer the PK: if you do Book.objects.only('title'), both title and the primary key are fetched. What I've implement will follow that convention when the RawQuery is executed directly, but leave the PK out in subqueries when applicable, so that things like Author.objects.get(pk__in=raw_book_query.only('author_id')) work as expected.

Supporting only() requires a pretty big change to RawQuery, and I feel like it might be all a bit too tricky - it's all contained in the most recent commit so can be easily rolled back if necessary. Keen to know what others think.

Using only() also doesn't work in SQLite currently, due to #21603.

If the query is of the form NOT IN, then we need to also filter out null values, otherwise no results are returned if the raw query contains a single null (NOT IN is weird...)

This is taken into account in the patch - subqueries are always filtered for nulls. This will be a database no-op in the common case of selecting a not-null primary key.

Does the wrapping of the subquery into (select field from (raw_query) where field is not null) cause performance problems? It could be possible that this prevents the DB's optimizer from working correctly.

Postgres does the right thing and as far as I can read SQLite's query planner, that does too. Have not tried any other backends yet.

If the query doesn't select the required column (that is, it is deferred), we don't have any way to know that. This seems like a documentation issue, but worth mentioning.

The patch in its current state sidesteps this to some degree, in that including the primary key is already a stated requirement of the queries passed to RawQuerySet. However, at the moment you just get an error back from the database backend, instead of the InvalidQuery Django raises when evaluating a pk-less raw query. I think this is acceptable if documented.

According to the ticket's flags, the next step(s) to move this issue forward are:

To improve the patch as described in the pull request review comments or on this ticket, then uncheck "Patch needs improvement".
If creating a new pull request, include a link to the pull request in the ticket comment when making that update. The usual format is: [https://github.com/django/django/pull/#### PR].

Change History (10)

comment:1 by Marc Tamlyn, 11 years ago

Triage Stage:	Unreviewed → Accepted

comment:2 by Anssi Kääriäinen, 11 years ago

Patch needs improvement:	set

Sorry for not responding to this ticket sooner.

I think the patch is a bit too invasive for this ticket. We only need support for somefield__in=rawqs. It would be hopefully possible to just alter the way how the SQL is generated when used in an in lookup. In particular there shouldn't be any need

I wonder if it would be better to just add support for qs.filter(somefield__in=RawQuery(sql_str, params)) (and also do the same for other lookup types if at all possible). This would be a lot easier to support (actually, this will be likely extremely easy to do after #14030). Of course, this doesn't allow one to use an existing raw query.

If this could be implemented with just wrapping the query in subquery when used from __in lookup, then I think supporting somefield__in=rawqs directly is OK. If not, then lets focus on adding support for somefield__in=RawQuery(sql_str, params). In other words, the current patch seems too complicated for the added feature.

comment:3 by Simon Charette, 6 years ago

This should be as easy as implementing RawQuery.as_sql to return (self.sql, self.params) and setting has_select_fields=True as class attribute.

comment:4 by Andreas Galazis, 2 years ago

Any news? This is a must if you want to optimize subqueries that cannot be generated by orm

comment:5 by Simon Charette, 2 years ago

Andreas, if this is something you'd like to see in Django you could give a shot submitting a PR based on the approach described in comment:3

Something along these lines, with regression tests, should get you almost all the way there

TabularUnified django/db/models/sql/query.py

diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
index c2a71ff589..4845ae9e90 100644

                def get_children_from_q(q):
 class RawQuery:
     """A single raw SQL query."""
+    has_select_fields = True
     def __init__(self, sql, using, params=()):
         self.params = params
         self.sql = sql
-…
+               def _execute_query(self):
         self.cursor = connection.cursor()
         self.cursor.execute(self.sql, params)
+    def as_sql(self, compiler, connection):
+        return self.sql, self.params
 ExplainInfo = namedtuple("ExplainInfo", ("format", "options"))

comment:6 by William, 2 years ago

Owner:	changed from nobody to William
Status:	new → assigned

comment:7 by Aivars Kalvāns, 16 months ago

Hi!

Is this still a thing? I had a similar need recently and solved it by using RawSQL:

from django.db.models.expressions import RawSQL

NamedCategory.objects.create(id=1, name="first")
NamedCategory.objects.create(id=2, name="second")
NamedCategory.objects.create(id=3, name="third")
NamedCategory.objects.create(id=4, name="fourth")

query = DumbCategory.objects.filter(
    id__in=RawSQL("SELECT id FROM queries_dumbcategory WHERE id >= %s", params=[3])
)

self.assertEqual(set(query.values_list("id", flat=True)), {3, 4})
print(query.query)

this works as expected and prints out

SELECT "queries_dumbcategory"."id" FROM "queries_dumbcategory" WHERE "queries_dumbcategory"."id" IN (SELECT id FROM queries_dumbcategory WHERE id >= 3)

which confirms that the raw subquery was embedded.
Is there something that would make it better or more useful by using RawQuerySet?

comment:8 by Aivars Kalvāns, 16 months ago

https://docs.djangoproject.com/en/4.2/ref/models/expressions/#raw-sql-expressions

comment:9 by Simon Charette, 16 months ago

The submitted report was about making RawQuerySet work but if RawSQL works, and didn't exist at the time when this report was created 10 years ago, I'd be inclined to close this one as wontfix at this point.

comment:10 by Aivars Kalvāns, 16 months ago

Cc:	Aivars Kalvāns added

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues