id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
21604	Embed raw queries as subqueries when used with an __in filter	alex@…	William	"As per this discussion on django-developers: https://groups.google.com/d/topic/django-developers/cWrcEHZaMTg/discussion

In brief: At the moment, when passing a `RawQuerySet` to an `__in` filter, the raw query is evaluated and the result passed as a parameter to the filter's SQL. It would be nice to have the option of using the raw SQL in a subquery instead, for performance reasons.

I've had an initial shot at implementing this feature and would love any feedback: https://github.com/AlexHill/django/compare/master...raw_subqueries

I've taken the approach of wrapping the raw query in another `SELECT` when it's used as a subquery, selecting the required columns (either the primary key or others as specified in `only()` - see below) and adding an `IS NOT NULL` clause.


== Responding to Anssi's feedback on django-developers about this approach:

> 1. It should not be assumed that the primary key is the field needed in the inner query. We do have foreign keys to other fields, and one can use this approach without foreign keys at all.

We discussed making a `ValuesRawQuerySet` that could be used like this: `Book.objects.filter(subject_code__in=subject_raw_query.values('code'))`

But that feels a bit too convoluted - as Anssi pointed out, it's a verbose way of writing raw SQL and getting plain data back, which Django already has.

What I think makes sense, and what I've implemented, is using `only()` for this purpose. `QuerySet.only()` and `QuerySet.defer()` appear to silently ignore requests to defer the PK: if you do `Book.objects.only('title')`, both `title` and the primary key are fetched. What I've implement will follow that convention when the RawQuery is executed directly, but leave the PK out in subqueries when applicable, so that things like `Author.objects.get(pk__in=raw_book_query.only('author_id'))` work as expected.

Supporting `only()` requires a pretty big change to RawQuery, and I feel like it might be all a bit too tricky - it's all contained in the most recent commit so can be easily rolled back if necessary. Keen to know what others think.

Using `only()` also doesn't work in SQLite currently, due to #21603.


> 2. If the query is of the form NOT IN, then we need to also filter out null values, otherwise no results are returned if the raw query contains a single null (NOT IN is weird...)

This is taken into account in the patch - subqueries are always filtered for nulls. This will be a database no-op in the common case of selecting a not-null primary key.

> 3. Does the wrapping of the subquery into (select field from (raw_query) where field is not null) cause performance problems? It could be possible that this prevents the DB's optimizer from working correctly.

Postgres does the right thing and as far as I can read SQLite's query planner, that does too. Have not tried any other backends yet.

> 4. If the query doesn't select the required column (that is, it is deferred), we don't have any way to know that. This seems like a documentation issue, but worth mentioning.

The patch in its current state sidesteps this to some degree, in that including the primary key is already a stated requirement of the queries passed to RawQuerySet. However, at the moment you just get an error back from the database backend, instead of the InvalidQuery Django raises when evaluating a pk-less raw query. I think this is acceptable if documented. "	New feature	assigned	Database layer (models, ORM)	dev	Normal			Aivars Kalvāns	Accepted	1	0	0	1	0	0