Opened 9 months ago

Last modified 9 months ago

#27719 new New feature

Add queryset.alias() to mimic .annotate() for aggregations without loading data

Reported by: Marc Tamlyn Owned by: nobody
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

At the moment the Expressions API currently does not allow creating a value for later use in the query without returning that data from the database. In some cases this can be quite expensive over the pipe, and can also influence query execution where functional indexes are involved. I would like to introduce an API like alias(), which does the same thing as annotate() to allow reuse in later querysets but doesn't return the value.

To be explicit, we are changing a query from one of the form:

SELECT to_tsvector('english', "some_field"), "some", "other", "fields" FROM "table" WHERE to_tsvector('english', "some_field") @@ plainto_tsquery("search term") ... LIMIT 10;

to one of the form:

SELECT "some", "other", "fields" FROM "table" WHERE to_tsvector('english', "some_field") @@ plainto_tsquery("search term") ... LIMIT 10;

Change History (6)

comment:1 Changed 9 months ago by Christophe Pettus

Doing a quick example query: In a table with a relatively large text field, but just 2,400 rows, the difference between returning all rows and all fields without the ts_vector() call and returning all with it was 12.7ms vs 24,000ms.

Doing a full sequential scan (no index) with both calculating the to_tsvector() for return and in query predicate is 27,000ms without recalculating the to_tsvector for return, 46,700ms with recalculating it.

After having created a functional index on the to_tsvector() call, and using a relatively common search term, the time to return without recalculating the to_tsvector was 1.7ms, vs 21,200ms with recalculating the ts_vector.

So, in any scenario, not having to recalcuate the to_tsvector() is a win.

comment:2 Changed 9 months ago by Simon Charette

Could this be solved by allowing expressions to be passed to filter() instead? #25367.

comment:3 Changed 9 months ago by Marc Tamlyn

Whilst support in all the places which can use an annotated name would solve this kind of performance problem, there are loads of ways you can use a field name after it's defined.

.filter(name__lookup=value)
.order_by('name')
.annotate(SomeOtherExpression('name'))
.values('name') (and related APIs)`
.distinct()
.date() (and datetimes)
.defer('name') (and only)

I know that some of these places already support expressions. For very complex expressions there's utility in being able to "name" the expression as some kind of virtual field. If you're within the same function then a python reference is usually sufficient (except for the __lookup case), but if you have for example a manager method which would annotate a certain field and you use it elsewhere then things get messy quickly.

comment:4 Changed 9 months ago by Simon Charette

Triage Stage: UnreviewedAccepted

That makes sense, thanks for the clarification.

comment:5 Changed 9 months ago by Josh Smeaton

I'm kind of on the fence with this. The 90% case is expressions in WHERE clauses. We need expression support in filter/exclude, including the ability to filter without a left-hand-side (column reference) for things such as EXISTS( subq ).

Having alias() feels like a fairly big hack.

Let's look at your list of example usages of aliases:

.filter(name__lookup=value) - Yes, Yes, Yes. But solved by expressions-in-filters.
.order_by('name') - Already supports expressions
.annotate(SomeOtherExpression('name')) - Already supported.
.values('name') (and related APIs)` - If you want the values, then it's already in the select list and should be kept there. But expressions in values works now, no?
.distinct() - Already in the select list, refer by name.
.date() (and datetimes) - unsure about this, fairly uncommon?
.defer('name') (and only) - it's already in the select list

Out of all of the above, except for filter we always want the expression in the select list anyway. And when we use defer, we still want it in the annotations list so we can use it later (I think).

Am I missing something? The only time you might not want the value in the select list is if you're running a filter and possibly an order by. maybe an annotation over another expression where the other expression shouldn't be selected, but I feel that'd be a very small set of users, just to avoid creating two objects that are effectively the same.

Last edited 9 months ago by Josh Smeaton (previous) (diff)

comment:6 Changed 9 months ago by Marc Tamlyn

I take your point about the select list, most of those APIs (e.g. values, distinct) do indeed require the field to be selected making alias slightly redundant.

For me it's very explicitly not a hack, and the fact we can use annotate is the hack. But that's bikeshedding.

More practically, consider this code:

query = SearchQuery('some term')
return qs.alias(
   main_vector=SearchVector('title', 'body'),
   secondary_vector=SearchVector('parent__title', 'parent__body'),
).alias(
   vector=F('main_vector') + F('secondard_vector'),
).alias(
   rank=Case(
      When(vector__search=query, then=SearchRank('vector', query)),
      When(secondary_vector__search=query, then=SearchRank('secondary_vector', query)),
      # other cases here
   )
).order_by('-rank')

(I'm aware this example is contrived, but I have some pretty complex annotations for reporting, and they're even more fun with SubQuery... In most of these cases it doesn't matter that this field is in the select_list, although it doesn't need to be.

Here I've actually used the "alias approach" and the "variable approach" - one for vectors and one for queries.

The alias approach has allowed me to:

  • Incrementally build up expressions in the query
  • Name values and potentially reuse them at a later time in the query creation phase - perhaps I have some other function which takes the returned queryset from this method and counts the records where the rank is > 0.7 to give an indication of strong matching results, then without the naming from annotate or alias I'd have to access or recreate the component expressions. Rank is an int, so again it doesn't really matter to have it queryable.

I think #25367 is an important piece of work, and it would help solve the performance problem by allowing unselected values. I think though that there is a merit in being able to create a "virtual field" using an expression, and use it in subsequent querysets. This would be an easier solution to this particular performance issue, and it also gives the user explicit API control over the select list where needed, rather than implicitly by how you used your expression. It's not an either/or scenario.

Note: See TracTickets for help on using tickets.
Back to Top