Opened 4 years ago

Closed 9 days ago

#27719 closed New feature (fixed)

Add queryset.alias() to mimic .annotate() for aggregations without loading data

Reported by: Marc Tamlyn Owned by: Alexandr Tatarinov
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

At the moment the Expressions API currently does not allow creating a value for later use in the query without returning that data from the database. In some cases this can be quite expensive over the pipe, and can also influence query execution where functional indexes are involved. I would like to introduce an API like alias(), which does the same thing as annotate() to allow reuse in later querysets but doesn't return the value.

To be explicit, we are changing a query from one of the form:

SELECT to_tsvector('english', "some_field"), "some", "other", "fields" FROM "table" WHERE to_tsvector('english', "some_field") @@ plainto_tsquery("search term") ... LIMIT 10;

to one of the form:

SELECT "some", "other", "fields" FROM "table" WHERE to_tsvector('english', "some_field") @@ plainto_tsquery("search term") ... LIMIT 10;

Change History (18)

comment:1 Changed 4 years ago by Christophe Pettus

Doing a quick example query: In a table with a relatively large text field, but just 2,400 rows, the difference between returning all rows and all fields without the ts_vector() call and returning all with it was 12.7ms vs 24,000ms.

Doing a full sequential scan (no index) with both calculating the to_tsvector() for return and in query predicate is 27,000ms without recalculating the to_tsvector for return, 46,700ms with recalculating it.

After having created a functional index on the to_tsvector() call, and using a relatively common search term, the time to return without recalculating the to_tsvector was 1.7ms, vs 21,200ms with recalculating the ts_vector.

So, in any scenario, not having to recalcuate the to_tsvector() is a win.

comment:2 Changed 4 years ago by Simon Charette

Could this be solved by allowing expressions to be passed to filter() instead? #25367.

comment:3 Changed 4 years ago by Marc Tamlyn

Whilst support in all the places which can use an annotated name would solve this kind of performance problem, there are loads of ways you can use a field name after it's defined.

.filter(name__lookup=value)
.order_by('name')
.annotate(SomeOtherExpression('name'))
.values('name') (and related APIs)`
.distinct()
.date() (and datetimes)
.defer('name') (and only)

I know that some of these places already support expressions. For very complex expressions there's utility in being able to "name" the expression as some kind of virtual field. If you're within the same function then a python reference is usually sufficient (except for the __lookup case), but if you have for example a manager method which would annotate a certain field and you use it elsewhere then things get messy quickly.

comment:4 Changed 4 years ago by Simon Charette

Triage Stage: UnreviewedAccepted

That makes sense, thanks for the clarification.

comment:5 Changed 4 years ago by Josh Smeaton

I'm kind of on the fence with this. The 90% case is expressions in WHERE clauses. We need expression support in filter/exclude, including the ability to filter without a left-hand-side (column reference) for things such as EXISTS( subq ).

Having alias() feels like a fairly big hack.

Let's look at your list of example usages of aliases:

.filter(name__lookup=value) - Yes, Yes, Yes. But solved by expressions-in-filters.
.order_by('name') - Already supports expressions
.annotate(SomeOtherExpression('name')) - Already supported.
.values('name') (and related APIs)` - If you want the values, then it's already in the select list and should be kept there. But expressions in values works now, no?
.distinct() - Already in the select list, refer by name.
.date() (and datetimes) - unsure about this, fairly uncommon?
.defer('name') (and only) - it's already in the select list

Out of all of the above, except for filter we always want the expression in the select list anyway. And when we use defer, we still want it in the annotations list so we can use it later (I think).

Am I missing something? The only time you might not want the value in the select list is if you're running a filter and possibly an order by. maybe an annotation over another expression where the other expression shouldn't be selected, but I feel that'd be a very small set of users, just to avoid creating two objects that are effectively the same.

Last edited 4 years ago by Josh Smeaton (previous) (diff)

comment:6 Changed 4 years ago by Marc Tamlyn

I take your point about the select list, most of those APIs (e.g. values, distinct) do indeed require the field to be selected making alias slightly redundant.

For me it's very explicitly not a hack, and the fact we can use annotate is the hack. But that's bikeshedding.

More practically, consider this code:

query = SearchQuery('some term')
return qs.alias(
   main_vector=SearchVector('title', 'body'),
   secondary_vector=SearchVector('parent__title', 'parent__body'),
).alias(
   vector=F('main_vector') + F('secondard_vector'),
).alias(
   rank=Case(
      When(vector__search=query, then=SearchRank('vector', query)),
      When(secondary_vector__search=query, then=SearchRank('secondary_vector', query)),
      # other cases here
   )
).order_by('-rank')

(I'm aware this example is contrived, but I have some pretty complex annotations for reporting, and they're even more fun with SubQuery... In most of these cases it doesn't matter that this field is in the select_list, although it doesn't need to be.

Here I've actually used the "alias approach" and the "variable approach" - one for vectors and one for queries.

The alias approach has allowed me to:

  • Incrementally build up expressions in the query
  • Name values and potentially reuse them at a later time in the query creation phase - perhaps I have some other function which takes the returned queryset from this method and counts the records where the rank is > 0.7 to give an indication of strong matching results, then without the naming from annotate or alias I'd have to access or recreate the component expressions. Rank is an int, so again it doesn't really matter to have it queryable.

I think #25367 is an important piece of work, and it would help solve the performance problem by allowing unselected values. I think though that there is a merit in being able to create a "virtual field" using an expression, and use it in subsequent querysets. This would be an easier solution to this particular performance issue, and it also gives the user explicit API control over the select list where needed, rather than implicitly by how you used your expression. It's not an either/or scenario.

comment:7 Changed 3 years ago by Cristiano Coelho

In case it helps, I have got an issue using annotate to simply select some values and performance related to it and the way it interacts with count: https://groups.google.com/forum/#!topic/django-users/q6XdfyK29HA

Looks like alias might help this issue.

comment:8 Changed 23 months ago by Simon Charette

FWIW alias(...) was initially proposed in https://code.djangoproject.com/ticket/16187#comment:28 but was later removed.

comment:9 Changed 14 months ago by Simon Charette

There was a recent post to django-developers where a user was trying to use annotate(foo=expr).defer('foo') for this exact purpose. Their use case was more about avoiding the double evaluation of an EXISTS expression which is probably fixed on the master branch by 1ca825e4dc186da2b93292b5c848a3e5445968d7 and could be addressed by #25367 which was previously mentioned but it remains that adding annotation support to defer and only could be another solution for this issue that wouldn't require adding another method to the QuerySet API.

Last edited 14 months ago by Simon Charette (previous) (diff)

comment:10 Changed 3 months ago by Alexandr Tatarinov

Owner: changed from nobody to Alexandr Tatarinov
Status: newassigned

I am working on this one. I've had a lot of cases where unneeded expressions ended up in a select clause.
When someone builds complex queries spanning multiple methods and modules, this can save a lot of CPU (for ORM to build the query) and query execution time.
My suggestion is to support only the usages below, values(), distinct() and aggregate() will require to use .annotate() to add the expression into the select list.

.alias(alias=SomeExpression()).annotate(SomeOtherExpression('alias'))
.alias(alias=SomeExpression()).filter(alias__lookup=value)
.alias(alias=SomeExpression()).order_by('alias')
.alias(alias=SomeExpression()).update(field=F('alias'))

Also automatically supported: earliest/latest and dates/datetimes.

And if you want to use values() or distinct(), promote alias to annotation

.alias(alias=SomeExpression()).annotate(alias=F('alias')).distinct('alias')

The implementation relies on annotation_mask to remove an expression from the select list, all other logic remains the same: despite the expression is not selected, the joins and group by are the same as for annotation.

comment:11 Changed 3 months ago by Alexandr Tatarinov

comment:12 Changed 3 months ago by Alexandr Tatarinov

Has patch: set

comment:13 Changed 3 weeks ago by Simon Charette

Patch needs improvement: set

Left some comments for improvements on the PR but pending that the proposed changes LGTM.

comment:14 Changed 2 weeks ago by Alexandr Tatarinov

Patch needs improvement: unset

Made requested changes

comment:15 Changed 10 days ago by felixxm

Patch needs improvement: set

comment:16 Changed 10 days ago by Alexandr Tatarinov

Patch needs improvement: unset

comment:17 Changed 9 days ago by felixxm

Triage Stage: AcceptedReady for checkin

comment:18 Changed 9 days ago by Mariusz Felisiak <felisiak.mariusz@…>

Resolution: fixed
Status: assignedclosed

In f4ac1671:

Fixed #27719 -- Added QuerySet.alias() to allow creating reusable aliases.

QuerySet.alias() allows creating reusable aliases for expressions that
don't need to be selected but are used for filtering, ordering, or as
a part of complex expressions.

Thanks Simon Charette for reviews.

Note: See TracTickets for help on using tickets.
Back to Top