Opened 10 years ago

Closed 4 years ago

Last modified 4 years ago

#21181 closed New feature (fixed)

collation specific query results ordering

Reported by: alan.kesselmann@… Owned by: Tom Carrick
Component: Database layer (models, ORM) Version:
Severity: Normal Keywords: ORM
Cc: Tom Carrick, Alexandr Artemyev, Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

There has been feature request for it in 2012 in django-developers group : https://groups.google.com/forum/#!topic/django-developers/0iESVnawNAY. Before finding that request i asked about same thing in stackoverflow: http://stackoverflow.com/questions/18935712/queryset-sorting-specifying-column-collation-for-django-orm-query. Both Anssi (https://groups.google.com/d/msg/django-developers/0iESVnawNAY/JefMfAm7nQMJ) and me outline how it could be used im same way:

.order_by([fieldnames], collation = 'et_EE')

another version of it could be automatical, based on django language settings. Model often has ordering set in its Meta. So why not include automatical ordering there too.

Is it possible to include this into some of next versions of Django?

Some background

If django is used to power some international site like a newspaper then there is rarely need for sorting data based on sorting rules of language other than english. If you use it to power a site that hosts data in several languages then you have need to also display data in that language properly. Sorting based on English alphabet does not work correctly in that case cause letters missing from english alphabet (and there are MANY of those letters that exist in swedish, danish, german, not to mention estonian, russian etc. alphabets) are displayed in completely wrong places.

Again - most django sites will be probably using single language. But there are many cases that will benefit from this feature. Like e-commerce modules, that translate product names - they can benefit from it. Any site that allows language based naming or translating of their objects will benefit from this feature.

Change History (21)

comment:1 by Anssi Kääriäinen, 10 years ago

I agree that some solution to this is a good idea. There are some tickets that are close to this request, #14310 and #9682 come to mind. I didn't find a perfect duplicate so accepting this.

Custom lookups could offer a neat solution to this. You should be able to do: qs.alias(name_fi=Collate('name', 'fi')).order_by('name_fi'). Or, to use collated filtering, use qs.alias(name_fi=Collate('name', 'fi')).filter(name_fi__icontains='someval'). See #16187 for details. The patch in that ticket already offer all the machinery needed to do this, but there is still a lot of work before merge is possible.

While waiting for custom lookups maybe there should be some other way to achieve this.

comment:2 by Anssi Kääriäinen, 10 years ago

Triage Stage: UnreviewedAccepted

comment:3 by alan.kesselmann@…, 10 years ago

I think there is no duplicate to this, as this is relatively new thing. #14310 is similar cause it requests per query collation overrides but the #9682, which is marked as #14310's duplicate is another thing completely.

In my test (or investigation) case i created two postgresql databases. One with default en_US.utf8 collation and one with POSIX collation. Since one table can hold data which is in several languages you (in this case i) have to be able to add collation as a part of query. Like this:

SELECT * FROM <tablename> ORDER BY <textcolum> COLLATE "<your_collation>", <integercolumn>
or
SELECT <textcolumn> COLLATE "<your_collation>" FROM <tablename> ORDER BY <textcolumn>

What this allows to achieve is that table of strings like:
id, name
1, älan
2, alan
3, õlan

will be sorted with query "SELECT * FROM table ORDEr BY name" to
2, alan
1, älan
3, õlan
with en_US.utf8 collation and to
2, alan
3, õlan
1, älan
with estonian alphabet based collation.

Your example, qs.alias(name_fi=Collate('name', 'fi')).order_by('name_fi') is the best i think (for my use-case anyway), as it pictures the core thing that i need - pair of column name ("name") and collation ("fi"). The Collate object seems like a good idea that might allow some more complex use cases too (http://www.postgresql.org/docs/9.1/static/collation.html lists several).

comment:4 by Anssi Kääriäinen, 10 years ago

I tried how hard this would be to implement using custom lookups. Turns out it was somewhat straightforward, see: https://github.com/akaariai/django/commit/76063881d79732cbb1942e9905fe4d51c0df09c1

The example works only on PostgreSQL 9.1 plus, and the test works only if collations fi_FI, en_GB and POSIX are available (I believe the last one is always there). Of course, adding in full multidb support would require more work so that the SQL would actually work on all databases.

comment:5 by alan.kesselmann@…, 10 years ago

Hmm i can see you have done loads of work in your custom lookups branch. Any idea which version those changes might end up in?

comment:6 by Anssi Kääriäinen, 9 years ago

Now that expressions are in, a Collate expression would be a very welcome addition to Django. It should work on all databases that support collations, not just on PostgreSQL.

comment:9 by Alan, 7 years ago

Owner: set to Alan
Status: newassigned

comment:10 by Simon Charette, 4 years ago

Now that order_by has proper expression support this is pretty straightforward to implement using an Expression of Func subclass.

I guess this ticket could be repurposed to allow a collation argument to be passed to OrderBy or introduce a backend agnostic Collate expression.

order_by(F('title').desc(collation='fr_FR')) or order_by(Collate('title', 'fr_FR').desc())

I think a Collate expression would make more sense since it could be used in lookups as well filter(title=Collate(Value(title), 'fr_FR')).

comment:11 by Tom Carrick, 4 years ago

I made a quick attempt at implementing Collate() as an expression. I've hit a problem though, as it seems like SQLIte doesn't allow you to use parameters with COLLATE:

In [34]: with connection.cursor() as cursor: 
    ...:     cursor.execute("select name from testapp_foo order by %s collate %s", ["name", "nocase"]) 
    ...:     row = cursor.fetchall()

...

OperationalError: near "?": syntax error

I also have problems on postgres (haven't tried with other backends):

In [16]: with connection.cursor() as cursor: 
    ...:     cursor.execute("select name from testapp_foo order by %s collate %s", ["name", "en_US"]) 
    ...:     row = cursor.fetchone()

...

ProgrammingError: syntax error at or near "'en_US'"
LINE 1: select name from testapp_foo order by 'name' collate 'en_US'

Not exactly sure what's going on here, but it looks like single quotes don't do it here. It works if I remove the placeholders and set it with "en_US" - with double quotes.

Maybe I'm misunderstanding something.

comment:12 by Tom Carrick, 4 years ago

Cc: Tom Carrick added

comment:13 by Alexandr Artemyev, 4 years ago

Cc: Alexandr Artemyev added

comment:14 by Simon Charette, 4 years ago

Cc: Simon Charette added

Looks like collation names are identifiers on these backends and need to be escaped using connection.ops.quote_name. Was MySQL also behaving this way?

comment:15 by Tom Carrick, 4 years ago

Owner: changed from Alan to Tom Carrick

comment:16 by Tom Carrick, 4 years ago

Has patch: set

comment:17 by Simon Charette, 4 years ago

Triage Stage: AcceptedReady for checkin

Went through a few rounds of reviews and patch now LGTM

comment:18 by Mariusz Felisiak, 4 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

It is subject to SQL injection.

Last edited 4 years ago by Mariusz Felisiak (previous) (diff)

comment:19 by Simon Charette, 4 years ago

Yes, quote_name cannot protect against SQL injections. It shouldn't be an issue just like Func(function) also allows injections? As long as collation is not under user control it should not be an issue.

Since collation names are identifiers and cannot be provided as string literals (see comment:11) I don't see bullet proof way around that. Isn't this something we could explicitly document?

Last edited 4 years ago by Simon Charette (previous) (diff)

comment:20 by Mariusz Felisiak, 4 years ago

So maybe, we can add a warning in docs, e.g.

    To protect against `SQL injection attacks
    <https://en.wikipedia.org/wiki/SQL_injection>`_, ``collation`` shouldn't be
    a user-controlled parameter because collation names are database
    identifiers and cannot be properly escape.

comment:21 by Mariusz Felisiak, 4 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

I added warning to docs.

comment:22 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

Resolution: fixed
Status: assignedclosed

In 63300f7e:

Fixed #21181 -- Added Collate database function.

Thanks Simon Charette for reviews.

comment:23 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

In 278b6187:

Refs #21181 -- Corrected DatabaseFeatures.test_collations for Swedish collation.

Previously, backends used different keys "swedish-ci" or "swedish_ci".

Note: See TracTickets for help on using tickets.
Back to Top