Opened 7 years ago

Closed 7 years ago

Last modified 4 years ago

#7789 closed (fixed)

get_or_create fails due to case sensivity

Reported by: Julian Bez Owned by: mtredinnick
Component: Database layer (models, ORM) Version: master
Severity: Keywords: get_or_create iexact exact
Cc: elsdoerfer@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

>>> name, created = Name.objects.get_or_create(name="Julian")
>>> created
True
>>> name, created = Name.objects.get_or_create(name="julian")
Traceback (most recent call last):
  File "<console>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/django/db/models/manager.py", line 85, in get_or_create
    return self.get_query_set().get_or_create(**kwargs)
  File "/usr/lib/python2.4/site-packages/django/db/models/query.py", line 335, in get_or_create
    return self.get(**kwargs), False
  File "/usr/lib/python2.4/site-packages/django/db/models/query.py", line 302, in get
    raise self.model.DoesNotExist("%s matching query does not exist."
DoesNotExist: Name matching query does not exist.

>>> name, created = Name.objects.get_or_create(name__iexact="julian")
>>> created
False

In my opinion get_or_create should get the name if it is there, or create a new one, not fail.
May be related to [7798].

Attachments (7)

django-mysql-case-sensitive-unique.patch (590 bytes) - added by David Danier <goliath.mailinglist@…> 7 years ago.
Patch adding BINARY to VARCHAR() to fix case insensitive match in get_or_create() on MySQL
7789_tests.diff (1.1 KB) - added by arne 7 years ago.
Regressiontests
7789_fields.diff (1.5 KB) - added by arne 7 years ago.
Proposed fix for the problem
7789_docs.diff (646 bytes) - added by arne 7 years ago.
Updated docs
7789_fields_2.diff (2.2 KB) - added by arne 7 years ago.
Alternate aproach
7789_fields_3.diff (3.6 KB) - added by arne 7 years ago.
Similar to 7789_fields_2.diff, but this time returning unicode strings from the binary fields.
7789_tests_3.diff (1.5 KB) - added by arne 7 years ago.
Some more tests for 7789_fields_3.diff

Download all attachments as: .zip

Change History (34)

comment:1 Changed 7 years ago by Julian Bez

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

More info:
name is unique=True and MySQL is case insensitive for most collations.

Changed 7 years ago by David Danier <goliath.mailinglist@…>

Patch adding BINARY to VARCHAR() to fix case insensitive match in get_or_create() on MySQL

comment:2 Changed 7 years ago by nullie

Well, it's more complex problem.

Real issue is: unique index is not case-sensitive, while exact query is. So, get_or_create doesn't get matching object (because query is case-sensitive), tries to create new object and gets integrity error. It treats as concurrency problem, due to http://code.djangoproject.com/changeset/7289, and tries to get matching object again, and again it fails.

It's related to http://code.djangoproject.com/ticket/7402 (http://code.djangoproject.com/changeset/7289 opened a can of worms)

comment:3 follow-up: Changed 7 years ago by nullie

Making unique index for mysql case-sensitive will solve this.

comment:4 in reply to: ↑ 3 Changed 7 years ago by julianb

  • milestone set to 1.0 beta
  • Needs tests set
  • Triage Stage changed from Unreviewed to Accepted

Replying to nullie:

Making unique index for mysql case-sensitive will solve this.

How does one do that? If it's possible, shouldn't Django be doing it by default?

And what if I want get_or_create to be case insensitive?

comment:5 Changed 7 years ago by miracle2k

  • Cc elsdoerfer@… added

comment:6 Changed 7 years ago by julianb

From what I read on the web, there is not way to make a MySQL column utf8, case sensitive and with a real collation for sorting (not binary).

That leaves just one solution: get_or_create has to make a case insensitive search on columns, or at least on those who are marked unique.

If the item is already there we will find it for sure only by using iexact, otherwise it could be there and we get an error, because Unique index cannot be made case sensitive in MySQL.

Changed 7 years ago by arne

Regressiontests

Changed 7 years ago by arne

Proposed fix for the problem

Changed 7 years ago by arne

Updated docs

comment:7 Changed 7 years ago by arne

  • Needs tests unset

In the three attached files, we (me and Martin v. Löwis) take a slightly different approach: Changing all CharFields to BINARY might be too intrusive (as julianb points out): it will break true collation. Given that there is no real solution, we propose that BINARY is only applied to CharFields which are also unique. To achieve that, we change the MySQL DATA_TYPES template to allow an optional %(binary)s, which is set on unique CharFields.

In the process, we also found that SlugFields need to get the same treatment, only that it is reasonable to make them BINARY unconditionally (as collation is irrelevant for them).

Also included are the necessary doc and test changes.

Changed 7 years ago by arne

Alternate aproach

comment:8 Changed 7 years ago by arne

In 7789_fields_2.diff, the dependency on MySQL is taken out of CharField, and a db backend hook is provided. The backend can now optionally provide a creation.backend_parameters function, which can fill in additional parameters for substitution into DATA_TYPES. This hook receives the Field and the dict, and must return a new dict if it performs any modification, or return the dict passed in otherwise.

As a design consideration, a choice is to make this backend_parameters mandatory for backend modules. The current patch makes it optional by catching the AttributeError when calling the function.

comment:9 Changed 7 years ago by kmtracey

Pointer to related thread on django-dev: http://groups.google.com/group/django-developers/browse_thread/thread/6b29d9ff816eb761#

Any solution that involves specifying a binary collation for any character-type fields will, I believe, need to address the side-effect that mysqldb will start returning those field values as bytestings instead of unicode objects. Django apps expect their character-type fields to be returned from the DB as unicode objects. Likely effects of having them be bytestrings instead include their __unicode__() methods suddenly generating UnicodeDecodeErrors and I don't know what else. I don't think it is reasonable to expect the app writer to take into account whether unique was specified (and whether MySQL is the DB) for a field to know whether the field value will be a bytstring or unicode. Therefore if this approach is pursued I believe the mysql backend will need to also take on the chore of ensuring that values for these columns get properly converted to unicode objects for the app level code. I don't even know if that is possible, but then I don't know much about database backends.

comment:10 Changed 7 years ago by julianb

Seems like messing with binary just has too many side effects...

can't we rewrite get_or_create to be case insensitive for unique columns on MySQL? Or would that tackle the problem at the wrong end?

I mean, sure, if you want to add values that differ in case, that will not be possible then. But on the other hand, you actively chose MySQL and from my point of view it seems that it's just not supported by MySQL that way.

To illustrate: You have an unique column, you do an exact lookup, you see that an entry is not there. You choose to add it and you get an integrity error, because an entry that is the same but differs in case is there. That's basically your fault for doing the wrong lookup, you should've done an iexact lookup - because you use MySQL.

All things that try to get MySQL to be case sensitive there seem like some hacks with bad flavor...

comment:11 Changed 7 years ago by mtredinnick

  • Owner changed from nobody to mtredinnick

Okay, I think we might be getting towards a solution here (when combined with #8102) that isn't just symptom patching. I want to think about Arne and Martins' patch for a bit to see if we can also fix the unicode issue. That would be the ideal solution.

I'm going to assign this to me so that it pops up on the list I look at every day of things I need to get to a conclusion.

comment:12 Changed 7 years ago by julianb

Great to see we get forward with this one. Btw, you could combine it with #7402, too.

comment:13 Changed 7 years ago by arne

In 7789_fields_3.diff a conversion is added to the mysql specific django_conversions dict which decodes strings returned from binary fields into unicode objects.

Changed 7 years ago by arne

Similar to 7789_fields_2.diff, but this time returning unicode strings from the binary fields.

Changed 7 years ago by arne

Some more tests for 7789_fields_3.diff

comment:14 Changed 7 years ago by julianb

Just something I read about the case-sensivity topic:

http://www.venutip.com/content/case-sensitivity-mysql-do-you-really-need-it

Be sure to read the last paragraphs.

comment:15 Changed 7 years ago by Karen Tracey <kmtracey@…>

I'm a lot more comfortable with the proposal for limited use of BINARY VARCHAR columns now that the issue with MySQLdb returning bytestrings for such columns is addressed. Actually I believe that fix should go in regardless of what is decided for this ticket. Just because Django doesn't currently specify binary collations in any circumstances doesn't mean user's haven't set their databases up that way, and the current way these fields are returned as bytestrings is wrong, I think. (In fact there are a number of tickets in the tracker referring to the fact that specifying binary collations on MySQL when using Django is a Bad Idea, though most reference an earlier bug in MySQLdb which caused the Django admin app to fail spectacularly in such configurations.)

As for this ticket, I'm ambivalent. Part of me thinks callers who are specifying exact matches on a get_or_create() when the underlying DB column has a case-insensitive collation are just wrong, they ought to be using iexact lookups (though of course with 3rd-party apps the writer may have had no idea it would eventually be running on MySQL with its default case-insensitive behavior). And I think I agree with that posting referenced above that in fact in most cases what people really want is case-insensitive 'uniqueness'. (The bug of MySQL 'exact' lookups returning case-insensitive results existed in the code for years with remarkably little complaint.)

However, I can understand that there are cases where real case-sensitive uniqueness is needed. (Pity we can't specify unique=True/False/Sorta ... no that's not a serious suggestion.) To support genuine uniqueness on MySQL it seems we have to go the route of specifying binary on the column. For people who don't really want that they can always manually remove the binary attribute from their columns and be sure to use iexact lookups instead of exact.

If we do go the route of sometimes specifying a BINARY column I think we will also need to fix the iexact, etc. lookup mappings to explicitly specify a case-insensitive collation otherwise the insensitive lookups will be broken for these columns. Django currently assumes the default collation is always case-insensitive, and this change breaks that assumption. (Messing with the mappings ties in to fixing #8102.)

Finally I'll note (and this is also not a serious proposal for how to fix this now) that MySQL does supply an atomic 'get_or_create' type statement.

Here you see the problem we currently have with a get_or_create caller specifying an exact match for a column that has case-insensitive collation:

mysql> select * from Authors where Author='frances hansen' collate utf8_bin;
Empty set (0.00 sec)

mysql> insert into Authors (Author) values ('frances hansen');
ERROR 1062 (23000): Duplicate entry 'frances hansen' for key 2

We could avoid this by getting MySQL itself to detect the duplicate key:

mysql> insert into Authors (Author) Values ('frances hansen') on duplicate key update `Author ID`=last_insert_id(`Author ID`);
Query OK, 0 rows affected (0.00 sec)

mysql> select * from Authors where `Author ID`=last_insert_id();
+-----------+----------------+-----------+-----------+
| Author ID | Author         | Pseudonym | Notes     |
+-----------+----------------+-----------+-----------+
|       110 | Frances Hansen | No        | 1919-2004 |
+-----------+----------------+-----------+-----------+
1 row in set (0.00 sec)

But I realize this is highly MySQL-specific and not appropriate for the level at which get_or_create() is coded. If, however, you were ever to consider an approach where the get_or_create work would be pushed down to the database backend level, if possible, and only implemented as separate get and create steps if the backends didn't provide such support, this idea might be useful. It might also go some ways to fixing #7402. (What's above isn't universally applicable since the last_insert_id is only valid for auto_increment columns so if you've manually specified the primary key as something else you'd have to use some other technique to locate the either-inserted-or-preexisting-row...but since this isn't a serious proposal I'm not going to attempt to solve that issue, nor other niggling things like whether the entry field value(s) should be updated to match the exact values specified by the get_or_create caller or left as they currently are in the DB. I expect everyone has quit reading long before now anyway.)

The two things I really wanted to say here:

  1. The fix to return unicode from the backend even when the columns have a binary modifier is good, should go in independent of what is done with this ticket, I think.
  1. If this change goes forward and Django starts sometimes specifying binary for some columns, the inexact lookups will be broken unless some change is made to the lookup mappings.

comment:16 Changed 7 years ago by julianb

Does everyone here know that specifying BINARY will screw up order by? That will be the next ticket if we take that path.

comment:17 Changed 7 years ago by loewis

In what does it mess up order by?

comment:18 Changed 7 years ago by Karen Tracey <kmtracey@…>

Binary collation will result in a binary sort order, yes, which is not generally what people expect. Case-insensitive ordering gives a 'dictionary' sort:

mysql> select name from testfail_author order by name collate latin1_swedish_ci;
+------+
| name |
+------+
| a1   |
| A1   |
| a4   |
| z1   |
| Z1   |
+------+
5 rows in set (0.00 sec)

whereas binary collation places, for example, all capital letters before lower case letters (for latin1):

mysql> select name from testfail_author order by name collate latin1_bin;
+------+
| name |
+------+
| A1   |
| Z1   |
| a1   |
| a4   |
| z1   |
+------+
5 rows in set (0.00 sec)

This was mentioned in the dev thread I pointed to above. 'Fixing' it in Django is not so straightforward. First, how is Django to know which sort order the user is looking for? To date it just assumes the database default sort order so there is no mechanism for a Django app to influence sort order selection (SFAIK).

Second and probably more of a show stopper, for 'order by', unlike '=' and 'like' matches, you need to specify a collation that is valid for the column charset. For the lookups you must specify a collation that matches the connection charset, which means Django can always use a utf8 collation regardless of the actual encoding used by the DB to store the table data. For 'order by', if I set names to utf8 on the connection and try to specify a utf8 collation for 'order by' where the column is using latin1 encoding MySQL complains:

ERROR 1253 (42000): COLLATION 'utf8_bin' is not valid for CHARACTER SET 'latin1'

(That's not surprising if you think about what you'd have to do to implement ordering by an arbitrary collation.)

So, at least short term, and just my opinion but I don't see Django providing any way to get a case-insensitive ordering on a column with binary collation. You can't always get (all of) what you want...

comment:19 follow-up: Changed 7 years ago by loewis

So it sorts using the binary collation - why does this "mess up" sorting? It seems to be completely conforming to the Django specification, which says

"There’s no way to specify whether ordering should be case sensitive. With respect to case-sensitivity, Django will order results however your database backend normally orders them."

So I can't see why binary would "mess up" anything.

I think the notion of case-insensitivity in a database, or perhaps even the notion of lexicographical ordering is severely flawed. If sorting according to some natural language rules is important to the application, then the application should do the sorting itself, rather than having the database do it. If Django wants to fully support such programming, it should integrate a locale database.

comment:20 in reply to: ↑ 19 Changed 7 years ago by julianb

"Django will order results however your database backend normally orders them."

Okay, suppose for my database I want the ordering to be case sensitive. Then we are at the point that Django can't expect or set BINARY there. Django messes up my sorting or my database messes up Django? Come on.

comment:21 follow-up: Changed 7 years ago by loewis

julianb: I still don't understand, I guess. You want the ordering case sensitive, and Django (with this patch) sets the ordering to binary, making the ordering case sensitive - as you wanted it. What precisely is messed up (either by the database or by Django)?

comment:22 in reply to: ↑ 21 Changed 7 years ago by julianb

Oh, I got that wrong, I want case insensitive ordering of course. But it really doesn't matter what I want, cause you said users could choose on their own, and that conflicts with the usage of binary which would break things.

comment:23 Changed 7 years ago by Karen Tracey <kmtracey@…>

I agree this change does not in fact "mess up" sorting. Fact is Django has always left determination of sort order up to the DB, this change doesn't alter that. It does alter the sort order that MySQL will use for character fields defined as unique, and slug fields, created after this change. That might be surprising to MySQL users who are used to always getting the default case-insensitive sort order, so it might be perceived as "messing up" sort order for these fields. For MySQL, that's the price of getting a truly case-sensitive unique field.

The result is apparently also not that different from the situation with PostgreSQL, see for example: http://scottbarnham.com/blog/2007/11/20/case-insensitive-ordering-with-django-and-postgresql/. So where in the past MySQL users didn't have to go to any trouble to get a field with 'natural' sort order, they now might have to use some workaround/trick as described there. SFAIK the workaround listed there should work when you've to MySQL as the DB, though I haven't tried it. So it's apparently doable to get a case-insensitive sort on a case-sensitive unique field, just a bit of work. Without this change it isn't possible to have a case-sensitive unique field at all, so I don't see that having to do some work to get case-insensitive ordering on it when it's actually possible is that much of a drawback. If you really don't want it to be case-sensitive unique remove the binary attribute from the column and always use iexact matches in your get_and_create calls and you'll get your default case-insensitive ordering back.

comment:24 Changed 7 years ago by mtredinnick

(In [8318]) Convert binary-matched VARCHAR fields to unicode objects in the MySQL backend.
This conforms to Djangos' policy of returning Unicode everywhere.

Suggested by arne and Martin von Löwis. Refs #7789.

comment:25 Changed 7 years ago by mtredinnick

(In [8319]) Changed "exact" matches in MySQL to use the database's native collation.

This effectively reverses the change in [7798]. It was proving too difficult to
successfully manage all the side effects here and provide a satisfactory
solution for everybody. Many thanks to arne, Martin von Löwis and, particular,
Karen Tracey, for doing a lot of research and proto-patches here to establish what was possible and practical.

This is backwards incompatible if you were relying on the behaviour after
[7798]. The docs have been updated to indicate the solution.

Refs #2170, #7789, #8102.

comment:26 Changed 7 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from new to closed

The above commit now makes this a ticket a moot point (although we took part of one patch as a good idea).

comment:27 Changed 4 years ago by jacob

  • milestone 1.0 beta deleted

Milestone 1.0 beta deleted

Note: See TracTickets for help on using tickets.
Back to Top