Opened 7 years ago

Closed 7 years ago

Last modified 4 years ago

#8102 closed (fixed)

get_or_create is unable to handle unicode characters

Reported by: davenaff Owned by: mtredinnick
Component: Database layer (models, ORM) Version: master
Severity: Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: UI/UX:

Description (last modified by mtredinnick)

Models.py:

class Article(models.Model):
    headline = models.CharField(max_length=100)

Shell session:
>>> from mysite.models import Article
>>> string = u'\xae'
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article object>, True)
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article object>, True)

This seems to be recently introduced (I'm on 8125) and only noticed the bug after a recent update.

This is a pretty major bug and I ran out of time tonight to dig into it, but I wanted to at raise the issue now so that others can look at it.

This is going to create serious backwards incompatibility issues in people's databases if they are running with this bug.

Attachments (4)

8102.diff (1017 bytes) - added by kmtracey 7 years ago.
Use COLLATE UTF8_BIN not BINARY to force case-sensitive comparisons on MySQL
8102-2.diff (1.0 KB) - added by kmtracey 7 years ago.
Cast exact lookups to CHAR before specifying COLLATE to avoid problems with specifying collate on numeric/binary columns
nopatch_failures.txt (8.0 KB) - added by kmtracey 7 years ago.
For reference, test failures without the patch
patch_failures.txt (5.6 KB) - added by kmtracey 7 years ago.
For reference, test failures with the patch

Download all attachments as: .zip

Change History (28)

comment:1 Changed 7 years ago by kmtracey

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Resolution set to worksforme
  • Status changed from new to closed

I cannot recreate this on r8204 with either MySQL or sqlite backends. In both cases, using the model you show (except with a __unicode__ method added that prints the pk), I get:

>>> from indent.models import Article
>>> string = u'\xae'
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article pk=1>, True)
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article pk=1>, False)
>>>

So I see it created with the first call and not with the second. What DB are you using?

comment:2 Changed 7 years ago by davenaff

Mysql.

I need to invest some time to deal with the latest BackwardsIncompatibleChanges then I'll try updating and will try again to recreate the problem.

comment:3 Changed 7 years ago by davenaff

  • Resolution worksforme deleted
  • Status changed from closed to reopened

I updated to r8207. The bug still exists.

I've reproduced this on:
Windows, Python 2.5.2, MySQL
Linux, Python 2.5.2, MySQL

comment:4 Changed 7 years ago by julianb

  • Component changed from Core framework to Database wrapper
  • milestone set to 1.0

Looks like get_or_create is pretty screwed up... #7402 #7789

comment:5 Changed 7 years ago by julianb

Cannot reproduce it with Python 2.5.2, sqlite on Windows.

comment:6 follow-up: Changed 7 years ago by kmtracey

OK, I've been able to recreate this if I change my table charset to latin1. I had been using a default of utf8, and with that I could not recreate. So key to hitting the error seems to be (I'm guess) having a non-utf8 charset for the column/table.

comment:7 Changed 7 years ago by davenaff

Yup, I've got a latin1 charset default on mysql.

Changed 7 years ago by kmtracey

Use COLLATE UTF8_BIN not BINARY to force case-sensitive comparisons on MySQL

comment:8 Changed 7 years ago by kmtracey

  • milestone changed from 1.0 to 1.0 beta
  • Triage Stage changed from Unreviewed to Accepted

Attached a patch that changes case-sensitive comparisons from using BINARY to COLLATE UTF8_BIN. That way MySQL doesn't neglect to do the charset conversion between table charset and utf8 (what Django always provides). Tested it with a latin1 table and the provided model. Before the patch I observed the failure described, after the patch case-sensitive comparisons correctly retrieve the matched row:

>>> from mysqlci.models import Article
>>> string = u'\xea'
>>> a1, created1 = Article.objects.get_or_create(headline = string)
>>> a1
<Article: Article pk=4, headline=ê>
>>> created1
True
>>> a2, created2 = Article.objects.get_or_create(headline = string)
>>> a2
<Article: Article pk=4, headline=ê>
>>> created2
False
>>> Article.objects.filter(headline__contains = string)
[<Article: Article pk=4, headline=ê>]
>>> Article.objects.filter(headline__startswith = string)
[<Article: Article pk=4, headline=ê>]
>>> Article.objects.filter(headline__endswith = string)
[<Article: Article pk=4, headline=ê>]
>>> Article.objects.filter(headline__regex = string)
[<Article: Article pk=4, headline=ê>]
>>> 

Upped the milestone to beta since I do think this is a pretty serious error that should get addressed ASAP, core devs may of course override that.

comment:9 Changed 7 years ago by kmtracey

  • Patch needs improvement set

Patch fixes problems for (VAR)CHAR fields but introduces problems for non-string types, so is no good. You can't specify COLLATE for, say, an integer field. Still investigating how to fix this.

comment:10 follow-up: Changed 7 years ago by kmtracey

OK, I've got an alternative version of the patch that for iexact matches casts what's being compared to a CHAR. That avoids problems with specifying COLLATE for a column type that doesn't support such, but I'm unsure of the performance implications of this. I did not include the cast for the other lookup types, since it seems they only make sense when dealing with string-types? But I could be missing something there.

I ran the test suite using a default latin1 charset (which means some testcases will fail, because some tests use data that is not representable in latin1) both with the patch and without. Differences observed:

First, one more failure without the patch than with. That is, the patch fixes one testcase (in regressiontests/string_lookup/models.py) that was known to fail with a non-UTF8 database for MySQL. Specifically these lookups:

# Regression tests for #3937: make sure we can use unicode characters in
# queries.
# BUG: These tests fail on MySQL, but it's a problem with the test setup. A
# properly configured UTF-8 database can handle this.

>>> fx = Foo(name='Bjorn', friend=u'François')
>>> fx.save()
>>> Foo.objects.get(friend__contains=u'\xe7')
<Foo: Foo Bjorn>

# We can also do the above query using UTF-8 strings.
>>> Foo.objects.get(friend__contains='\xc3\xa7')
<Foo: Foo Bjorn>

succeed with the patch.

The other failures (same both with and without the patch) relate to attempting to store data that can't be represented in latin1 and the fact that True/False on MySQL comes back as 1/0. (I ran with MyISAM as the storage engine in order to avoid other known(?) problems with InnoDB and serializers.)

Second, the tests took 3% (28 min. vs. 27) longer to run WITHOUT the patch than with. So there does not appear to be a huge performance hit with this, though any difference could be being overwhelmed by the other overhead of running the tests.

I'll attach the updated patch but I'm not at all sure this is the right way to fix this problem. Someone with more than half a clue on mapping Django lookups to SQL might have a better approach.

Changed 7 years ago by kmtracey

Cast exact lookups to CHAR before specifying COLLATE to avoid problems with specifying collate on numeric/binary columns

Changed 7 years ago by kmtracey

For reference, test failures without the patch

Changed 7 years ago by kmtracey

For reference, test failures with the patch

comment:11 in reply to: ↑ 10 Changed 7 years ago by kmtracey

Replying to kmtracey:

OK, I've got an alternative version of the patch that for iexact matches...

Typo, I meant EXACT matches there.

comment:12 in reply to: ↑ 6 Changed 7 years ago by julianb

Uhm, do we have to care about non-utf8 databases?

"Make sure your database is configured to be able to store arbitrary string data. Normally, this means giving it an encoding of UTF-8 or UTF-16."

http://www.djangoproject.com/documentation/unicode/

comment:13 Changed 7 years ago by anonymous

latin1 is the default MySQL encoding, so yes I think it is important to support it. The current Django docs come nowhere close to saying that a database encoding of utf-8 is required for proper interaction with Django. The very next sentence of what you quoted says:

"If you use a more restrictive encoding — for example, latin1 (iso8859-1) — you won’t be able to store certain characters in the database, and information will be lost."

That says if you don't use utf-8 you won't be able to store certain data, not that Django will exhibit crippling bugs. It makes perfect sense that if I choose a latin1 encoding for my db I won't be able to store, say, Japanese characters in it. These characters simply cannot be represented in latin1 encoding.

It does not make sense that I cannot successfully do an exact lookup on, say, little e with circumflex. It's a perfectly legitimate character in both latin1 and Unicode, it's just got a different binary representation in latin1 vs. utf8.

comment:14 Changed 7 years ago by Karen Tracey <kmtracey@…>

(That was me.)

comment:15 Changed 7 years ago by mtredinnick

  • Description modified (diff)

Unbreak the description so that my eyes stop watering.

I'm not 100% certain this is a real bug. If you're using latin1 as your database coding, then Django cannot be expected to adjust for that because it should not have to introspect the server/client differences. Could well be a case of "Doctor, it hurts when I do this...".

Still there's a lot of other good collate-related stuff in here, so leaving it open because there's some stuff here we need to apply.

comment:16 follow-up: Changed 7 years ago by Karen Tracey <kmtracey@…>

Sigh. If you do decide not to fix this, I believe the documentation needs to be much more strongly worded in terms of requiring, not just recommending, using utf8 encoding on the MySQL side. Plus the table-creation SQL generated for MySQL should ideally specify utf8 encoding so that the user doesn't have to go fix it themselves after the fact or change their database installation default.

Up until this binary change MySQL has transparently on its side handled the conversions required between the utf8 spoken on the Django connection and whatever encoding the DB/table/column is actually stored in. The problem now is that Django is specifying in its where clause a modifier that is telling MySQL to treat the supplied data as a binary sequence, not a character string, so MySQL is not doing the encoding conversion.

The 'proper' way to tell MySQL to do a case-sensitive match is to specify the binary collation for the encoding (the collation specified must be one that is valid for the connection encoding, so Django can always specify the binary utf8 collation because Django always speaks utf8 on the connection, and no introspection of the DB is required to determine what encoding the DB is actually using.) The trouble there is that COLLATE is not valid to be specified for non-character data, so a blind mapping of 'exact' to '%s COLLATE UTF8_BIN' fails spectacularly for numeric primary key lookups, for example.

If there was some way to specify different lookup mappings based on the data type involved then the mapping could specify COLLATE only for character types, but there doesn't seem to be any way to do that in the current Django database backend spec. I'm not very familiar with the db backend stuff but in looking through it I noticed there are things like field_cast_sql and lookup_cast that allow the db backend to tweak the sql a bit. This looked promising, but what is needed for MySQL is something that gets handed both the db_type (as in field_cast_sql) and the lookup_type (as in lookup_cast). A method that got handed those two pieces of information, could, I think, correctly decide whether a COLLATE was necessary. I don't see that any such fixup routine exists, though.

Lacking such a fixup routine, always doing a cast to CHAR & specifying the binary collation for 'exact' lookups seems to work, based on the test suite results. I haven't found a case where casting to CHAR causes an error, nor a case where one of the other lookups is used on a non-character datatype. But I'll admit I'm still not entirely comfortable with it, so I was hoping there was a better way.

BTW I'm guessing this is the problem referred to in this thread:

http://groups.google.com/group/django-developers/browse_thread/thread/4b9a37a577c87a71/58f2365019c54699?

which so far as I noticed was never followed up on.

comment:17 in reply to: ↑ 16 Changed 7 years ago by julianb

Replying to Karen Tracey <kmtracey@gmail.com>:

the table-creation SQL generated for MySQL should ideally specify utf8 encoding so that the user doesn't have to go fix it themselves after the fact or change their database installation default

That's what I thought about, too. Is there a ticket for that? Why isn't Django doing it?

comment:18 Changed 7 years ago by mtredinnick

  • Owner changed from nobody to mtredinnick
  • Status changed from reopened to new

comment:19 Changed 7 years ago by mtredinnick

(In [8319]) Changed "exact" matches in MySQL to use the database's native collation.

This effectively reverses the change in [7798]. It was proving too difficult to
successfully manage all the side effects here and provide a satisfactory
solution for everybody. Many thanks to arne, Martin von Löwis and, particular,
Karen Tracey, for doing a lot of research and proto-patches here to establish what was possible and practical.

This is backwards incompatible if you were relying on the behaviour after
[7798]. The docs have been updated to indicate the solution.

Refs #2170, #7789, #8102.

comment:20 Changed 7 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from new to closed

This has been fixed by the reversion of the changes from [7798].

comment:21 Changed 7 years ago by davenaff

It still strikes me that specifying utf-8 encoding when syncdb creates tables is an inherently good idea. Should a new ticket be created for this?

comment:22 Changed 7 years ago by mtredinnick

No, because that's not Django's decision. It's up to the database creator. UTF-8 is not always the right encoding.

comment:23 Changed 7 years ago by kmtracey

Just a followup in case anyone ever considers trying to actually support both case-sensitive & case-insensitive lookups on Django/MySQL by specifying COLLATE in the SQL generated by Django:

I experimented with this a bit and there seem to be nasty performance implications to specifying a collation that differs from a column's default collation. My DB is stored with latin1 charset and default case-insensitive collation. I have a view that issues quite a few queries (hundreds) which with current Django code takes a couple seconds maybe to complete. Changing the Django code to specify one of the UTF collations caused that view to start taking a minute or so. Problem seems to be if you explicitly specify a collation that does not match the column's default, MySQL cannot use any indexes (indices?) it has for that column. Here's the output from EXPLAIN on Django's current SQL for the kinds of queries used in this view:

mysql> explain  SELECT COUNT(*) FROM `Clues` INNER JOIN `Entries` ON (`Clues`.`Entry ID` = `Entries`.`Entry ID`) WHERE (NOT (`Clues`.`Derived` = 1 ) AND `Entries`.`Entry` = 'ERIC'  AND `Clues`.`Clue` = 'Idle Monty Python member?' );
+----+-------------+---------+-------+-----------------+------------+---------+-------+------+-------------+
| id | select_type | table   | type  | possible_keys   | key        | key_len | ref   | rows | Extra       |
+----+-------------+---------+-------+-----------------+------------+---------+-------+------+-------------+
|  1 | SIMPLE      | Entries | const | PRIMARY,Entry   | Entry      | 52      | const |    1 |             | 
|  1 | SIMPLE      | Clues   | ref   | EntryIndex,Clue | EntryIndex | 4       | const |  208 | Using where | 
+----+-------------+---------+-------+-----------------+------------+---------+-------+------+-------------+
2 rows in set (0.00 sec)

MySQL uses the Entry index to speed up the query. But if Django tries to specify a collation:

mysql> explain SELECT COUNT(*) FROM `Clues` INNER JOIN `Entries` ON (`Clues`.`Entry ID` = `Entries`.`Entry ID`) WHERE (NOT (`Clues`.`Derived` = 1  ) AND `Entries`.`Entry` = 'ERIC' COLLATE UTF8_BIN  AND `Clues`.`Clue` = 'Idle Monty Python member?' COLLATE UTF8_BIN );
+----+-------------+---------+------+---------------+------------+---------+------------------------------+--------+-------------+
| id | select_type | table   | type | possible_keys | key        | key_len | ref                          | rows   | Extra       |
+----+-------------+---------+------+---------------+------------+---------+------------------------------+--------+-------------+
|  1 | SIMPLE      | Entries | ALL  | PRIMARY       | NULL       | NULL    | NULL                         | 138731 | Using where | 
|  1 | SIMPLE      | Clues   | ref  | EntryIndex    | EntryIndex | 4       | crosswordDB.Entries.Entry ID |      7 | Using where | 
+----+-------------+---------+------+---------------+------------+---------+------------------------------+--------+-------------+
2 rows in set (0.00 sec)

The Entry index is not eligible, key=NULL and resulting rows=138731 on the first line is bad, resulting query takes about 25 times longer than the non-collate version. (It doesn't matter if the actual collation specified is BIN or case-insensitive, the fact that it is for UTF8 instead of latin1 seems to prevent use of the index.)

So while specifying collate might be a way to get correct case-sensitive + case-insensitive results out of the Django/MySQL combo, it looks to have a pretty serious performance impact for anyone not using a utf8 coding in their database. Not sure it would be worth it unless someone knows a way to fix that performance hit. [I don't know that anyone (besides maybe me, just out of curiosity to see if I could get it to work) was going to pursue that idea at all, but just in case anyone does I figured I'd mention this while it was fresh in my mind.]

comment:24 Changed 4 years ago by jacob

  • milestone 1.0 beta deleted

Milestone 1.0 beta deleted

Note: See TracTickets for help on using tickets.
Back to Top