#8102 closed (fixed)
get_or_create is unable to handle unicode characters
Reported by: | Dave Naffziger | Owned by: | Malcolm Tredinnick |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Keywords: | ||
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Models.py:
class Article(models.Model): headline = models.CharField(max_length=100) Shell session: >>> from mysite.models import Article >>> string = u'\xae' >>> a = Article.objects.get_or_create(headline = string) >>> a (<Article: Article object>, True) >>> a = Article.objects.get_or_create(headline = string) >>> a (<Article: Article object>, True)
This seems to be recently introduced (I'm on 8125) and only noticed the bug after a recent update.
This is a pretty major bug and I ran out of time tonight to dig into it, but I wanted to at raise the issue now so that others can look at it.
This is going to create serious backwards incompatibility issues in people's databases if they are running with this bug.
Attachments (4)
Change History (28)
comment:1 by , 16 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:2 by , 16 years ago
Mysql.
I need to invest some time to deal with the latest BackwardsIncompatibleChanges then I'll try updating and will try again to recreate the problem.
comment:3 by , 16 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
I updated to r8207. The bug still exists.
I've reproduced this on:
Windows, Python 2.5.2, MySQL
Linux, Python 2.5.2, MySQL
comment:4 by , 16 years ago
Component: | Core framework → Database wrapper |
---|---|
milestone: | → 1.0 |
follow-up: 12 comment:6 by , 16 years ago
OK, I've been able to recreate this if I change my table charset to latin1. I had been using a default of utf8, and with that I could not recreate. So key to hitting the error seems to be (I'm guess) having a non-utf8 charset for the column/table.
by , 16 years ago
Use COLLATE UTF8_BIN not BINARY to force case-sensitive comparisons on MySQL
comment:8 by , 16 years ago
milestone: | 1.0 → 1.0 beta |
---|---|
Triage Stage: | Unreviewed → Accepted |
Attached a patch that changes case-sensitive comparisons from using BINARY to COLLATE UTF8_BIN. That way MySQL doesn't neglect to do the charset conversion between table charset and utf8 (what Django always provides). Tested it with a latin1 table and the provided model. Before the patch I observed the failure described, after the patch case-sensitive comparisons correctly retrieve the matched row:
>>> from mysqlci.models import Article >>> string = u'\xea' >>> a1, created1 = Article.objects.get_or_create(headline = string) >>> a1 <Article: Article pk=4, headline=ê> >>> created1 True >>> a2, created2 = Article.objects.get_or_create(headline = string) >>> a2 <Article: Article pk=4, headline=ê> >>> created2 False >>> Article.objects.filter(headline__contains = string) [<Article: Article pk=4, headline=ê>] >>> Article.objects.filter(headline__startswith = string) [<Article: Article pk=4, headline=ê>] >>> Article.objects.filter(headline__endswith = string) [<Article: Article pk=4, headline=ê>] >>> Article.objects.filter(headline__regex = string) [<Article: Article pk=4, headline=ê>] >>>
Upped the milestone to beta since I do think this is a pretty serious error that should get addressed ASAP, core devs may of course override that.
comment:9 by , 16 years ago
Patch needs improvement: | set |
---|
Patch fixes problems for (VAR)CHAR fields but introduces problems for non-string types, so is no good. You can't specify COLLATE for, say, an integer field. Still investigating how to fix this.
follow-up: 11 comment:10 by , 16 years ago
OK, I've got an alternative version of the patch that for iexact matches casts what's being compared to a CHAR. That avoids problems with specifying COLLATE for a column type that doesn't support such, but I'm unsure of the performance implications of this. I did not include the cast for the other lookup types, since it seems they only make sense when dealing with string-types? But I could be missing something there.
I ran the test suite using a default latin1 charset (which means some testcases will fail, because some tests use data that is not representable in latin1) both with the patch and without. Differences observed:
First, one more failure without the patch than with. That is, the patch fixes one testcase (in regressiontests/string_lookup/models.py) that was known to fail with a non-UTF8 database for MySQL. Specifically these lookups:
# Regression tests for #3937: make sure we can use unicode characters in # queries. # BUG: These tests fail on MySQL, but it's a problem with the test setup. A # properly configured UTF-8 database can handle this. >>> fx = Foo(name='Bjorn', friend=u'François') >>> fx.save() >>> Foo.objects.get(friend__contains=u'\xe7') <Foo: Foo Bjorn> # We can also do the above query using UTF-8 strings. >>> Foo.objects.get(friend__contains='\xc3\xa7') <Foo: Foo Bjorn>
succeed with the patch.
The other failures (same both with and without the patch) relate to attempting to store data that can't be represented in latin1 and the fact that True/False on MySQL comes back as 1/0. (I ran with MyISAM as the storage engine in order to avoid other known(?) problems with InnoDB and serializers.)
Second, the tests took 3% (28 min. vs. 27) longer to run WITHOUT the patch than with. So there does not appear to be a huge performance hit with this, though any difference could be being overwhelmed by the other overhead of running the tests.
I'll attach the updated patch but I'm not at all sure this is the right way to fix this problem. Someone with more than half a clue on mapping Django lookups to SQL might have a better approach.
by , 16 years ago
Attachment: | 8102-2.diff added |
---|
Cast exact lookups to CHAR before specifying COLLATE to avoid problems with specifying collate on numeric/binary columns
by , 16 years ago
Attachment: | nopatch_failures.txt added |
---|
For reference, test failures without the patch
comment:11 by , 16 years ago
Replying to kmtracey:
OK, I've got an alternative version of the patch that for iexact matches...
Typo, I meant EXACT matches there.
comment:12 by , 16 years ago
Uhm, do we have to care about non-utf8 databases?
"Make sure your database is configured to be able to store arbitrary string data. Normally, this means giving it an encoding of UTF-8 or UTF-16."
comment:13 by , 16 years ago
latin1 is the default MySQL encoding, so yes I think it is important to support it. The current Django docs come nowhere close to saying that a database encoding of utf-8 is required for proper interaction with Django. The very next sentence of what you quoted says:
"If you use a more restrictive encoding — for example, latin1 (iso8859-1) — you won’t be able to store certain characters in the database, and information will be lost."
That says if you don't use utf-8 you won't be able to store certain data, not that Django will exhibit crippling bugs. It makes perfect sense that if I choose a latin1 encoding for my db I won't be able to store, say, Japanese characters in it. These characters simply cannot be represented in latin1 encoding.
It does not make sense that I cannot successfully do an exact lookup on, say, little e with circumflex. It's a perfectly legitimate character in both latin1 and Unicode, it's just got a different binary representation in latin1 vs. utf8.
comment:15 by , 16 years ago
Description: | modified (diff) |
---|
Unbreak the description so that my eyes stop watering.
I'm not 100% certain this is a real bug. If you're using latin1 as your database coding, then Django cannot be expected to adjust for that because it should not have to introspect the server/client differences. Could well be a case of "Doctor, it hurts when I do this...".
Still there's a lot of other good collate-related stuff in here, so leaving it open because there's some stuff here we need to apply.
follow-up: 17 comment:16 by , 16 years ago
Sigh. If you do decide not to fix this, I believe the documentation needs to be much more strongly worded in terms of requiring, not just recommending, using utf8 encoding on the MySQL side. Plus the table-creation SQL generated for MySQL should ideally specify utf8 encoding so that the user doesn't have to go fix it themselves after the fact or change their database installation default.
Up until this binary change MySQL has transparently on its side handled the conversions required between the utf8 spoken on the Django connection and whatever encoding the DB/table/column is actually stored in. The problem now is that Django is specifying in its where clause a modifier that is telling MySQL to treat the supplied data as a binary sequence, not a character string, so MySQL is not doing the encoding conversion.
The 'proper' way to tell MySQL to do a case-sensitive match is to specify the binary collation for the encoding (the collation specified must be one that is valid for the connection encoding, so Django can always specify the binary utf8 collation because Django always speaks utf8 on the connection, and no introspection of the DB is required to determine what encoding the DB is actually using.) The trouble there is that COLLATE is not valid to be specified for non-character data, so a blind mapping of 'exact' to '%s COLLATE UTF8_BIN' fails spectacularly for numeric primary key lookups, for example.
If there was some way to specify different lookup mappings based on the data type involved then the mapping could specify COLLATE only for character types, but there doesn't seem to be any way to do that in the current Django database backend spec. I'm not very familiar with the db backend stuff but in looking through it I noticed there are things like field_cast_sql
and lookup_cast
that allow the db backend to tweak the sql a bit. This looked promising, but what is needed for MySQL is something that gets handed both the db_type
(as in field_cast_sql
) and the lookup_type
(as in lookup_cast
). A method that got handed those two pieces of information, could, I think, correctly decide whether a COLLATE was necessary. I don't see that any such fixup routine exists, though.
Lacking such a fixup routine, always doing a cast to CHAR & specifying the binary collation for 'exact' lookups seems to work, based on the test suite results. I haven't found a case where casting to CHAR causes an error, nor a case where one of the other lookups is used on a non-character datatype. But I'll admit I'm still not entirely comfortable with it, so I was hoping there was a better way.
BTW I'm guessing this is the problem referred to in this thread:
which so far as I noticed was never followed up on.
comment:17 by , 16 years ago
Replying to Karen Tracey <kmtracey@gmail.com>:
the table-creation SQL generated for MySQL should ideally specify utf8 encoding so that the user doesn't have to go fix it themselves after the fact or change their database installation default
That's what I thought about, too. Is there a ticket for that? Why isn't Django doing it?
comment:18 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | reopened → new |
comment:19 by , 16 years ago
(In [8319]) Changed "exact" matches in MySQL to use the database's native collation.
This effectively reverses the change in [7798]. It was proving too difficult to
successfully manage all the side effects here and provide a satisfactory
solution for everybody. Many thanks to arne, Martin von Löwis and, particular,
Karen Tracey, for doing a lot of research and proto-patches here to establish what was possible and practical.
This is backwards incompatible if you were relying on the behaviour after
[7798]. The docs have been updated to indicate the solution.
comment:20 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
This has been fixed by the reversion of the changes from [7798].
comment:21 by , 16 years ago
It still strikes me that specifying utf-8 encoding when syncdb creates tables is an inherently good idea. Should a new ticket be created for this?
comment:22 by , 16 years ago
No, because that's not Django's decision. It's up to the database creator. UTF-8 is not always the right encoding.
comment:23 by , 16 years ago
Just a followup in case anyone ever considers trying to actually support both case-sensitive & case-insensitive lookups on Django/MySQL by specifying COLLATE in the SQL generated by Django:
I experimented with this a bit and there seem to be nasty performance implications to specifying a collation that differs from a column's default collation. My DB is stored with latin1 charset and default case-insensitive collation. I have a view that issues quite a few queries (hundreds) which with current Django code takes a couple seconds maybe to complete. Changing the Django code to specify one of the UTF collations caused that view to start taking a minute or so. Problem seems to be if you explicitly specify a collation that does not match the column's default, MySQL cannot use any indexes (indices?) it has for that column. Here's the output from EXPLAIN on Django's current SQL for the kinds of queries used in this view:
mysql> explain SELECT COUNT(*) FROM `Clues` INNER JOIN `Entries` ON (`Clues`.`Entry ID` = `Entries`.`Entry ID`) WHERE (NOT (`Clues`.`Derived` = 1 ) AND `Entries`.`Entry` = 'ERIC' AND `Clues`.`Clue` = 'Idle Monty Python member?' ); +----+-------------+---------+-------+-----------------+------------+---------+-------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------+-------+-----------------+------------+---------+-------+------+-------------+ | 1 | SIMPLE | Entries | const | PRIMARY,Entry | Entry | 52 | const | 1 | | | 1 | SIMPLE | Clues | ref | EntryIndex,Clue | EntryIndex | 4 | const | 208 | Using where | +----+-------------+---------+-------+-----------------+------------+---------+-------+------+-------------+ 2 rows in set (0.00 sec)
MySQL uses the Entry index to speed up the query. But if Django tries to specify a collation:
mysql> explain SELECT COUNT(*) FROM `Clues` INNER JOIN `Entries` ON (`Clues`.`Entry ID` = `Entries`.`Entry ID`) WHERE (NOT (`Clues`.`Derived` = 1 ) AND `Entries`.`Entry` = 'ERIC' COLLATE UTF8_BIN AND `Clues`.`Clue` = 'Idle Monty Python member?' COLLATE UTF8_BIN ); +----+-------------+---------+------+---------------+------------+---------+------------------------------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------+------+---------------+------------+---------+------------------------------+--------+-------------+ | 1 | SIMPLE | Entries | ALL | PRIMARY | NULL | NULL | NULL | 138731 | Using where | | 1 | SIMPLE | Clues | ref | EntryIndex | EntryIndex | 4 | crosswordDB.Entries.Entry ID | 7 | Using where | +----+-------------+---------+------+---------------+------------+---------+------------------------------+--------+-------------+ 2 rows in set (0.00 sec)
The Entry index is not eligible, key=NULL and resulting rows=138731 on the first line is bad, resulting query takes about 25 times longer than the non-collate version. (It doesn't matter if the actual collation specified is BIN or case-insensitive, the fact that it is for UTF8 instead of latin1 seems to prevent use of the index.)
So while specifying collate might be a way to get correct case-sensitive + case-insensitive results out of the Django/MySQL combo, it looks to have a pretty serious performance impact for anyone not using a utf8 coding in their database. Not sure it would be worth it unless someone knows a way to fix that performance hit. [I don't know that anyone (besides maybe me, just out of curiosity to see if I could get it to work) was going to pursue that idea at all, but just in case anyone does I figured I'd mention this while it was fresh in my mind.]
I cannot recreate this on r8204 with either MySQL or sqlite backends. In both cases, using the model you show (except with a
__unicode__
method added that prints the pk), I get:So I see it created with the first call and not with the second. What DB are you using?