Opened 16 years ago

Last modified 13 years ago

#8102 closed

get_or_create is unable to handle unicode characters — at Version 15

Reported by: Dave Naffziger Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description (last modified by Malcolm Tredinnick)

Models.py:

class Article(models.Model):
    headline = models.CharField(max_length=100)

Shell session:
>>> from mysite.models import Article
>>> string = u'\xae'
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article object>, True)
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article object>, True)

This seems to be recently introduced (I'm on 8125) and only noticed the bug after a recent update.

This is a pretty major bug and I ran out of time tonight to dig into it, but I wanted to at raise the issue now so that others can look at it.

This is going to create serious backwards incompatibility issues in people's databases if they are running with this bug.

Change History (19)

comment:1 by Karen Tracey, 16 years ago

Resolution: worksforme
Status: newclosed

I cannot recreate this on r8204 with either MySQL or sqlite backends. In both cases, using the model you show (except with a __unicode__ method added that prints the pk), I get:

>>> from indent.models import Article
>>> string = u'\xae'
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article pk=1>, True)
>>> a = Article.objects.get_or_create(headline = string)
>>> a
(<Article: Article pk=1>, False)
>>>

So I see it created with the first call and not with the second. What DB are you using?

comment:2 by Dave Naffziger, 16 years ago

Mysql.

I need to invest some time to deal with the latest BackwardsIncompatibleChanges then I'll try updating and will try again to recreate the problem.

comment:3 by Dave Naffziger, 16 years ago

Resolution: worksforme
Status: closedreopened

I updated to r8207. The bug still exists.

I've reproduced this on:
Windows, Python 2.5.2, MySQL
Linux, Python 2.5.2, MySQL

comment:4 by Julian Bez, 16 years ago

Component: Core frameworkDatabase wrapper
milestone: 1.0

Looks like get_or_create is pretty screwed up... #7402 #7789

comment:5 by Julian Bez, 16 years ago

Cannot reproduce it with Python 2.5.2, sqlite on Windows.

comment:6 by Karen Tracey, 16 years ago

OK, I've been able to recreate this if I change my table charset to latin1. I had been using a default of utf8, and with that I could not recreate. So key to hitting the error seems to be (I'm guess) having a non-utf8 charset for the column/table.

comment:7 by Dave Naffziger, 16 years ago

Yup, I've got a latin1 charset default on mysql.

by Karen Tracey, 16 years ago

Attachment: 8102.diff added

Use COLLATE UTF8_BIN not BINARY to force case-sensitive comparisons on MySQL

comment:8 by Karen Tracey, 16 years ago

milestone: 1.01.0 beta
Triage Stage: UnreviewedAccepted

Attached a patch that changes case-sensitive comparisons from using BINARY to COLLATE UTF8_BIN. That way MySQL doesn't neglect to do the charset conversion between table charset and utf8 (what Django always provides). Tested it with a latin1 table and the provided model. Before the patch I observed the failure described, after the patch case-sensitive comparisons correctly retrieve the matched row:

>>> from mysqlci.models import Article
>>> string = u'\xea'
>>> a1, created1 = Article.objects.get_or_create(headline = string)
>>> a1
<Article: Article pk=4, headline=ê>
>>> created1
True
>>> a2, created2 = Article.objects.get_or_create(headline = string)
>>> a2
<Article: Article pk=4, headline=ê>
>>> created2
False
>>> Article.objects.filter(headline__contains = string)
[<Article: Article pk=4, headline=ê>]
>>> Article.objects.filter(headline__startswith = string)
[<Article: Article pk=4, headline=ê>]
>>> Article.objects.filter(headline__endswith = string)
[<Article: Article pk=4, headline=ê>]
>>> Article.objects.filter(headline__regex = string)
[<Article: Article pk=4, headline=ê>]
>>> 

Upped the milestone to beta since I do think this is a pretty serious error that should get addressed ASAP, core devs may of course override that.

comment:9 by Karen Tracey, 16 years ago

Patch needs improvement: set

Patch fixes problems for (VAR)CHAR fields but introduces problems for non-string types, so is no good. You can't specify COLLATE for, say, an integer field. Still investigating how to fix this.

comment:10 by Karen Tracey, 16 years ago

OK, I've got an alternative version of the patch that for iexact matches casts what's being compared to a CHAR. That avoids problems with specifying COLLATE for a column type that doesn't support such, but I'm unsure of the performance implications of this. I did not include the cast for the other lookup types, since it seems they only make sense when dealing with string-types? But I could be missing something there.

I ran the test suite using a default latin1 charset (which means some testcases will fail, because some tests use data that is not representable in latin1) both with the patch and without. Differences observed:

First, one more failure without the patch than with. That is, the patch fixes one testcase (in regressiontests/string_lookup/models.py) that was known to fail with a non-UTF8 database for MySQL. Specifically these lookups:

# Regression tests for #3937: make sure we can use unicode characters in
# queries.
# BUG: These tests fail on MySQL, but it's a problem with the test setup. A
# properly configured UTF-8 database can handle this.

>>> fx = Foo(name='Bjorn', friend=u'François')
>>> fx.save()
>>> Foo.objects.get(friend__contains=u'\xe7')
<Foo: Foo Bjorn>

# We can also do the above query using UTF-8 strings.
>>> Foo.objects.get(friend__contains='\xc3\xa7')
<Foo: Foo Bjorn>

succeed with the patch.

The other failures (same both with and without the patch) relate to attempting to store data that can't be represented in latin1 and the fact that True/False on MySQL comes back as 1/0. (I ran with MyISAM as the storage engine in order to avoid other known(?) problems with InnoDB and serializers.)

Second, the tests took 3% (28 min. vs. 27) longer to run WITHOUT the patch than with. So there does not appear to be a huge performance hit with this, though any difference could be being overwhelmed by the other overhead of running the tests.

I'll attach the updated patch but I'm not at all sure this is the right way to fix this problem. Someone with more than half a clue on mapping Django lookups to SQL might have a better approach.

by Karen Tracey, 16 years ago

Attachment: 8102-2.diff added

Cast exact lookups to CHAR before specifying COLLATE to avoid problems with specifying collate on numeric/binary columns

by Karen Tracey, 16 years ago

Attachment: nopatch_failures.txt added

For reference, test failures without the patch

by Karen Tracey, 16 years ago

Attachment: patch_failures.txt added

For reference, test failures with the patch

in reply to:  10 comment:11 by Karen Tracey, 16 years ago

Replying to kmtracey:

OK, I've got an alternative version of the patch that for iexact matches...

Typo, I meant EXACT matches there.

in reply to:  6 comment:12 by Julian Bez, 16 years ago

Uhm, do we have to care about non-utf8 databases?

"Make sure your database is configured to be able to store arbitrary string data. Normally, this means giving it an encoding of UTF-8 or UTF-16."

http://www.djangoproject.com/documentation/unicode/

comment:13 by anonymous, 16 years ago

latin1 is the default MySQL encoding, so yes I think it is important to support it. The current Django docs come nowhere close to saying that a database encoding of utf-8 is required for proper interaction with Django. The very next sentence of what you quoted says:

"If you use a more restrictive encoding — for example, latin1 (iso8859-1) — you won’t be able to store certain characters in the database, and information will be lost."

That says if you don't use utf-8 you won't be able to store certain data, not that Django will exhibit crippling bugs. It makes perfect sense that if I choose a latin1 encoding for my db I won't be able to store, say, Japanese characters in it. These characters simply cannot be represented in latin1 encoding.

It does not make sense that I cannot successfully do an exact lookup on, say, little e with circumflex. It's a perfectly legitimate character in both latin1 and Unicode, it's just got a different binary representation in latin1 vs. utf8.

comment:14 by Karen Tracey <kmtracey@…>, 16 years ago

(That was me.)

comment:15 by Malcolm Tredinnick, 16 years ago

Description: modified (diff)

Unbreak the description so that my eyes stop watering.

I'm not 100% certain this is a real bug. If you're using latin1 as your database coding, then Django cannot be expected to adjust for that because it should not have to introspect the server/client differences. Could well be a case of "Doctor, it hurts when I do this...".

Still there's a lot of other good collate-related stuff in here, so leaving it open because there's some stuff here we need to apply.

Note: See TracTickets for help on using tickets.
Back to Top