Opened 13 years ago
Closed 12 years ago
#16052 closed Bug (wontfix)
MySQL: CharField's with utf8_bin collation return as str, should return as unicode.
Reported by: | Graham King | Owned by: | Graham King |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | mysql utf_bin unicode |
Cc: | Triage Stage: | Design decision needed | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
On MySQL, a CharField with default collation comes out as unicode, but with utf8_bin it comes out as str.
Adding this line into the conversion dictionary (in django/db/backends/mysql/base.py) fixes it:
FIELD_TYPE.VAR_STRING: [(None, force_unicode)],
Patch includes two tests in regressiontests/backends/. The full test suite passes with this patch applied.
Attachments (1)
Change History (6)
comment:1 by , 13 years ago
Owner: | changed from | to
---|
comment:2 by , 13 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
by , 13 years ago
Attachment: | mysql_utf8bin_unicode.diff added |
---|
Tries to guess whether a VAR_STRING has binary or unicode data, and converts appropriately.
comment:3 by , 13 years ago
Thanks for the explanation. I thought it was a little too easy! :-)
It seems the 'standard' way of guessing if we have binary or text data is to look for a null byte. Apparently that's what 'grep' and 'git diff' do. I've changed the patch to do that, and added tests. A VAR_STRING now becomes str if it's binary data (or utf-16), unicode otherwise.
Null byte idea is from here: http://stackoverflow.com/questions/898669/how-can-i-detect-if-a-file-is-binary-non-text-in-python
Hopefully this means we can do both: not break (VAR)BINARY columns, and make utf8_bin CharField less surprising.
comment:5 by , 12 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Django gained a binary field in 8ee1eddb7e148de89aebde9e68da495633fc1ec9.
Unless I missed something, the technique of looking for a null byte is very fragile. If the data stored in a binary field happens not to contain a null byte, it will be returned as unicode. You could iterate over 1000 rows, get 993 str and 7 unicode!
We tried to do something like this before, see r8318 and r8329. You'll notice we were trying to specify
FIELD_TYPE.VARCHAR
, notFIELD_TYPE.VAR_STRING
and upon further investigation we found that the VARCHAR variant was never in fact returned by the MySQL C API, so the code added to attempt to convert to unicode wasn't ever called. The attempt was reverted in r8380.It was reverted rather than switched to specify VAR_STRING because VAR_STRING is also the type reported for a MySQL column with type VARBINARY: truly binary data. Attempting to convert truly binary data to unicode (assuming utf-8 encoding) could generate an exception or could corrupt the data; it didn't seem appropriate to be attempting the conversion for all types MySQL reported as VAR_STRING types.
Now, Django doesn't actually support binary fields (never generates a (VAR)BINARY column), but back then there was some hope it might in the future. (Though I've not heard any mention of binary field support in a while.) However, the ability to pretty easily create/use custom fields means there might be people out there with custom fields that use the MySQL VARBINARY type...I fear this proposed change would break such fields.
Also note that this deviation from the norm of returning unicode for character data is documented in the database notes: http://docs.djangoproject.com/en/1.3/ref/databases/#collation-settings. If/when we do find a way to safely change the low-level behavior here those notes will need to be revised as well.