Opened 4 years ago

Closed 2 years ago

#16052 closed Bug (wontfix)

MySQL: CharField's with utf8_bin collation return as str, should return as unicode.

Reported by: graham_king Owned by: graham_king
Component: Database layer (models, ORM) Version: master
Severity: Normal Keywords: mysql utf_bin unicode
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

On MySQL, a CharField with default collation comes out as unicode, but with utf8_bin it comes out as str.

Adding this line into the conversion dictionary (in django/db/backends/mysql/base.py) fixes it:

 FIELD_TYPE.VAR_STRING: [(None, force_unicode)],

Patch includes two tests in regressiontests/backends/. The full test suite passes with this patch applied.

Attachments (1)

mysql_utf8bin_unicode.diff (4.5 KB) - added by graham_king 4 years ago.
Tries to guess whether a VAR_STRING has binary or unicode data, and converts appropriately.

Download all attachments as: .zip

Change History (6)

comment:1 Changed 4 years ago by graham_king

  • Needs documentation unset
  • Needs tests unset
  • Owner changed from nobody to graham_king
  • Patch needs improvement unset

comment:2 Changed 4 years ago by kmtracey

  • Triage Stage changed from Unreviewed to Design decision needed

We tried to do something like this before, see r8318 and r8329. You'll notice we were trying to specify FIELD_TYPE.VARCHAR, not FIELD_TYPE.VAR_STRING and upon further investigation we found that the VARCHAR variant was never in fact returned by the MySQL C API, so the code added to attempt to convert to unicode wasn't ever called. The attempt was reverted in r8380.

It was reverted rather than switched to specify VAR_STRING because VAR_STRING is also the type reported for a MySQL column with type VARBINARY: truly binary data. Attempting to convert truly binary data to unicode (assuming utf-8 encoding) could generate an exception or could corrupt the data; it didn't seem appropriate to be attempting the conversion for all types MySQL reported as VAR_STRING types.

Now, Django doesn't actually support binary fields (never generates a (VAR)BINARY column), but back then there was some hope it might in the future. (Though I've not heard any mention of binary field support in a while.) However, the ability to pretty easily create/use custom fields means there might be people out there with custom fields that use the MySQL VARBINARY type...I fear this proposed change would break such fields.

Also note that this deviation from the norm of returning unicode for character data is documented in the database notes: http://docs.djangoproject.com/en/1.3/ref/databases/#collation-settings. If/when we do find a way to safely change the low-level behavior here those notes will need to be revised as well.

Changed 4 years ago by graham_king

Tries to guess whether a VAR_STRING has binary or unicode data, and converts appropriately.

comment:3 Changed 4 years ago by graham_king

Thanks for the explanation. I thought it was a little too easy! :-)

It seems the 'standard' way of guessing if we have binary or text data is to look for a null byte. Apparently that's what 'grep' and 'git diff' do. I've changed the patch to do that, and added tests. A VAR_STRING now becomes str if it's binary data (or utf-16), unicode otherwise.

Null byte idea is from here: http://stackoverflow.com/questions/898669/how-can-i-detect-if-a-file-is-binary-non-text-in-python

Hopefully this means we can do both: not break (VAR)BINARY columns, and make utf8_bin CharField less surprising.

comment:4 Changed 3 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:5 Changed 2 years ago by aaugustin

  • Resolution set to wontfix
  • Status changed from new to closed

Django gained a binary field in 8ee1eddb7e148de89aebde9e68da495633fc1ec9.

Unless I missed something, the technique of looking for a null byte is very fragile. If the data stored in a binary field happens not to contain a null byte, it will be returned as unicode. You could iterate over 1000 rows, get 993 str and 7 unicode!

Note: See TracTickets for help on using tickets.
Back to Top