Opened 16 years ago

Closed 16 years ago

Last modified 12 years ago

#8340 closed (fixed)

MySQL case sensitivity and utf8_bin / 2710

Reported by: hrauch Owned by: Malcolm Tredinnick
Component: Uncategorized Version: dev
Severity: Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

As noted in the documentation, I tried to change the collation within my Django-database.

So I dumped the database and changed "utf8_german_ci" to "utf8_bin". After restoring the data and trying to open one page, I'll get the following error (I'm using the latest svn-version 8366):

[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] mod_python (pid=2259, interpreter='dms', phase='PythonHandler?', handler='django.core.handlers.modpython'): Application error 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] ServerName?: 'www.dms.bildung.hessen.de' 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] DocumentRoot?: '/srv/www/htdocs' 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] URI: '/news/index.html' 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] Location: None 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] Directory: None [Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] Filename: '/srv/www/htdocs/news' 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] PathInfo?: '/index.html' 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] Traceback (most recent call last):

[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/lib64/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch?\n default=default_handler, arg=req, silent=hlist.silent) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/lib64/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target\n result = _execute_target(config, req, object, arg) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/lib64/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target\n result = object(arg) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/core/handlers/modpython.py", line 210, in handler\n return ModPythonHandler?()(req) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/core/handlers/modpython.py", line 183, in call\n response = self.get_response(request) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/core/handlers/base.py", line 67, in get_response\n response = middleware_method(request) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/middleware/locale.py", line 16, in process_request\n language = translation.get_language_from_request(request) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/utils/translation/init.py", line 97, in get_language_from_request\n return real_get_language_from_request(request) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/utils/translation/trans_real.py", line 351, in get_language_from_request\n lang_code = request.session.get('django_language', None) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/contrib/sessions/backends/base.py", line 50, in get\n return self._session.get(key, default) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/contrib/sessions/backends/base.py", line 147, in _get_session\n self._session_cache = self.load() 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/contrib/sessions/backends/db.py", line 19, in load\n return self.decode(s.session_data) 
[Fri Aug 15 08:52:23 2008] [error] [client 192.168.0.222] File "/usr/local/lib64/python2.5/site-packages/django/contrib/sessions/backends/base.py", line 80, in decode\n encoded_data = base64.decodestring(session_data)

Change History (13)

comment:1 by Karen Tracey <kmtracey@…>, 16 years ago

An unfortunate side-effect of setting a binary collation is that the data starts getting returned from the DB as bytestrings instead of unicode. I'd guess the sessions backend is tripping over this though it's a little hard to be sure because the ultimate error encountered seems to have been left out of the traceback. In looking at the code to see where it would go next after the last line in the traceback I do not think you are running r8366 as you say. Line 80 of django/contrib/sessions/backends/base.py has not been encoded_data = base64.decodestring(session_data) since before r8340. I don't think updating to the latest code would fix this error, but figured I'd mention you don't seem to be running what you think you are.

I think changing the entire database to use a binary collation is not a good idea. Given the side-effect mentioned above a better approach, I believe, is to selectively change the collation to binary for columns where it is really essential for you to get case-sensitive comparisons. (I expect you could fix this error by changing the collation for django_session.session_data back to a non-binary one, but I do not know if there are other built-in apps that will trip over this side-effect as well.)

comment:2 by Daniel Pope <dan@…>, 16 years ago

The data should not be returned as bytestrings from the database with utf8_bin collation. Binary collation just changes the comparison rules. It doesn't affect character sets.

The location of the bug suggests it cannot base64-decode the session data. Is it possible your django_session table is corrupt? You could try resetting django.contrib.sessions.

in reply to:  2 ; comment:3 by hrauch, 16 years ago

Thanks for your suggestions! I now use svn-version 8381 - doble-checked!

After I truncated table django_session and change EVERY text- and longtext-field back to utf8_unicode_ci, it seems to be ok.

But I don't understand, why this solves the problem. If I ask a database to return a string with collation utf8_bin, both char- and text-field should return the same data-type.

Maybe the changes from [7798] were the better solution?

comment:4 by Karen Tracey <kmtracey@…>, 16 years ago

Replying to Daniel Pope <dan@mauveinternet.co.uk>:

The data should not be returned as bytestrings from the database with utf8_bin collation. Binary collation just changes the comparison rules. It doesn't affect character sets.

Should not, perhaps, but does, in fact. Setting a binary collation causes MySQL to set the BINARY flag in the flags field of the data structure it uses to describe the data it returns via its C-API (which is ultimately what MySQLdb uses). MySQLdb uses this flag to decide if the data is binary or character in nature. If BINARY is set, then MySQLdb does not transform the returned data to Unicode.

There is another bit of information, called charsetnr, that is apparently what MySQLdb could use to decide if a particular returned field is actually binary. But it doesn't currently do so. See: http://sourceforge.net/tracker/index.php?func=detail&aid=1387290&group_id=22307&atid=374934.

MySQLdb is not the only db connector to run into trouble here. If you search around you'll find lots of bugs reported about character data being returned as binary, because MySQL's interface here was originally lacking (charsetnr was only added in 4.1) and is now extremely non-obvious.

in reply to:  3 comment:5 by Karen Tracey <kmtracey@…>, 16 years ago

Replying to hrauch:

Maybe the changes from [7798] were the better solution?

Well that remains to be seen. For years exact matches on MySQL returned case insensitive results with little outcry. Fixing it caused other problems that were hard to fix, so things were put back the way they had been, but believe me not without a lot of thought/discussion/consideration of alternatives, just search the tracker for get_and_create problems introduced by [7798]. The current situation is not ideal but was status quo for a long time with apparently little trouble. Personally I don't think it would be impossible to get exact matches returning case-sensitive results on MySQL without the side-effects of [7789], but less than one month before 1.0 release is not the time to be making the kind of code changes required to make it work properly. Maybe it will be looked at again after 1.0, if it really seems like a problem enough people need solved.

comment:6 by hrauch, 16 years ago

I do understand your time-line arguments. But the problem with utf8_bin seems to be more complicated:

When I changed most of our var chars field to utf8_bin on my developing system (SuSE 11), everything works fine. But if I do the same with our production server (Ubuntu 8.04), I'll sometimes get UnicodeDecodeErrors, since some of the var char fields do return normal strings instead of unicode strings. Both systems use the same django version; both systems use the same database by dumping and restorung the database.

If I turn back theses fields from utf8_bin to ut8_unicode_ci, the production system works well again.

Another problem: If I use utf8_bin for a field and use the query operator iexect, it does work correctly.

So I must say, that case sensitivity with mysql and utf8_bin doesn't work well.

in reply to:  6 comment:7 by Karen Tracey <kmtracey@…>, 16 years ago

Replying to hrauch:>

When I changed most of our var chars field to utf8_bin on my developing system (SuSE 11), everything works fine. But if I do the same with our production server (Ubuntu 8.04), I'll sometimes get UnicodeDecodeErrors, since some of the var char fields do return normal strings instead of unicode strings. Both systems use the same django version; both systems use the same database by dumping and restorung the database.

Two things could cause the UnicodeDeocdeErrors only occurring on one machine. First, only MySQLdb 1.2.2 returns character fields with binary collation as bytestrings, MySQL 1.2.1p2 returns them as unicode. So if your SuSE 11 system is using the older MySQLdb then you would not see this problem there. (The problem with the behavior of the older level is that it will either throw an exception or corrupt truly binary data when it tries to convert it to unicode assuming a utf8 encoding; the fix for that bug introduced the behavior you see now.) Second, you'll only get the UnicodeDecodeErrors when you actually access data with non-ASCII chars, so if you did not happen to access problematic data on your development system, it would appear to work even if it too is running the latest MySQLdb level.


If I turn back theses fields from utf8_bin to ut8_unicode_ci, the production system works well again.

Alternatively you could keep the utf8_bin collation and use django.utils.encoding.force_unicode() on the values of your binary-collated character character fields to properly transform them to unicode. See this thread on django-users: http://groups.google.com/group/django-users/browse_thread/thread/d7dd21493ab5f1fa/eafc9959bb3302f6


Another problem: If I use utf8_bin for a field and use the query operator __iexect, it does work correctly.

I think you mean it does not work correctly? As it stands today if you use MySQL, depending on how you have your collation set, either exact or iexact will work correctly, not both. Django does not specify a collation for either comparison, so both use the column's default collation, which will only be correct for one of them. To get them both to work would require, I believe, some changes in the way Django interacts with the backend to construct the query, see some of the comments I made on #8102.


So I must say, that case sensitivity with mysql and utf8_bin doesn't work well.

I don't think anyone sees the current behavior as ideal, just the best that could be achieved for the 1.0 timeframe.

comment:8 by zbskii, 16 years ago

Owner: changed from nobody to zbskii

comment:9 by Malcolm Tredinnick, 16 years ago

After extensive investigations from Karen and zbskii (who spent a whole day on this at the Portland sprint) and some checking from me, this is Officially Hard(tm). We can't really make things work correctly in all cases without harming people using VARBINARY fields (which is simply not an option).

Thus, this is going to be documented as a "trap" for 1.0. I will fix the session backend to handle this situation (which was the original bug reported here) and anybody else discovering this problem in Django core code can open new tickets. People will just have to work around it in their own code (with judicious calls to force_unicode()) if they want to operate in this mode.

The commit to the session backend will close this ticket. I'll update the docs once Jacob lands the docs refactor in the next 24 hours or so (we have a temporary freeze on docs changes at the moment).

comment:10 by Malcolm Tredinnick, 16 years ago

milestone: 1.0
Owner: changed from zbskii to Malcolm Tredinnick
Triage Stage: UnreviewedAccepted

Moving as 1.0, since it causes Django to crash under not-unreasonable circumstances. I'll take care of fixing it.

comment:11 by Malcolm Tredinnick, 16 years ago

(In [8507]) Avoid a crash when unencoding session data for the db backend. This is required
because some configurations of MySQL (with utf8_bin collation) will return
bytestring, rather than unicode data, which was causing problems previously.

Refs #8340.

comment:12 by Malcolm Tredinnick, 16 years ago

Resolution: fixed
Status: newclosed

I've opened #8506 for the general documentation issue here, so I'll close this now.

comment:13 by Jacob, 12 years ago

milestone: 1.0

Milestone 1.0 deleted

Note: See TracTickets for help on using tickets.
Back to Top