#7921 closed (fixed)
Tests that use 8-bit bytestrings fail on Python 2.6/sqlite3
Reported by: | Owned by: | ||
---|---|---|---|
Component: | Uncategorized | Version: | dev |
Severity: | Keywords: | python26 | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Running the test suite on Python 2.6 beta2 we get several failures with this sqlite3 complaint:
return Database.Cursor.execute(self, query, params) ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
Checking out the first couple so far they are due to tests that are indeed using bytestrings, not unicode strings, as arguments, for example:
http://code.djangoproject.com/browser/django/trunk/tests/modeltests/custom_pk/models.py#L107
has: >>> emp = Employee(employee_code='jaźń')
Shouldn't that be: >>> emp = Employee(employee_code=u'jaźń')
?
Changing it like so makes the test succeed on 2.6b2 and has no apparent ill-effects on 2.3/2.5. But I'm a little surprised this hasn't caused a problem before, so I fear I'm missing something here? Should these calls be changed to use u''
or is there something else that should be done to make them work on 2.6?
Attachments (1)
Change History (13)
comment:1 by , 16 years ago
Summary: | Tests the use 8-bit bytestrings fail on Python 2.6/sqlite3 → Tests that use 8-bit bytestrings fail on Python 2.6/sqlite3 |
---|
comment:2 by , 16 years ago
comment:3 by , 16 years ago
I'm confused how the coding marker could ever affect how string literals are interpreted when they are passed as parameters, because bytestrings have no encoding information associated with them, right? Per http://docs.python.org/ref/encodings.html#encodings, the coding marker is used by the Python interpreter for lexical analysis, and to decode Unicode literals:
"The encoding is used for all lexical analysis, in particular to find the end of a string, and to interpret the contents of Unicode literals. String literals are converted to Unicode for syntactical analysis, then converted back to their original encoding before interpretation starts."
So by the time the code is actually being interpreted, the string literal is back to being just a sequence of bytes. When it is passed as a parameter, how is the called function supposed to know the coding marker value of the file in which it was originally declared?
I don't understand how this has been working up to now except through always using utf-8 encoding in the files and an assumption on the part of the code receiving these strings that they are utf-8. I expect (though I haven't tried it) things would break if we changed any of the files to have, say, latin-1 coding markers (plus actually using that encoding in the string literals), because I think the underlying code would still be assuming the strings are utf-8 encoded. Or am I completely missing how the coding marker information from the source file can get associated with a string literal value as it is passed around?
Anyway, the error message is coming from pysqlite, and is the result of this changeset:
http://oss.itsystementwicklung.de/trac/pysqlite/changeset/307%3A0ec0f228658e
I get the impression it's intended to prevent the kind of thing I just mentioned above, say passing in a latin-1 encoded string that gets stored in the database but then causes trouble on retrieval because it's not valid utf-8 byte sequence. Only they don't actually test for valid utf-8, but rather reject anything that has a high bit set in any of the bytes.
I don't understand the text_factory part of the message at all though it sounds like it might provide an alternative to using Unicode string literals. Do you understand that part at all? I could try figuring out what that is trying to say if you think it would be useful. I started out focused on the "It is highly recommended that you instead just switch your application to Unicode strings" part because I thought that would be easier....
by , 16 years ago
Attachment: | sqlite3-str.diff added |
---|
Patch to convert str objects into Unicoe before passing them to SQLite
comment:4 by , 16 years ago
Has patch: | set |
---|
This is an issue with PySQLite 2.4.1, not with Python 2.6 per se. The attached patch fixes the issue, by coercing all str objects to Unicode before passing them into sqlite3.
The source encoding has nothing to do with this; byte strings keep the semantics that they always had.
comment:5 by , 16 years ago
Owner: | changed from | to
---|
comment:6 by , 16 years ago
OK, I think I understand this. PySQLite prior to 2.4.1 simply passed through to sqlite3 bytestrings it was handed under the assumption they were validly encoded. But if in fact such data was not validly encoded, it could not be retrieved via PySQLite since it could not be decoded properly, leading to problems like these:
http://itsystementwicklung.de/pipermail/list-pysqlite/2008-March/000007.html
http://itsystementwicklung.de/pipermail/list-pysqlite/2008-March/000017.html
So now PySQLite has put in a check to reject any bytestrings with chars out of the 7-bit ASCII range, but this breaks any app that was relying on the old behavior and in fact providing validly encoded utf-8 bytestrings. So the provided patch coerces the bytestrings to unicode assuming a utf-8 encoding, allowing these apps to still work properly. (One could argue the new PySQLite check is overly aggressive -- it should perhaps be checking for valid utf-8 and not simply rejecting everything outside the 7-bit ASCII range?)
Net effect of the change is that apps that try to store non-utf8-encoded bytestrings will now (likely) fail on attempts to store, rather than sometime down the road on an attempt to retrieve the badly-formed data. Hypothetically one could have had an app that relied on this old pass-through behavior to store non-utf8 data but never tried to retrieve it via PySQLite (rather using some other interface for reading that didn't do any utf8-decoding, since sqlite3 itself apparently doesn't care if the data is valid utf8 and will simply pass back the bytes it was given). This hypothetical app will no longer work, but I doubt we care about that.
One thing I'm not entirely clear on: is there any possibility that valid utf16 bytestrings that used to work will now be rejected? I gather sqlite3 supports utf16 as well, but it sounds like that support is via a different interface, so if PySQLite is always using the utf8 interface then providing utf16 bytestrings to PySQLite never would have worked...right?
follow-up: 9 comment:7 by , 16 years ago
Triage Stage: | Unreviewed → Accepted |
---|
Are you asking about UTF-16 bytestrings out of interest, Karen, or do you see some potential problem? I'm asking because I don't see how that affects us in this particular issue. Django treats all bytestrings as UTF-8 and I'm comfortable extending that to include django.db.backend.cursor.execute()
and friends (since if you don't want that you can use the sqlite3
module directly. But I'm willing to believe you're asking the leading question for a reason, so.... spill the beans! :-)
That aside, I'm pretty happy with this solution (which is why I assigned it to me, so I'll commit it when I next review my list). I'm thinking of just making that behaviour the default, however, rather than just for 2.4.1. That would also catch the poorly encoded input data problem you mention before we save to disk. Exploding is acceptable there, since the developer shouldn't be passing in non-UTF8 bytestrings and exploding is definitely better than writing non-readable data.
comment:8 by , 16 years ago
The disadvantage of making it the default (and with the approach taken) is that it slows down sqlite somewhat (how much I haven't measured).
PySQLite has an optimization where they track whether any primitive type has an adapter. If not, primitive types don't go through the adapter layer at all, but are directly used as-is. If one primitive type has an adapter specified, they all go through that layer.
In the adapter layer (as specified in PEP 246), it first checks whether an adapter factory was registered (it then is for unicode, but not for, say, int). If so, it calls it. If not, it checks whether conform or adapt are implemented. They aren't, hence adaptation reports an exception. That exception gets cleared, and the original object is used.
So for int and unicode objects, adaptation will perform a few (failing) dictionary lookups, set an exception, and clear it.
The additional cost shouldn't matter in practice (again, I didn't measure it), so it might not be so bad to make this adaptation the default; I still thought I mention it.
comment:9 by , 16 years ago
Replying to mtredinnick:
... But I'm willing to believe you're asking the leading question for a reason, so.... spill the beans! :-)
No, I wasn't trying to hint at any real problem I can see. I'm just ignorant enough of the interfaces involved here to not know what may or may not be a problem. I thought post-Unicode-merge that apps were supposed to be supplying Unicode objects, not bytestrings, so it's actually news to me that non-ASCII bytestrings are supported so long as they are utf8 encoded. I didn't know if the previous PySQLite pass-through behavior plus the fact that sqlite3 also allows for the database to be stored using utf16 encoding meant it used to be possible for apps to supply utf16 bytestrings for a utf16-encoded DB and have it just work. If Django (or the PySQLite interface it uses) has always assumed utf8, though, I don't think that would have ever worked....so don't mind me.
comment:10 by , 16 years ago
Owner: | removed |
---|---|
Triage Stage: | Accepted → Ready for checkin |
thanks, martin and karen for the feedback. I won't bikeshed this too much; let's just call this ready to be checked in and whoever does that can give it another quick read at the time. Thanks for the work on this.
comment:11 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
The coding marker at the top of the file is (at least, was ... pre-2.6) telling Python to treat all data as UTF-8. So those non-ASCII strings were correctly converted to UTF-8 and thus were treated as UTF-8 encoded bytestrings. Which are valid bytestrings and something we really would like to test. Django is meant to work in that mode, since a lot of things produce UTF-8 output.
I don't have time to devote brain cycles to this right at the moment, but I'll have a think about it. We might end up needing a bunch of special-casing for 2.6, since those tests have caught a lot of problems and provided a lot of support that we're doing the Right Thing(tm) over time.
Punishing people using older Python versions like this just because you want people to move forwards to a newer version is really retarded. Python devs used to be more professional.