Django should not use `force_unicode(..., errors='replace')` when parsing POST data.
|Reported by:||mrmachine||Owned by:||aaugustin|
|Severity:||Normal||Keywords:||post data unicode utf8 encode decode transaction aborted|
|Cc:||anssi.kaariainen@…, streeter||Triage Stage:||Unreviewed|
|Has patch:||yes||Needs documentation:||no|
|Needs tests:||no||Patch needs improvement:||no|
I ran into a problem that is no doubt an edge case, but which results in an extremely difficult bug to track down. At least it was for me.
First, some background on how I came across this issue and the steps I took to trace it.
I have a simple ModelForm for a Photo model that has an ImageField, so users can upload photos. In my view, the form.save() method is called, then some other fields are updated on the photo object and photo.save() is called, triggering a second save.
Intermittently, I get HTTP 500 error reports emailed to me with a DatabaseError: current transaction is aborted, commands ignored until end of transaction block error, triggered by the photo.save() call.
I know that this error indicates an earlier SQL statement has caused a database error, which has aborted the transaction.
Looking in my PostgreSQL log file, I see
2012-03-28 23:32:20 UTC ERROR: invalid byte sequence for encoding "UTF8": 0xea2020 2012-03-28 23:32:20 UTC HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". 2012-03-28 23:32:20 UTC ERROR: current transaction is aborted, commands ignored until end of transaction block
The next lines in the log are the SQL statements Django has tried to run *after* the transaction has been aborted, and they did not cause the error.
RhodiumToad in #postgresql on IRC tells me "iirc the statement doesn't actually get dumped if the error occurs when reading the statement itself from the client" and "there doesn't seem to be a way to convince pg to log the offending statement, but can you tell where it is in your code based on the surrounding statements?"
So I can't see the actual SQL statement (and the actual "byte sequence") that triggers the invalid byte sequence error, unless I log every single SQL statement and wait for this error to happen again by chance, then check the logs looking for statements executed immediately before the invalid byte sequence error.
This is going to be a big log file, since the error only occurs intermittently, and having to wait for it to trigger again is not very efficient.
I should mention at this point that the encoding for my database is UTF8, my HTML templates have <meta content="text/html; charset=utf-8" http-equiv="content-type">, and when I check my response headers, I see Content-Type: text/html; charset=utf-8.
So I hit the Django docs. They tell me that "If you pass a string to Django that has been encoded in some other format, things will go wrong in interesting ways. Usually, Django will raise a UnicodeDecodeError at some point." and that request.GET and request.POST are decoded using the DEFAULT_CHARSET (utf-8).
I look at the request.POST data in the HTTP 500 error report email, but I don't see anything but ASCII in there. It doesn't actually tell me anything about the uploaded file name or file data, though.
Since the docs say file data is not decoded, and only the file name is saved to the database, I suspect that someone, somewhere, is intermittently trying to upload a file which has a file name that has an encoding other than UTF8, and that Django also doesn't decode the file name.
Even though I think the browser should see that the HTML page is UTF8, and should encode it's POST data as UTF8, I guess in this obscure case it is not.
So I take a look at the FileUploadTests.test_unicode_file_name() test in regression_tests/file_uploads and try to create a similar test for a file name encoded with BIG5.
When I submit a request POST using the test client with a file object that has a BIG5 encoded filename, I noticed in my test view that request.FILES['file_big5'].name IS actually Unicode. So Django is decoding it, contrary to my previous suspicion.
But the test, request.FILES['file_big5'].name.endswith(BIG5_FILENAME) does NOT succeed. The file name has been mangled by Django. At this point I wonder how Django ends up with a mangled file name when it has decoded a BIG5 encoded file name as UTF8, when I get a UnicodeDecodeError when I do BIG5_FILENAME.decode('utf-8') in the shell.
I dig further, and find that django.http.multipartparser.MultiPartParser.parse() uses force_unicode() with errors='replace'for field names, field data, file names and file field names. I think I've found the problem.
For whatever reason, some browser out there is including a non-UTF8 encoded file name in the POST data. Django is decoding this with UTF8, and replacing errors instead of raising an exception. Then Django is re-encoding this with UTF8 and sending it to the database. The database sees that it is an invalid byte sequence for UTF8 encoding, and raises an error without logging it because "the error occurs when reading the statement itself from the client".
As you can see, this took me a while to work out. Django didn't do much to make this error easy to track down. The user still gets an HTTP 500 error, and I still get an HTTP 500 error report emailed to me, but it doesn't include useful information so that I can debug the problem.
I think it would be technically possible to create an HttpRequest subclass that uses a different encoding, but this is not really practical, and the whole issue here is that I can't know beforehand what encoding will be used by the browser.
Looking at the history, multipartparser.py was added by Jacob in  when Django's file upload capabilities were refactored, and it has behaved this way since all the way back then. I did not go back further any than this to see if there was a legitimate reason for using errors='replace' at the very beginning.
I don't think this behaviour is explicitly specified anywhere in the docs, and I think it could be classified as a bug and fixed without being backwards incompatible. I don't see how anyone could be relying on Django to mangle POST data.
I changed those four occurrences (field name, field data, file name, file field name) to errors='strict' and the full test suite still passes (at least in SQLite), so I'm not sure if this actually needs to be this way.
If not, I would like to see Django use force_unicode() with errors='strict' instead for POST field names, field data, file names and file field names. I would rather Django raise a UnicodeDecodeError when parsing incorrectly encoded POST data, if the data cannot be decoded with the specified (or default) encoding, than send developers down the rabbit hole trying to debug this.
At the very least, I would like to see Django *try* to decode it with errors='strict' first, and add a dirty flag to request.POST and request.FILES (indicating which field names, field data, file names or file field names are "dirty") if the data cannot be cleanly decoded, so that developers can choose how to handle it in their view, e.g. return a form validation error, instead of an HTTP 500 error.
Apologies for this essay-length ticket. Thank you for taking the time to read it!
Change History (20)
Changed 4 years ago by mrmachine
comment:1 Changed 4 years ago by mrmachine
- Has patch set
- Needs documentation unset
- Needs tests unset
- Patch needs improvement unset
comment:3 Changed 4 years ago by akaariai
- Cc anssi.kaariainen@… added
- Triage Stage changed from Unreviewed to Design decision needed
Changed 3 years ago by aaugustin
comment:14 Changed 3 years ago by aaugustin
- Triage Stage changed from Design decision needed to Unreviewed
comment:16 Changed 3 years ago by kmtracey
- Resolution set to needsinfo
- Status changed from assigned to closed