﻿id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
18004	Django should not use `force_unicode(..., errors='replace')` when parsing POST data.	Tai Lee	Aymeric Augustin	"I ran into a problem that is no doubt an edge case, but which results in an extremely difficult bug to track down. At least it was for me.

First, some background on how I came across this issue and the steps I took to trace it.

I have a simple `ModelForm` for a `Photo` model that has an `ImageField`, so users can upload photos. In my view, the `form.save()` method is called, then some other fields are updated on the photo object and `photo.save()` is called, triggering a second save.

Intermittently, I get HTTP 500 error reports emailed to me with a `DatabaseError: current transaction is aborted, commands ignored until end of transaction block` error, triggered by the `photo.save()` call.

I know that this error indicates an earlier SQL statement has caused a database error, which has aborted the transaction.

Looking in my PostgreSQL log file, I see

{{{
2012-03-28 23:32:20 UTC ERROR:  invalid byte sequence for encoding ""UTF8"": 0xea2020
2012-03-28 23:32:20 UTC HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by ""client_encoding"".
2012-03-28 23:32:20 UTC ERROR:  current transaction is aborted, commands ignored until end of transaction block
}}}

The next lines in the log are the SQL statements Django has tried to run *after* the transaction has been aborted, and they did not cause the error.

`RhodiumToad` in #postgresql on IRC tells me ""iirc the statement doesn't actually get dumped if the error occurs when reading the statement itself from the client"" and ""there doesn't seem to be a way to convince pg to log the offending statement, but can you tell where it is in your code based on the surrounding statements?""

So I can't see the actual SQL statement (and the actual ""byte sequence"") that triggers the `invalid byte sequence` error, unless I log every single SQL statement and wait for this error to happen again by chance, then check the logs looking for statements executed immediately before the `invalid byte sequence` error.

This is going to be a big log file, since the error only occurs intermittently, and having to wait for it to trigger again is not very efficient.

I should mention at this point that the encoding for my database is UTF8, my HTML templates have `<meta content=""text/html; charset=utf-8"" http-equiv=""content-type"">`, and when I check my response headers, I see `Content-Type: text/html; charset=utf-8`.

So I hit the Django docs. They tell me that ""If you pass a string to Django that has been encoded in some other format, things will go wrong in interesting ways. Usually, Django will raise a `UnicodeDecodeError` at some point."" and that `request.GET` and `request.POST` are decoded using the `DEFAULT_CHARSET` (`utf-8`).

I look at the `request.POST` data in the HTTP 500 error report email, but I don't see anything but ASCII in there. It doesn't actually tell me anything about the uploaded file name or file data, though.

Since the docs say file data is not decoded, and only the file name is saved to the database, I suspect that someone, somewhere, is intermittently trying to upload a file which has a file name that has an encoding other than UTF8, and that Django also doesn't decode the file name.

Even though I think the browser should see that the HTML page is UTF8, and should encode it's POST data as UTF8, I guess in this obscure case it is not.

So I take a look at the `FileUploadTests.test_unicode_file_name()` test in `regression_tests/file_uploads` and try to create a similar test for a file name encoded with BIG5.

When I submit a request POST using the test client with a file object that has a BIG5 encoded filename, I noticed in my test view that `request.FILES['file_big5'].name` IS actually Unicode. So Django is decoding it, contrary to my previous suspicion.

But the test, `request.FILES['file_big5'].name.endswith(BIG5_FILENAME)` does NOT succeed. The file name has been mangled by Django. At this point I wonder how Django ends up with a mangled file name when it has decoded a BIG5 encoded file name as UTF8, when I get a `UnicodeDecodeError` when I do `BIG5_FILENAME.decode('utf-8')` in the shell.

I dig further, and find that `django.http.multipartparser.MultiPartParser.parse()` uses `force_unicode()` with `errors='replace'`for field names, field data, file names and file field names. I think I've found the problem.

For whatever reason, some browser out there is including a non-UTF8 encoded file name in the POST data. Django is decoding this with UTF8, and replacing errors instead of raising an exception. Then Django is re-encoding this with UTF8 and sending it to the database. The database sees that it is an invalid byte sequence for UTF8 encoding, and raises an error without logging it because ""the error occurs when reading the statement itself from the client"".

As you can see, this took me a while to work out. Django didn't do much to make this error easy to track down. The user still gets an HTTP 500 error, and I still get an HTTP 500 error report emailed to me, but it doesn't include useful information so that I can debug the problem.

I think it would be technically possible to create an `HttpRequest` subclass that uses a different encoding, but this is not really practical, and the whole issue here is that I can't know beforehand what encoding will be used by the browser.

Looking at the history, `multipartparser.py` was added by Jacob in [7814] when Django's file upload capabilities were refactored, and it has behaved this way since all the way back then. I did not go back further any than this to see if there was a legitimate reason for using `errors='replace'` at the very beginning.

I don't think this behaviour is explicitly specified anywhere in the docs, and I think it could be classified as a bug and fixed without being backwards incompatible. I don't see how anyone could be relying on Django to mangle POST data.

I changed those four occurrences (field name, field data, file name, file field name) to `errors='strict'` and the full test suite still passes (at least in SQLite), so I'm not sure if this actually needs to be this way.

If not, I would like to see Django use `force_unicode()` with `errors='strict'` instead for POST field names, field data, file names and file field names. I would rather Django raise a `UnicodeDecodeError` when parsing incorrectly encoded POST data, if the data cannot be decoded with the specified (or default) encoding, than send developers down the rabbit hole trying to debug this.

At the very least, I would like to see Django *try* to decode it with `errors='strict'` first, and add a `dirty` flag to `request.POST` and `request.FILES` (indicating which field names, field data, file names or file field names are ""dirty"") if the data cannot be cleanly decoded, so that developers can choose how to handle it in their view, e.g. return a form validation error, instead of an HTTP 500 error.

Apologies for this essay-length ticket. Thank you for taking the time to read it!
"	Bug	closed	HTTP handling	dev	Normal	needsinfo	post data unicode utf8 encode decode transaction aborted	anssi.kaariainen@… Chris Streeter	Unreviewed	1	0	0	0	0	0
