Opened 13 years ago

Closed 12 years ago

Last modified 12 years ago

#18004 closed Bug (needsinfo)

Django should not use `force_unicode(..., errors='replace')` when parsing POST data.

Reported by: Tai Lee Owned by: Aymeric Augustin
Component: HTTP handling Version: dev
Severity: Normal Keywords: post data unicode utf8 encode decode transaction aborted
Cc: anssi.kaariainen@…, Chris Streeter Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I ran into a problem that is no doubt an edge case, but which results in an extremely difficult bug to track down. At least it was for me.

First, some background on how I came across this issue and the steps I took to trace it.

I have a simple ModelForm for a Photo model that has an ImageField, so users can upload photos. In my view, the form.save() method is called, then some other fields are updated on the photo object and photo.save() is called, triggering a second save.

Intermittently, I get HTTP 500 error reports emailed to me with a DatabaseError: current transaction is aborted, commands ignored until end of transaction block error, triggered by the photo.save() call.

I know that this error indicates an earlier SQL statement has caused a database error, which has aborted the transaction.

Looking in my PostgreSQL log file, I see

2012-03-28 23:32:20 UTC ERROR:  invalid byte sequence for encoding "UTF8": 0xea2020
2012-03-28 23:32:20 UTC HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
2012-03-28 23:32:20 UTC ERROR:  current transaction is aborted, commands ignored until end of transaction block

The next lines in the log are the SQL statements Django has tried to run *after* the transaction has been aborted, and they did not cause the error.

RhodiumToad in #postgresql on IRC tells me "iirc the statement doesn't actually get dumped if the error occurs when reading the statement itself from the client" and "there doesn't seem to be a way to convince pg to log the offending statement, but can you tell where it is in your code based on the surrounding statements?"

So I can't see the actual SQL statement (and the actual "byte sequence") that triggers the invalid byte sequence error, unless I log every single SQL statement and wait for this error to happen again by chance, then check the logs looking for statements executed immediately before the invalid byte sequence error.

This is going to be a big log file, since the error only occurs intermittently, and having to wait for it to trigger again is not very efficient.

I should mention at this point that the encoding for my database is UTF8, my HTML templates have <meta content="text/html; charset=utf-8" http-equiv="content-type">, and when I check my response headers, I see Content-Type: text/html; charset=utf-8.

So I hit the Django docs. They tell me that "If you pass a string to Django that has been encoded in some other format, things will go wrong in interesting ways. Usually, Django will raise a UnicodeDecodeError at some point." and that request.GET and request.POST are decoded using the DEFAULT_CHARSET (utf-8).

I look at the request.POST data in the HTTP 500 error report email, but I don't see anything but ASCII in there. It doesn't actually tell me anything about the uploaded file name or file data, though.

Since the docs say file data is not decoded, and only the file name is saved to the database, I suspect that someone, somewhere, is intermittently trying to upload a file which has a file name that has an encoding other than UTF8, and that Django also doesn't decode the file name.

Even though I think the browser should see that the HTML page is UTF8, and should encode it's POST data as UTF8, I guess in this obscure case it is not.

So I take a look at the FileUploadTests.test_unicode_file_name() test in regression_tests/file_uploads and try to create a similar test for a file name encoded with BIG5.

When I submit a request POST using the test client with a file object that has a BIG5 encoded filename, I noticed in my test view that request.FILES['file_big5'].name IS actually Unicode. So Django is decoding it, contrary to my previous suspicion.

But the test, request.FILES['file_big5'].name.endswith(BIG5_FILENAME) does NOT succeed. The file name has been mangled by Django. At this point I wonder how Django ends up with a mangled file name when it has decoded a BIG5 encoded file name as UTF8, when I get a UnicodeDecodeError when I do BIG5_FILENAME.decode('utf-8') in the shell.

I dig further, and find that django.http.multipartparser.MultiPartParser.parse() uses force_unicode() with errors='replace'for field names, field data, file names and file field names. I think I've found the problem.

For whatever reason, some browser out there is including a non-UTF8 encoded file name in the POST data. Django is decoding this with UTF8, and replacing errors instead of raising an exception. Then Django is re-encoding this with UTF8 and sending it to the database. The database sees that it is an invalid byte sequence for UTF8 encoding, and raises an error without logging it because "the error occurs when reading the statement itself from the client".

As you can see, this took me a while to work out. Django didn't do much to make this error easy to track down. The user still gets an HTTP 500 error, and I still get an HTTP 500 error report emailed to me, but it doesn't include useful information so that I can debug the problem.

I think it would be technically possible to create an HttpRequest subclass that uses a different encoding, but this is not really practical, and the whole issue here is that I can't know beforehand what encoding will be used by the browser.

Looking at the history, multipartparser.py was added by Jacob in [7814] when Django's file upload capabilities were refactored, and it has behaved this way since all the way back then. I did not go back further any than this to see if there was a legitimate reason for using errors='replace' at the very beginning.

I don't think this behaviour is explicitly specified anywhere in the docs, and I think it could be classified as a bug and fixed without being backwards incompatible. I don't see how anyone could be relying on Django to mangle POST data.

I changed those four occurrences (field name, field data, file name, file field name) to errors='strict' and the full test suite still passes (at least in SQLite), so I'm not sure if this actually needs to be this way.

If not, I would like to see Django use force_unicode() with errors='strict' instead for POST field names, field data, file names and file field names. I would rather Django raise a UnicodeDecodeError when parsing incorrectly encoded POST data, if the data cannot be decoded with the specified (or default) encoding, than send developers down the rabbit hole trying to debug this.

At the very least, I would like to see Django *try* to decode it with errors='strict' first, and add a dirty flag to request.POST and request.FILES (indicating which field names, field data, file names or file field names are "dirty") if the data cannot be cleanly decoded, so that developers can choose how to handle it in their view, e.g. return a form validation error, instead of an HTTP 500 error.

Apologies for this essay-length ticket. Thank you for taking the time to read it!

Attachments (2)

18004-multipartparser-r17813.diff (3.9 KB ) - added by Tai Lee 13 years ago.
18004-strict-parsing-in-querydict.diff (2.0 KB ) - added by Aymeric Augustin 12 years ago.

Download all attachments as: .zip

Change History (20)

by Tai Lee, 13 years ago

comment:1 by Tai Lee, 13 years ago

Has patch: set

Attached a patch with tests for my preferred fix, to allow UnicodeDecodeError exceptions to be raised when parsing POST data if it is incorrectly encoded.

Let me know if the alternative is preferred, where request.POST and request.FILES could be marked as "dirty" if they cannot be decoded cleanly, but still allow them to be decoded with errors='replace' so that the developer can decide what to do, if they are impacted by this issue.

comment:2 by Tai Lee, 13 years ago

Jezdez recommended a discussion on django-developers would be appropriate for this issue.

https://groups.google.com/d/topic/django-developers/EAFhwLh3C5c/discussion

comment:3 by Anssi Kääriäinen, 13 years ago

Cc: anssi.kaariainen@… added
Triage Stage: UnreviewedDesign decision needed

I am not too experienced with the HTTP protocol and encoding issues there. But, I wonder how the string ends up being invalid UTF8 when it seems it is force_unicoded? Shouldn't it be valid UTF8 after that?

I am afraid that errors='strict' could result in UnicodeDecodeErrors for users who currently have none.

comment:4 by Tai Lee, 13 years ago

I'm not an encoding expert, either. But it seems that it is forced to Unicode, which mangles the actual data, and although it can still be re-encoded as UTF8, it will be different to the original string and PostgreSQL somehow manages to detect an "invalid byte sequence for encoding: UTF8", at least in some cases.

I don't know if it would happen every time non-UTF8 POST data is mangled, or only with some specific strings that PostgreSQL is able to detect as an "invalid byte sequence". I suspect the latter.

I still haven't seen the actual SQL statement that triggered these errors on my production server, because I am still waiting for this lucky user to try their upload again and trigger the error while I am logging every single SQL statement.

I think that the only time this would result in a UnicodeDecodeError exception when there previously wasn't one, is when a DatabaseError ("current transaction is aborted") exception is raised instead, OR mangled data is silently saved to the database.

comment:5 by Tai Lee, 13 years ago

My theory is that although we cannot reliably detect encoding from a bytestring, we can at least loudly complain when we *know* a bytestring doesn't match the specified encoding. This should be preferable to silently mangling data.

comment:6 by Chris Streeter, 13 years ago

Cc: Chris Streeter added

comment:7 by Tai Lee, 13 years ago

I enabled log_statement = all in my PostgreSQL config since yesterday. About an hour ago the same user (I think, by looking at the other data provided in request.POST) tripped the error again. I checked the PostgreSQL log, but I still don't see the offending statement.

After further discussion in #postgresql, and finding this Stack Overflow post, I understand what RhodiumToad was telling me yesterday. PostgreSQL can't even read the statement as it comes in from the client, so PostgreSQL can't (or won't) log it.

http://stackoverflow.com/questions/2327239/postgresql-log-query-only-on-error/2329605#2329605

Even if I had the offending raw POST data though, there's nothing I could do to stop users from submitting incorrectly encoded POST data in the future, and I would rather be able to detect this and return a meaningful message to the user than allow an HTTP 500 or mangling their data.

comment:8 by Tai Lee, 12 years ago

I've updated this patch on a branch at GitHub to apply cleanly.

https://github.com/thirstydigital/django/tree/tickets/18004-multipartparser-force_unicode

Can we get a BDFL decision here?

Would people prefer to see an alternative patch where Django *tries* to decode it with errors='strict' first, and adds a dirty flag to request.POST and request.FILES (indicating which field names, field data, file names or file field names are "dirty") if the data cannot be cleanly decoded, so that developers can choose how to handle it in their view, e.g. return a form validation error, instead of an HTTP 500 error?

comment:9 by Aymeric Augustin, 12 years ago

Owner: changed from nobody to Aymeric Augustin

comment:10 by Claude Paroz, 12 years ago

We should not forget to address the same issue in QueryDict initialization.

by Aymeric Augustin, 12 years ago

comment:11 by Aymeric Augustin, 12 years ago

All this ticket is based on the premise that browsers send "unencoded" filenames -- ie. filenames as stored on disk, as a sequence of bytes.

If that's true, it's certainly documented somewhere. Can you look for some documentation or specification of this behavior?

(This is hard for me to reproduce as I don't own any systems with a non-utf-8 filesystem.)

Version 0, edited 12 years ago by Aymeric Augustin (next)

comment:12 by Aymeric Augustin, 12 years ago

Status: newassigned

comment:13 by Tai Lee, 12 years ago

I'm not sure how relevant this is, but I found a reference online to IE being buggy with regard to encoding filenames in multipart/form data.

http://lists.evolt.org/archive/Week-of-Mon-20100524/130378.html

The MSDN link in that post seems to be dead now, but the author mentions that it said IE would guess when not specified.

The linked w3schools page indicates that the default value for the <form accept-charset="UNKNOWN"> attribute is UNKNOWN, which means the encoding equals the encoding of the document that contains the form.

http://www.w3schools.com/tags/att_form_accept_charset.asp

The linked sitepoint page says that all versions of IE (only goes up to 8) are buggy, and that IE will happily send data encoded as Windows-1252 if the attribute is specified as ISO-8859-1. Not sure if this is just one example of buggy behaviour, or the only example.

http://reference.sitepoint.com/html/form/accept-charset

aaugustin, do you have any objection to the proposal of attempting to decode strictly, and marking the data as dirty if strict encoding fails and falling back to the current behaviour, so that developers can choose how to deal with this either in their own view code or middleware?

Including the raw filename data in HTTP 500 error emails would help, too.

Last edited 12 years ago by Tai Lee (previous) (diff)

comment:14 by Aymeric Augustin, 12 years ago

Triage Stage: Design decision neededUnreviewed

Yes, I have strong objections to your proposal: I'm not going to add a workaround for a problem that we haven't identified yet.

All of the above is vague, and we still don't know how to trigger this error.


I did the research, and RFC 1867 says that file names must be encoded:

The client application should make best
effort to supply the file name; if the file name of the client's
operating system is not in US-ASCII, the file name might be
approximated or encoded using the method of RFC 1522.


This is repeated in section 5.11 - Non-ASCII field names:

Note that mime headers are generally required to consist only of 7-
bit data in the US-ASCII character set. Hence field names should be
encoded according to the prescriptions of RFC 1522 if they contain
characters outside of that set. In HTML 2.0, the default character
set is ISO-8859-1, but non-ASCII characters in field names should be
encoded.


RFC 1522 describes mime-encoding, and this encoding explicitly includes the charset.


Note that you're the only person to have ever hit this problem; for all I know this could be a bug in your code. The only way to be sure is to log a request, and figure out why Django can't parse it.

Here's what I would suggest: in the problematic view, catch the DatabaseError, and when it occurs, dump request.body in a file in binary mode. Once we have this file, we can figure out why Django ends up with invalid utf8 data.

comment:15 by Aymeric Augustin, 12 years ago

(I moved the ticket back to unreviwed because there isn't enough info yet to accept it, even though it was opened long ago.)

comment:16 by Karen Tracey, 12 years ago

Resolution: needsinfo
Status: assignedclosed

Indeed, we need some way to recreate and understand this issue before attempting any fix. I'm not at all sure, based on this description, that errors=replace is causing this. The "invalid byte sequence" error message noted in the description identifies the invalid byte sequence as 0xea2020. That is nothing at all like the expected utf-8 encoding of the unicode replacement character (which would be 0xefbfbd). The whole point of errors=replace is to take a bytestring we cannot figure out how to correctly decode and turn it into a valid unicode string, possibly containing the unicode replacement character if in fact the input bytestring is "broken". The unicode replacement character itself is a perfectly valid unicode character with a utf-8 encoding: for it to trigger a database error would indicate a bug in the database.

The question is how the byte string 0xea2020 got stuffed onto the postgres connection -- this, as logged, is not a valid utf-8 byte sequence and therefore should never have been sent on the DB connection from Django. But without some way for us to trigger that happening I don't see how we can identify how it happened. Therefore closing this needsinfo since I don't see we can do anything to address this problem without a recreation scenario.

comment:17 by Karen Tracey, 12 years ago

In fact one way to trigger this error is to pass the invalid bytestring in as a raw query, for example:

>>> badq = 'SELECT * from "auth_user" WHERE "auth_user"."username" = \'\xea\x20\x20\''
>>> from django.db import connection
>>> cursor = connection.cursor()
>>> cursor.execute(badq)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/kmtracey/.virtualenvs/abc/local/lib/python2.7/site-packages/django/db/backends/util.py", line 34, in execute
    return self.cursor.execute(sql, params)
  File "/home/kmtracey/.virtualenvs/abc/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 44, in execute
    return self.cursor.execute(query, args)
DatabaseError: invalid byte sequence for encoding "UTF8": 0xea2020

If instead you take the bytestring, decode it with errors=replace, and re-encode it as utf-8, the DB is fine with it (once you clear the error state on the transaction):

>>> cursor.execute(badq.decode('utf-8', errors='replace').encode('utf-8'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/kmtracey/.virtualenvs/abc/local/lib/python2.7/site-packages/django/db/backends/util.py", line 34, in execute
    return self.cursor.execute(sql, params)
  File "/home/kmtracey/.virtualenvs/abc/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 44, in execute
    return self.cursor.execute(query, args)
DatabaseError: current transaction is aborted, commands ignored until end of transaction block

>>> connection._rollback()
>>> cursor.execute(badq.decode('utf-8', errors='replace').encode('utf-8'))
>>> 

That's essentially what the errors=replace on decoding post data is doing: ensuring that internally we work with unicode that can be encoded in utf-8 when it needs to be rather than internally carrying around who-knows-what-encoded bytestrings that may or may not be safe to pass on to other subsystems. How the bytestring 0xea2020 is getting stuffed in the postgres connection is the question that needs to be answered first for this ticket -- I'm not at all convinced that is being caused by errors=replace on decode of post data.

comment:18 by Tai Lee, 12 years ago

Thanks for the feedback. I will attempt to log the request body as you suggested, and hopefully narrow down the cause of these errors.

However, while I understand the need to force incoming data to unicode instead of carrying around who-knows-what-encoded bytestrings, I am still not sure we should be doing so with errors='replace', even if that turns out to not be the cause of my particular problem. It seems like the current code is itself a work-around to a problem we don't know exists (and expect not to exist). A work-around that silently alters and corrupts incoming data.

Note: See TracTickets for help on using tickets.
Back to Top