Unicode and Django
This page is to make an impact analysis on the Django source to see what parts of it need what changes if we want to switch Django from using utf-8 bytestrings internally to fully use unicode strings internally.
Just a pin-down of things that spring to mind, all of them need more complete checking:
- database backends need to handle unicode vs. DATABASE_CHARSET translations
- special casing: the psycopg backend will need type handlers for string types (just as it already has type handlers for date/time types)
- the HTTPResponse sending machinery needs to do the unicode to DEFAULT_CHARSET translation
- the HTTPRequest creation process needs to turn outside strings into unicode strings, using the provided charset (if given) or defaulting to DEFAULT_CHARSET (as that is what was sent to the browser when the form was transmitted)
- There should be a way to access the original "raw" (as bytes) GET and POST data. Django already provides raw POST data using the raw_post_data attribute. Perhaps raw_get_data should also be added.
- Special casing: what happens with GET parameters? those don't provide charsets, what should we do if DEFAULT_ENCODING is utf-8, but the GET parameters aren't valid utf-8? The clean way would be to throw an exception (like with all other places, too)
- The current URI spec (RFC 3986) clearly states that all URIs must be encoded according to UTF-8 so we can assume that this is the case. If this causes a UnicodeDecodeError it makes sense to fall back on windows-1252 or latin-1. Has anyone taken a look at Mark Pilgrim's Universal Encoding Detector? - Noah Slater
- template loaders need to do DEFAULT_CHARSET to unicode translation
- internal usage of str() needs to be checked and supposedly changed over to unicode() usage
- debugging stuff needs to use repr() on strings, not str() (or use unicode() and let the HTTP response handling stuff handle the conversion - most debugging stuff is working with the response machinery anyway)
- mail sending functions need to do the right thing with the MIME type
- we should decide wether to normalize the input unicode data so that at the database or application level we can match strings regardless of their decomposition (see the standard lib’s unicodedata module with its
normalize()function). I would go for NFC, if there’s consensus around normalizing.
- Lazy evaluated method calls do not currently work with Unicode return values, see #1664. I have provided a potential workaround. - Noah Slater
Please either complete the above list or add headlines with more detailed discussions of the points above. Please only post results here, discussion should take place on the django-developer list.