Context Navigation

Changes between Version 7 and Version 8 of StringEncoding

Timestamp:: Jun 1, 2007, 8:22:30 AM (18 years ago)
Author:: Malcolm Tredinnick
Comment:: Brought this page into line with the Unicode branch. All the previous problems have been solved, so removed that list.

Legend:

: Unmodified
: Added
: Removed
: Modified

StringEncoding

-              v7
+              v8
 = String Encoding In Django =
 '''Status''': (as of Friday, April 6, 2007) Still being written. -- mtredinnick
+'''Status''': (as of Friday, June 1, 2007) Documents the UnicodeBranch implementation. -- mtredinnick
 == Introduction ==
 …
 An important part of internationalization is how we handle strings inside Django and at the interfaces between Django and other applications.
 This page tries to capture both what we are aiming to do internally and what we are currently doing (which might be different from the eventual goal). It is partly an attempt to get my (Malcolm Tredinnick's) thoughts down in a logical form. This stuff is very tricky and it's easy to become confused when working on the code. That happens to me regularly, so I need notes like this.
+This page tries to capture both what we are aiming to do internally. It is partly an attempt to get my (Malcolm Tredinnick's) thoughts down in a logical form. This stuff is very tricky and it's easy to become confused when working on the code. That happens to me regularly, so I need notes like this.
 == String Types In Python ==
 …
 HTML form submission is an area that has traditionally had poor browser compliance with standards and not particularly encompassing standards in the first place. So there are a lot of corner cases involved here. For the most part, though, we can get by with a few simple rules and a couple of conventions ("conventions" meaning that if you don't follow them, anything could, and probably will, happen).
+''(TO BE COMPLETED)''
+For simplicity and the sanity of developers (both core and third-party), Django adopts a fairly simple policy when it comes to interpreting form input. Form submissions are assumed to be in the DEFAULT_CHARSET setting. However, decoding of the input is only done when the GET and POST attributes on an !HttpRequest object are accessed and the decoding is not cached in any way. So the "encoding" attribute on the !HttpRequest instance can be changed and this will update the encoding that is assumed when interpreting GET and POST data. File uploads are never decoded in any way, since they are assumed to already be an arbitrary and opaque sequence of bytes.
 ''Notes:
  * This is one area where conversion from UTF-8 to unicode may fail. Malicious or accidental causes.
+ * This is one area where conversion from UTF-8 to unicode may fail, due to malicious or accidental causes. Any invalid input is treated Python's "replace" codecs error handling.
  * setting "accept" types on forms makes it the responsibility of the developer to handle. Don't guess.
 ''
 …
 The only potential problem here is when bytestrings are passed between Django and the developer's code. Once again, we have no way of knowing the encoding. Most of the time, the two parties should exchange unicode strings. When bytestrings are passed, we need to have a convention about how these strings are encoded. This is a case where we cannot enforce (at the Python level) any requirement. We can only say "here is what Django expects" and if a developer does not respect this, any errors are their own to deal with.
 '''New Convention:''' All bytestrings used inside the Django core are assumed to be UTF-8 encoded.
+'''Convention:''' All bytestrings used inside the Django core are assumed to be UTF-8 encoded.
 Bytestrings passed between the core and the applications should not be dependent on the encoding of the source files they were created in (those files using [http://www.python.org/dev/peps/pep-0263/ PEP 263] encoding declarations). PEP 263 does not do anything special to bytestrings. It parses them, but leaves them with their original encoding. That encoding information is lost as soon as the string moves beyond its original source file.
 …
 The application developer may not be in control of the basic database configuration. This may require help from a database administrator. Or the application may be built on top of a legacy database. Consequently, it is unreasonable to assume that Django can enforce a particular character set encoding on the database.
+Django will need to encode all strings it sends to the database with the right encoding method. Similarly, all incoming strings needs to be decoded to unicode objects (keeping them as bytestrings will lose the information about the database encoding, which may not always be UTF-8).
+There is currently no way in Django to have the database encoding be any different from the HTML output encoding. We need to fix this (this has been fixed in [4971] -- ''mtredinnick'')
+'''Proposal:''' For databases that support table creation with different collation or encoding schemes, add support in the existing DATABASE_OPTIONS setting for these.
+This would be analagous to the current encoding support that is provided in DATABASE_OPTIONS for MySQL.
+Django encodes all strings it sends to the database with the right encoding method. Similarly, all incoming strings are decoded to unicode objects (keeping them as bytestrings would lose the information about the database encoding, which may not always be UTF-8).
 == Talking To External Processes ==
 …
 There are a couple of fuzzy, middle-ground areas here. Automated email sending for 404 pages and other admin items is handled by Django. This is treated similarly to template output generation and settings.DEFAULT_CHARSET is used to encode the output.
-== Current Problems and Solution Outline ==
-''(This list is incomplete at the moment)''
- * Django does not currently handle arbitrary database encodings, unattached to the concept of DEFAULT_CHARSET.
-    * Add DEFAULT_CHARSET setting
-    * teach database backends how to encode unicode and bytestrings for the database (avoid pointless round-trips for bytestrings, if the target is UTF-8).
- * The getttext() functions return bytestrings using DEFAULT_CHARSET. This causes a number of difficulties in the code, because UTF-8 encoded bytestrings cannot safely be passed to the string.join() method.
-    * Consider switching to ugettext() and friends everywhere internally. These return unicode strings that can be used in join() calls. The alternative requires being aware of when bytestrings might be involved and doing yet another decode()/encode() round-trip around the join(). Error-prone and time-consuming.
- * gettext_lazy() is not a perfect proxy for a string. In particular, ''.join([gettext_lazy('some string'), foo]) does not work, because join() wants a real string instance as the first element.
-     * This might be fixable by using metaprogramming to make the returned result of gettext_lazy() look like an instance of its result classes. This could have unintended side-effects, though. I haven't tested this out yet.
-     * The alternative is to blow away gettext_lazy and do what other languages do: use gettext_noop() to mark strings and then put gettext() calls in at the presentation locations to do the translation.