Changes between Version 3 and Version 4 of StringEncoding

04/06/2007 05:24:14 AM (11 years ago)
Malcolm Tredinnick

Fleshed out most of the sections


  • StringEncoding

    v3 v4  
    11= String Encoding In Django =
     3'''Status''': (as of Friday, April 6, 2007) Still being written. -- mtredinnick
    35== Introduction ==
    911This page tries to capture both what we are aiming to do internally and what we are currently doing (which might be different from the eventual goal). It is partly an attempt to get my (Malcolm Tredinnick's) thoughts down in a logical form. This stuff is very tricky and it's easy to become confused when working on the code. That happens to me regularly, so I need notes like this.
     13== String Types In Python ==
     15Python essentially has two different types of strings: bytestrings, which are sequences of 8-bit bytes that can sometimes be printed, and unicode strings, which hold true Unicode character points in a portable fashion. Unicode strings carry no particular byte encoding, but can be converted to any encoding we like, using the ''encode()'' method, possibly with the aid of the ''codecs'' module in the standard library.
     17Applications intending to be of use to an international (or even merely non-ASCII) audience are recommended to use unicode strings internally wherever possible. Communication with the outside world (e.g. web browsers) usually involves sending and receiving bytestrings (or streams of data that can be converted to bytestrings) and carrying around information about the encoding of those bytestrings alongside the string. This shows the logic behind using unicode strings wherever possible: without an extra piece of information indicating the encoding, a bytestring is an opaque data item. We have no way of knowing whether the bytes it contains are encoded as UTF-8 or KOI8-R or anything else. So we either have to invent a new object representing a bytestring + encoding or convert to the existing Python solution: unicode strings, which are encoding-neutral.
     19A lot of this page is concerned with identifying where we cannot avoid bytestrings and how to work out their encoding at that point and then converting them to unicode immediately for internal use.
    1121== The Big Picture ==
    1525Each arrow in this diagram shows an interface between the core Django framework (henceforth, ''Django'') and applications which interact with Django. The arrow heads show the direction of the string exchanges.
    2131== HTTP Input Handling ==
     33Django's views should be able to handle any data that can be submitted over an HTTP connection. This includes, but is not limited to
     35 * HTML form submissions (both ''application/x-ww-form-urlencoded'' and ''multipart/form-data'' MIME types).
     36 * XML data, such as Atom Publishing Protocol, other REST-based protocol submissions or even WS-*/SOAP and XML-RPC-based submissions.
     37 * Abitrary binary data encoded with a known MIME type.
     39Of these submission types, the first one is the one needing the most attention from Django. Arbitrary data submission, which includes the latter two scenarios, need to be handled by the view functions and presumably those view functions will know how to decode the incoming message.
     41The main thing we need to concerned with from a string encoding perspective is understanding the encoding of the incoming data and the converting it to a known format for subsequent processing.
    2343=== Form Input Handing ===
     45HTML form submission is an area that has traditionally had poor browser compliance with standards and not particularly encompassing standards in the first place. So there are a lot of corner cases involved here. For the most part, though, we can get by with a few simple rules and a couple of conventions ("conventions" meaning that if you don't follow them, anything could, and probably will, happen).
     47''(TO BE COMPLETED)''
     50 * This is one area where conversion from UTF-8 to unicode may fail. Malicious or accidental causes.
     51 * setting "accept" types on forms makes it the responsibility of the developer to handle. Don't guess.
    2554== Passing Strings Between Django and the Developer's Code ==
     56(By ''Developer's code'', I mean any code that is part of an application, but not in Django's core. This includes views, model definitions, URL configuration and templates. Anything the Django user is responsible for writing.)
     58The only potential problem here is when bytestrings are passed between Django and the developer's code. Once again, we have no way of knowing the encoding. Most of the time, the two parties should exchange unicode strings. When bytestrings are passed, we need to have a convention about how these strings are encoded. This is a case where we cannot enforce (at the Python level) any requirement. We can only say "here is what Django expects" and if a developer does not respect this, any errors are their own to deal with.
     60'''Proposal:''' All bytestrings used inside the Django core are assumed to be UTF-8 encoded.
     62Bytestrings passed between the core and the applications should not be dependent on the encoding of the source files they were created in (those files using [ PEP 263] encoding declarations). PEP 263 does not do anything special to bytestrings. It parses them, but leaves them with their original encoding. That encoding information is lost as soon as the string moves beyond its original source file.
     64Similarly, we should not make assumptions based on the value of settings.DEFAULT_CHARSET. That setting is for the default output character encoding used by Django. It is designed to be changeable by the application user. If the internal code depended on this setting, the developer would need to change the code whenever the setting changed and two applications written by two developers with different default assumptions would not interact correctly.
    2766== Database Interaction ==
     68The application developer may not be in control of the basic database configuration. This may require help from a database administrator. Or the application may be built on top of a legacy database. Consequently, it is unreasonable to assume that Django can enforce a particular character set encoding on the database.
     70Django will need to encode all strings it sends to the database with the right encoding method. Similarly, all incoming strings needs to be decoded to unicode objects (keeping them as bytestrings will lose the information about the database encoding, which may not always be UTF-8).
     72'''Proposal:''' Add a new setting to django called DATABASE_CHARSET which is used to database encoding and decoding. This setting contains a string that is understood by Python's ''codecs'' module, not necessarily by the database server.
     74There is currently no way in Django to have the database encoding be any different from the HTML output encoding. We need to fix this.
     77'''Proposal:''' For databases that support table creation with different collation or encoding schemes, add support in the existing DATABASE_OPTIONS setting for these.
     79This would be analagous to the current encoding support that is provided in DATABASE_OPTIONS for !MySQL.
    2981== Talking To External Processes ==
     83A Django developer's code may need to talk to external services or applications to retrieve necessary data.
     85It is entirely outside the realm of Django to handle the character set encoding and conversion issues here. Rather, the application developer is responsible for ensuring that any strings used by Django are either unicode strings or bytestrings encoded as UTF-8.
    3187== HTTP Output ==
    33 === Template Output ===
     89As with input, Django's output possibilities are essentially limitless. However, we can loosely group them into two categories:
     91 * Template generated output
     92 * Arbitrary data sent back by the view.
     94These two cases could also be called "things we care about" and "things we don't". In the first case, Django itself is responsible for encoding the output in the correct way and passing the HTTP handler a bytestring that can be returned. This conversion will be done using the encoding specified by settings.DEFAULT_CHARSET.
     96In the second scenario, the developer (of the view) is entirely responsible for encoding the bytestring correctly and setting the right MIME types and, consequently, charset-encoding value for the returned result.
     98There are a couple of fuzzy, middle-ground areas here. Automated email sending for 404 pages and other admin items is handled by Django. This is treated similarly to template output generation and settings.DEFAULT_CHARSET is used to encode the output.
    35100== Current Problems and Solution Outline ==
     102''(This list is incomplete at the moment)''
     104 * Django does not currently handle arbitrary database encodings, unattached to the concept of DEFAULT_CHARSET.
     105    + Add DEFAULT_CHARSET setting
     106    + teach database backends how to encode unicode and bytestrings for the database (avoid pointless round-trips for bytestrings, if the target is UTF-8).
     107 * The getttext() functions return bytestrings using DEFAULT_CHARSET. This causes a number of difficulties in the code, because UTF-8 encoded bytestrings cannot safely be passed to the string.join() method.
     108    + Consider switching to ugettext() and friends everywhere internally. These return unicode strings that can be used in join() calls. The alternative requires being aware of when bytestrings might be involved and doing yet another decode()/encode() round-trip around the join(). Error-prone and time-consuming.
Back to Top