Version 18 (modified by Malcolm Tredinnick, 17 years ago) ( diff )

Updated status

TOC

The unicode branch

This branch aims to make Django's internals fully Unicode-aware.

How to get the branch

svn co http://code.djangoproject.com/svn/django/branches/unicode/

See our branch policy for full information on how to use a branch.

Goals

The main goals of this branch are:

  • Make it easier for developers to work with non-ASCII character data when working with Django.
  • Be more consistent in our string handling behaviour inside Django (see StringEncoding for details on this).

Upon completion, you will be able to pass around unicode strings anywhere inside Django (or between Django and developer applications).

Note that we are not trying to switch to forcing everybody to only use unicode strings. You will also be able to pass around bytestrings and Django will assume they are UTF-8 encoded (we have to make an assumption because there is no way to tell what the encoding is otherwise). This feature means that a large chunk of existing code that uses Django will continue to work unchanged.

Status

The branch was created on April 7, 2007.

Reported bugs from trunk that have been fixed and will be closed once the branch is merged back into trunk can be viewed here.

Todo Items

The various pieces will be converted in roughly the following order:

  1. Template rendering (Done in [4971])
  2. Database I/O (Done in [4971] for postgresql, postgresql_psycopg2, mysql, mysql_old and sqlite backends)
    • Needs testing for servers/tables that are not in UTF-8 or ASCII encoding. The theory is that the client connection for each backend should be automatically converting everything to UTF-8 or Unicode objects (depends on backend), but this needs verifying. (Ivan Sagalaev and Malcolm have tested this feature a fair bit with various database servers, but more stress tests would be nice.)
  3. Model class support (Done in [5057])
  4. Form input encoding (Done in [5192])
  5. Other output methods:
    • syndication (Done in [5251])
    • serialization (Done in [5248])
    • Google sitemaps (Done in ]5277])
  6. Audit other contrib modules (not all will require changes) (all Done as of [5274]).
  7. Verifying and fixing all bugs mentioned in Trac.
    • This includes the 9 tickets that were merged in #2489.

We also need to look at the i18n support functions (in django.utils.translation):

  • Decide on usage of gettext() versus ugettext() in a number of places (Done. Recommended to use ugettext() and friends everywhere. In [5230] this change has been made throughout the framework).
  • Look at rewriting gettext_lazy() so that it acts as a better string and unicode proxy.

Finally, some documentation needs to be written describing good practices for creating unicode-aware Django apps.

Porting Applications (The Quick Checklist)

One of the design goals of the Unicode branch is that very little significant changes to existing third-party code should be required. However, there are some things that developers should be aware of when writing applications designed to handle international input.

A detailed list of things you might wish to think about when writing your code is given below. However, for the programmer on a deadline, here is the cheatsheet version (if you only use ASCII strings, none of these changes are necessary):

  1. Change the __str__ methods on your models to be __unicode__ methods. Just change the name. Usually, nothing else will be needed.
  1. Look for any str() calls in your code that operate on model fields. These should almost always be changed to smart_unicode() calls (which is imported from django.utils.encoding).
  1. Use the unicode versions of the django.utils.translation.* functions. Replace gettext and ngettext with ugettext and ungettext respectively. There are also ugettext_lazy and ungettext_lazy functions if you use the lazy versions.
  1. Make sure your database can store all the data you will send to it. Usually, this means ensuring it is using UTF-8 (or similar) encoding internally.
  1. Use the FILE_CHARSET setting if your on-disk template files are not UTF-8 encoded.

That is all. Enjoy!

Things To Consider When Writing Applications

This section is no doubt incomplete. User experiences are welcome. If you discover something that is necessary to change, please add a bullet-point to the list (although we may edit the list periodically to be more coherent).

String Encoding

  • In many cases, Django will convert any bytestrings passed to functions, such as filter functions, into unicode strings. All bytestrings, with the exception of form inputs and data read from files, are assumed to be UTF-8 encoded. Internal bytestrings that are not valid UTF-8 will cause fatal exceptions (because my_string.decode('utf-8') will fail).
  • Template files read from disk may be in an encoding that is not related to the output encoding or UTF-8. To specify the on-disk file encoding, use the FILE_CHARSET setting, which is new in the Unicode branch.
  • String data read from the database will be converted directly to unicode strings. So model attributes based on text fields (TextField, CharField, etc) will be unicode strings.
  • Field sizes for text fields such as TextField and CharField are specified in terms of characters, not the number of bytes used in the encoding in the database. All databases supported by Django can handle this (i.e. their VARCHAR fields are sized in terms of characaters and can store unicode characters). So you do not need to worry about how many bytes the encoded version of your data will take up when working with lengths.
  • You might find the functions django.utils.encoding.smart_str() and django.utils.encoding.smart_unicode() useful in your application code. Particularly the latter is handy: it takes a bytestring or unicode string and returns a unicode string. It also knows to convert objects with a __unicode___ or __str__ method into unicode strings. So if you have a string that is either a bytestring or unicode and you wish to make it uniform -- always a unicode string -- call smart_unicode() on the object.

Databases

  • Make sure that your database tables support an encoding that can hold all the data you are going to send to it. For example, if you may possibly be sending Chinese characters to the database, using the Russian KOI8-R encoding is going to cause errors. Django does not need to know what encoding your database uses, since the Python database wrappers take care of that. However, you should ensure your database is configured to handle the data you wish to send it. Generally, using a UTF-8 encoding for your tables is the simplest solution.
    • TODO: Write up how to set and check this information for MySQL, PostgreSQL and SQLite.

Models

  • As mentioned previously, all model attributes retrieved from the database will be unicode strings.
  • If you are supporting international data, it is not safe to return the value of a field directly in your model's __str__ method (in Python, __str___ will always coerce the result to a bytestring object, even if you return a unicode string from the function). There are two possibilities here:
    • The simplest solution is to replace any __str__ methods with a __unicode__ method. This method returns a unicode string, so you can safely write
      class MyModel(models.Model):
          name = models.CharField(maxlength=50)
          ...
          def __unicode__(self):
              return self.name
      
      The default models.Model.__str__ method will call your model's __unicode__, if it exists, and then convert the result to UTF-8. So this single change should be transparent to the rest of your code.
    • Alternatively, if you want to explicitly write the __str__ method for your model, it must return a UTF-8 encoded bytestring. No other encoding is acceptable here (certainly not settings.DEFAULT_CHARSET), because the result of calling str() on a model is used in more places than just template output.
Note: See TracWiki for help on using the wiki.
Back to Top