| 18 | | a more restrictive encoding -- for example, latin1 (iso8859-1) -- there will be |
|---|
| 19 | | some characters that you cannot store in the database and information will be |
|---|
| 20 | | lost. |
|---|
| 21 | | |
|---|
| 22 | | * For MySQL users, refer to the `MySQL manual`_ (section 10.3.2 for MySQL 5.1) |
|---|
| 23 | | for details on how to set or alter the database character set encoding. |
|---|
| 24 | | |
|---|
| 25 | | * For PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in |
|---|
| | 19 | a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be |
|---|
| | 20 | able to store certain characters in the database, and information will be lost. |
|---|
| | 21 | |
|---|
| | 22 | * MySQL users, refer to the `MySQL manual`_ (section 10.3.2 for MySQL 5.1) for |
|---|
| | 23 | details on how to set or alter the database character set encoding. |
|---|
| | 24 | |
|---|
| | 25 | * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in |
|---|
| 43 | | Whenever you use strings with Django, you have two choices. You can use Unicode |
|---|
| 44 | | strings or you can use normal strings (sometimes called bytestrings) that are |
|---|
| 45 | | encoded using UTF-8. |
|---|
| | 45 | Whenever you use strings with Django -- e.g., in database lookups, template |
|---|
| | 46 | rendering or anywhere else -- you have two choices for encoding those strings. |
|---|
| | 47 | You can use Unicode strings, or you can use normal strings (sometimes called |
|---|
| | 48 | "bytestrings") that are encoded using UTF-8. |
|---|
| 48 | | A bytestring does not carry any information with it about its encoding. So |
|---|
| 49 | | we have to make an assumption and Django assumes that all bytestrings are |
|---|
| 50 | | in UTF-8. If you pass a string to Django that has been encoded in some |
|---|
| 51 | | other format, things will go wrong in interesting ways. Usually Django will |
|---|
| 52 | | raise a UnicodeDecodeError at some point. |
|---|
| 53 | | |
|---|
| 54 | | If your code only uses ASCII data, you are quite safe to simply use your normal |
|---|
| 55 | | strings (since ASCII is a subset of UTF-8) and pass them around at will. |
|---|
| 56 | | |
|---|
| 57 | | Do not be fooled into thinking that if your ``DEFAULT_CHARSET`` setting is set |
|---|
| 58 | | to something other than ``utf-8`` you can use that encoding in your |
|---|
| 59 | | bytestrings! The ``DEFAULT_CHARSET`` only applies to the strings generated as |
|---|
| 60 | | the result of template rendering (and email). Django will always assume UTF-8 |
|---|
| | 51 | A bytestring does not carry any information with it about its encoding. |
|---|
| | 52 | For that reason, we have to make an assumption, and Django assumes that all |
|---|
| | 53 | bytestrings are in UTF-8. |
|---|
| | 54 | |
|---|
| | 55 | If you pass a string to Django that has been encoded in some other format, |
|---|
| | 56 | things will go wrong in interesting ways. Usually, Django will raise a |
|---|
| | 57 | ``UnicodeDecodeError`` at some point. |
|---|
| | 58 | |
|---|
| | 59 | If your code only uses ASCII data, it's safe to use your normal strings, |
|---|
| | 60 | passing them around at will, because ASCII is a subset of UTF-8. |
|---|
| | 61 | |
|---|
| | 62 | Don't be fooled into thinking that if your ``DEFAULT_CHARSET`` setting is set |
|---|
| | 63 | to something other than ``'utf-8'`` you can use that other encoding in your |
|---|
| | 64 | bytestrings! ``DEFAULT_CHARSET`` only applies to the strings generated as |
|---|
| | 65 | the result of template rendering (and e-mail). Django will always assume UTF-8 |
|---|
| 63 | | application developer). It is under the control of the person installing and |
|---|
| 64 | | using your application and if they choose a different setting, your code must |
|---|
| 65 | | still continue to work. Ergo, it cannot rely on that setting. |
|---|
| | 68 | application developer). It's under the control of the person installing and |
|---|
| | 69 | using your application -- and if that person chooses a different setting, your |
|---|
| | 70 | code must still continue to work. Ergo, it cannot rely on that setting. |
|---|
| 76 | | There is actually a third type of string-like object you may encounter when |
|---|
| 77 | | using Django. If you are using the internationalization features of Django, |
|---|
| 78 | | there is the concept of a "lazy translation". This is a string that has been |
|---|
| 79 | | marked as translated, but the actual result is not determined until the object |
|---|
| 80 | | is used in a string. This is useful because the locale that should be used for |
|---|
| 81 | | the translation will not be known until the string is used, even though the |
|---|
| 82 | | string might have originally been created when the code was first imported. |
|---|
| | 79 | Aside from Unicode strings and bytestrings, there's a third type of string-like |
|---|
| | 80 | object you may encounter when using Django. The framework's |
|---|
| | 81 | internationalization features introduce the concept of a "lazy translation" -- |
|---|
| | 82 | a string that has been marked as translated but whose actual translation result |
|---|
| | 83 | isn't determined until the object is used in a string. This feature is useful |
|---|
| | 84 | in cases where the translation locale is unknown until the string is used, even |
|---|
| | 85 | though the string might have originally been created when the code was first |
|---|
| | 86 | imported. |
|---|
| 111 | | input to unicode string. The ``encoding`` parameter specifies the input |
|---|
| 112 | | encoding of any bytestring -- Django uses this internally when |
|---|
| 113 | | processing form input data, for example, which might not be UTF-8 |
|---|
| 114 | | encoded. The ``errors`` parameter takes any of the values that are |
|---|
| 115 | | accepted by Python's ``unicode()`` function for its error handling. |
|---|
| | 113 | input to a Unicode string. The ``encoding`` parameter specifies the input |
|---|
| | 114 | encoding. (For example, Django uses this internally when processing form |
|---|
| | 115 | input data, which might not be UTF-8 encoded.) The ``errors`` parameter |
|---|
| | 116 | takes any of the values that are accepted by Python's ``unicode()`` |
|---|
| | 117 | function for its error handling. |
|---|
| 124 | | forces those objects to a unicode string (causing the translation to |
|---|
| 125 | | occur). Normally, you will want to use ``smart_unicode()``. However, |
|---|
| 126 | | ``force_unicode()`` is useful in filters and template tags when you |
|---|
| 127 | | absolutely must have a string to work with, not just something that can |
|---|
| | 126 | forces those objects to a Unicode string (causing the translation to |
|---|
| | 127 | occur). Normally, you'll want to use ``smart_unicode()``. However, |
|---|
| | 128 | ``force_unicode()`` is useful in template tags and filters that |
|---|
| | 129 | absolutely *must* have a string to work with, not just something that can |
|---|
| 136 | | difference is needed in a few places internally. |
|---|
| 137 | | |
|---|
| 138 | | Normally, you will only need to use ``smart_unicode()``. Call it as early as |
|---|
| 139 | | possible on any input data that might be either a unicode or bytestring and |
|---|
| 140 | | from then on you can treat the result as always being unicode. |
|---|
| 141 | | |
|---|
| 142 | | .. _uri_and_iri: |
|---|
| | 138 | difference is needed in a few places within Django's internals. |
|---|
| | 139 | |
|---|
| | 140 | Normally, you'll only need to use ``smart_unicode()``. Call it as early as |
|---|
| | 141 | possible on any input data that might be either Unicode or a bytestring, and |
|---|
| | 142 | from then on, you can treat the result as always being Unicode. |
|---|
| 149 | | However, in an international environment, you will often need to construct a |
|---|
| 150 | | URL from an IRI_ (very loosely speaking, a URI that can contain unicode |
|---|
| 151 | | characters). Getting the quoting and conversion from IRI to URI correct can be |
|---|
| 152 | | a little tricky, so Django provides some assistance. |
|---|
| | 149 | However, in an international environment, you might need to construct a |
|---|
| | 150 | URL from an IRI_ -- very loosely speaking, a URI that can contain Unicode |
|---|
| | 151 | characters. Quoting and converting an IRI to URI can be a little tricky, so |
|---|
| | 152 | Django provides some assistance. |
|---|
| 213 | | contain unicode values when Django retrieves the model from the database. This |
|---|
| 214 | | is always the case, even if the data could fit into an ASCII string. |
|---|
| 215 | | |
|---|
| 216 | | As always, you can pass in bytestrings when creating a model or populating a |
|---|
| 217 | | field and Django will convert it to unicode when it needs to. |
|---|
| | 212 | contain Unicode values when Django retrieves data from the database. This |
|---|
| | 213 | is *always* the case, even if the data could fit into an ASCII bytestring. |
|---|
| | 214 | |
|---|
| | 215 | You can pass in bytestrings when creating a model or populating a field, and |
|---|
| | 216 | Django will convert it to Unicode when it needs to. |
|---|
| 220 | | ----------------------------------------------------- |
|---|
| 221 | | |
|---|
| 222 | | One consequence of using unicode by default is that you have to take some care |
|---|
| 223 | | when printing data from the model. In particular, rather than writing a |
|---|
| 224 | | ``__str__()`` method, it is recommended to write a ``__unicode__()`` method for |
|---|
| 225 | | your model. In the ``__unicode__()`` method, you can quite safely return the |
|---|
| 226 | | values of all your fields without having to worry about whether they fit into a |
|---|
| 227 | | bytestring or not (the result of ``__str__()`` is *always* a bytestring, even |
|---|
| 228 | | if you accidentally try to return a unicode object). |
|---|
| 229 | | |
|---|
| 230 | | You can still create a ``__str__()`` method on your models if you wish, of |
|---|
| 231 | | course. However, Django's ``Model`` base class automatically provides you with |
|---|
| 232 | | a ``__str__()`` method that calls your ``__unicode__()`` method and then |
|---|
| 233 | | encodes the result correctly into UTF-8. So you would normally only create a |
|---|
| 234 | | ``__unicode__()`` method and let Django handle the coercion to a bytestring |
|---|
| 235 | | when required. |
|---|
| | 219 | ---------------------------------------------------- |
|---|
| | 220 | |
|---|
| | 221 | One consequence of using Unicode by default is that you have to take some care |
|---|
| | 222 | when printing data from the model. |
|---|
| | 223 | |
|---|
| | 224 | In particular, rather than giving your model a ``__str__()`` method, we |
|---|
| | 225 | recommended you implement a ``__unicode__()`` method. In the ``__unicode__()`` |
|---|
| | 226 | method, you can quite safely return the values of all your fields without |
|---|
| | 227 | having to worry about whether they fit into a bytestring or not. (The way |
|---|
| | 228 | Python works, the result of ``__str__()`` is *always* a bytestring, even if you |
|---|
| | 229 | accidentally try to return a Unicode object). |
|---|
| | 230 | |
|---|
| | 231 | You can still create a ``__str__()`` method on your models if you want, of |
|---|
| | 232 | course, but you shouldn't need to do this unless you have a good reason. |
|---|
| | 233 | Django's ``Model`` base class automatically provides a ``__str__()`` |
|---|
| | 234 | implementation that calls ``__unicode__()`` and encodes the result into UTF-8. |
|---|
| | 235 | This means you'll normally only need to implement a ``__unicode__()`` method |
|---|
| | 236 | and let Django handle the coercion to a bytestring when required. |
|---|
| 240 | | URLs can only contain ASCII characters. If you are constructing a URL from |
|---|
| 241 | | pieces of data that might be non-ASCII, you must be careful to encode the |
|---|
| 242 | | results in a way that is suitable for a URL. If you are using the |
|---|
| 243 | | ``django.db.models.permalink()`` decorator, this is handled automatically by |
|---|
| 244 | | the decorator. |
|---|
| 245 | | |
|---|
| 246 | | If you are constructing the URL manually, you need to take care of the |
|---|
| 247 | | encoding yourself. Normally, this would involve a combination of the |
|---|
| 248 | | ``iri_to_uri()`` and ``urlquote()`` functions that were documented above_. For |
|---|
| 249 | | example:: |
|---|
| | 241 | URLs can only contain ASCII characters. If you're constructing a URL from |
|---|
| | 242 | pieces of data that might be non-ASCII, be careful to encode the results in a |
|---|
| | 243 | way that is suitable for a URL. The ``django.db.models.permalink()`` decorator |
|---|
| | 244 | handles this for you automatically. |
|---|
| | 245 | |
|---|
| | 246 | If you're constructing a URL manually (i.e., *not* using the ``permalink()`` |
|---|
| | 247 | decorator), you'll need to take care of the encoding yourself. In this case, |
|---|
| | 248 | use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented |
|---|
| | 249 | above_. For example:: |
|---|
| 279 | | As usual, templates can be created from unicode or bytestrings. However, they |
|---|
| 280 | | can also be created by reading a file from disk and this creates a slight |
|---|
| 281 | | complication: not all filesystems store their data encoded as UTF-8. If your |
|---|
| 282 | | template files are not stored with a UTF-8 encoding, set the ``FILE_CHARSET`` |
|---|
| 283 | | setting to the encoding of the on-disk files. When Django reads in a template |
|---|
| 284 | | file it will convert the data from this encoding to unicode. |
|---|
| 285 | | |
|---|
| 286 | | When a template is rendered for sending out as an HTML document or an e-mail, |
|---|
| 287 | | it may be convenient to use an encoding other than UTF-8. You should set the |
|---|
| 288 | | ``DEFAULT_CHARSET`` parameter to control the rendered template encoding (the |
|---|
| 289 | | default setting is utf-8). |
|---|
| | 278 | You can use either Unicode or bytestrings when creating templates manually:: |
|---|
| | 279 | |
|---|
| | 280 | from django.template import Template |
|---|
| | 281 | t1 = Template('This is a bytestring template.') |
|---|
| | 282 | t2 = Template(u'This is a Unicode template.') |
|---|
| | 283 | |
|---|
| | 284 | But the common case is to read templates from the filesystem, and this creates |
|---|
| | 285 | a slight complication: not all filesystems store their data encoded as UTF-8. |
|---|
| | 286 | If your template files are not stored with a UTF-8 encoding, set the ``FILE_CHARSET`` |
|---|
| | 287 | setting to the encoding of the files on disk. When Django reads in a template |
|---|
| | 288 | file, it will convert the data from this encoding to Unicode. (``FILE_CHARSET`` |
|---|
| | 289 | is set to ``'utf-8'`` by default.) |
|---|
| | 290 | |
|---|
| | 291 | The ``DEFAULT_CHARSET`` setting controls the encoding of rendered templates. |
|---|
| | 292 | This is set to UTF-8 by default. |
|---|
| 308 | | Django's email framework (in ``django.core.mail``) supports unicode |
|---|
| 309 | | transparently. You can use unicode data in the message bodies and any headers. |
|---|
| 310 | | However, you must still respect the requirements of the email specifications, |
|---|
| 311 | | so, for example, email addresses should use ASCII characters. The following |
|---|
| 312 | | code is certainly possible (demonstrating the everything except e-mail |
|---|
| 313 | | addresses can be non-ASCII):: |
|---|
| | 311 | Django's e-mail framework (in ``django.core.mail``) supports Unicode |
|---|
| | 312 | transparently. You can use Unicode data in the message bodies and any headers. |
|---|
| | 313 | However, you're still obligated to respect the requirements of the e-mail |
|---|
| | 314 | specifications, so, for example, e-mail addresses should use only ASCII |
|---|
| | 315 | characters. |
|---|
| | 316 | |
|---|
| | 317 | The following code example demonstrates that everything except e-mail addresses |
|---|
| | 318 | can be non-ASCII:: |
|---|
| 349 | | ``request.POST`` and all subsequent accesses will use the new encoding. |
|---|
| 350 | | |
|---|
| 351 | | It will typically be very rare that you would need to worry about changing the |
|---|
| 352 | | form encoding. However, if you are talking to a legacy system or a system |
|---|
| 353 | | beyond your control with particular ideas about encoding, you do have a way to |
|---|
| 354 | | control the decoding of the data. |
|---|
| 355 | | |
|---|
| 356 | | For request features such as file uploads, no automatic decoding takes place, |
|---|
| 357 | | because those attributes are normally treated as collections of bytes, rather |
|---|
| 358 | | than strings. Any decoding would alter the meaning of the stream of bytes. |
|---|
| 359 | | |
|---|
| | 355 | ``request.POST``, and all subsequent accesses will use the new encoding. |
|---|
| | 356 | |
|---|
| | 357 | Most developers won't need to worry about changing form encoding, but this is |
|---|
| | 358 | a useful feature for applications that talk to legacy systems whose encoding |
|---|
| | 359 | you cannot control. |
|---|
| | 360 | |
|---|
| | 361 | Django does not decode the data of file uploads, because that data is normally |
|---|
| | 362 | treated as collections of bytes, rather than strings. Any automatic decoding |
|---|
| | 363 | there would alter the meaning of the stream of bytes. |
|---|