| 1 |
====================== |
|---|
| 2 |
Unicode data in Django |
|---|
| 3 |
====================== |
|---|
| 4 |
|
|---|
| 5 |
**New in Django development version** |
|---|
| 6 |
|
|---|
| 7 |
Django natively supports Unicode data everywhere. Providing your database can |
|---|
| 8 |
somehow store the data, you can safely pass around Unicode strings to |
|---|
| 9 |
templates, models and the database. |
|---|
| 10 |
|
|---|
| 11 |
This document tells you what you need to know if you're writing applications |
|---|
| 12 |
that use data or templates that are encoded in something other than ASCII. |
|---|
| 13 |
|
|---|
| 14 |
Creating the database |
|---|
| 15 |
===================== |
|---|
| 16 |
|
|---|
| 17 |
Make sure your database is configured to be able to store arbitrary string |
|---|
| 18 |
data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use |
|---|
| 19 |
a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be |
|---|
| 20 |
able to store certain characters in the database, and information will be lost. |
|---|
| 21 |
|
|---|
| 22 |
* MySQL users, refer to the `MySQL manual`_ (section 10.3.2 for MySQL 5.1) for |
|---|
| 23 |
details on how to set or alter the database character set encoding. |
|---|
| 24 |
|
|---|
| 25 |
* PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in |
|---|
| 26 |
PostgreSQL 8) for details on creating databases with the correct encoding. |
|---|
| 27 |
|
|---|
| 28 |
* SQLite users, there is nothing you need to do. SQLite always uses UTF-8 |
|---|
| 29 |
for internal encoding. |
|---|
| 30 |
|
|---|
| 31 |
.. _MySQL manual: http://www.mysql.org/doc/refman/5.1/en/charset-database.html |
|---|
| 32 |
.. _PostgreSQL manual: http://www.postgresql.org/docs/8.2/static/multibyte.html#AEN24104 |
|---|
| 33 |
|
|---|
| 34 |
All of Django's database backends automatically convert Unicode strings into |
|---|
| 35 |
the appropriate encoding for talking to the database. They also automatically |
|---|
| 36 |
convert strings retrieved from the database into Python Unicode strings. You |
|---|
| 37 |
don't even need to tell Django what encoding your database uses: that is |
|---|
| 38 |
handled transparently. |
|---|
| 39 |
|
|---|
| 40 |
For more, see the section "The database API" below. |
|---|
| 41 |
|
|---|
| 42 |
General string handling |
|---|
| 43 |
======================= |
|---|
| 44 |
|
|---|
| 45 |
Whenever you use strings with Django -- e.g., in database lookups, template |
|---|
| 46 |
rendering or anywhere else -- you have two choices for encoding those strings. |
|---|
| 47 |
You can use Unicode strings, or you can use normal strings (sometimes called |
|---|
| 48 |
"bytestrings") that are encoded using UTF-8. |
|---|
| 49 |
|
|---|
| 50 |
.. warning:: |
|---|
| 51 |
A bytestring does not carry any information with it about its encoding. |
|---|
| 52 |
For that reason, we have to make an assumption, and Django assumes that all |
|---|
| 53 |
bytestrings are in UTF-8. |
|---|
| 54 |
|
|---|
| 55 |
If you pass a string to Django that has been encoded in some other format, |
|---|
| 56 |
things will go wrong in interesting ways. Usually, Django will raise a |
|---|
| 57 |
``UnicodeDecodeError`` at some point. |
|---|
| 58 |
|
|---|
| 59 |
If your code only uses ASCII data, it's safe to use your normal strings, |
|---|
| 60 |
passing them around at will, because ASCII is a subset of UTF-8. |
|---|
| 61 |
|
|---|
| 62 |
Don't be fooled into thinking that if your ``DEFAULT_CHARSET`` setting is set |
|---|
| 63 |
to something other than ``'utf-8'`` you can use that other encoding in your |
|---|
| 64 |
bytestrings! ``DEFAULT_CHARSET`` only applies to the strings generated as |
|---|
| 65 |
the result of template rendering (and e-mail). Django will always assume UTF-8 |
|---|
| 66 |
encoding for internal bytestrings. The reason for this is that the |
|---|
| 67 |
``DEFAULT_CHARSET`` setting is not actually under your control (if you are the |
|---|
| 68 |
application developer). It's under the control of the person installing and |
|---|
| 69 |
using your application -- and if that person chooses a different setting, your |
|---|
| 70 |
code must still continue to work. Ergo, it cannot rely on that setting. |
|---|
| 71 |
|
|---|
| 72 |
In most cases when Django is dealing with strings, it will convert them to |
|---|
| 73 |
Unicode strings before doing anything else. So, as a general rule, if you pass |
|---|
| 74 |
in a bytestring, be prepared to receive a Unicode string back in the result. |
|---|
| 75 |
|
|---|
| 76 |
Translated strings |
|---|
| 77 |
------------------ |
|---|
| 78 |
|
|---|
| 79 |
Aside from Unicode strings and bytestrings, there's a third type of string-like |
|---|
| 80 |
object you may encounter when using Django. The framework's |
|---|
| 81 |
internationalization features introduce the concept of a "lazy translation" -- |
|---|
| 82 |
a string that has been marked as translated but whose actual translation result |
|---|
| 83 |
isn't determined until the object is used in a string. This feature is useful |
|---|
| 84 |
in cases where the translation locale is unknown until the string is used, even |
|---|
| 85 |
though the string might have originally been created when the code was first |
|---|
| 86 |
imported. |
|---|
| 87 |
|
|---|
| 88 |
Normally, you won't have to worry about lazy translations. Just be aware that |
|---|
| 89 |
if you examine an object and it claims to be a |
|---|
| 90 |
``django.utils.functional.__proxy__`` object, it is a lazy translation. |
|---|
| 91 |
Calling ``unicode()`` with the lazy translation as the argument will generate a |
|---|
| 92 |
Unicode string in the current locale. |
|---|
| 93 |
|
|---|
| 94 |
For more details about lazy translation objects, refer to the |
|---|
| 95 |
internationalization_ documentation. |
|---|
| 96 |
|
|---|
| 97 |
.. _internationalization: ../i18n/#lazy-translation |
|---|
| 98 |
|
|---|
| 99 |
Useful utility functions |
|---|
| 100 |
------------------------ |
|---|
| 101 |
|
|---|
| 102 |
Because some string operations come up again and again, Django ships with a few |
|---|
| 103 |
useful functions that should make working with Unicode and bytestring objects |
|---|
| 104 |
a bit easier. |
|---|
| 105 |
|
|---|
| 106 |
Conversion functions |
|---|
| 107 |
~~~~~~~~~~~~~~~~~~~~ |
|---|
| 108 |
|
|---|
| 109 |
The ``django.utils.encoding`` module contains a few functions that are handy |
|---|
| 110 |
for converting back and forth between Unicode and bytestrings. |
|---|
| 111 |
|
|---|
| 112 |
* ``smart_unicode(s, encoding='utf-8', errors='strict')`` converts its |
|---|
| 113 |
input to a Unicode string. The ``encoding`` parameter specifies the input |
|---|
| 114 |
encoding. (For example, Django uses this internally when processing form |
|---|
| 115 |
input data, which might not be UTF-8 encoded.) The ``errors`` parameter |
|---|
| 116 |
takes any of the values that are accepted by Python's ``unicode()`` |
|---|
| 117 |
function for its error handling. |
|---|
| 118 |
|
|---|
| 119 |
If you pass ``smart_unicode()`` an object that has a ``__unicode__`` |
|---|
| 120 |
method, it will use that method to do the conversion. |
|---|
| 121 |
|
|---|
| 122 |
* ``force_unicode(s, encoding='utf-8', errors='strict')`` is identical to |
|---|
| 123 |
``smart_unicode()`` in almost all cases. The difference is when the |
|---|
| 124 |
first argument is a `lazy translation`_ instance. While |
|---|
| 125 |
``smart_unicode()`` preserves lazy translations, ``force_unicode()`` |
|---|
| 126 |
forces those objects to a Unicode string (causing the translation to |
|---|
| 127 |
occur). Normally, you'll want to use ``smart_unicode()``. However, |
|---|
| 128 |
``force_unicode()`` is useful in template tags and filters that |
|---|
| 129 |
absolutely *must* have a string to work with, not just something that can |
|---|
| 130 |
be converted to a string. |
|---|
| 131 |
|
|---|
| 132 |
* ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')`` |
|---|
| 133 |
is essentially the opposite of ``smart_unicode()``. It forces the first |
|---|
| 134 |
argument to a bytestring. The ``strings_only`` parameter, if set to True, |
|---|
| 135 |
will result in Python integers, booleans and ``None`` not being |
|---|
| 136 |
converted to a string (they keep their original types). This is slightly |
|---|
| 137 |
different semantics from Python's builtin ``str()`` function, but the |
|---|
| 138 |
difference is needed in a few places within Django's internals. |
|---|
| 139 |
|
|---|
| 140 |
Normally, you'll only need to use ``smart_unicode()``. Call it as early as |
|---|
| 141 |
possible on any input data that might be either Unicode or a bytestring, and |
|---|
| 142 |
from then on, you can treat the result as always being Unicode. |
|---|
| 143 |
|
|---|
| 144 |
URI and IRI handling |
|---|
| 145 |
~~~~~~~~~~~~~~~~~~~~ |
|---|
| 146 |
|
|---|
| 147 |
Web frameworks have to deal with URLs (which are a type of URI_). One |
|---|
| 148 |
requirement of URLs is that they are encoded using only ASCII characters. |
|---|
| 149 |
However, in an international environment, you might need to construct a |
|---|
| 150 |
URL from an IRI_ -- very loosely speaking, a URI that can contain Unicode |
|---|
| 151 |
characters. Quoting and converting an IRI to URI can be a little tricky, so |
|---|
| 152 |
Django provides some assistance. |
|---|
| 153 |
|
|---|
| 154 |
* The function ``django.utils.encoding.iri_to_uri()`` implements the |
|---|
| 155 |
conversion from IRI to URI as required by the specification (`RFC |
|---|
| 156 |
3987`_). |
|---|
| 157 |
|
|---|
| 158 |
* The functions ``django.utils.http.urlquote()`` and |
|---|
| 159 |
``django.utils.http.urlquote_plus()`` are versions of Python's standard |
|---|
| 160 |
``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII |
|---|
| 161 |
characters. (The data is converted to UTF-8 prior to encoding.) |
|---|
| 162 |
|
|---|
| 163 |
These two groups of functions have slightly different purposes, and it's |
|---|
| 164 |
important to keep them straight. Normally, you would use ``urlquote()`` on the |
|---|
| 165 |
individual portions of the IRI or URI path so that any reserved characters |
|---|
| 166 |
such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to |
|---|
| 167 |
the full IRI and it converts any non-ASCII characters to the correct encoded |
|---|
| 168 |
values. |
|---|
| 169 |
|
|---|
| 170 |
.. note:: |
|---|
| 171 |
Technically, it isn't correct to say that ``iri_to_uri()`` implements the |
|---|
| 172 |
full algorithm in the IRI specification. It doesn't (yet) perform the |
|---|
| 173 |
international domain name encoding portion of the algorithm. |
|---|
| 174 |
|
|---|
| 175 |
The ``iri_to_uri()`` function will not change ASCII characters that are |
|---|
| 176 |
otherwise permitted in a URL. So, for example, the character '%' is not |
|---|
| 177 |
further encoded when passed to ``iri_to_uri()``. This means you can pass a |
|---|
| 178 |
full URL to this function and it will not mess up the query string or anything |
|---|
| 179 |
like that. |
|---|
| 180 |
|
|---|
| 181 |
An example might clarify things here:: |
|---|
| 182 |
|
|---|
| 183 |
>>> urlquote(u'Paris & Orléans') |
|---|
| 184 |
u'Paris%20%26%20Orl%C3%A9ans' |
|---|
| 185 |
>>> iri_to_uri(u'/favorites/François/%s' % urlquote(u'Paris & Orléans')) |
|---|
| 186 |
'/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans' |
|---|
| 187 |
|
|---|
| 188 |
If you look carefully, you can see that the portion that was generated by |
|---|
| 189 |
``urlquote()`` in the second example was not double-quoted when passed to |
|---|
| 190 |
``iri_to_uri()``. This is a very important and useful feature. It means that |
|---|
| 191 |
you can construct your IRI without worrying about whether it contains |
|---|
| 192 |
non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the |
|---|
| 193 |
result. |
|---|
| 194 |
|
|---|
| 195 |
The ``iri_to_uri()`` function is also idempotent, which means the following is |
|---|
| 196 |
always true:: |
|---|
| 197 |
|
|---|
| 198 |
iri_to_uri(iri_to_uri(some_string)) = iri_to_uri(some_string) |
|---|
| 199 |
|
|---|
| 200 |
So you can safely call it multiple times on the same IRI without risking |
|---|
| 201 |
double-quoting problems. |
|---|
| 202 |
|
|---|
| 203 |
.. _URI: http://www.ietf.org/rfc/rfc2396.txt |
|---|
| 204 |
.. _IRI: http://www.ietf.org/rfc/rfc3987.txt |
|---|
| 205 |
.. _RFC 3987: IRI_ |
|---|
| 206 |
|
|---|
| 207 |
Models |
|---|
| 208 |
====== |
|---|
| 209 |
|
|---|
| 210 |
Because all strings are returned from the database as Unicode strings, model |
|---|
| 211 |
fields that are character based (CharField, TextField, URLField, etc) will |
|---|
| 212 |
contain Unicode values when Django retrieves data from the database. This |
|---|
| 213 |
is *always* the case, even if the data could fit into an ASCII bytestring. |
|---|
| 214 |
|
|---|
| 215 |
You can pass in bytestrings when creating a model or populating a field, and |
|---|
| 216 |
Django will convert it to Unicode when it needs to. |
|---|
| 217 |
|
|---|
| 218 |
Choosing between ``__str__()`` and ``__unicode__()`` |
|---|
| 219 |
---------------------------------------------------- |
|---|
| 220 |
|
|---|
| 221 |
One consequence of using Unicode by default is that you have to take some care |
|---|
| 222 |
when printing data from the model. |
|---|
| 223 |
|
|---|
| 224 |
In particular, rather than giving your model a ``__str__()`` method, we |
|---|
| 225 |
recommended you implement a ``__unicode__()`` method. In the ``__unicode__()`` |
|---|
| 226 |
method, you can quite safely return the values of all your fields without |
|---|
| 227 |
having to worry about whether they fit into a bytestring or not. (The way |
|---|
| 228 |
Python works, the result of ``__str__()`` is *always* a bytestring, even if you |
|---|
| 229 |
accidentally try to return a Unicode object). |
|---|
| 230 |
|
|---|
| 231 |
You can still create a ``__str__()`` method on your models if you want, of |
|---|
| 232 |
course, but you shouldn't need to do this unless you have a good reason. |
|---|
| 233 |
Django's ``Model`` base class automatically provides a ``__str__()`` |
|---|
| 234 |
implementation that calls ``__unicode__()`` and encodes the result into UTF-8. |
|---|
| 235 |
This means you'll normally only need to implement a ``__unicode__()`` method |
|---|
| 236 |
and let Django handle the coercion to a bytestring when required. |
|---|
| 237 |
|
|---|
| 238 |
Taking care in ``get_absolute_url()`` |
|---|
| 239 |
------------------------------------- |
|---|
| 240 |
|
|---|
| 241 |
URLs can only contain ASCII characters. If you're constructing a URL from |
|---|
| 242 |
pieces of data that might be non-ASCII, be careful to encode the results in a |
|---|
| 243 |
way that is suitable for a URL. The ``django.db.models.permalink()`` decorator |
|---|
| 244 |
handles this for you automatically. |
|---|
| 245 |
|
|---|
| 246 |
If you're constructing a URL manually (i.e., *not* using the ``permalink()`` |
|---|
| 247 |
decorator), you'll need to take care of the encoding yourself. In this case, |
|---|
| 248 |
use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented |
|---|
| 249 |
above_. For example:: |
|---|
| 250 |
|
|---|
| 251 |
from django.utils.encoding import iri_to_uri |
|---|
| 252 |
from django.utils.http import urlquote |
|---|
| 253 |
|
|---|
| 254 |
def get_absolute_url(self): |
|---|
| 255 |
url = u'/person/%s/?x=0&y=0' % urlquote(self.location) |
|---|
| 256 |
return iri_to_uri(url) |
|---|
| 257 |
|
|---|
| 258 |
This function returns a correctly encoded URL even if ``self.location`` is |
|---|
| 259 |
something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()`` |
|---|
| 260 |
call isn't strictly necessary in the above example, because all the |
|---|
| 261 |
non-ASCII characters would have been removed in quoting in the first line.) |
|---|
| 262 |
|
|---|
| 263 |
.. _above: uri_and_iri_ |
|---|
| 264 |
|
|---|
| 265 |
The database API |
|---|
| 266 |
================ |
|---|
| 267 |
|
|---|
| 268 |
You can pass either Unicode strings or UTF-8 bytestrings as arguments to |
|---|
| 269 |
``filter()`` methods and the like in the database API. The following two |
|---|
| 270 |
querysets are identical:: |
|---|
| 271 |
|
|---|
| 272 |
qs = People.objects.filter(name__contains=u'Ã |
|---|
| 273 |
') |
|---|
| 274 |
qs = People.objects.filter(name__contains='\xc3\85') # UTF-8 encoding of à |
|---|
| 275 |
|
|---|
| 276 |
|
|---|
| 277 |
Templates |
|---|
| 278 |
========= |
|---|
| 279 |
|
|---|
| 280 |
You can use either Unicode or bytestrings when creating templates manually:: |
|---|
| 281 |
|
|---|
| 282 |
from django.template import Template |
|---|
| 283 |
t1 = Template('This is a bytestring template.') |
|---|
| 284 |
t2 = Template(u'This is a Unicode template.') |
|---|
| 285 |
|
|---|
| 286 |
But the common case is to read templates from the filesystem, and this creates |
|---|
| 287 |
a slight complication: not all filesystems store their data encoded as UTF-8. |
|---|
| 288 |
If your template files are not stored with a UTF-8 encoding, set the ``FILE_CHARSET`` |
|---|
| 289 |
setting to the encoding of the files on disk. When Django reads in a template |
|---|
| 290 |
file, it will convert the data from this encoding to Unicode. (``FILE_CHARSET`` |
|---|
| 291 |
is set to ``'utf-8'`` by default.) |
|---|
| 292 |
|
|---|
| 293 |
The ``DEFAULT_CHARSET`` setting controls the encoding of rendered templates. |
|---|
| 294 |
This is set to UTF-8 by default. |
|---|
| 295 |
|
|---|
| 296 |
Template tags and filters |
|---|
| 297 |
------------------------- |
|---|
| 298 |
|
|---|
| 299 |
A couple of tips to remember when writing your own template tags and filters: |
|---|
| 300 |
|
|---|
| 301 |
* Always return Unicode strings from a template tag's ``render()`` method |
|---|
| 302 |
and from template filters. |
|---|
| 303 |
|
|---|
| 304 |
* Use ``force_unicode()`` in preference to ``smart_unicode()`` in these |
|---|
| 305 |
places. Tag rendering and filter calls occur as the template is being |
|---|
| 306 |
rendered, so there is no advantage to postponing the conversion of lazy |
|---|
| 307 |
translation objects into strings. It's easier to work solely with Unicode |
|---|
| 308 |
strings at that point. |
|---|
| 309 |
|
|---|
| 310 |
E-mail |
|---|
| 311 |
====== |
|---|
| 312 |
|
|---|
| 313 |
Django's e-mail framework (in ``django.core.mail``) supports Unicode |
|---|
| 314 |
transparently. You can use Unicode data in the message bodies and any headers. |
|---|
| 315 |
However, you're still obligated to respect the requirements of the e-mail |
|---|
| 316 |
specifications, so, for example, e-mail addresses should use only ASCII |
|---|
| 317 |
characters. |
|---|
| 318 |
|
|---|
| 319 |
The following code example demonstrates that everything except e-mail addresses |
|---|
| 320 |
can be non-ASCII:: |
|---|
| 321 |
|
|---|
| 322 |
from django.core.mail import EmailMessage |
|---|
| 323 |
|
|---|
| 324 |
subject = u'My visit to SÞr-TrÞndelag' |
|---|
| 325 |
sender = u'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>' |
|---|
| 326 |
recipients = ['Fred <fred@example.com'] |
|---|
| 327 |
body = u'...' |
|---|
| 328 |
EmailMessage(subject, body, sender, recipients).send() |
|---|
| 329 |
|
|---|
| 330 |
Form submission |
|---|
| 331 |
=============== |
|---|
| 332 |
|
|---|
| 333 |
HTML form submission is a tricky area. There's no guarantee that the |
|---|
| 334 |
submission will include encoding information, which means the framework might |
|---|
| 335 |
have to guess at the encoding of submitted data. |
|---|
| 336 |
|
|---|
| 337 |
Django adopts a "lazy" approach to decoding form data. The data in an |
|---|
| 338 |
``HttpRequest`` object is only decoded when you access it. In fact, most of |
|---|
| 339 |
the data is not decoded at all. Only the ``HttpRequest.GET`` and |
|---|
| 340 |
``HttpRequest.POST`` data structures have any decoding applied to them. Those |
|---|
| 341 |
two fields will return their members as Unicode data. All other attributes and |
|---|
| 342 |
methods of ``HttpRequest`` return data exactly as it was submitted by the |
|---|
| 343 |
client. |
|---|
| 344 |
|
|---|
| 345 |
By default, the ``DEFAULT_CHARSET`` setting is used as the assumed encoding |
|---|
| 346 |
for form data. If you need to change this for a particular form, you can set |
|---|
| 347 |
the ``encoding`` attribute on the ``GET`` and ``POST`` data structures. For |
|---|
| 348 |
convenience, changing the ``encoding`` property on an ``HttpRequest`` instance |
|---|
| 349 |
does this for you. For example:: |
|---|
| 350 |
|
|---|
| 351 |
def some_view(request): |
|---|
| 352 |
# We know that the data must be encoded as KOI8-R (for some reason). |
|---|
| 353 |
request.encoding = 'koi8-r' |
|---|
| 354 |
... |
|---|
| 355 |
|
|---|
| 356 |
You can even change the encoding after having accessed ``request.GET`` or |
|---|
| 357 |
``request.POST``, and all subsequent accesses will use the new encoding. |
|---|
| 358 |
|
|---|
| 359 |
Most developers won't need to worry about changing form encoding, but this is |
|---|
| 360 |
a useful feature for applications that talk to legacy systems whose encoding |
|---|
| 361 |
you cannot control. |
|---|
| 362 |
|
|---|
| 363 |
Django does not decode the data of file uploads, because that data is normally |
|---|
| 364 |
treated as collections of bytes, rather than strings. Any automatic decoding |
|---|
| 365 |
there would alter the meaning of the stream of bytes. |
|---|