Django

Code

root/django/branches/unicode/docs/unicode.txt

Revision 5597, 16.2 kB (checked in by adrian, 1 year ago)

unicode: Made some documentation edits and inconsequential typo fixes throughout code

Line 
1 ======================
2 Unicode data in Django
3 ======================
4
5 **New in Django development version**
6
7 Django natively supports Unicode data everywhere. Providing your database can
8 somehow store the data, you can safely pass around Unicode strings to
9 templates, models and the database.
10
11 This document tells you what you need to know if you're writing applications
12 that use data or templates that are encoded in something other than ASCII.
13
14 Creating the database
15 =====================
16
17 Make sure your database is configured to be able to store arbitrary string
18 data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
19 a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
20 able to store certain characters in the database, and information will be lost.
21
22  * MySQL users, refer to the `MySQL manual`_ (section 10.3.2 for MySQL 5.1) for
23    details on how to set or alter the database character set encoding.
24
25  * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in
26    PostgreSQL 8) for details on creating databases with the correct encoding.
27
28  * SQLite users, there is nothing you need to do. SQLite always uses UTF-8
29    for internal encoding.
30
31 .. _MySQL manual: http://www.mysql.org/doc/refman/5.1/en/charset-database.html
32 .. _PostgreSQL manual: http://www.postgresql.org/docs/8.2/static/multibyte.html#AEN24104
33
34 All of Django's database backends automatically convert Unicode strings into
35 the appropriate encoding for talking to the database. They also automatically
36 convert strings retrieved from the database into Python Unicode strings. You
37 don't even need to tell Django what encoding your database uses: that is
38 handled transparently.
39
40 For more, see the section "The database API" below.
41
42 General string handling
43 =======================
44
45 Whenever you use strings with Django -- e.g., in database lookups, template
46 rendering or anywhere else -- you have two choices for encoding those strings.
47 You can use Unicode strings, or you can use normal strings (sometimes called
48 "bytestrings") that are encoded using UTF-8.
49
50 .. warning::
51     A bytestring does not carry any information with it about its encoding.
52     For that reason, we have to make an assumption, and Django assumes that all
53     bytestrings are in UTF-8.
54
55     If you pass a string to Django that has been encoded in some other format,
56     things will go wrong in interesting ways. Usually, Django will raise a
57     ``UnicodeDecodeError`` at some point.
58
59 If your code only uses ASCII data, it's safe to use your normal strings,
60 passing them around at will, because ASCII is a subset of UTF-8.
61
62 Don't be fooled into thinking that if your ``DEFAULT_CHARSET`` setting is set
63 to something other than ``'utf-8'`` you can use that other encoding in your
64 bytestrings! ``DEFAULT_CHARSET`` only applies to the strings generated as
65 the result of template rendering (and e-mail). Django will always assume UTF-8
66 encoding for internal bytestrings. The reason for this is that the
67 ``DEFAULT_CHARSET`` setting is not actually under your control (if you are the
68 application developer). It's under the control of the person installing and
69 using your application -- and if that person chooses a different setting, your
70 code must still continue to work. Ergo, it cannot rely on that setting.
71
72 In most cases when Django is dealing with strings, it will convert them to
73 Unicode strings before doing anything else. So, as a general rule, if you pass
74 in a bytestring, be prepared to receive a Unicode string back in the result.
75
76 Translated strings
77 ------------------
78
79 Aside from Unicode strings and bytestrings, there's a third type of string-like
80 object you may encounter when using Django. The framework's
81 internationalization features introduce the concept of a "lazy translation" --
82 a string that has been marked as translated but whose actual translation result
83 isn't determined until the object is used in a string. This feature is useful
84 in cases where the translation locale is unknown until the string is used, even
85 though the string might have originally been created when the code was first
86 imported.
87
88 Normally, you won't have to worry about lazy translations. Just be aware that
89 if you examine an object and it claims to be a
90 ``django.utils.functional.__proxy__`` object, it is a lazy translation.
91 Calling ``unicode()`` with the lazy translation as the argument will generate a
92 Unicode string in the current locale.
93
94 For more details about lazy translation objects, refer to the
95 internationalization_ documentation.
96
97 .. _internationalization: ../i18n/#lazy-translation
98
99 Useful utility functions
100 ------------------------
101
102 Because some string operations come up again and again, Django ships with a few
103 useful functions that should make working with Unicode and bytestring objects
104 a bit easier.
105
106 Conversion functions
107 ~~~~~~~~~~~~~~~~~~~~
108
109 The ``django.utils.encoding`` module contains a few functions that are handy
110 for converting back and forth between Unicode and bytestrings.
111
112     * ``smart_unicode(s, encoding='utf-8', errors='strict')`` converts its
113       input to a Unicode string. The ``encoding`` parameter specifies the input
114       encoding. (For example, Django uses this internally when processing form
115       input data, which might not be UTF-8 encoded.) The ``errors`` parameter
116       takes any of the values that are accepted by Python's ``unicode()``
117       function for its error handling.
118
119       If you pass ``smart_unicode()`` an object that has a ``__unicode__``
120       method, it will use that method to do the conversion.
121
122     * ``force_unicode(s, encoding='utf-8', errors='strict')`` is identical to
123       ``smart_unicode()`` in almost all cases. The difference is when the
124       first argument is a `lazy translation`_ instance. While
125       ``smart_unicode()`` preserves lazy translations, ``force_unicode()``
126       forces those objects to a Unicode string (causing the translation to
127       occur). Normally, you'll want to use ``smart_unicode()``. However,
128       ``force_unicode()`` is useful in template tags and filters that
129       absolutely *must* have a string to work with, not just something that can
130       be converted to a string.
131
132     * ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')``
133       is essentially the opposite of ``smart_unicode()``. It forces the first
134       argument to a bytestring. The ``strings_only`` parameter, if set to True,
135       will result in Python integers, booleans and ``None`` not being
136       converted to a string (they keep their original types). This is slightly
137       different semantics from Python's builtin ``str()`` function, but the
138       difference is needed in a few places within Django's internals.
139
140 Normally, you'll only need to use ``smart_unicode()``. Call it as early as
141 possible on any input data that might be either Unicode or a bytestring, and
142 from then on, you can treat the result as always being Unicode.
143
144 URI and IRI handling
145 ~~~~~~~~~~~~~~~~~~~~
146
147 Web frameworks have to deal with URLs (which are a type of URI_). One
148 requirement of URLs is that they are encoded using only ASCII characters.
149 However, in an international environment, you might need to construct a
150 URL from an IRI_ -- very loosely speaking, a URI that can contain Unicode
151 characters. Quoting and converting an IRI to URI can be a little tricky, so
152 Django provides some assistance.
153
154     * The function ``django.utils.encoding.iri_to_uri()`` implements the
155       conversion from IRI to URI as required by the specification (`RFC
156       3987`_).
157
158     * The functions ``django.utils.http.urlquote()`` and
159       ``django.utils.http.urlquote_plus()`` are versions of Python's standard
160       ``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII
161       characters. (The data is converted to UTF-8 prior to encoding.)
162
163 These two groups of functions have slightly different purposes, and it's
164 important to keep them straight. Normally, you would use ``urlquote()`` on the
165 individual portions of the IRI or URI path so that any reserved characters
166 such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
167 the full IRI and it converts any non-ASCII characters to the correct encoded
168 values.
169
170 .. note::
171     Technically, it isn't correct to say that ``iri_to_uri()`` implements the
172     full algorithm in the IRI specification. It doesn't (yet) perform the
173     international domain name encoding portion of the algorithm.
174
175 The ``iri_to_uri()`` function will not change ASCII characters that are
176 otherwise permitted in a URL. So, for example, the character '%' is not
177 further encoded when passed to ``iri_to_uri()``. This means you can pass a
178 full URL to this function and it will not mess up the query string or anything
179 like that.
180
181 An example might clarify things here::
182
183     >>> urlquote(u'Paris & Orléans')
184     u'Paris%20%26%20Orl%C3%A9ans'
185     >>> iri_to_uri(u'/favorites/François/%s' % urlquote(u'Paris & Orléans'))
186     '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
187
188 If you look carefully, you can see that the portion that was generated by
189 ``urlquote()`` in the second example was not double-quoted when passed to
190 ``iri_to_uri()``. This is a very important and useful feature. It means that
191 you can construct your IRI without worrying about whether it contains
192 non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
193 result.
194
195 The ``iri_to_uri()`` function is also idempotent, which means the following is
196 always true::
197
198     iri_to_uri(iri_to_uri(some_string)) = iri_to_uri(some_string)
199
200 So you can safely call it multiple times on the same IRI without risking
201 double-quoting problems.
202
203 .. _URI: http://www.ietf.org/rfc/rfc2396.txt
204 .. _IRI: http://www.ietf.org/rfc/rfc3987.txt
205 .. _RFC 3987: IRI_
206
207 Models
208 ======
209
210 Because all strings are returned from the database as Unicode strings, model
211 fields that are character based (CharField, TextField, URLField, etc) will
212 contain Unicode values when Django retrieves data from the database. This
213 is *always* the case, even if the data could fit into an ASCII bytestring.
214
215 You can pass in bytestrings when creating a model or populating a field, and
216 Django will convert it to Unicode when it needs to.
217
218 Choosing between ``__str__()`` and ``__unicode__()``
219 ----------------------------------------------------
220
221 One consequence of using Unicode by default is that you have to take some care
222 when printing data from the model.
223
224 In particular, rather than giving your model a ``__str__()`` method, we
225 recommended you implement a ``__unicode__()`` method. In the ``__unicode__()``
226 method, you can quite safely return the values of all your fields without
227 having to worry about whether they fit into a bytestring or not. (The way
228 Python works, the result of ``__str__()`` is *always* a bytestring, even if you
229 accidentally try to return a Unicode object).
230
231 You can still create a ``__str__()`` method on your models if you want, of
232 course, but you shouldn't need to do this unless you have a good reason.
233 Django's ``Model`` base class automatically provides a ``__str__()``
234 implementation that calls ``__unicode__()`` and encodes the result into UTF-8.
235 This means you'll normally only need to implement a ``__unicode__()`` method
236 and let Django handle the coercion to a bytestring when required.
237
238 Taking care in ``get_absolute_url()``
239 -------------------------------------
240
241 URLs can only contain ASCII characters. If you're constructing a URL from
242 pieces of data that might be non-ASCII, be careful to encode the results in a
243 way that is suitable for a URL. The ``django.db.models.permalink()`` decorator
244 handles this for you automatically.
245
246 If you're constructing a URL manually (i.e., *not* using the ``permalink()``
247 decorator), you'll need to take care of the encoding yourself. In this case,
248 use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented
249 above_. For example::
250
251     from django.utils.encoding import iri_to_uri
252     from django.utils.http import urlquote
253
254     def get_absolute_url(self):
255         url = u'/person/%s/?x=0&y=0' % urlquote(self.location)
256         return iri_to_uri(url)
257
258 This function returns a correctly encoded URL even if ``self.location`` is
259 something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
260 call isn't strictly necessary in the above example, because all the
261 non-ASCII characters would have been removed in quoting in the first line.)
262
263 .. _above: uri_and_iri_
264
265 The database API
266 ================
267
268 You can pass either Unicode strings or UTF-8 bytestrings as arguments to
269 ``filter()`` methods and the like in the database API. The following two
270 querysets are identical::
271
272     qs = People.objects.filter(name__contains=u'Ã
273 ')
274     qs = People.objects.filter(name__contains='\xc3\85') # UTF-8 encoding of Ã
275
276
277 Templates
278 =========
279
280 You can use either Unicode or bytestrings when creating templates manually::
281
282         from django.template import Template
283         t1 = Template('This is a bytestring template.')
284         t2 = Template(u'This is a Unicode template.')
285
286 But the common case is to read templates from the filesystem, and this creates
287 a slight complication: not all filesystems store their data encoded as UTF-8.
288 If your template files are not stored with a UTF-8 encoding, set the ``FILE_CHARSET``
289 setting to the encoding of the files on disk. When Django reads in a template
290 file, it will convert the data from this encoding to Unicode. (``FILE_CHARSET``
291 is set to ``'utf-8'`` by default.)
292
293 The ``DEFAULT_CHARSET`` setting controls the encoding of rendered templates.
294 This is set to UTF-8 by default.
295
296 Template tags and filters
297 -------------------------
298
299 A couple of tips to remember when writing your own template tags and filters:
300
301     * Always return Unicode strings from a template tag's ``render()`` method
302       and from template filters.
303
304     * Use ``force_unicode()`` in preference to ``smart_unicode()`` in these
305       places. Tag rendering and filter calls occur as the template is being
306       rendered, so there is no advantage to postponing the conversion of lazy
307       translation objects into strings. It's easier to work solely with Unicode
308       strings at that point.
309
310 E-mail
311 ======
312
313 Django's e-mail framework (in ``django.core.mail``) supports Unicode
314 transparently. You can use Unicode data in the message bodies and any headers.
315 However, you're still obligated to respect the requirements of the e-mail
316 specifications, so, for example, e-mail addresses should use only ASCII
317 characters.
318
319 The following code example demonstrates that everything except e-mail addresses
320 can be non-ASCII::
321
322     from django.core.mail import EmailMessage
323
324     subject = u'My visit to SÞr-TrÞndelag'
325     sender = u'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>'
326     recipients = ['Fred <fred@example.com']
327     body = u'...'
328     EmailMessage(subject, body, sender, recipients).send()
329
330 Form submission
331 ===============
332
333 HTML form submission is a tricky area. There's no guarantee that the
334 submission will include encoding information, which means the framework might
335 have to guess at the encoding of submitted data.
336
337 Django adopts a "lazy" approach to decoding form data. The data in an
338 ``HttpRequest`` object is only decoded when you access it. In fact, most of
339 the data is not decoded at all. Only the ``HttpRequest.GET`` and
340 ``HttpRequest.POST`` data structures have any decoding applied to them. Those
341 two fields will return their members as Unicode data. All other attributes and
342 methods of ``HttpRequest`` return data exactly as it was submitted by the
343 client.
344
345 By default, the ``DEFAULT_CHARSET`` setting is used as the assumed encoding
346 for form data. If you need to change this for a particular form, you can set
347 the ``encoding`` attribute on the ``GET`` and ``POST`` data structures. For
348 convenience, changing the ``encoding`` property on an ``HttpRequest`` instance
349 does this for you. For example::
350
351     def some_view(request):
352         # We know that the data must be encoded as KOI8-R (for some reason).
353         request.encoding = 'koi8-r'
354         ...
355
356 You can even change the encoding after having accessed ``request.GET`` or
357 ``request.POST``, and all subsequent accesses will use the new encoding.
358
359 Most developers won't need to worry about changing form encoding, but this is
360 a useful feature for applications that talk to legacy systems whose encoding
361 you cannot control.
362
363 Django does not decode the data of file uploads, because that data is normally
364 treated as collections of bytes, rather than strings. Any automatic decoding
365 there would alter the meaning of the stream of bytes.
Note: See TracBrowser for help on using the browser.