id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
30481	Document that force_str() allows lone surrogates.	Adam Hooper	nobody	"{{{
$ python3
Python 3.7.3 (default, Mar 27 2019, 13:36:35)
[GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
Type ""help"", ""copyright"", ""credits"" or ""license"" for more information.

>>> invalid_text = '\ud802\udf12'
>>> print(invalid_text)  # we'd expect this to fail
Traceback (most recent call last):
  File ""<stdin>"", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

>>> import django.utils.encoding
>>> django.VERSION
(2, 2, 0, 'alpha', 1)

>>> valid_text = django.utils.encoding.force_text(invalid_text)
>>> print(valid_text)  # we'd expect this to succeed?
Traceback (most recent call last):
  File ""<stdin>"", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

>>> valid_text
'\ud802\udf12'
}}}

Perhaps this is a flaw in my expectations? I'd expect `force_text()`'s output to always be a valid text -- even though Python allows me to create _non-text_ `str` objects. (In this case, I'd expect maybe `\ufffd\ufffd` -- Unicode replacement characters.)

Unicode primer: `\ud802` is a ""lone surrogate"" in this context. A lone surrogate is a valid Unicode _code point_ but it does not represent _text_. (Lone surrogates can crop up if someone decodes valid UCS-2 as UTF-16.) I don't think any caller of `force_text()` expects it to ever return a non-textual Unicode string."	Cleanup/optimization	closed	Documentation	2.2	Normal	wontfix	force_text unicode		Accepted	0	0	0	0	0	0