Opened 16 years ago
Last modified 6 years ago
#10449 new Bug
HTML accents not escaped out when using forms
Reported by: | tipan | Owned by: | nobody |
---|---|---|---|
Component: | Internationalization | Version: | 1.0 |
Severity: | Normal | Keywords: | accents, newforms |
Cc: | Guille López | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | yes |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
When passing a translated string to a select box using a Forms Select widget, the output string still contains accent HTML entity codes.
For example: "Problemas para el Inicio de Sesión o relacionados con la Cuenta
My code is set up as follows:
Models.py
from django.utils import translation from django.utils.translation import ugettext_lazy as _ SUBJECT = ( (10, _('Log-in or Account related problems')), (20, _('General enquiry')), (30, _('Other feedback')), )
This is then used in a form:
class HelpFeedbackForm(forms.Form): #form for user to submit their feedback when not logged in subject = forms.CharField(widget=forms.Select(choices=SUBJECT))
This is then presented in the template as:
{{ form.subject }}
which produces: "Problemas para el Inicio de Sesión o relacionados con la Cuenta" as mentioned above.
The translations are returned from the django.po file, but all of the HTML entity codes are returned as is, in the Select box in the template.
Also, use of the safe template filter tag does affect this and the string is still rendered with accents.
Change History (18)
comment:1 by , 16 years ago
Description: | modified (diff) |
---|
comment:2 by , 16 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
There is no bug here.
Django converts certain characters to the corresponding HTML entities (a.k.a. "automatic HTML escaping") and offers a couple of ways to disable this behaviour, including the safe
filter. You seem to think that Django will do the conversion in the other direction, but that is not the case.
Note that Django marks translation strings as being "safe" (you'd see "Sessión" instead of "Sessión" otherwise), so applying the safe
filter doesn't change things. (The safe
filter prevent auto-escaping, it doesn't "unescape".)
Using this in your .po
file, things will Just Work:
msgid "Log-in or Account related problems" msgstr "Problemas para el Inicio de Sesión o relacionados con la Cuenta"
comment:3 by , 16 years ago
Resolution: | invalid |
---|---|
Status: | closed → reopened |
Triage Stage: | Unreviewed → Accepted |
Based on testing the reported problem, Django is misbehaving here. The output produced looks like this:
In [4]: print forms.FeedbackHelpForm() <tr><th><label for="id_subject">Subject:</label></th><td><select name="subject" id="id_subject"> <option value="10">Problemas para el Inicio de Sesi&oacute;n o relacionados con la Cuenta</option> <option value="20">General enquiry</option> <option value="30">Other feedback</option> </select></td></tr>
which is incorrect. The &
part shows double-escaping going on.
comment:4 by , 16 years ago
I was having a look at this, just for fun, whilst something else was going on. It's surprisingly deep in the system and requires some care to fix.
When I wrote the auto-escaping code, I was apparently high on glue at some stage and went a little crazy throwing in escape()
and conditional_escape()
calls to every render()
and render_*()
method in django/forms/widgets.py
. Unfortunately, things aren't that simple. We should only be escaping the values that have potentially come from user-supplied input in the widgets' output methods. For widgets like Select*
and Checkbox*
, the values and display strings can be treated as "safe".
Fix requires
- Going through the widgets and working out the ones that will never take user-supplied values directly (the case we're protecting against is re-rendering a form after errors have occurred, using the previous form input data). They can have the
escape()
calls removed. - Possibly adding some documentation to explain how autoescaping values are treated in form widgets. Probably all that's needed her is identifying the widgets whose values are escaped on output. Those are the ones where supplying an initial value might require wrapping the value in a
safe()
call in developer code, so I want to explain that to people who read the documentation. - Lots of tests. Probably one new test for every widget, shoving some data into it that requires escaping (say, "<") and checking that it is escaped in only the right places.
This isn't going to be fully backwards-compatible (much as with the fix to the "join" filter in r9442) and some people including raw "&" characters in their select option values will need to replace them with HTML entities. The current behaviour is a bug, though, and prevents some valid behaviour. So we just explain that the original coder was an idiot and we fixed it now.
comment:5 by , 16 years ago
Needs documentation: | set |
---|---|
Needs tests: | set |
I'm not sure what I was smoking when I checked this and wrote comment:2, but I doubt it was legal...
I think using HTML entities (or other markup) in a msgstr
when there are none in the msgid
is begging for trouble. By doing this you are assuming the msgstr
will be used in a certain type of context when there is no assumption for the msgid
. (For example, the msgstr
in the report is only suited for HTML, while the msgid
could be use in the subject of an email.)
That said, it turns out that Django doesn't mark all translated strings as "safe". In django.utils.translation.trans_real
, do_translate
will only mark the translation as "safe" if the string to be translated was "safe", and do_ntranslate
will never mark the translation as "safe". The inconsistency between these two should be fixed.
Solving this at the widget level seems wrong to me. The widgets don't know if the data is "safe" or not, it could come from anywhere. (The choices
for a Select
widget may come from the queryset
of a ModelChoiceField
, for example.) It would also result in inconsistency in the auto escaping behaviour of different widgets. Seems to me, the original coder was onto somehting, not on somehting... :-)
That leaves the question if all translation functions should return "safe" strings all the time, or not. I'm not sure they should. For the *gettext_lazy
functions it might make sense to always return "safe" strings, but I don't think it does for the *gettext
versions.
comment:6 by , 16 years ago
It is implicit in the translation infrastructure that we know whether the strings are intended for HTML output or not. We can't differentiate them, that's true (although we are going to add the ability to include contextual comments in the PO files). And if the same string is going to be used in two different situations (from two places in teh code), we have to make sure it's a different string in each place (a limitation of gettext being that it always collapses identical occurences of the same msgid, since it doesn't assume msgid's make sense on their own). But that's life in the big city. Translation strings should be written so that they are safe for including in HTML if they are part of the forms module, etc. Yes, we don't make this clear, but it basically never comes up, so that's just an oversight.
Widgets can easily understand safe or not. The only issue is whether the data came directly from user supplied input via a form redisplay. If developer is including user supplied input via some other method it is up to the developer to be careful. Auto-escaping is not Perl's "taint" operation. It's a much more limited and well-defined operation, but many of the same principles apply. So we should still solve this as I suggested, since form widgets are definitely dealing with HTML output directly.
The question of safe return values from gettext is an interesting one. I already know we need to document the assumptions on translations (although I have a feeling they are buried somewhere in the documentation or the i18n mailing list, but it needs to be very explicit). But that's not an issue for this ticket. There are limitations of the gettext framework here that we have to work with and this is a case where picking separate msgids for separate purposes is the appropriate solution.
comment:7 by , 16 years ago
Apparently I haven't expressed myself clearly; let me try again.
IMHO it's bad practice to use markup like HTML entities in a msgstr
when there's no such markup in the corresponding msgid
.
- In many (all?) cases, there is no need to use markup when translating a string that doesn't contain markup. For example,
sesión
is just another way of writingsesión
in HTML. - Using markup restricts possible usage of the target language strings with respect to the possible usage of the source language string. (
sesión
can be used no problem in XML, CSV, email, PDF, etc. Results forsesión
are somewhat different.) - In many cases it's not trivial to determine if the message strings are intended to be used solely in, e.g., HTML output (that is, impossible from the PO file alone). A
models
module has many examples:verbose_name
andverbose_name_plural
,verbose_name
for model fields,FOO_CHOICES
. A PO file will not show any other places where these are used. - Even if you can easily determine what the current usage of a certain message string is, what happens in the future if the translation string is used in other output formats? For example, some
FOO_CHOICES
that were previously only used in HTML now need to be used in emails' subjects as well. That's rewriting translation strings right there.
Anyway, that isn't the issue. (It would have prevented this particular instance, though.) The issue is that translated strings are sometimes escaped when they shouldn't be, as demonstrated by the reported example.
The solution isn't changing some widgets so that they assume that data they're fed never needs to be escaped. The widgets don't know if it's user supplied data (they can't know) and shouldn't care. The data should be assumed to need escaping unless the data has been marked as safe by something outside the widget. (Not to mention the inconsistency in auto-escaping behaviour between, say, Select
and TextInput
that would result from the proposed solution.)
The solution is marking translated strings as "safe" the way do_translate
does if it's fed SafeData
, and probably changing do_ntranslate
so that it works the same way. When to mark translated strings as safe is exactly what this ticket is about.
Anyway, I think the reporter probably thought that Django would convert HTML entities in translation string to the characters they represent (see also his problem with HTML entities and email in the same thread) and has changed his PO file as suggested while we're discussing something that is really a non-issue.
comment:8 by , 16 years ago
You did express yourself clearly. I just don't agree with you and there's no reason for that requirement. The solution is to fix the widgets because they absolutely do know and no, assuming escaping was the mistake I made originally. It's only required in very specific situations.
Talking about "safe" and "unsafe" is confusing this conversation, since is a template auto-escaping concept, not an issue with constant strings in Python. The resulting string returned from widget's render()
method will always be safe from auto-escaping, since it contains raw HTML. The translated strings count as constants, not user-supplied data, so they shouldn't be escaped. Please stop fighting me on this one: we're being very consistent throughout Django on this issue and the current case was just an oversight on my part.
comment:9 by , 16 years ago
Malcolm, I'm not fighting you. I disagree with your solution, that is all.
If and when anybody wants to have a stab at this, it may be nice to know that changing the translation functions so that they return SafeData
fixes this particular issue, while passing all but a few tests in Django's suite. The tests that don't pass (in django/contrib/auth/tests
and tests/regressiontests/admin_views
) assume that translated strings will be escaped on output. (See django.contrib.auth.tests.PasswordResetTest.test_email_not_found
for an example, escaping single quotes.)
I think I'm done with this particular ticket. *sigh*
comment:10 by , 14 years ago
Severity: | → Normal |
---|---|
Summary: | Bug: HTML accents not escaped out when using newforms → HTML accents not escaped out when using newforms |
Triage Stage: | Accepted → Design decision needed |
Type: | → Bug |
I'm bumping this issue back to a design decision. Not sure I agree with malcolm's logic either - even though he is the master of all things escape-related.
It also seems like this late in the game it's a rather large backwards incompatibility to introduce (even if the current behaviour isn't explicitly documented).
comment:13 by , 12 years ago
Status: | reopened → new |
---|
comment:14 by , 12 years ago
Triage Stage: | Design decision needed → Accepted |
---|
Whoever writes the patch can choose to implement Malcolm's design (he seemed rather confident) or something else.
comment:15 by , 10 years ago
Summary: | HTML accents not escaped out when using newforms → HTML accents not escaped out when using forms |
---|
comment:17 by , 6 years ago
Cc: | added |
---|
Please use the preview button.