Context Navigation

← Previous Ticket
Next Ticket →

#10449 assigned Bug

HTML accents not escaped out when using forms

Reported by:	tipan	Owned by:	Abhishek Mane
Component:	Internationalization	Version:	1.0
Severity:	Normal	Keywords:	accents, newforms
Cc:	Guille López	Triage Stage:	Accepted
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description (last modified by Alex Gaynor)

When passing a translated string to a select box using a Forms Select widget, the output string still contains accent HTML entity codes.

For example: "Problemas para el Inicio de Sesión o relacionados con la Cuenta

My code is set up as follows:

Models.py

from django.utils import translation 
from django.utils.translation import ugettext_lazy as _

SUBJECT = ( 
    (10, _('Log-in or Account related problems')), 
    (20, _('General enquiry')), 
    (30, _('Other feedback')), 
)

This is then used in a form:

class HelpFeedbackForm(forms.Form): 
        #form for user to submit their feedback when not logged in 
                subject = forms.CharField(widget=forms.Select(choices=SUBJECT))

This is then presented in the template as:
{{ form.subject }}

which produces: "Problemas para el Inicio de Sesión o relacionados con la Cuenta" as mentioned above.

The translations are returned from the django.po file, but all of the HTML entity codes are returned as is, in the Select box in the template.

Also, use of the safe template filter tag does affect this and the string is still rendered with accents.

Change History (21)

comment:1 by Alex Gaynor, 17 years ago

Description:	modified (diff)

Please use the preview button.

comment:2 by arien, 17 years ago

Resolution:	→ invalid
Status:	new → closed

There is no bug here.

Django converts certain characters to the corresponding HTML entities (a.k.a. "automatic HTML escaping") and offers a couple of ways to disable this behaviour, including the safe filter. You seem to think that Django will do the conversion in the other direction, but that is not the case.

Note that Django marks translation strings as being "safe" (you'd see "Sessi&oacute;n" instead of "Sessión" otherwise), so applying the safe filter doesn't change things. (The safe filter prevent auto-escaping, it doesn't "unescape".)

Using this in your .po file, things will Just Work:

msgid "Log-in or Account related problems"
msgstr "Problemas para el Inicio de Sesión o relacionados con la Cuenta"

Last edited 10 years ago by Philip James (previous) (diff)

comment:3 by Malcolm Tredinnick, 17 years ago

Resolution:	invalid
Status:	closed → reopened
Triage Stage:	Unreviewed → Accepted

Based on testing the reported problem, Django is misbehaving here. The output produced looks like this:

In [4]: print forms.FeedbackHelpForm()
<tr><th><label for="id_subject">Subject:</label></th><td><select name="subject" id="id_subject">
<option value="10">Problemas para el Inicio de Sesi&amp;oacute;n o relacionados con la Cuenta</option>
<option value="20">General enquiry</option>
<option value="30">Other feedback</option>
</select></td></tr>

which is incorrect. The & part shows double-escaping going on.

comment:4 by Malcolm Tredinnick, 17 years ago

I was having a look at this, just for fun, whilst something else was going on. It's surprisingly deep in the system and requires some care to fix.

When I wrote the auto-escaping code, I was apparently high on glue at some stage and went a little crazy throwing in escape() and conditional_escape() calls to every render() and render_*() method in django/forms/widgets.py. Unfortunately, things aren't that simple. We should only be escaping the values that have potentially come from user-supplied input in the widgets' output methods. For widgets like Select* and Checkbox*, the values and display strings can be treated as "safe".

Fix requires

Going through the widgets and working out the ones that will never take user-supplied values directly (the case we're protecting against is re-rendering a form after errors have occurred, using the previous form input data). They can have the escape() calls removed.
Possibly adding some documentation to explain how autoescaping values are treated in form widgets. Probably all that's needed her is identifying the widgets whose values are escaped on output. Those are the ones where supplying an initial value might require wrapping the value in a safe() call in developer code, so I want to explain that to people who read the documentation.
Lots of tests. Probably one new test for every widget, shoving some data into it that requires escaping (say, "<") and checking that it is escaped in only the right places.

This isn't going to be fully backwards-compatible (much as with the fix to the "join" filter in r9442) and some people including raw "&" characters in their select option values will need to replace them with HTML entities. The current behaviour is a bug, though, and prevents some valid behaviour. So we just explain that the original coder was an idiot and we fixed it now.

comment:5 by arien, 17 years ago

Needs documentation:	set
Needs tests:	set

I'm not sure what I was smoking when I checked this and wrote comment:2, but I doubt it was legal...

I think using HTML entities (or other markup) in a msgstr when there are none in the msgid is begging for trouble. By doing this you are assuming the msgstr will be used in a certain type of context when there is no assumption for the msgid. (For example, the msgstr in the report is only suited for HTML, while the msgid could be use in the subject of an email.)

That said, it turns out that Django doesn't mark all translated strings as "safe". In django.utils.translation.trans_real, do_translate will only mark the translation as "safe" if the string to be translated was "safe", and do_ntranslate will never mark the translation as "safe". The inconsistency between these two should be fixed.

Solving this at the widget level seems wrong to me. The widgets don't know if the data is "safe" or not, it could come from anywhere. (The choices for a Select widget may come from the queryset of a ModelChoiceField, for example.) It would also result in inconsistency in the auto escaping behaviour of different widgets. Seems to me, the original coder was onto somehting, not on somehting... :-)

That leaves the question if all translation functions should return "safe" strings all the time, or not. I'm not sure they should. For the *gettext_lazy functions it might make sense to always return "safe" strings, but I don't think it does for the *gettext versions.

comment:6 by Malcolm Tredinnick, 17 years ago

It is implicit in the translation infrastructure that we know whether the strings are intended for HTML output or not. We can't differentiate them, that's true (although we are going to add the ability to include contextual comments in the PO files). And if the same string is going to be used in two different situations (from two places in teh code), we have to make sure it's a different string in each place (a limitation of gettext being that it always collapses identical occurences of the same msgid, since it doesn't assume msgid's make sense on their own). But that's life in the big city. Translation strings should be written so that they are safe for including in HTML if they are part of the forms module, etc. Yes, we don't make this clear, but it basically never comes up, so that's just an oversight.

Widgets can easily understand safe or not. The only issue is whether the data came directly from user supplied input via a form redisplay. If developer is including user supplied input via some other method it is up to the developer to be careful. Auto-escaping is not Perl's "taint" operation. It's a much more limited and well-defined operation, but many of the same principles apply. So we should still solve this as I suggested, since form widgets are definitely dealing with HTML output directly.

The question of safe return values from gettext is an interesting one. I already know we need to document the assumptions on translations (although I have a feeling they are buried somewhere in the documentation or the i18n mailing list, but it needs to be very explicit). But that's not an issue for this ticket. There are limitations of the gettext framework here that we have to work with and this is a case where picking separate msgids for separate purposes is the appropriate solution.

comment:7 by arien, 17 years ago

Apparently I haven't expressed myself clearly; let me try again.

IMHO it's bad practice to use markup like HTML entities in a msgstr when there's no such markup in the corresponding msgid.

In many (all?) cases, there is no need to use markup when translating a string that doesn't contain markup. For example, sesión is just another way of writing sesión in HTML.
Using markup restricts possible usage of the target language strings with respect to the possible usage of the source language string. (sesión can be used no problem in XML, CSV, email, PDF, etc. Results for sesión are somewhat different.)
In many cases it's not trivial to determine if the message strings are intended to be used solely in, e.g., HTML output (that is, impossible from the PO file alone). A models module has many examples: verbose_name and verbose_name_plural, verbose_name for model fields, FOO_CHOICES. A PO file will not show any other places where these are used.
Even if you can easily determine what the current usage of a certain message string is, what happens in the future if the translation string is used in other output formats? For example, some FOO_CHOICES that were previously only used in HTML now need to be used in emails' subjects as well. That's rewriting translation strings right there.

Anyway, that isn't the issue. (It would have prevented this particular instance, though.) The issue is that translated strings are sometimes escaped when they shouldn't be, as demonstrated by the reported example.

The solution isn't changing some widgets so that they assume that data they're fed never needs to be escaped. The widgets don't know if it's user supplied data (they can't know) and shouldn't care. The data should be assumed to need escaping unless the data has been marked as safe by something outside the widget. (Not to mention the inconsistency in auto-escaping behaviour between, say, Select and TextInput that would result from the proposed solution.)

The solution is marking translated strings as "safe" the way do_translate does if it's fed SafeData, and probably changing do_ntranslate so that it works the same way. When to mark translated strings as safe is exactly what this ticket is about.

Anyway, I think the reporter probably thought that Django would convert HTML entities in translation string to the characters they represent (see also his problem with HTML entities and email in the same thread) and has changed his PO file as suggested while we're discussing something that is really a non-issue.

comment:8 by Malcolm Tredinnick, 17 years ago

You did express yourself clearly. I just don't agree with you and there's no reason for that requirement. The solution is to fix the widgets because they absolutely do know and no, assuming escaping was the mistake I made originally. It's only required in very specific situations.

Talking about "safe" and "unsafe" is confusing this conversation, since is a template auto-escaping concept, not an issue with constant strings in Python. The resulting string returned from widget's render() method will always be safe from auto-escaping, since it contains raw HTML. The translated strings count as constants, not user-supplied data, so they shouldn't be escaped. Please stop fighting me on this one: we're being very consistent throughout Django on this issue and the current case was just an oversight on my part.

comment:9 by arien, 17 years ago

Malcolm, I'm not fighting you. I disagree with your solution, that is all.

If and when anybody wants to have a stab at this, it may be nice to know that changing the translation functions so that they return SafeData fixes this particular issue, while passing all but a few tests in Django's suite. The tests that don't pass (in django/contrib/auth/tests and tests/regressiontests/admin_views) assume that translated strings will be escaped on output. (See django.contrib.auth.tests.PasswordResetTest.test_email_not_found for an example, escaping single quotes.)

I think I'm done with this particular ticket. *sigh*

comment:10 by Chris Beaven, 15 years ago

Severity:	→ Normal
Summary:	Bug: HTML accents not escaped out when using newforms → HTML accents not escaped out when using newforms
Triage Stage:	Accepted → Design decision needed
Type:	→ Bug

I'm bumping this issue back to a design decision. Not sure I agree with malcolm's logic either - even though he is the master of all things escape-related.

It also seems like this late in the game it's a rather large backwards incompatibility to introduce (even if the current behaviour isn't explicitly documented).

comment:11 by Aymeric Augustin, 14 years ago

UI/UX:	unset

Change UI/UX from NULL to False.

comment:12 by Aymeric Augustin, 14 years ago

Easy pickings:	unset

Change Easy pickings from NULL to False.

comment:13 by Aymeric Augustin, 13 years ago

Status:	reopened → new

comment:14 by Aymeric Augustin, 13 years ago

Triage Stage:	Design decision needed → Accepted

Whoever writes the patch can choose to implement Malcolm's design (he seemed rather confident) or something else.

comment:15 by Collin Anderson, 12 years ago

Summary:	HTML accents not escaped out when using newforms → HTML accents not escaped out when using forms

comment:16 by Guille López, 8 years ago

Is this bug unsolved yet? I would like to claim this to finally fix it.

comment:17 by Guille López, 8 years ago

Cc:	Guille López added

comment:18 by Tim Graham, 8 years ago

Yes, the ticket is open which means the issue isn't resolved.

comment:19 by Abhishek Mane, 4 months ago

Has patch:	set
Needs tests:	unset
Owner:	changed from nobody to Abhishek Mane
Status:	new → assigned

I investigated this issue and found that modern Django's template-based widget rendering no longer suffers from this double-escaping bug, provided the translated strings are explicitly marked as safe. >
I have submitted a PR that adds a regression test to test_select.py to prove mark_safe correctly prevents double-escaping in Select choices. This fulfills the "Needs tests" requirement and proves the bug is resolved so this ticket can finally be closed: https://github.com/django/django/pull/20871

comment:20 by Jacob Walls, 4 months ago

Has patch:	unset
Needs documentation:	unset

PR from comment:19 just documented (via a test case) calling mark_safe() on hard-coded values sporting HTML entities, whereas I took the point to be that translation strings might or might not contain HTML entities after resolving them to local values.

Last edited 4 months ago by Jacob Walls (previous) (diff)

comment:21 by Abhishek Mane, 4 months ago

Hi Jacob, thanks for the review and for closing the PR. Your point makes complete sense,I was definitely just memorializing a gotcha rather than fixing the root issue. You are completely right that forcing developers to sprinkle mark_safe() everywhere is terrible DX and opens up massive XSS risks if a .po file contains malicious HTML tags.

I want to take a step back and figure out the right architectural approach before writing a new patch. As I see it, we have three potential paths forward:

Selective Entity Decoding: Modify the translation wrapper to only unescape standard safe HTML entities (like & or ©) from the localized string, while keeping actual HTML tags (like <script>) strictly escaped.

A Dedicated Template Filter: Introduce a new filter (e.g., |safe_entities) that developers can use in templates when they suspect a translator might have injected entities. This avoids turning off full auto-escaping via mark_safe().

WONTFIX / Documentation Only: Is this fundamentally a translator education issue? If a translator uses a literal & instead of & in the .po file, Django's auto-escaper handles it perfectly in the template. Should we just officially document that translators must use literal characters rather than HTML entities unless the developer specifically intended the string to contain HTML?

I'd love to get the team's thoughts on which of these directions (or another one I missed) is the right path forward so I can put together a proper PR.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues