Code

Opened 5 years ago

Last modified 13 months ago

#10449 new Bug

HTML accents not escaped out when using newforms

Reported by: tipan Owned by: nobody
Component: Internationalization Version: 1.0
Severity: Normal Keywords: accents, newforms
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: yes
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Alex)

When passing a translated string to a select box using a Forms Select widget, the output string still contains accent HTML entity codes.

For example: "Problemas para el Inicio de Sesión o relacionados con la Cuenta

My code is set up as follows:

Models.py

from django.utils import translation 
from django.utils.translation import ugettext_lazy as _

SUBJECT = ( 
    (10, _('Log-in or Account related problems')), 
    (20, _('General enquiry')), 
    (30, _('Other feedback')), 
)

This is then used in a form:

class HelpFeedbackForm(forms.Form): 
        #form for user to submit their feedback when not logged in 
                subject = forms.CharField(widget=forms.Select(choices=SUBJECT)) 

This is then presented in the template as:
{{ form.subject }}

which produces: "Problemas para el Inicio de Sesión o relacionados con la Cuenta" as mentioned above.

The translations are returned from the django.po file, but all of the HTML entity codes are returned as is, in the Select box in the template.

Also, use of the safe template filter tag does affect this and the string is still rendered with accents.

Attachments (0)

Change History (14)

comment:1 Changed 5 years ago by Alex

  • Description modified (diff)
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Please use the preview button.

comment:2 Changed 5 years ago by arien

  • Resolution set to invalid
  • Status changed from new to closed

There is no bug here.

Django converts certain characters to the corresponding HTML entities (a.k.a. "automatic HTML escaping") and offers a couple of ways to disable this behaviour, including the safe filter. You seem to think that Django will do the conversion in the other direction, but that is not the case.

Note that Django marks translation strings as being "safe" (you'd see "Sessión" instead of "Sessión" otherwise), so applying the safe filter doesn't change things. (The safe filter prevent auto-escaping, it doesn't "unescape".)

Using this in your .po file, things will Just Work:

msgid "Log-in or Account related problems"
msgstr "Problemas para el Inicio de Sesión o relacionados con la Cuenta"

comment:3 Changed 5 years ago by mtredinnick

  • Resolution invalid deleted
  • Status changed from closed to reopened
  • Triage Stage changed from Unreviewed to Accepted

Based on testing the reported problem, Django is misbehaving here. The output produced looks like this:

In [4]: print forms.FeedbackHelpForm()
<tr><th><label for="id_subject">Subject:</label></th><td><select name="subject" id="id_subject">
<option value="10">Problemas para el Inicio de Sesi&amp;oacute;n o relacionados con la Cuenta</option>
<option value="20">General enquiry</option>
<option value="30">Other feedback</option>
</select></td></tr>

which is incorrect. The &amp; part shows double-escaping going on.

comment:4 Changed 5 years ago by mtredinnick

I was having a look at this, just for fun, whilst something else was going on. It's surprisingly deep in the system and requires some care to fix.

When I wrote the auto-escaping code, I was apparently high on glue at some stage and went a little crazy throwing in escape() and conditional_escape() calls to every render() and render_*() method in django/forms/widgets.py. Unfortunately, things aren't that simple. We should only be escaping the values that have potentially come from user-supplied input in the widgets' output methods. For widgets like Select* and Checkbox*, the values and display strings can be treated as "safe".

Fix requires

  1. Going through the widgets and working out the ones that will never take user-supplied values directly (the case we're protecting against is re-rendering a form after errors have occurred, using the previous form input data). They can have the escape() calls removed.
  2. Possibly adding some documentation to explain how autoescaping values are treated in form widgets. Probably all that's needed her is identifying the widgets whose values are escaped on output. Those are the ones where supplying an initial value might require wrapping the value in a safe() call in developer code, so I want to explain that to people who read the documentation.
  3. Lots of tests. Probably one new test for every widget, shoving some data into it that requires escaping (say, "<") and checking that it is escaped in only the right places.

This isn't going to be fully backwards-compatible (much as with the fix to the "join" filter in r9442) and some people including raw "&" characters in their select option values will need to replace them with HTML entities. The current behaviour is a bug, though, and prevents some valid behaviour. So we just explain that the original coder was an idiot and we fixed it now.

comment:5 Changed 5 years ago by arien

  • Needs documentation set
  • Needs tests set

I'm not sure what I was smoking when I checked this and wrote comment:2, but I doubt it was legal...

I think using HTML entities (or other markup) in a msgstr when there are none in the msgid is begging for trouble. By doing this you are assuming the msgstr will be used in a certain type of context when there is no assumption for the msgid. (For example, the msgstr in the report is only suited for HTML, while the msgid could be use in the subject of an email.)

That said, it turns out that Django doesn't mark all translated strings as "safe". In django.utils.translation.trans_real, do_translate will only mark the translation as "safe" if the string to be translated was "safe", and do_ntranslate will never mark the translation as "safe". The inconsistency between these two should be fixed.

Solving this at the widget level seems wrong to me. The widgets don't know if the data is "safe" or not, it could come from anywhere. (The choices for a Select widget may come from the queryset of a ModelChoiceField, for example.) It would also result in inconsistency in the auto escaping behaviour of different widgets. Seems to me, the original coder was onto somehting, not on somehting... :-)

That leaves the question if all translation functions should return "safe" strings all the time, or not. I'm not sure they should. For the *gettext_lazy functions it might make sense to always return "safe" strings, but I don't think it does for the *gettext versions.

comment:6 Changed 5 years ago by mtredinnick

It is implicit in the translation infrastructure that we know whether the strings are intended for HTML output or not. We can't differentiate them, that's true (although we are going to add the ability to include contextual comments in the PO files). And if the same string is going to be used in two different situations (from two places in teh code), we have to make sure it's a different string in each place (a limitation of gettext being that it always collapses identical occurences of the same msgid, since it doesn't assume msgid's make sense on their own). But that's life in the big city. Translation strings should be written so that they are safe for including in HTML if they are part of the forms module, etc. Yes, we don't make this clear, but it basically never comes up, so that's just an oversight.

Widgets can easily understand safe or not. The only issue is whether the data came directly from user supplied input via a form redisplay. If developer is including user supplied input via some other method it is up to the developer to be careful. Auto-escaping is not Perl's "taint" operation. It's a much more limited and well-defined operation, but many of the same principles apply. So we should still solve this as I suggested, since form widgets are definitely dealing with HTML output directly.

The question of safe return values from gettext is an interesting one. I already know we need to document the assumptions on translations (although I have a feeling they are buried somewhere in the documentation or the i18n mailing list, but it needs to be very explicit). But that's not an issue for this ticket. There are limitations of the gettext framework here that we have to work with and this is a case where picking separate msgids for separate purposes is the appropriate solution.

comment:7 Changed 5 years ago by arien

Apparently I haven't expressed myself clearly; let me try again.

IMHO it's bad practice to use markup like HTML entities in a msgstr when there's no such markup in the corresponding msgid.

  • In many (all?) cases, there is no need to use markup when translating a string that doesn't contain markup. For example, sesi&oacute;n is just another way of writing sesión in HTML.
  • Using markup restricts possible usage of the target language strings with respect to the possible usage of the source language string. (sesión can be used no problem in XML, CSV, email, PDF, etc. Results for sesi&oacute;n are somewhat different.)
  • In many cases it's not trivial to determine if the message strings are intended to be used solely in, e.g., HTML output (that is, impossible from the PO file alone). A models module has many examples: verbose_name and verbose_name_plural, verbose_name for model fields, FOO_CHOICES. A PO file will not show any other places where these are used.
  • Even if you can easily determine what the current usage of a certain message string is, what happens in the future if the translation string is used in other output formats? For example, some FOO_CHOICES that were previously only used in HTML now need to be used in emails' subjects as well. That's rewriting translation strings right there.

Anyway, that isn't the issue. (It would have prevented this particular instance, though.) The issue is that translated strings are sometimes escaped when they shouldn't be, as demonstrated by the reported example.

The solution isn't changing some widgets so that they assume that data they're fed never needs to be escaped. The widgets don't know if it's user supplied data (they can't know) and shouldn't care. The data should be assumed to need escaping unless the data has been marked as safe by something outside the widget. (Not to mention the inconsistency in auto-escaping behaviour between, say, Select and TextInput that would result from the proposed solution.)

The solution is marking translated strings as "safe" the way do_translate does if it's fed SafeData, and probably changing do_ntranslate so that it works the same way. When to mark translated strings as safe is exactly what this ticket is about.

Anyway, I think the reporter probably thought that Django would convert HTML entities in translation string to the characters they represent (see also his problem with HTML entities and email in the same thread) and has changed his PO file as suggested while we're discussing something that is really a non-issue.

comment:8 Changed 5 years ago by mtredinnick

You did express yourself clearly. I just don't agree with you and there's no reason for that requirement. The solution is to fix the widgets because they absolutely do know and no, assuming escaping was the mistake I made originally. It's only required in very specific situations.

Talking about "safe" and "unsafe" is confusing this conversation, since is a template auto-escaping concept, not an issue with constant strings in Python. The resulting string returned from widget's render() method will always be safe from auto-escaping, since it contains raw HTML. The translated strings count as constants, not user-supplied data, so they shouldn't be escaped. Please stop fighting me on this one: we're being very consistent throughout Django on this issue and the current case was just an oversight on my part.

comment:9 Changed 5 years ago by arien

Malcolm, I'm not fighting you. I disagree with your solution, that is all.

If and when anybody wants to have a stab at this, it may be nice to know that changing the translation functions so that they return SafeData fixes this particular issue, while passing all but a few tests in Django's suite. The tests that don't pass (in django/contrib/auth/tests and tests/regressiontests/admin_views) assume that translated strings will be escaped on output. (See django.contrib.auth.tests.PasswordResetTest.test_email_not_found for an example, escaping single quotes.)

I think I'm done with this particular ticket. *sigh*

comment:10 Changed 3 years ago by SmileyChris

  • Severity set to Normal
  • Summary changed from Bug: HTML accents not escaped out when using newforms to HTML accents not escaped out when using newforms
  • Triage Stage changed from Accepted to Design decision needed
  • Type set to Bug

I'm bumping this issue back to a design decision. Not sure I agree with malcolm's logic either - even though he is the master of all things escape-related.

It also seems like this late in the game it's a rather large backwards incompatibility to introduce (even if the current behaviour isn't explicitly documented).

comment:11 Changed 2 years ago by aaugustin

  • UI/UX unset

Change UI/UX from NULL to False.

comment:12 Changed 2 years ago by aaugustin

  • Easy pickings unset

Change Easy pickings from NULL to False.

comment:13 Changed 13 months ago by aaugustin

  • Status changed from reopened to new

comment:14 Changed 13 months ago by aaugustin

  • Triage Stage changed from Design decision needed to Accepted

Whoever writes the patch can choose to implement Malcolm's design (he seemed rather confident) or something else.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as new
The owner will be changed from nobody to anonymous. Next status will be 'assigned'
as The resolution will be set. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.