#6533 closed (fixed)
Syndication Framework: Item titles and descriptions double-escaped
Reported by: | miracle2k | Owned by: | Jacob |
---|---|---|---|
Component: | contrib.syndication | Version: | dev |
Severity: | Keywords: | ||
Cc: | Robert.Hopson@…, chromano@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
Using SVN rev 6759.
Item titles and descriptions are escaped double, first by the template system due to auto escaping, and then again by the XML code. Solution seems to be as simple as passing autoescape=False to the render() calls in django.contrib.syndication.feeds.
Attachments (5)
Change History (35)
comment:1 by , 17 years ago
comment:2 by , 17 years ago
Has patch: | set |
---|
I have attached a patch and confirmed that it works with build 0.97-pre-SVN-7513.
To test, generate a feed that has an item title of '>', and then generate an RSS feed. The '>' should only be encoded once (<) and not twice (&lt;).
Note, I would write a test case, but the bug only occurs when the feed is delivered via the server.
With the attached patch, the output is correct both via the shell, and via the dev server.
by , 17 years ago
Attachment: | 6533_correct_dir.patch added |
---|
This patch is same as the previous one, but is generated from the trunk
comment:3 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
by , 17 years ago
Attachment: | 6533_fixed_tabbing.patch added |
---|
This patch fixes the ambiguous (wrong) indentation in the previous patch.
comment:5 by , 17 years ago
Triage Stage: | Unreviewed → Ready for checkin |
---|
comment:6 by , 17 years ago
Needs tests: | set |
---|---|
Triage Stage: | Ready for checkin → Unreviewed |
Please don't self-promote tickets to 'ready for checkin' unless they are completely trivial. This change involves removing escaping behavior, so it will need to be independently reviewed. It also needs an automated regression test case, not just a 'This is how you do it' test.
comment:7 by , 17 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Unreviewed → Accepted |
comment:8 by , 16 years ago
Cc: | added |
---|
comment:9 by , 16 years ago
milestone: | → 1.0 beta |
---|
by , 16 years ago
Attachment: | syndication-escapingtest.diff added |
---|
comment:10 by , 16 years ago
Needs tests: | unset |
---|
Added automated test case.
I can confirm that the patch fixes the escaping. Somebody please review everything and mark ready for checkin.
comment:11 by , 16 years ago
Patch needs improvement: | unset |
---|
In case somebody's at it: #6618 makes changes in feeds.py too and I've included the patch there.
comment:12 by , 16 years ago
Patch needs improvement: | set |
---|
This is trickier than it looks. RSS feeds should be double-escaped in the title, Atom feeds should not. This isn't specified in the RSS spec and it's inconsistently handled by readers. Still the consensus is to double-escape (which is one of the reasons RSS is a bad format to use).
So I don't think this looks ready yet, because it tries to apply uniform handling to everything.
Also, there should be one patch for the whole thing, not multiple patches we have to apply.
comment:13 by , 16 years ago
If we want to do double escaping, it would seem that the lower-level django.utils.feedgenerator code is the right place.
However, while trying to make a patch, I realized this may indeed be tricky. If the whole point of double-escaping is that title and description may contain HTML, then what if I *want* my title to contain HTML. If Django were to force double-escaping, tags would always be interpreted as plain text.
I see a couple different options:
1) Make the developer responsible for calling htmlescape() when he doesn't want to use HTML, and only single-escape by default.
2) Do not make the change suggested by this ticket and continue to render the feed with autoescaping on. The developer can use the "|safe"-filter to indicate that he wants to use HTML.
This doesn't address the low-level generation code, though. It could a) either be made to know about safe strings, or b) remain untouched, meaning that it 1) would be true.
Personally, I like 2b.
In addition, #6618 might have to be considered, since double-escaping, if done through the template system, could be bypassed. And not to make things too confusing, but with that ticket, the following might make sense:
- If title/description returned by a method call: Double-escape.
- If rendered via a template: Do NOT double-escape, since templates are likely to be used in more complex scenarios, in which HTML usage is likely.
comment:14 by , 16 years ago
milestone: | 1.0 beta → 1.0 |
---|
Okay, I've spent a bit more time thinking about this to work out the impact and priority.
There's partly a documentation issue here, since the reason this is at all difficult is because RSS is so poorly defined and interpreted and because Atom is specifically defined (and is the opposite of how most readers interpret RSS). What the correct behaviour is depends upon whether you're rendering Atom (single escaping only) or RSS (double escaping pretty much required). We don't know that at the lower level and shouldn't worry about handling it, since it can be worked around pretty much as miracle2k describes above.
Everything should just stay the same at the template level, since using the safe filter or wrapping things in {% autoescape off %}
isn't too onerous. The lower-level code probably needs to know about safestrings and that seems to be the only change required (along with documentation). It's basically just a bug in that we don't understand safestrings in the output generation. That should be relatively easy to fix. It looks like creating a characters()
method in django.utils.xmlutils.SimpleXMLGenerator
that overrides the default and is safestring-aware will be the right thing.
Moving to 1.0 milestone, since this is just a bug.
comment:15 by , 16 years ago
Just realised, I may have been high when I was writing earlier comments here and misremembering the RSS recommended "pragmatic" approach. It's messier than I thought (so we should really leave it up to the client code writers). The comments in this Sam Ruby blog post give a reasonable summary of the side of the issue and may be relevant to whomever wishes to try a patch here.
comment:16 by , 16 years ago
Even after reading the blog post you provide, I don't really get why we need double-escaping or what should go wrong.
Example:
(1) I write a text about an HTML element, the content would be:
This tag is nice: <br />
(2) I have to escape it once, for the tag to appear correctly, and I add my HTML:
This <em>tag</em> is nice: <br />
I could write that directly in a template or if it's in a variable that the template displays I had to mark it as safe.
(3) But in a feed it gets included and needs to be escaped once more, to not mess up the XML:
<description>This <em>tag</em> is nice: &lt;br /&gt;</description>
Problem at the moment is, that Django auto-escapes at (2) when the text comes from a variable, which is likely with feeds. Auto-escaping is there to prevent content from variables to be directly at the same "level" as the HTML. With feeds, the construction of XML takes care of that. We have to switch auto-escaping off, which is what the ticket proposes.
Now, at which point do some feed readers have problems? Do they expect everything to be escaped one more time?
comment:17 by , 16 years ago
julianb, what you seem to be suggesting is that the Django developer takes care of escaping HTML in feeds, correct?
Here's the issue, as I understand it:
If my blog post title is "Math explained: 1<3>2" and auto-escaping is not used, as per this ticket (and thus, we single-escape), the feed will contain:
<title>Math explained: 1 <3>2</title>
A feed reader that expects the title to contain html (be double-escaped), would probably display this as:
Math explained: 12
since it would consider "<2>" to be a tag.
Similarly, in your example, if obj.title is "This tag is nice: <br />", and my feed title template contains {{obj.title}}, with auto-escaping off, the feed will contain:
This tag is nice: <br />
which said html-expecting feed reader might consider to be the string "This tag is nice", followed by a line break.
The question thus is: Do feed readers expect title/descriptions to contain HTML (i.e. be double escaped), and should the Django developer be responsible of doing the initial escaping.
That said, I just tested in IE7 and Firefox 2.
Both seem to expect single-escaping in the title.
Both seem to use some heuristic to handle the description. With single-escaping applied to all examples, Firefox handles both "Math solved: 1<3>2" and "Math <em>solved</em>: 1<3>2" (!!) correctly. IE 7 handles the first, but displays the second example as: "Math solved: 12", as explained above.
With correct double-escaping (e.g. true html tags only single-escaped), both examples are handled fine by both browers.
FeedDemon is as smart is Firefox, but tries to support double-escaping in titles as well (although seems to have a bug in cutting off the last character of the title in some of those cases, unimportant here).
The FeedParser library only seems to single-decode while leaving it to the user how to handle that data.
Overall most readers seem to do a good job in handling both cases (see also the comments here: http://www.therssweblog.com/?guid=20070610101812).
Feedvalidator.org says no HTML in titles and suggests to use numeric entities to encode < and &: http://feedvalidator.org/docs/warning/ContainsHTML.html
comment:18 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:19 by , 16 years ago
I'm getting now why this is all confusing.
In the blog post Malcolm provided, it says "The initial RSS specs were clear that titles were to be singly encoded, an no subsequent specification gave license to putting escaped HTML in titles."
So the problem that I saw in my feed probably just has been that there was double-escaping going on in the title and Firefox and my feed reader didn't expect that. If it's really true that titles should not contain HTML we should switch auto-escaping off there. I think in descriptions double-escaping is expected as these can contain HTML.
by , 16 years ago
Attachment: | syndication-optionaltemplates.diff added |
---|
comment:20 by , 16 years ago
With this patch, titles are escaped once (no HTML possible), descriptions have auto-escaping so you need to mark variables in description templates as safe if you they contain HTML.
comment:21 by , 16 years ago
comment:22 by , 16 years ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
OK, I spent a really long time looking into this. As the test in [8632] shows, Django *does* do the right thing when presented with characters to be escaped; the problem comes when you're writing a template.
In those cases, I think you just need to use safe
/escape
explicitly -- there's almost no way to make Django do the right thing both for RSS and Atom. Perhaps at some point in the future when the syndication framework is a bit more flexible... or when we drop RSS support... but for now this has to be wontfix -- anything we can possibly do can fail in different circumstances, and at least the current behavior is predictable, though dumb.
comment:23 by , 16 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
Actually Atom has feature to fight exactly this issue: title element can have attribute type="text|html|xhtml", which explicitly defines how much escaping was applied to title. So adding type="xhtml" to title and double escaping text in it should be the right and safe.
http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.title
http://www.atomenabled.org/developers/syndication/atom-format-spec.php#text.constructs
http://intertwingly.net/blog/2006/07/14/Another-Month#c1152903120
comment:24 by , 16 years ago
milestone: | 1.0 → 1.1 |
---|
comment:25 by , 16 years ago
milestone: | 1.1 → 1.2 |
---|
comment:26 by , 15 years ago
Cc: | added |
---|---|
Resolution: | → fixed |
Status: | reopened → closed |
comment:27 by , 15 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:28 by , 15 years ago
So: any plans to address this? As arty pointed out, the Atom spec allows for XHTML titles and summaries; would be nice to be able to actually use this capability in Django-generated Atom feeds...
comment:29 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
As Jacob points out, the RSS-specific behavior is correct as of [8632]. Supporting additional features of Atom is a separate issue, and there's already a long-standing ticket open for general improvements to our Atom support.
I can reproduce this when calling the API from the development server, but cannot reproduce when calling from the shell:
For example:
outputs:
Which is correct.
However, retrieving a feed that is generated and server via the development server, results in the double encoding:
Notice the title in the first item is double encoded.
Im not sure what build I am on, other than one of the .97 builds.