Opened 16 years ago

Closed 14 years ago

Last modified 12 years ago

#6533 closed (fixed)

Syndication Framework: Item titles and descriptions double-escaped

Reported by: miracle2k Owned by: Jacob
Component: contrib.syndication Version: dev
Severity: Keywords:
Cc: Robert.Hopson@…, chromano@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Using SVN rev 6759.

Item titles and descriptions are escaped double, first by the template system due to auto escaping, and then again by the XML code. Solution seems to be as simple as passing autoescape=False to the render() calls in django.contrib.syndication.feeds.

Attachments (5)

6533.patch (1.2 KB ) - added by mikechambers@… 16 years ago.
Patch with fix for #6533
6533_correct_dir.patch (1.3 KB ) - added by mikechambers@… 16 years ago.
This patch is same as the previous one, but is generated from the trunk
6533_fixed_tabbing.patch (1.3 KB ) - added by mikechambers 16 years ago.
This patch fixes the ambiguous (wrong) indentation in the previous patch.
syndication-escapingtest.diff (2.6 KB ) - added by Julian Bez 16 years ago.
syndication-optionaltemplates.diff (5.1 KB ) - added by Julian Bez 16 years ago.

Download all attachments as: .zip

Change History (35)

comment:1 by mikechambers@…, 16 years ago

I can reproduce this when calling the API from the development server, but cannot reproduce when calling from the shell:

For example:

#!/usr/bin/python

from django.utils.feedgenerator import Rss201rev2Feed

rss = Rss201rev2Feed(
		title="title test", 
		link="http://www.djangoproject.com", 
		description="description test")
		
rss.add_item(
			title="<",
			link="http://www.foo.com",
			description="item description"
			)
			
print rss.writeString('utf8')

outputs:

<?xml version="1.0" encoding="utf8"?>
<rss version="2.0">
	<channel>
		<title>title test</title>
		<link>http://www.djangoproject.com</link>
		<description>description test</description>
		<lastBuildDate>Sat, 03 May 2008 04:04:46 -0000</lastBuildDate>
		<item>
			<title>&lt;</title>
			<link>http://www.foo.com</link>
			<description>item description</description>
		</item>
	</channel>
</rss>

Which is correct.

However, retrieving a feed that is generated and server via the development server, results in the double encoding:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
	<channel>
		<title> Newest Entries Feed</title>
		<link>http://example.com/item/</link>
		<description>Newest entries added to </description>
		<language>en-us</language>
		<lastBuildDate>Sat, 03 May 2008 03:53:54 -0000</lastBuildDate>
		<item>
			<title>&amp;lt;</title>
			<link>http://example.com/item/2/</link>
			<description>&amp;amp;</description>
			<guid>http://example.com/item/2/</guid>
		</item>
</rss>

Notice the title in the first item is double encoded.

Im not sure what build I am on, other than one of the .97 builds.

by mikechambers@…, 16 years ago

Attachment: 6533.patch added

Patch with fix for #6533

comment:2 by mikechambers@…, 16 years ago

Has patch: set

I have attached a patch and confirmed that it works with build 0.97-pre-SVN-7513.

To test, generate a feed that has an item title of '>', and then generate an RSS feed. The '>' should only be encoded once (&lt;) and not twice (&amp;lt;).

Note, I would write a test case, but the bug only occurs when the feed is delivered via the server.

With the attached patch, the output is correct both via the shell, and via the dev server.

by mikechambers@…, 16 years ago

Attachment: 6533_correct_dir.patch added

This patch is same as the previous one, but is generated from the trunk

comment:3 by anonymous, 16 years ago

Owner: changed from nobody to anonymous
Status: newassigned

comment:4 by mikechambers, 16 years ago

Owner: changed from anonymous to mikechambers
Status: assignednew

by mikechambers, 16 years ago

Attachment: 6533_fixed_tabbing.patch added

This patch fixes the ambiguous (wrong) indentation in the previous patch.

comment:5 by mikechambers, 16 years ago

Triage Stage: UnreviewedReady for checkin

comment:6 by Russell Keith-Magee, 16 years ago

Needs tests: set
Triage Stage: Ready for checkinUnreviewed

Please don't self-promote tickets to 'ready for checkin' unless they are completely trivial. This change involves removing escaping behavior, so it will need to be independently reviewed. It also needs an automated regression test case, not just a 'This is how you do it' test.

comment:7 by Jacob, 16 years ago

Patch needs improvement: set
Triage Stage: UnreviewedAccepted

comment:8 by anonymous, 16 years ago

Cc: Robert.Hopson@… added

comment:9 by Julian Bez, 16 years ago

milestone: 1.0 beta

by Julian Bez, 16 years ago

comment:10 by Julian Bez, 16 years ago

Needs tests: unset

Added automated test case.

I can confirm that the patch fixes the escaping. Somebody please review everything and mark ready for checkin.

comment:11 by Julian Bez, 16 years ago

Patch needs improvement: unset

In case somebody's at it: #6618 makes changes in feeds.py too and I've included the patch there.

comment:12 by Malcolm Tredinnick, 16 years ago

Patch needs improvement: set

This is trickier than it looks. RSS feeds should be double-escaped in the title, Atom feeds should not. This isn't specified in the RSS spec and it's inconsistently handled by readers. Still the consensus is to double-escape (which is one of the reasons RSS is a bad format to use).

So I don't think this looks ready yet, because it tries to apply uniform handling to everything.

Also, there should be one patch for the whole thing, not multiple patches we have to apply.

comment:13 by miracle2k, 16 years ago

If we want to do double escaping, it would seem that the lower-level django.utils.feedgenerator code is the right place.

However, while trying to make a patch, I realized this may indeed be tricky. If the whole point of double-escaping is that title and description may contain HTML, then what if I *want* my title to contain HTML. If Django were to force double-escaping, tags would always be interpreted as plain text.

I see a couple different options:

1) Make the developer responsible for calling htmlescape() when he doesn't want to use HTML, and only single-escape by default.
2) Do not make the change suggested by this ticket and continue to render the feed with autoescaping on. The developer can use the "|safe"-filter to indicate that he wants to use HTML.

This doesn't address the low-level generation code, though. It could a) either be made to know about safe strings, or b) remain untouched, meaning that it 1) would be true.

Personally, I like 2b.

In addition, #6618 might have to be considered, since double-escaping, if done through the template system, could be bypassed. And not to make things too confusing, but with that ticket, the following might make sense:

  • If title/description returned by a method call: Double-escape.
  • If rendered via a template: Do NOT double-escape, since templates are likely to be used in more complex scenarios, in which HTML usage is likely.

comment:14 by Malcolm Tredinnick, 16 years ago

milestone: 1.0 beta1.0

Okay, I've spent a bit more time thinking about this to work out the impact and priority.

There's partly a documentation issue here, since the reason this is at all difficult is because RSS is so poorly defined and interpreted and because Atom is specifically defined (and is the opposite of how most readers interpret RSS). What the correct behaviour is depends upon whether you're rendering Atom (single escaping only) or RSS (double escaping pretty much required). We don't know that at the lower level and shouldn't worry about handling it, since it can be worked around pretty much as miracle2k describes above.

Everything should just stay the same at the template level, since using the safe filter or wrapping things in {% autoescape off %} isn't too onerous. The lower-level code probably needs to know about safestrings and that seems to be the only change required (along with documentation). It's basically just a bug in that we don't understand safestrings in the output generation. That should be relatively easy to fix. It looks like creating a characters() method in django.utils.xmlutils.SimpleXMLGenerator that overrides the default and is safestring-aware will be the right thing.

Moving to 1.0 milestone, since this is just a bug.

comment:15 by Malcolm Tredinnick, 16 years ago

Just realised, I may have been high when I was writing earlier comments here and misremembering the RSS recommended "pragmatic" approach. It's messier than I thought (so we should really leave it up to the client code writers). The comments in this Sam Ruby blog post give a reasonable summary of the side of the issue and may be relevant to whomever wishes to try a patch here.

comment:16 by Julian Bez, 16 years ago

Even after reading the blog post you provide, I don't really get why we need double-escaping or what should go wrong.

Example:

(1) I write a text about an HTML element, the content would be:

This tag is nice: <br />

(2) I have to escape it once, for the tag to appear correctly, and I add my HTML:

This <em>tag</em> is nice: &lt;br /&gt;

I could write that directly in a template or if it's in a variable that the template displays I had to mark it as safe.

(3) But in a feed it gets included and needs to be escaped once more, to not mess up the XML:

<description>This &lt;em&gt;tag&lt;/em&gt; is nice: &amp;lt;br /&amp;gt;</description>

Problem at the moment is, that Django auto-escapes at (2) when the text comes from a variable, which is likely with feeds. Auto-escaping is there to prevent content from variables to be directly at the same "level" as the HTML. With feeds, the construction of XML takes care of that. We have to switch auto-escaping off, which is what the ticket proposes.

Now, at which point do some feed readers have problems? Do they expect everything to be escaped one more time?

comment:17 by miracle2k, 16 years ago

julianb, what you seem to be suggesting is that the Django developer takes care of escaping HTML in feeds, correct?

Here's the issue, as I understand it:

If my blog post title is "Math explained: 1<3>2" and auto-escaping is not used, as per this ticket (and thus, we single-escape), the feed will contain:

<title>Math explained: 1 &lt;3&gt;2</title>

A feed reader that expects the title to contain html (be double-escaped), would probably display this as:

Math explained: 12

since it would consider "<2>" to be a tag.

Similarly, in your example, if obj.title is "This tag is nice: <br />", and my feed title template contains {{obj.title}}, with auto-escaping off, the feed will contain:

This tag is nice: &lt;br /&gt;

which said html-expecting feed reader might consider to be the string "This tag is nice", followed by a line break.

The question thus is: Do feed readers expect title/descriptions to contain HTML (i.e. be double escaped), and should the Django developer be responsible of doing the initial escaping.

That said, I just tested in IE7 and Firefox 2.

Both seem to expect single-escaping in the title.

Both seem to use some heuristic to handle the description. With single-escaping applied to all examples, Firefox handles both "Math solved: 1<3>2" and "Math <em>solved</em>: 1<3>2" (!!) correctly. IE 7 handles the first, but displays the second example as: "Math solved: 12", as explained above.

With correct double-escaping (e.g. true html tags only single-escaped), both examples are handled fine by both browers.

FeedDemon is as smart is Firefox, but tries to support double-escaping in titles as well (although seems to have a bug in cutting off the last character of the title in some of those cases, unimportant here).

The FeedParser library only seems to single-decode while leaving it to the user how to handle that data.

Overall most readers seem to do a good job in handling both cases (see also the comments here: http://www.therssweblog.com/?guid=20070610101812).

Feedvalidator.org says no HTML in titles and suggests to use numeric entities to encode < and &: http://feedvalidator.org/docs/warning/ContainsHTML.html

comment:18 by Jacob, 16 years ago

Owner: changed from mikechambers to Jacob
Status: newassigned

comment:19 by Julian Bez, 16 years ago

I'm getting now why this is all confusing.

In the blog post Malcolm provided, it says "The initial RSS specs were clear that titles were to be singly encoded, an no subsequent specification gave license to putting escaped HTML in titles."

So the problem that I saw in my feed probably just has been that there was double-escaping going on in the title and Firefox and my feed reader didn't expect that. If it's really true that titles should not contain HTML we should switch auto-escaping off there. I think in descriptions double-escaping is expected as these can contain HTML.

by Julian Bez, 16 years ago

comment:20 by Julian Bez, 16 years ago

With this patch, titles are escaped once (no HTML possible), descriptions have auto-escaping so you need to mark variables in description templates as safe if you they contain HTML.

comment:21 by Jacob, 16 years ago

(In [8632]) Added a test to ensure that strings in RSS are properly escaped. Refs #6533.

comment:22 by Jacob, 16 years ago

Resolution: wontfix
Status: assignedclosed

OK, I spent a really long time looking into this. As the test in [8632] shows, Django *does* do the right thing when presented with characters to be escaped; the problem comes when you're writing a template.

In those cases, I think you just need to use safe/escape explicitly -- there's almost no way to make Django do the right thing both for RSS and Atom. Perhaps at some point in the future when the syndication framework is a bit more flexible... or when we drop RSS support... but for now this has to be wontfix -- anything we can possibly do can fail in different circumstances, and at least the current behavior is predictable, though dumb.

comment:23 by arty, 15 years ago

Resolution: wontfix
Status: closedreopened

Actually Atom has feature to fight exactly this issue: title element can have attribute type="text|html|xhtml", which explicitly defines how much escaping was applied to title. So adding type="xhtml" to title and double escaping text in it should be the right and safe.

http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.title
http://www.atomenabled.org/developers/syndication/atom-format-spec.php#text.constructs
http://intertwingly.net/blog/2006/07/14/Another-Month#c1152903120

comment:24 by Gary Wilson, 15 years ago

milestone: 1.01.1

comment:25 by Gary Wilson, 15 years ago

milestone: 1.11.2

comment:26 by Henrique Romano, 15 years ago

Cc: chromano@… added
Resolution: fixed
Status: reopenedclosed

comment:27 by miracle2k, 15 years ago

Resolution: fixed
Status: closedreopened

comment:28 by anonymous, 14 years ago

So: any plans to address this? As arty pointed out, the Atom spec allows for XHTML titles and summaries; would be nice to be able to actually use this capability in Django-generated Atom feeds...

comment:29 by James Bennett, 14 years ago

Resolution: fixed
Status: reopenedclosed

As Jacob points out, the RSS-specific behavior is correct as of [8632]. Supporting additional features of Atom is a separate issue, and there's already a long-standing ticket open for general improvements to our Atom support.

comment:30 by Jacob, 12 years ago

milestone: 1.2

Milestone 1.2 deleted

Note: See TracTickets for help on using tickets.
Back to Top