Opened 5 years ago

Closed 5 years ago

#24985 closed Cleanup/optimization (fixed)

Warn about invalid RSS characters in syndication docs

Reported by: Michael Wood Owned by: nobody
Component: Documentation Version: 1.7
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Michael Wood)

I have some data which comes from log files that I'd like to put into a RSS feed, unfortunately due to the nature of this data it sometimes contains control characters e.g. \0001 \0003 , this causes it to fail RSS feed reader validation due to these characters (although valid utf-8) are not allowed (1).

I'm not sure if this is something that should be fixed in this module, perhaps in sax/saxutils or in somewhere like django.utils.encoding force_text ?

At the moment I'm working around this issue with a regex which replaces this range of chars.

(1) http://www.w3.org/TR/REC-xml/#charsets

Change History (7)

comment:1 Changed 5 years ago by Michael Wood

Description: modified (diff)

comment:2 Changed 5 years ago by Tim Graham

Summary: Rss201rev2Feed invalid characters in character data for RSSProvide a way to santize invalid characters from Rss201rev2Feed
Triage Stage: UnreviewedAccepted
Type: BugNew feature

We could look and see if other web frameworks perform sanitization or make alternate recommendations. If we don't make a change in Django, we could at least update the docs to note that requirement of sanitizing your own input and make a recommendation of how to do so.

comment:3 Changed 5 years ago by Tim Graham

Summary: Provide a way to santize invalid characters from Rss201rev2FeedProvide a way to sanitize invalid characters from Rss201rev2Feed

comment:4 Changed 5 years ago by Claude Paroz

#20197 is similar but targets XML serialization with dumpdata. I just added a patch in that ticket to loudly fail instead of silently producing invalid XML. Automatic sanitation is tricky, because depending on the use case, you might want to remove the offending chars, replace them with some alternative coding, or simply fix the source.

The patch for #20197 also affects RSS production, as the same django.utils.xmlutils.SimplerXMLGenerator is used. If it gets committed, we might want to add a similar admonition in syndication docs.

comment:5 Changed 5 years ago by Claude Paroz

Proposal for a documentation addition:

  • docs/ref/contrib/syndication.txt

    diff --git a/docs/ref/contrib/syndication.txt b/docs/ref/contrib/syndication.txt
    index 6c86be0..940123c 100644
    a b They share this interface: 
    919919    ``self.feed`` for use with `custom feed generators`_.
    920920
    921921    All parameters should be Unicode objects, except ``categories``, which
    922     should be a sequence of Unicode objects.
     922    should be a sequence of Unicode objects. Beware that some control characters
     923    are `not allowed <http://www.w3.org/International/questions/qa-controls>`_
     924    in XML documents. If your content has some of them, you might encounter a
     925    :exp:`ValueError` when producing the feed.
    923926
    924927:meth:`.SyndicationFeed.add_item`
    925928    Add an item to the feed with the given parameters.

comment:6 Changed 5 years ago by Tim Graham

Component: contrib.syndicationDocumentation
Summary: Provide a way to sanitize invalid characters from Rss201rev2FeedWarn about invalid RSS characters in syndication docs
Triage Stage: AcceptedReady for checkin
Type: New featureCleanup/optimization

exp -> exc, otherwise looks good.

comment:7 Changed 5 years ago by Claude Paroz <claude@…>

Resolution: fixed
Status: newclosed

In 1c90a3dc:

Fixed #24985 -- Added note about possible invalid feed content

Thanks Michael Wood for the report and Tim Graham for the review.

Note: See TracTickets for help on using tickets.
Back to Top