Code

Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#4430 closed (fixed)

[unicode] Syndication framework cannot handle unicode description

Reported by: bugs@… Owned by: mtredinnick
Component: contrib.syndication Version: other branch
Severity: Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description (last modified by mtredinnick)

I have object with content attribute, where I have non-ascii data. For both cases (either specifying {{ obj.content }} in description template or by adding method

    def __unicode__(self):
        return smart_unicode(self.content)

), I got UnicodeDecodeError when trying to display feed:

UnicodeDecodeError at /feeds/wiki/
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Request Method: 	GET
Request URL: 	        http://rpgpedia.cz/feeds/wiki/
Exception Type: 	UnicodeDecodeError
Exception Value: 	'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Exception Location: 	/usr/lib/python2.5/codecs.py in write, line 303

Local variables show object codecs is trying to decode:

u'\xdasp\u011bch zna\u010d\xed zd\xe1rn\xe9 zavr\u0161en\xed akce, kter\xe1 je p\u0159edm\u011btem testov\xe1n\xed. Je pot\u0159ebn\xfd zejm\xe9na tehdy, kdy\u017e se n\u011bkter\xe1 ((rp postava postava)) nebo jin\xfd element v ((rp rolova_hra rolov\xe9 h\u0159e)) sna\u017e\xed n\u011bco ud\u011blat, n\u011bco zd\xe1rn\u011b zavr\u0161it, nebo n\u011bjak\xfdm zp\u016fsobem zvr\xe1tit situaci ve sv\u016fj prosp\u011bch.\r\n\r\nNakl\xe1d\xe1n\xed s \xfasp\u011bchem z\xe1vis\xed od ((rp pravidla pravidel)) hry. V n\u011bkter\xfdch hr\xe1ch je d\u016fle\u017eit\xfd tak\xe9 po\u010det \xfasp\u011bch\u016f (pokud jich m\u016f\u017ee hr\xe1\u010d v testu dos\xe1hnout v\xedce), v jin\xfdch hr\xe1ch je podstatn\xe9 jenom to, jestli hr\xe1\u010d v ((rp test testu)) usp\u011bje, nebo ne.\r\n\r\nV prvn\xedm p\u0159\xedpad\u011b m\u016f\u017ee nav\xedc p\u0159i v\xfdsledku konfliktn\xed akce mezi dv\u011bma nebo v\xedce postavami (nebo elementy) b\xfdt rozhoduj\xedc\xed i po\u010det \xfasp\u011bch\u016f jednotliv\xfdch postav a ta s nejvy\u0161\u0161\xedm po\u010dtem \xfasp\u011bch\u016f pak v dan\xe9m konfliktu zpravidla v\xedt\u011bz\xed.\r\n\r\nV n\u011bkter\xfdch hr\xe1ch existuje t\xe9\u017e """tot\xe1ln\xed \xfasp\u011bch""" (jak\xe1si zes\xedlen\xe1 varianta \xfasp\u011bchu obvykle s \u0159\xe1dov\u011b ni\u017e\u0161\xed pravd\u011bpodobnost\xed) vedouc\xed zpravida k v\xfdkon\u016fm \u010di ud\xe1lostem, kter\xe9 by za norm\xe1ln\xedch okolnost\xed byly (t\xe9m\u011b\u0159) nemo\u017en\xe9.'

( = normal unicode string, which has no problem when encoding with s.encode('utf-8')

Attachments (1)

rss-unicode.patch (5.9 KB) - added by Almad 7 years ago.
patch fixing unicode description issues.

Download all attachments as: .zip

Change History (8)

comment:1 Changed 7 years ago by michal@…

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

I have similar problem (not exactly same, but it's in relation with RSS framework and strings in Czech language and UTF-8).

When I try to fetch RSS feed, I get this error:

UnicodeDecodeError at /rss/aktualni-zpravy/
'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
Request Method: 	GET
Request URL: 	http://127.0.0.1:8000/rss/aktualni-zpravy/
Exception Type: 	UnicodeDecodeError
Exception Value: 	'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
Exception Location: 	/usr/local/lib/python2.4/codecs.py in write, line 178
Traceback (innermost last)

Traceback (most recent call last):
File "/usr/local/lib/python2.4/site-packages/django/core/handlers/base.py" in get_response
  77. response = callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python2.4/site-packages/django/contrib/syndication/views.py" in feed
  24. feedgen.write(response, 'utf-8')
File "/usr/local/lib/python2.4/site-packages/django/utils/feedgenerator.py" in write
  136. self.write_items(handler)
File "/usr/local/lib/python2.4/site-packages/django/utils/feedgenerator.py" in write_items
  160. handler.addQuickElement(u"title", item['title'])
File "/usr/local/lib/python2.4/site-packages/django/utils/xmlutils.py" in addQuickElement
  13. self.characters(contents)
File "/usr/local/lib/python2.4/site-packages/_xmlplus/sax/saxutils.py" in characters
  309. writetext(self._out, content)
File "/usr/local/lib/python2.4/site-packages/_xmlplus/sax/saxutils.py" in writetext
  188. stream.write(escape(text, entities))
File "/usr/local/lib/python2.4/codecs.py" in write
  178. data, consumed = self.encode(object, self.errors)

  UnicodeDecodeError at /rss/aktualni-zpravy/
  'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

It looks like the RSS framework wrong handle items. I looked into Django source, into file django/utils/feedgenerator.py and change code from line 160 to:

...

from django.utils.encoding import smart_unicode
handler.addQuickElement(u"title", smart_unicode(item['title']))
handler.addQuickElement(u"link", smart_unicode(item['link']))
if item['description'] is not None:
    handler.addQuickElement(u"description", smart_unicode(item['description']))

# Author information.
if item["author_name"] and item["author_email"]:
    handler.addQuickElement(u"author", u"%s (%s)" % \
        (smart_unicode(item['author_email']), smart_unicode(item['author_name'])))
elif item["author_email"]:
    handler.addQuickElement(u"author", smart_unicode(item["author_email"]))
elif item["author_name"]:
    handler.addQuickElement(u"dc:creator", smart_unicode(item["author_name"]), {"xmlns:dc": u"http://purl.org/dc/elements/1.1/"})

if item['pubdate'] is not None:
    handler.addQuickElement(u"pubDate", rfc2822_date(item['pubdate']).decode('ascii'))
if item['comments'] is not None:
    handler.addQuickElement(u"comments", smart_unicode(item['comments']))
if item['unique_id'] is not None:
    handler.addQuickElement(u"guid", smart_unicode(item['unique_id']))

# Enclosure.
if item['enclosure'] is not None:
    handler.addQuickElement(u"enclosure", '',
        {u"url": item['enclosure'].url, u"length": item['enclosure'].length,
            u"type": item['enclosure'].mime_type})

# Categories.
for cat in item['categories']:
    handler.addQuickElement(u"category", smart_unicode(cat))
...

In every call of handler.addQuickElement I used smart_unicode function to recode content. Now my RSS feed is running.

Maybe there is need to make patch do something similar in the RSS framework?

comment:2 Changed 7 years ago by Almad

  • Has patch set

Fixed like michal pointed out + fix also other classes.

Adding patch.

Changed 7 years ago by Almad

patch fixing unicode description issues.

comment:3 Changed 7 years ago by mtredinnick

  • Description modified (diff)
  • Triage Stage changed from Unreviewed to Accepted

(fixed description formatting)

The patch goes a bit too far. We should never be applying smart_unicode() to anything is a URL. If they aren't already in ASCII, it's a bug on the client code's side (they should be using things like iri_to_uri() at the appropriate moments).

I'm having a bit of trouble understanding the original report, because smart_unicode() does work on the string you posted and you don't include what's in the traceback leading up to the error.

If the patch fixes it for you, can you just drop in a comment saying so? I'll apply a version of this patch anyway, since it mostly fixes some places that have been overlooked (thanks for testing that, both of you), but I would like some confirmation that it is fixing the original report as well.

comment:4 Changed 7 years ago by mtredinnick

  • Owner changed from adrian to mtredinnick

Okay, the original bug report does make sense (that is, I can repeat it) if the string passed in is a UTF-8 bytestring that uses non-ASCII characters.

I'll commit a modified patch shortly that takes care of the IRI -> URI mapping as well.

comment:5 Changed 7 years ago by mtredinnick

  • Resolution set to fixed
  • Status changed from new to closed

(In [5389]) unicode: Fixed #4430 -- Handle bytestrings and IRIs more robustly in feed
production. Thanks to Almad and Michal@… for some good debugging here.

comment:6 Changed 7 years ago by mtredinnick

(In [5400]) unicode: Reverted [5388] and fixed the problem in a different way. Checked
every occurrence of smart_unicode() and force_unicode() that was not previously
a str() call, so hopefully the problems will not reoccur. Fixed #4447. Refs #4435, #4430.

comment:7 Changed 7 years ago by michal@…

Works for me, thank you.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.