Code

Opened 7 years ago

Closed 4 years ago

#5663 closed (wontfix)

markdown 1.6b unicodedecodeerror

Reported by: Koen Biermans <koen.biermans@…> Owned by:
Component: Contrib apps Version: master
Severity: Keywords: markdown
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

Using markdown version 1.6b, the markdown filter in contrib.markup shows a unicode error. It seems the markdown function uses unicode now, because it works when leaving out the smart_str() conversion.

Attachments (3)

markdown_1.6b.diff (867 bytes) - added by Daniel Pope <dan@…> 6 years ago.
Patch removing broken 1.6b codepath
markdown_1.6b.2.diff (996 bytes) - added by Daniel Pope <dan@…> 6 years ago.
Patch removing broken 1.6b codepath (fixed)
markdown_1.6b.3.diff (1.5 KB) - added by Daniel Pope <dan@…> 6 years ago.
Same diff with 10 lines of context

Download all attachments as: .zip

Change History (20)

comment:1 Changed 7 years ago by mboersma

  • Needs documentation unset
  • Needs tests unset
  • Owner changed from nobody to mboersma
  • Patch needs improvement unset
  • Status changed from new to assigned
  • Triage Stage changed from Unreviewed to Accepted

comment:2 Changed 7 years ago by mboersma

  • Resolution set to worksforme
  • Status changed from assigned to closed

I installed markdown 1.6b and ran its unit tests and my blog and didn't see any error. If there's a problem here, could you please reopen the bug with some test code and a specific description of how to reproduce the error?

Here's my test output, just FYI:

(((
mboersma@marmaladesky:~/projects/django.trunk/django/contrib/markup$ python
Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import markdown
markdown.version

'1.6b'

mboersma@marmaladesky:~/projects/django.trunk/django/contrib/markup$ python tests.py
...


Ran 3 tests in 0.526s

OK
)))

comment:3 Changed 7 years ago by Koen Biermans <koen.biermans@…>

  • Resolution worksforme deleted
  • Status changed from closed to reopened

There are no strange characters in those tests. But when you try this:

>>> import markdown
>>> markdown.markdown(u'dépot')
u'<p>d\x82pot\n</p>
>>> from django.contrib.markup.templatetags import markup
>>> markup.markdown(u'dépot')
Traceback(most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byt 0xc2 in position 1: ordinal not in range(128)

In the templatetag the values are wrapped into a smart_str() call, so django is passing in a bytestring, but markdown is expecting unicode it seems (at least it can handle it well now).

comment:4 Changed 6 years ago by mtredinnick

  • Resolution set to wontfix
  • Status changed from reopened to closed

Okay, this is a bug in markdown-1.6b. I've reported it to the maintainer.

@mboersma: the markup tests seem surprisingly reluctant to ever report a failure if there's any kind of internal crash. They don't fill me with confidence that they're detecting many problems.

comment:5 Changed 6 years ago by mtredinnick

Just to close this off definitively: the markdown maintainer has committed the patch I supplied. So it's not worth putting a special workaround into Django here. The solution is "don't use 1.6b if you need non-ASCII support."

comment:6 Changed 6 years ago by Daniel Pope <dan@…>

  • Resolution wontfix deleted
  • Status changed from closed to reopened

1.6b is the version shipped in Ubuntu 8.04 LTS and presumably any other distros developed between 2007-08-18 and 2008-02-17. The current situation is basically that we support Markdown 1.6a and 1.7 but not 1.6b and I think this is a bigger problem than you anticipated.

Markdown 1.7 changes the behaviour of markdown.markdown() such that it now no longer accepts bytestrings as input. This is implemented for Django in #6387. You were right at the time, that 1.6b broke UTF8-encoded bytestring input, but 1.7 removes that behaviour anyway. We can just treat 1.6b and 1.7 in the same way instead of ignoring this bug.

Changed 6 years ago by Daniel Pope <dan@…>

Patch removing broken 1.6b codepath

Changed 6 years ago by Daniel Pope <dan@…>

Patch removing broken 1.6b codepath (fixed)

comment:7 Changed 6 years ago by mtredinnick

  • Has patch set
  • Patch needs improvement set

Your patch can't be correct. It would result in trying to pass unicode strings to earlier versions of markdown that don't accept them.

If you want to relax the version test to include 1.6b (or even 1.6 if all such versions can handle unicode strings, but you'll need to test that), no problems. But we shouldn't break earlier versions.

comment:8 Changed 6 years ago by Daniel Pope <dan@…>

  • Patch needs improvement unset

The patch is correct. The earlier test for markdown.version already selects 1.6b, so that test alone is sufficient. I'll attach a diff with a few more lines of context so you can see the relevant comment.

Changed 6 years ago by Daniel Pope <dan@…>

Same diff with 10 lines of context

comment:9 Changed 6 years ago by wayla

  • Resolution set to fixed
  • Status changed from reopened to closed

That patch is no good. As one of the markdown core devs I can confirm that 1.6b does NOT have the necessary unicode support. In fact, my recollection is that 1.6b had a few nasty bugs and I don't recommend it at all (use 1.6a or 1.7). The current implementation (in django) is correct and does not need changed.

The problem seems to be that Debian (and by extension Ubuntu) had updated there python-markdown package from the markdown dev repo to somewhere between 1.6b and 1.7 and called it 1.6b1. This is not an official release of Python-Markdown and is not supported by the Python-Markdown team. I would suggest that a bug report be filed with Debian/Ubuntu asking them to update their package, or don't use their package and install Python-Markdown the way everyone else does.

comment:10 Changed 6 years ago by mtredinnick

I think the comment from @wayla is probably a sensible position to take. We've asked upstream providers to fix their packages before when they're clearly buggy, rather than work around them.

comment:11 follow-up: Changed 6 years ago by italomaia

Wayla, i'll try to mail them. By the way, the patch 3 works for me.

comment:12 in reply to: ↑ 11 Changed 6 years ago by wayla

Replying to italomaia:

By the way, the patch 3 works for me.

Yes, however, that patch would not work for the 1.6a or 1.6b official releases from the Python-Markdown project. It will only work with this "unofficial" 1.6b1 distributed by Debian/Unbuntu. So, if Django was going to address this, all the existing version checks would need to remain (the current patch removes them) and an additional check would need to be added for this "unofficial" 1.6b1. That is why it's preferred that you (and Debian/Ubuntu) upgrade to an officially supported version of markdown.

comment:13 follow-up: Changed 6 years ago by Daniel Pope <dan@…>

  • Resolution fixed deleted
  • Status changed from closed to reopened

The patch works for 1.6a and earlier because 1.6b is the first to provide markdown.version, so 1.6a is passed UTF-8 bytestrings.

There is no 1.6b1. I've just downloaded the official 1.6b .zip file:

mauve:~/dev/markdown$ diff markdown-1.6b/markdown.py /var/lib/python-support/python2.5/markdown.py
mauve:~/dev/markdown$

The -1 in the 1.6b-1 package version is the version of the package, not the version of markdown.

I can confirm by empirical testing that 1.6b does require unicode strings as input. Your memory has failed you in this case.

>>> i = '€£½'.decode('utf8')
>>> i
u'\u20ac\xa3\xbd'
>>> markdown.markdown(i)
u'<p>\u20ac\xa3\xbd\n</p>'
>>> markdown.markdown(i.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "markdown.py", line 1722, in markdown
    return md.convert(text)
  File "markdown.py", line 1614, in convert
    self.source = removeBOM(self.source, self.encoding)
  File "markdown.py", line 74, in removeBOM
    if text.startswith(bom):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

1.6a works with or without the patch.
1.6b works only with the patch (even if it does contain bugs).
1.6b1 works only with the patch.
1.7 works with or without the patch.

I don't see what we could be doing better. Even if there's an issue upstream it doesn't affect the status of this patch. Reopening.

comment:14 in reply to: ↑ 13 Changed 6 years ago by wayla

Replying to Daniel Pope <dan@mauveinternet.co.uk>:

I can confirm by empirical testing that 1.6b does require unicode strings as input. Your memory has failed you in this case.

Actually I wrote the code that changed unicode support between 1.6b & 1.7. I think I remember what I did. Yes, 1.6b did have some support for unicode, but it did not require it as you state. It also worked with (most) bytestrings. The fact that it supported unicode at all was kind of a fluke. It just so happens that the python re module runs just as well on unicode strings as it does on byte strings. Therefore, 1.6b added a __unicode__ method to the Markdown class which simply wrapped __str__ (return str(self)). In contrast, 1.7 raises a fatal error before running if it does not get a unicode string (or a bytestring only of ascii characters as they're a subset of unicode anyway) and will only return unicode.

Oh, and the reason your first example works:

>>> markdown.markdown(i)
u'<p>\u20ac\xa3\xbd\n</p>'

is because the markdown wrapper function did not call either Markdown.__str__ or Markdown.__unicode__, but did it's own thing reimplementing most of Markdown.__str__. In other words, by using the wrapper (shortcut) in 1.6b, you get different behavior than if you call the class directly. That's buggy and I don't recommend it.

Btw, using your example in 1.6b, you should have done this:

>>> markdown.markdown(i.encode('utf8'), encoding='utf8')

But even that was buggy. Shortly after Malcolm submitted a patch (which I applied), we threw away most of that convert-to-unicode stuff (we kept just enough only for use from the command line - including Malcolm's patch) and forced the requirement that Markdown only accept unicode.

That's the difference here. 1.7 is the only version where we can be absolutely sure that it is safe to pass unicode text to markdown. It may or may not work in any earlier version. As a Markdown core dev, I will not guarantee that it will work all the time in anything but 1.7. Sure it may work fine for you in testing, but then some user will submit some text that fails and breaks your site. Debian/Ubuntu need to do the right thing here and provide 1.7 which passes a version test for 1.7.

In any event, it's up to the django core devs to decide what to do in Django.

comment:15 Changed 6 years ago by Daniel Pope <dan@…>

Thanks for your comprehensive clarification.

The encoding argument is completely ignored in 1.6b so your example gives the same error:

>>> markdown.markdown(i.encode('utf8'), encoding='utf8')
Traceback (most recent call last):
 ...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Could Django warn on the console if it finds Markdown 1.6b? I think that might save users some headaches.

comment:16 Changed 6 years ago by mboersma

  • Owner mboersma deleted
  • Status changed from reopened to new

comment:17 Changed 4 years ago by adamnelson

  • Resolution set to wontfix
  • Status changed from new to closed

Markdown is now in the 2 series, which ships with Ubuntu Lucid, and the unicode issue was fixed 2 years ago - I think this ticket is no longer applicable. Reopen if that's not the case.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.