Opened 17 years ago
Closed 14 years ago
#5663 closed (wontfix)
markdown 1.6b unicodedecodeerror
Reported by: | Owned by: | ||
---|---|---|---|
Component: | Contrib apps | Version: | dev |
Severity: | Keywords: | markdown | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Using markdown version 1.6b, the markdown filter in contrib.markup shows a unicode error. It seems the markdown function uses unicode now, because it works when leaving out the smart_str() conversion.
Attachments (3)
Change History (20)
comment:1 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Triage Stage: | Unreviewed → Accepted |
comment:2 by , 17 years ago
Resolution: | → worksforme |
---|---|
Status: | assigned → closed |
comment:3 by , 17 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
There are no strange characters in those tests. But when you try this:
>>> import markdown >>> markdown.markdown(u'dépot') u'<p>d\x82pot\n</p> >>> from django.contrib.markup.templatetags import markup >>> markup.markdown(u'dépot') Traceback(most recent call last): ... UnicodeDecodeError: 'ascii' codec can't decode byt 0xc2 in position 1: ordinal not in range(128)
In the templatetag the values are wrapped into a smart_str() call, so django is passing in a bytestring, but markdown is expecting unicode it seems (at least it can handle it well now).
comment:4 by , 17 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
Okay, this is a bug in markdown-1.6b. I've reported it to the maintainer.
@mboersma: the markup tests seem surprisingly reluctant to ever report a failure if there's any kind of internal crash. They don't fill me with confidence that they're detecting many problems.
comment:5 by , 17 years ago
Just to close this off definitively: the markdown maintainer has committed the patch I supplied. So it's not worth putting a special workaround into Django here. The solution is "don't use 1.6b if you need non-ASCII support."
comment:6 by , 17 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
1.6b is the version shipped in Ubuntu 8.04 LTS and presumably any other distros developed between 2007-08-18 and 2008-02-17. The current situation is basically that we support Markdown 1.6a and 1.7 but not 1.6b and I think this is a bigger problem than you anticipated.
Markdown 1.7 changes the behaviour of markdown.markdown() such that it now no longer accepts bytestrings as input. This is implemented for Django in #6387. You were right at the time, that 1.6b broke UTF8-encoded bytestring input, but 1.7 removes that behaviour anyway. We can just treat 1.6b and 1.7 in the same way instead of ignoring this bug.
by , 17 years ago
Attachment: | markdown_1.6b.2.diff added |
---|
Patch removing broken 1.6b codepath (fixed)
comment:7 by , 17 years ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
Your patch can't be correct. It would result in trying to pass unicode strings to earlier versions of markdown that don't accept them.
If you want to relax the version test to include 1.6b (or even 1.6 if all such versions can handle unicode strings, but you'll need to test that), no problems. But we shouldn't break earlier versions.
comment:8 by , 17 years ago
Patch needs improvement: | unset |
---|
The patch is correct. The earlier test for markdown.version already selects 1.6b, so that test alone is sufficient. I'll attach a diff with a few more lines of context so you can see the relevant comment.
comment:9 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
That patch is no good. As one of the markdown core devs I can confirm that 1.6b does NOT have the necessary unicode support. In fact, my recollection is that 1.6b had a few nasty bugs and I don't recommend it at all (use 1.6a or 1.7). The current implementation (in django) is correct and does not need changed.
The problem seems to be that Debian (and by extension Ubuntu) had updated there python-markdown package from the markdown dev repo to somewhere between 1.6b and 1.7 and called it 1.6b1. This is not an official release of Python-Markdown and is not supported by the Python-Markdown team. I would suggest that a bug report be filed with Debian/Ubuntu asking them to update their package, or don't use their package and install Python-Markdown the way everyone else does.
comment:10 by , 16 years ago
I think the comment from @wayla is probably a sensible position to take. We've asked upstream providers to fix their packages before when they're clearly buggy, rather than work around them.
follow-up: 12 comment:11 by , 16 years ago
Wayla, i'll try to mail them. By the way, the patch 3 works for me.
comment:12 by , 16 years ago
Replying to italomaia:
By the way, the patch 3 works for me.
Yes, however, that patch would not work for the 1.6a or 1.6b official releases from the Python-Markdown project. It will only work with this "unofficial" 1.6b1 distributed by Debian/Unbuntu. So, if Django was going to address this, all the existing version checks would need to remain (the current patch removes them) and an additional check would need to be added for this "unofficial" 1.6b1. That is why it's preferred that you (and Debian/Ubuntu) upgrade to an officially supported version of markdown.
follow-up: 14 comment:13 by , 16 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
The patch works for 1.6a and earlier because 1.6b is the first to provide markdown.version, so 1.6a is passed UTF-8 bytestrings.
There is no 1.6b1. I've just downloaded the official 1.6b .zip file:
mauve:~/dev/markdown$ diff markdown-1.6b/markdown.py /var/lib/python-support/python2.5/markdown.py mauve:~/dev/markdown$
The -1 in the 1.6b-1 package version is the version of the package, not the version of markdown.
I can confirm by empirical testing that 1.6b does require unicode strings as input. Your memory has failed you in this case.
>>> i = '€£½'.decode('utf8') >>> i u'\u20ac\xa3\xbd' >>> markdown.markdown(i) u'<p>\u20ac\xa3\xbd\n</p>' >>> markdown.markdown(i.encode('utf8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "markdown.py", line 1722, in markdown return md.convert(text) File "markdown.py", line 1614, in convert self.source = removeBOM(self.source, self.encoding) File "markdown.py", line 74, in removeBOM if text.startswith(bom): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
1.6a works with or without the patch.
1.6b works only with the patch (even if it does contain bugs).
1.6b1 works only with the patch.
1.7 works with or without the patch.
I don't see what we could be doing better. Even if there's an issue upstream it doesn't affect the status of this patch. Reopening.
comment:14 by , 16 years ago
Replying to Daniel Pope <dan@mauveinternet.co.uk>:
I can confirm by empirical testing that 1.6b does require unicode strings as input. Your memory has failed you in this case.
Actually I wrote the code that changed unicode support between 1.6b & 1.7. I think I remember what I did. Yes, 1.6b did have some support for unicode, but it did not require it as you state. It also worked with (most) bytestrings. The fact that it supported unicode at all was kind of a fluke. It just so happens that the python re module runs just as well on unicode strings as it does on byte strings. Therefore, 1.6b added a __unicode__
method to the Markdown class which simply wrapped __str__
(return str(self)
). In contrast, 1.7 raises a fatal error before running if it does not get a unicode string (or a bytestring only of ascii characters as they're a subset of unicode anyway) and will only return unicode.
Oh, and the reason your first example works:
>>> markdown.markdown(i) u'<p>\u20ac\xa3\xbd\n</p>'
is because the markdown
wrapper function did not call either Markdown.__str__
or Markdown.__unicode__
, but did it's own thing reimplementing most of Markdown.__str__
. In other words, by using the wrapper (shortcut) in 1.6b, you get different behavior than if you call the class directly. That's buggy and I don't recommend it.
Btw, using your example in 1.6b, you should have done this:
>>> markdown.markdown(i.encode('utf8'), encoding='utf8')
But even that was buggy. Shortly after Malcolm submitted a patch (which I applied), we threw away most of that convert-to-unicode stuff (we kept just enough only for use from the command line - including Malcolm's patch) and forced the requirement that Markdown only accept unicode.
That's the difference here. 1.7 is the only version where we can be absolutely sure that it is safe to pass unicode text to markdown. It may or may not work in any earlier version. As a Markdown core dev, I will not guarantee that it will work all the time in anything but 1.7. Sure it may work fine for you in testing, but then some user will submit some text that fails and breaks your site. Debian/Ubuntu need to do the right thing here and provide 1.7 which passes a version test for 1.7.
In any event, it's up to the django core devs to decide what to do in Django.
comment:15 by , 16 years ago
Thanks for your comprehensive clarification.
The encoding argument is completely ignored in 1.6b so your example gives the same error:
>>> markdown.markdown(i.encode('utf8'), encoding='utf8') Traceback (most recent call last): ... UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Could Django warn on the console if it finds Markdown 1.6b? I think that might save users some headaches.
comment:16 by , 16 years ago
Owner: | removed |
---|---|
Status: | reopened → new |
comment:17 by , 14 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Markdown is now in the 2 series, which ships with Ubuntu Lucid, and the unicode issue was fixed 2 years ago - I think this ticket is no longer applicable. Reopen if that's not the case.
I installed markdown 1.6b and ran its unit tests and my blog and didn't see any error. If there's a problem here, could you please reopen the bug with some test code and a specific description of how to reproduce the error?
Here's my test output, just FYI:
(((
mboersma@marmaladesky:~/projects/django.trunk/django/contrib/markup$ python
Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
'1.6b'
mboersma@marmaladesky:~/projects/django.trunk/django/contrib/markup$ python tests.py
...
Ran 3 tests in 0.526s
OK
)))