Django

Code

Ticket #8391 (reopened)

Opened 2 years ago

Last modified 2 months ago

slugify template filter poorly encodes non-English strings

Reported by: bjornkri Assigned to: nobody
Milestone: Component: Template system
Version: SVN Keywords:
Cc: Triage Stage: Design decision needed
Has patch: 0 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

Going through the admin interface with a slug field, 'bøøøø' becomes 'boooo' (as expected)

But running this code: from django.template.defaultfilters import slugify

print slugify('bøøøø') print slugify(u'bøøøø')

results in: 'b' 'ba-a-a-a'

Results vary depending on which characters are used; I found this trying to inject a bunch of cyrillic and greek into a database, and most of the slug fields were empty. Entering them manually through the admin interface worked fine.

Attachments

Change History

08/18/08 05:02:35 changed by bjornkri

  • needs_better_patch changed.
  • needs_tests changed.
  • needs_docs changed.

The code snippet again, for easy copying and pasting...

from django.template.defaultfilters import slugify

print slugify('bøøøø') 
print slugify(u'bøøøø')

results in: 'b' 'ba-a-a-a'

08/18/08 06:15:43 changed by Daniel Pope <dan@mauveinternet.co.uk>

For the first example, the expected results aren't well defined: Unicode characters can't be reliably represented in a bytestring. I think Django does smart_unicode on the input so it works for UTF-8 byte strings but that's just Django being flexible.

For the second example, I suspect where you've typed u'bøøøø', Python has interpreted your unicode string literal as the wrong character set, equivalent to

u'bøøøø'.encode('utf8').decode('iso-8859-1')

Putting the same thing in a script with a PEP-263 header gives 'b' for the second example.

(follow-up: ↓ 4 ) 08/18/08 06:21:29 changed by julien

The reason why it doesn't give the same result in the admin and via the code above is that different algorithms are used. In the admin, it is done with some javascript, and odd characters are replaced by their latin 'equivalent', in particular:

var LATIN_MAP = {
   ...
   'ø': 'o',
   ...
}

In the template filter, odd characters that are not representable in ASCII are simply stripped out, see the 'ignore' below:

value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

Maybe the filter should replicate the javascript's algorithm, or vice versa, to make things homogeneous?

(in reply to: ↑ 3 ) 08/18/08 06:30:15 changed by bjornkri

Yep, I've been digging in and found the same. I absolutely think the two should work the same way, especially since it would make my like so much easier ;)

{{{ var LATIN_MAP = { ... 'ø': 'o', ... } }}}

I'm trying to translate the javascript into Python, but python's handling of this is giving me a headache. In the javascript function, there's a regular expression that matches one or more characters not in, a sequence of characters in, the list of special characters.

As an example, 'bjørn' becomes 'bj', 'ø' and 'rn'. It then tries to find these in the above map: LATIN_MAPbj? returns nothing so it's left unchanged, LATIN_MAPø? returns 'o', and LATIN_MAPrn? nothing again. The result is 'bjorn'

Python, on the other hand, matches 'bj', '\xc3' and '\xb8rn'. There is no match for '\xc3' in LATIN_MAP, only in the regexp. Now I just need to find a way of splitting this correctly, any ideas?

08/18/08 06:32:12 changed by bjornkri

... I really need to start using the 'Preview' function. Hope that's legible.

08/18/08 06:55:30 changed by Jökull Sólberg Auðunsson <jokullsolberg@gmail.com>

Maybe there should be a JSON file with the character map for DRY.

08/18/08 07:12:55 changed by julien

  • summary changed from Results of slugify in the admin interface differ from the one in shell. to Admin slugify function's results defer from those of slugify template filter.

There's another difference between the two algorithms. In the admin, small words are stripped out by the javascript. For example, "This is a sentence with small words" returns "sentence-small-words". Whereas the template filter gives "this-is-a-sentence-with-small-words".

Ideally, this word replacement should be configurable per-language. Maybe it already is, I've never tried.

Another remark. In the admin, the javascript function is called 'URLify', so that's maybe for a good reason...

08/18/08 07:18:46 changed by anonymous

  • milestone set to 1.0 maybe.

08/18/08 07:21:27 changed by Daniel Pope <dan@mauveinternet.co.uk>

  • summary changed from Admin slugify function's results defer from those of slugify template filter to Admin slugify function's results differ from those of slugify template filter.

Unfortunately, slugify isn't very well-defined outside of English. In German, for example, you would want to slugify 'Grüß' as 'gruess', but the same logic doesn't apply in other languages, where generally you can just omit accents at a pinch. We could build JSON transliteration character maps, but for i18n we would need several so that they can be selected based on locale. As an alternative, we could just do something with IRIs instead of trying to coerce Unicode to a "nice" ASCII string.

08/18/08 07:28:29 changed by bjornkri

Perhaps some sort of overriding mechanism could be implemented, say a dictionary in settings.py that is appended to and overrides the defaults? So by default 'ö' becomes 'o', but this behaviour could be changed by something like:

CHARACTER_MAP = {
    'ö': 'oe',
    'ü': 'ue',
    .....
}

But julien raises a good point, the javascript function is called URLify, not slugify, so perhaps the issue is not that slugify is 'wrong', but more that we need a python equivalent of URLify?

08/18/08 07:34:03 changed by bjornkri

My efforts at translating the javascript so far fall flat at the way python matches things:

import re

LATIN_MAP = {
'ö': 'o'
}

regex = re.compile('[ö]|[^ö]+')

pieces = regex.findall('björn')

downcoded = ""
for piece in pieces:
    mapped = ""
    try:
        mapped = LATIN_MAP[piece]
    except:
        mapped = piece
    downcoded += mapped

print pieces, downcoded

# Expected: ['bj', 'ö', 'rn'] bjorn
# Result: ['bj', '\xc3', '\xb6' 'rn'] björn

I.e. LATIN_MAP['ö'] isn't looked up, but LATIN_MAP['\xc3'] and LATIN_MAP['\xb6'] are, separately. LATIN_MAP['\xc3\xb6'] would work, but how to make sure these 'stay together' is something that leaves me stumped.

08/18/08 07:41:13 changed by Daniel Pope <dan@mauveinternet.co.uk>

I think the template filter should accept an argument, so slugify:"de" might add in the German-specific rules, and so on; a settings.py option would just choose whether that was done by default. But it would be difficult to ensure those tables are comprehensive enough. I note that libiconv has rudimentary support for transliteration, so perhaps we could use their data.

Slugs and URLs are the same thing, imho.

If we can adequately provide IRIs, on the other hand, the slugify operation can be well-defined, eg.

>>> slugify(u'Føø Bär Baß')
u'føø-bär-baß'

@bjornkri: You're still using UTF-8 encoded bytestrings. You must use unicode strings.

08/18/08 08:25:37 changed by julien

@Daniel Pope

Using IRIs seems quite promising. I guess language specific rules could be stored in the local flavors.

08/18/08 08:30:06 changed by julien

Hmmmm... this ticket is looking less and less like a ticket, and more and more like an email discussion. Should it be brought to the dev-list? Volunteer?

08/18/08 08:38:21 changed by julianb

#7980 aims to improve i18n and using CLRD data. What we need for slugify should be in there somewhere.

08/18/08 08:43:20 changed by mtredinnick

  • status changed from new to closed.
  • resolution set to wontfix.

Okay, this is all a bit of a non-issue. The Javascript and Python versions are not intended to give the same results. There is much more functionality available at the Python level, for a start, in the form of codecs and unicode mapping data. Secondly, we're not interested in shipping more and more data over in the Javascript file. The Javascript version works reasonably well for a bunch of cases. It doesn't work at all in other cases (e.g. Japanese text. In other cases it gives some result and some people may prefer something else. The point is that it doesn't matter. It's just an aid. If you don't like what the aid gives you, you can happily edit the field in the admin, or always do it on the Python side, or create your own Javascript function to use.

The Javascript function is not meant to be something that works perfectly for everybody because transliteration is a very ambiguous area. If it doesn't work for your purposes, don't use it.

08/18/08 10:22:16 changed by Daniel Pope <dan@mauveinternet.co.uk>

  • status changed from closed to reopened.
  • resolution deleted.
  • summary changed from Admin slugify function's results differ from those of slugify template filter to slugify template filter poorly encodes non-English strings.
  • component changed from Uncategorized to Template system.
  • milestone deleted.

Sorry Malcolm, I may have subverted this ticket a little by talking about generalised handling of slugs, and thrown you off the scent.

The original ticket was about a broken Python slugify filter, not a broken Javascript function. It was simply an observation on bjornkri's part that the admin javascript works better. The Python filter is not "just an aid". It should produce acceptably good results, which it has not done for the string u'bøøøø'.

Reopening.

11/28/08 18:18:35 changed by harkal

I have created a function that downcodes a string in a way similar to what urlify does but in Python. This can be used in conjunction to slugify like this :

slug = slugify(downcode(u'Γειά σου κόσμε!'))

Or it can be called from within slugify if the developers agree to merge it in!

Have fun!

#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# (c) 2008 Harry Kalogirou <harkal@gmail.com>
# 
# * Language maps taken from django's javascript urlify
#

import re

LATIN_MAP = {
    u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A', u'Æ': 'AE', u'Ç':'C', 
    u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I', u'Î': 'I',
    u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O', u'Õ': 'O', u'Ö':'O', 
    u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U', u'Ű': 'U',
    u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a', u'ã': 'a', u'ä':'a', 
    u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e', u'ë': 'e',
    u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n', u'ò': 'o', u'ó':'o', 
    u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u', u'ú': 'u',
    u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y'
}
LATIN_SYMBOLS_MAP = {
    u'©':'(c)'
}
GREEK_MAP = {
    u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z', u'η':'h', u'θ':'8',
    u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3', u'ο':'o', u'π':'p',
    u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x', u'ψ':'ps', u'ω':'w',
    u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h', u'ώ':'w', u'ς':'s',
    u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i',
    u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z', u'Η':'H', u'Θ':'8',
    u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3', u'Ο':'O', u'Π':'P',
    u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X', u'Ψ':'PS', u'Ω':'W',
    u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H', u'Ώ':'W', u'Ϊ':'I',
    u'Ϋ':'Y'
}
TURKISH_MAP = {
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'
}
RUSSIAN_MAP = {
    u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e', u'ё':'yo', u'ж':'zh',
    u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m', u'н':'n', u'о':'o',
    u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f', u'х':'h', u'ц':'c',
    u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'', u'э':'e', u'ю':'yu',
    u'я':'ya',
    u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E', u'Ё':'Yo', u'Ж':'Zh',
    u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M', u'Н':'N', u'О':'O',
    u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F', u'Х':'H', u'Ц':'C',
    u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'', u'Э':'E', u'Ю':'Yu',
    u'Я':'Ya'
}
UKRAINIAN_MAP = {
    u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i', u'ї':'yi', u'ґ':'g'
}
CZECH_MAP = {
    u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s', u'ť':'t', u'ů':'u',
    u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R', u'Š':'S', u'Ť':'T',
    u'Ů':'U', u'Ž':'Z'
}

POLISH_MAP = {
    u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o', u'ś':'s', u'ź':'z',
    u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N', u'Ó':'o', u'Ś':'S',
    u'Ź':'Z', u'Ż':'Z'
}

LATVIAN_MAP = {
    u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k', u'ļ':'l', u'ņ':'n',
    u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E', u'Ģ':'G', u'Ī':'i',
    u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
}

def _makeRegex():
    ALL_DOWNCODE_MAPS = {}
    ALL_DOWNCODE_MAPS.update(LATIN_MAP)
    ALL_DOWNCODE_MAPS.update(LATIN_SYMBOLS_MAP)
    ALL_DOWNCODE_MAPS.update(GREEK_MAP)
    ALL_DOWNCODE_MAPS.update(TURKISH_MAP)
    ALL_DOWNCODE_MAPS.update(RUSSIAN_MAP)
    ALL_DOWNCODE_MAPS.update(UKRAINIAN_MAP)
    ALL_DOWNCODE_MAPS.update(CZECH_MAP)
    ALL_DOWNCODE_MAPS.update(POLISH_MAP)
    ALL_DOWNCODE_MAPS.update(LATVIAN_MAP)
    
    s = u"".join(ALL_DOWNCODE_MAPS.keys())
    regex = re.compile(u"[%s]|[^%s]+" % (s,s))
    
    return ALL_DOWNCODE_MAPS, regex

_MAPINGS = None
_regex = None
def downcode(s):
    """
    This function is 'downcode' the string pass in the parameter s. This is useful 
    in cases we want the closest representation, of a multilingual string, in simple
    latin chars. The most probable use is before calling slugify.
    """
    global _MAPINGS, _regex
    
    if not _regex:
        _MAPINGS, _regex = _makeRegex()    
        
    downcoded = ""
    for piece in _regex.findall(s):
        if _MAPINGS.has_key(piece):
            downcoded += _MAPINGS[piece]
        else:
            downcoded += piece
    return downcoded


if __name__ == "__main__":
    string = u'Καλημέρα Joe!'
    print 'Original  :', string
    print 'Downcoded :', downcode(string)


02/25/09 14:34:03 changed by jacob

  • status changed from reopened to closed.
  • resolution set to wontfix.

Please don't reopen tickets closed by a committer. The correct way to revisit issues is to take it up on django-dev.

02/28/09 21:52:05 changed by mtredinnick

  • status changed from closed to reopened.
  • resolution deleted.

Jacob, I probably wontfixed in error, due to the confusion sown by Daniel. It's worth looking at this.

03/05/09 10:24:44 changed by anonymous

  • stage changed from Unreviewed to Design decision needed.

03/29/09 17:10:34 changed by HM

Another datapoint: In my language both 'å' and 'ø' are valid words in and of themselves... the standard slugify reduces both of these to . Oopsy. 'æææææææ' is a popular way to describe a scream, it also becomes... .

I have my own slugify-function that turns the unicode-string into NFKD then slugifys that, then checks that the string isn't empty and if it is: adds a dummy string + the datetime + random string. This is independent of locale, which I consider a bonus.

03/29/09 17:11:32 changed by HM

And it seems trac can't handle those letters either =)

01/03/10 13:05:52 changed by iElectric

Why not use proper Unicode transliteration package like http://pypi.python.org/pypi/Unidecode/0.04.1 ? Transliteration is currently the best way to go Unicode->ASCII

01/04/10 04:24:37 changed by Daniel Pope <dan@mauveinternet.co.uk>

That package is too big to bundle and too trivial for Django to depend strongly upon. But it's a good starting place if you want to write your own slugify filter.


Add/Change #8391 (slugify template filter poorly encodes non-English strings)




Change Properties
Action