Opened 16 years ago

Closed 10 years ago

Last modified 10 years ago

#8391 closed Bug (wontfix)

slugify template filter poorly encodes non-English strings

Reported by: Bjorn Kristinsson Owned by: nobody
Component: Template system Version: dev
Severity: Normal Keywords:
Cc: hr.bjarni+django@…, kmike84@…, mmitar@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Going through the admin interface with a slug field, 'bøøøø' becomes 'boooo' (as expected)

But running this code:
from django.template.defaultfilters import slugify

print slugify('bøøøø')
print slugify(u'bøøøø')

results in:
'b'
'ba-a-a-a'

Results vary depending on which characters are used; I found this trying to inject a bunch of cyrillic and greek into a database, and most of the slug fields were empty. Entering them manually through the admin interface worked fine.

Change History (39)

comment:1 by Bjorn Kristinsson, 16 years ago

The code snippet again, for easy copying and pasting...

from django.template.defaultfilters import slugify

print slugify('bøøøø') 
print slugify(u'bøøøø')

results in: 'b' 'ba-a-a-a'

comment:2 by Daniel Pope <dan@…>, 16 years ago

For the first example, the expected results aren't well defined: Unicode characters can't be reliably represented in a bytestring. I think Django does smart_unicode on the input so it works for UTF-8 byte strings but that's just Django being flexible.

For the second example, I suspect where you've typed u'bøøøø', Python has interpreted your unicode string literal as the wrong character set, equivalent to

u'bøøøø'.encode('utf8').decode('iso-8859-1')

Putting the same thing in a script with a PEP-263 header gives 'b' for the second example.

comment:3 by Julien Phalip, 16 years ago

The reason why it doesn't give the same result in the admin and via the code above is that different algorithms are used.
In the admin, it is done with some javascript, and odd characters are replaced by their latin 'equivalent', in particular:

var LATIN_MAP = {
   ...
   'ø': 'o',
   ...
}

In the template filter, odd characters that are not representable in ASCII are simply stripped out, see the 'ignore' below:

value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

Maybe the filter should replicate the javascript's algorithm, or vice versa, to make things homogeneous?

in reply to:  3 comment:4 by Bjorn Kristinsson, 16 years ago

Yep, I've been digging in and found the same. I absolutely think the two should work the same way, especially since it would make my like so much easier ;)

var LATIN_MAP = {
   ...
   'ø': 'o',
   ...
}

I'm trying to translate the javascript into Python, but python's handling of this is giving me a headache. In the javascript function, there's a regular expression that matches one or more characters not in, a sequence of characters in, the list of special characters.

As an example, 'bjørn' becomes 'bj', 'ø' and 'rn'. It then tries to find these in the above map: LATIN_MAPbj returns nothing so it's left unchanged, LATIN_MAPø returns 'o', and LATIN_MAPrn nothing again. The result is 'bjorn'

Python, on the other hand, matches 'bj', '\xc3' and '\xb8rn'. There is no match for '\xc3' in LATIN_MAP, only in the regexp. Now I just need to find a way of splitting this correctly, any ideas?

comment:5 by Bjorn Kristinsson, 16 years ago

... I really need to start using the 'Preview' function. Hope that's legible.

comment:6 by Jökull Sólberg Auðunsson <jokullsolberg@…>, 16 years ago

Maybe there should be a JSON file with the character map for DRY.

comment:7 by Julien Phalip, 16 years ago

Summary: Results of slugify in the admin interface differ from the one in shell.Admin slugify function's results defer from those of slugify template filter

There's another difference between the two algorithms. In the admin, small words are stripped out by the javascript. For example, "This is a sentence with small words" returns "sentence-small-words". Whereas the template filter gives "this-is-a-sentence-with-small-words".

Ideally, this word replacement should be configurable per-language. Maybe it already is, I've never tried.

Another remark. In the admin, the javascript function is called 'URLify', so that's maybe for a good reason...

comment:8 by anonymous, 16 years ago

milestone: 1.0 maybe

comment:9 by Daniel Pope <dan@…>, 16 years ago

Summary: Admin slugify function's results defer from those of slugify template filterAdmin slugify function's results differ from those of slugify template filter

Unfortunately, slugify isn't very well-defined outside of English. In German, for example, you would want to slugify 'Grüß' as 'gruess', but the same logic doesn't apply in other languages, where generally you can just omit accents at a pinch. We could build JSON transliteration character maps, but for i18n we would need several so that they can be selected based on locale. As an alternative, we could just do something with IRIs instead of trying to coerce Unicode to a "nice" ASCII string.

comment:10 by Bjorn Kristinsson, 16 years ago

Perhaps some sort of overriding mechanism could be implemented, say a dictionary in settings.py that is appended to and overrides the defaults? So by default 'ö' becomes 'o', but this behaviour could be changed by something like:

CHARACTER_MAP = {
    'ö': 'oe',
    'ü': 'ue',
    .....
}

But julien raises a good point, the javascript function is called URLify, not slugify, so perhaps the issue is not that slugify is 'wrong', but more that we need a python equivalent of URLify?

comment:11 by Bjorn Kristinsson, 16 years ago

My efforts at translating the javascript so far fall flat at the way python matches things:

import re

LATIN_MAP = {
'ö': 'o'
}

regex = re.compile('[ö]|[^ö]+')

pieces = regex.findall('björn')

downcoded = ""
for piece in pieces:
    mapped = ""
    try:
        mapped = LATIN_MAP[piece]
    except:
        mapped = piece
    downcoded += mapped

print pieces, downcoded

# Expected: ['bj', 'ö', 'rn'] bjorn
# Result: ['bj', '\xc3', '\xb6' 'rn'] björn

I.e. LATIN_MAP['ö'] isn't looked up, but LATIN_MAP['\xc3'] and LATIN_MAP['\xb6'] are, separately. LATIN_MAP['\xc3\xb6'] would work, but how to make sure these 'stay together' is something that leaves me stumped.

comment:12 by Daniel Pope <dan@…>, 16 years ago

I think the template filter should accept an argument, so slugify:"de" might add in the German-specific rules, and so on; a settings.py option would just choose whether that was done by default. But it would be difficult to ensure those tables are comprehensive enough. I note that libiconv has rudimentary support for transliteration, so perhaps we could use their data.

Slugs and URLs are the same thing, imho.

If we can adequately provide IRIs, on the other hand, the slugify operation can be well-defined, eg.

>>> slugify(u'Føø Bär Baß')
u'føø-bär-baß'

@bjornkri: You're still using UTF-8 encoded bytestrings. You must use unicode strings.

comment:13 by Julien Phalip, 16 years ago

@Daniel Pope

Using IRIs seems quite promising. I guess language specific rules could be stored in the local flavors.

comment:14 by Julien Phalip, 16 years ago

Hmmmm... this ticket is looking less and less like a ticket, and more and more like an email discussion. Should it be brought to the dev-list? Volunteer?

comment:15 by Julian Bez, 16 years ago

#7980 aims to improve i18n and using CLRD data. What we need for slugify should be in there somewhere.

comment:16 by Malcolm Tredinnick, 16 years ago

Resolution: wontfix
Status: newclosed

Okay, this is all a bit of a non-issue. The Javascript and Python versions are not intended to give the same results. There is much more functionality available at the Python level, for a start, in the form of codecs and unicode mapping data. Secondly, we're not interested in shipping more and more data over in the Javascript file. The Javascript version works reasonably well for a bunch of cases. It doesn't work at all in other cases (e.g. Japanese text. In other cases it gives some result and some people may prefer something else. The point is that it doesn't matter. It's just an aid. If you don't like what the aid gives you, you can happily edit the field in the admin, or always do it on the Python side, or create your own Javascript function to use.

The Javascript function is not meant to be something that works perfectly for everybody because transliteration is a very ambiguous area. If it doesn't work for your purposes, don't use it.

comment:17 by Daniel Pope <dan@…>, 16 years ago

Component: UncategorizedTemplate system
milestone: 1.0 maybe
Resolution: wontfix
Status: closedreopened
Summary: Admin slugify function's results differ from those of slugify template filterslugify template filter poorly encodes non-English strings

Sorry Malcolm, I may have subverted this ticket a little by talking about generalised handling of slugs, and thrown you off the scent.

The original ticket was about a broken Python slugify filter, not a broken Javascript function. It was simply an observation on bjornkri's part that the admin javascript works better. The Python filter is not "just an aid". It should produce acceptably good results, which it has not done for the string u'bøøøø'.

Reopening.

comment:18 by harkal, 16 years ago

I have created a function that downcodes a string in a way similar to what urlify does but in Python.
This can be used in conjunction to slugify like this :

slug = slugify(downcode(u'Γειά σου κόσμε!'))

Or it can be called from within slugify if the developers agree to merge it in!

Have fun!

#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# (c) 2008 Harry Kalogirou <harkal@gmail.com>
# 
# * Language maps taken from django's javascript urlify
#

import re

LATIN_MAP = {
    u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A', u'Æ': 'AE', u'Ç':'C', 
    u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I', u'Î': 'I',
    u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O', u'Õ': 'O', u'Ö':'O', 
    u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U', u'Ű': 'U',
    u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a', u'ã': 'a', u'ä':'a', 
    u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e', u'ë': 'e',
    u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n', u'ò': 'o', u'ó':'o', 
    u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u', u'ú': 'u',
    u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y'
}
LATIN_SYMBOLS_MAP = {
    u'©':'(c)'
}
GREEK_MAP = {
    u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z', u'η':'h', u'θ':'8',
    u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3', u'ο':'o', u'π':'p',
    u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x', u'ψ':'ps', u'ω':'w',
    u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h', u'ώ':'w', u'ς':'s',
    u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i',
    u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z', u'Η':'H', u'Θ':'8',
    u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3', u'Ο':'O', u'Π':'P',
    u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X', u'Ψ':'PS', u'Ω':'W',
    u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H', u'Ώ':'W', u'Ϊ':'I',
    u'Ϋ':'Y'
}
TURKISH_MAP = {
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'
}
RUSSIAN_MAP = {
    u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e', u'ё':'yo', u'ж':'zh',
    u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m', u'н':'n', u'о':'o',
    u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f', u'х':'h', u'ц':'c',
    u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'', u'э':'e', u'ю':'yu',
    u'я':'ya',
    u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E', u'Ё':'Yo', u'Ж':'Zh',
    u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M', u'Н':'N', u'О':'O',
    u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F', u'Х':'H', u'Ц':'C',
    u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'', u'Э':'E', u'Ю':'Yu',
    u'Я':'Ya'
}
UKRAINIAN_MAP = {
    u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i', u'ї':'yi', u'ґ':'g'
}
CZECH_MAP = {
    u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s', u'ť':'t', u'ů':'u',
    u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R', u'Š':'S', u'Ť':'T',
    u'Ů':'U', u'Ž':'Z'
}

POLISH_MAP = {
    u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o', u'ś':'s', u'ź':'z',
    u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N', u'Ó':'o', u'Ś':'S',
    u'Ź':'Z', u'Ż':'Z'
}

LATVIAN_MAP = {
    u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k', u'ļ':'l', u'ņ':'n',
    u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E', u'Ģ':'G', u'Ī':'i',
    u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
}

def _makeRegex():
    ALL_DOWNCODE_MAPS = {}
    ALL_DOWNCODE_MAPS.update(LATIN_MAP)
    ALL_DOWNCODE_MAPS.update(LATIN_SYMBOLS_MAP)
    ALL_DOWNCODE_MAPS.update(GREEK_MAP)
    ALL_DOWNCODE_MAPS.update(TURKISH_MAP)
    ALL_DOWNCODE_MAPS.update(RUSSIAN_MAP)
    ALL_DOWNCODE_MAPS.update(UKRAINIAN_MAP)
    ALL_DOWNCODE_MAPS.update(CZECH_MAP)
    ALL_DOWNCODE_MAPS.update(POLISH_MAP)
    ALL_DOWNCODE_MAPS.update(LATVIAN_MAP)
    
    s = u"".join(ALL_DOWNCODE_MAPS.keys())
    regex = re.compile(u"[%s]|[^%s]+" % (s,s))
    
    return ALL_DOWNCODE_MAPS, regex

_MAPINGS = None
_regex = None
def downcode(s):
    """
    This function is 'downcode' the string pass in the parameter s. This is useful 
    in cases we want the closest representation, of a multilingual string, in simple
    latin chars. The most probable use is before calling slugify.
    """
    global _MAPINGS, _regex
    
    if not _regex:
        _MAPINGS, _regex = _makeRegex()    
        
    downcoded = ""
    for piece in _regex.findall(s):
        if _MAPINGS.has_key(piece):
            downcoded += _MAPINGS[piece]
        else:
            downcoded += piece
    return downcoded


if __name__ == "__main__":
    string = u'Καλημέρα Joe!'
    print 'Original  :', string
    print 'Downcoded :', downcode(string)


comment:19 by Jacob, 16 years ago

Resolution: wontfix
Status: reopenedclosed

Please don't reopen tickets closed by a committer. The correct way to revisit issues is to take it up on django-dev.

comment:20 by Malcolm Tredinnick, 16 years ago

Resolution: wontfix
Status: closedreopened

Jacob, I probably wontfixed in error, due to the confusion sown by Daniel. It's worth looking at this.

comment:21 by anonymous, 16 years ago

Triage Stage: UnreviewedDesign decision needed

comment:22 by HM, 15 years ago

Another datapoint: In my language both 'å' and 'ø' are valid words in and of themselves... the standard slugify reduces both of these to . Oopsy. 'æææææææ' is a popular way to describe a scream, it also becomes... .

I have my own slugify-function that turns the unicode-string into NFKD then slugifys that, then checks that the string isn't empty and if it is: adds a dummy string + the datetime + random string. This is independent of locale, which I consider a bonus.

comment:23 by HM, 15 years ago

And it seems trac can't handle those letters either =)

comment:24 by Domen Kožar, 15 years ago

Why not use proper Unicode transliteration package like http://pypi.python.org/pypi/Unidecode/0.04.1 ? Transliteration is currently the best way to go Unicode->ASCII

comment:25 by Daniel Pope <dan@…>, 15 years ago

That package is too big to bundle and too trivial for Django to depend strongly upon. But it's a good starting place if you want to write your own slugify filter.

comment:26 by hejsan, 14 years ago

Cc: hr.bjarni+django@… added

Hi, this ticket is way to old for a trivial feature (for non-English-speaking-country based programmers).

When slugifying characters in my language, some of them are properly downgraded to ascii lookalikes but some of them get lost. That's pretty irritating for such a trivial feature.

There seems to be a consensus in other web frameworks on slugifying international characters, and that is to have a map.

Did you (core programmers) take a look at harkal's suggestion above? This is the way i.e. WordPress and others go about solving this problem.
I also found this ready made function: slughify

It is small, concise and it works.

If you decide you don't want/need to fix/upgrade the slugify function, or if you think it will take a very long time before you decide, then I'd like to suggest that it be made into a setting as soon as possible:
SLUGIFY_FUNCTION = myown_slugify

with django.template.defaultfilters.slugify as the default

But optimally the function provided by django should work for all languages in my opinion.

Thanks

comment:27 by Erik Allik, 14 years ago

Cc: eallik+django@… added

comment:28 by Mikhail Korobov, 13 years ago

Cc: kmike84@… added

comment:29 by Erik Allik, 13 years ago

Cc: eallik+django@… removed

comment:30 by Luke Plant, 13 years ago

Severity: Normal
Type: Bug

comment:31 by Mitar, 13 years ago

Cc: mmitar@… added
Easy pickings: unset
UI/UX: unset

I have added made slugify2 function which first downcodes and then translates to slug. It behaves exactly the same as its JavaScript counterpart. So now it is possible to have both in Python and JavaScript same behavior.

comment:32 by Preston Holmes, 13 years ago

Triage Stage: Design decision neededAccepted

see #16853 for a Turkish case

Seems that there have been no objections to the downcode then slugify approach.

This seems ready for someone to take a shot at implementing that approach in a patch.

comment:33 by Mitar, 13 years ago

You can take the above slugify2 function.

comment:34 by yasar11732@…, 13 years ago

Above slugify2 function won't fix #16853.

# -*- coding: utf-8 -*-
import sys
import re

from django.utils import encoding

TURKISH_MAP = {
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'
}

ALL_DOWNCODE_MAPS = [
    TURKISH_MAP,
]

class Downcoder(object):
    map = {}
    regex = None

    def __init__(self):
        self.map = {}
        chars = u''

        for lookup in ALL_DOWNCODE_MAPS:
            for c, l in lookup.items():
                self.map[c] = l
                chars += encoding.force_unicode(c)

        self.regex = re.compile(ur'[' + chars + ']|[^' + chars + ']+', re.U)
        
downcoder = Downcoder()

def downcode(value):
    downcoded = u''
    pieces = downcoder.regex.findall(value)

    if pieces:
        for p in pieces:
            mapped = downcoder.map.get(p)
            if mapped:
                downcoded += mapped
            else:
                downcoded += p
    else:
        downcoded = value

    return value
    
def slugify2(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = downcode(value)
    value = unicodedata.normalize('NFD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    return re.sub('[-\s]+', '-', value)

print(slugify2(u"Işık ılık süt iç"))
        

This prints "isk-lk-sut-ic", but expected value is, "isik-ilik-sut-ic".

comment:35 by Mitar, 13 years ago

Ups. That was a bug. Fixed version of slugify2.

comment:36 by Aymeric Augustin, 11 years ago

Status: reopenednew

comment:38 by Aymeric Augustin, 10 years ago

Resolution: wontfix
Status: newclosed

There's obviously more than one way to achieve slugification, depending on your tastes and constraints.

If we try to be smart, we'll get dozens and dozens of tickets from people who want to be smarter -- see the urlize filter for an example.

Django's implementation has the advantage of being simple and relying only on the stdlib. Pretty good solutions are available externally.

The drawbacks of implementing something more complicated outweigh the advantages at this stage.

Note: See TracTickets for help on using tickets.
Back to Top