Opened 10 years ago

Closed 3 years ago

Last modified 3 years ago

#8391 closed Bug (wontfix)

slugify template filter poorly encodes non-English strings

Reported by: Bjorn Kristinsson Owned by: nobody
Component: Template system Version: master
Severity: Normal Keywords:
Cc: hr.bjarni+django@…, kmike84@…, mmitar@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no


Going through the admin interface with a slug field, 'bøøøø' becomes 'boooo' (as expected)

But running this code:
from django.template.defaultfilters import slugify

print slugify('bøøøø')
print slugify(u'bøøøø')

results in:

Results vary depending on which characters are used; I found this trying to inject a bunch of cyrillic and greek into a database, and most of the slug fields were empty. Entering them manually through the admin interface worked fine.

Change History (39)

comment:1 Changed 10 years ago by Bjorn Kristinsson

The code snippet again, for easy copying and pasting...

from django.template.defaultfilters import slugify

print slugify('bøøøø') 
print slugify(u'bøøøø')

results in: 'b' 'ba-a-a-a'

comment:2 Changed 10 years ago by Daniel Pope <dan@…>

For the first example, the expected results aren't well defined: Unicode characters can't be reliably represented in a bytestring. I think Django does smart_unicode on the input so it works for UTF-8 byte strings but that's just Django being flexible.

For the second example, I suspect where you've typed u'bøøøø', Python has interpreted your unicode string literal as the wrong character set, equivalent to


Putting the same thing in a script with a PEP-263 header gives 'b' for the second example.

comment:3 Changed 10 years ago by Julien Phalip

The reason why it doesn't give the same result in the admin and via the code above is that different algorithms are used.
In the admin, it is done with some javascript, and odd characters are replaced by their latin 'equivalent', in particular:

var LATIN_MAP = {
   'ø': 'o',

In the template filter, odd characters that are not representable in ASCII are simply stripped out, see the 'ignore' below:

value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

Maybe the filter should replicate the javascript's algorithm, or vice versa, to make things homogeneous?

comment:4 in reply to:  3 Changed 10 years ago by Bjorn Kristinsson

Yep, I've been digging in and found the same. I absolutely think the two should work the same way, especially since it would make my like so much easier ;)

var LATIN_MAP = {
   'ø': 'o',

I'm trying to translate the javascript into Python, but python's handling of this is giving me a headache. In the javascript function, there's a regular expression that matches one or more characters not in, a sequence of characters in, the list of special characters.

As an example, 'bjørn' becomes 'bj', 'ø' and 'rn'. It then tries to find these in the above map: LATIN_MAPbj? returns nothing so it's left unchanged, LATIN_MAPø? returns 'o', and LATIN_MAPrn? nothing again. The result is 'bjorn'

Python, on the other hand, matches 'bj', '\xc3' and '\xb8rn'. There is no match for '\xc3' in LATIN_MAP, only in the regexp. Now I just need to find a way of splitting this correctly, any ideas?

comment:5 Changed 10 years ago by Bjorn Kristinsson

... I really need to start using the 'Preview' function. Hope that's legible.

comment:6 Changed 10 years ago by Jökull Sólberg Auðunsson <jokullsolberg@…>

Maybe there should be a JSON file with the character map for DRY.

comment:7 Changed 10 years ago by Julien Phalip

Summary: Results of slugify in the admin interface differ from the one in shell.Admin slugify function's results defer from those of slugify template filter

There's another difference between the two algorithms. In the admin, small words are stripped out by the javascript. For example, "This is a sentence with small words" returns "sentence-small-words". Whereas the template filter gives "this-is-a-sentence-with-small-words".

Ideally, this word replacement should be configurable per-language. Maybe it already is, I've never tried.

Another remark. In the admin, the javascript function is called 'URLify', so that's maybe for a good reason...

comment:8 Changed 10 years ago by anonymous

milestone: 1.0 maybe

comment:9 Changed 10 years ago by Daniel Pope <dan@…>

Summary: Admin slugify function's results defer from those of slugify template filterAdmin slugify function's results differ from those of slugify template filter

Unfortunately, slugify isn't very well-defined outside of English. In German, for example, you would want to slugify 'Grüß' as 'gruess', but the same logic doesn't apply in other languages, where generally you can just omit accents at a pinch. We could build JSON transliteration character maps, but for i18n we would need several so that they can be selected based on locale. As an alternative, we could just do something with IRIs instead of trying to coerce Unicode to a "nice" ASCII string.

comment:10 Changed 10 years ago by Bjorn Kristinsson

Perhaps some sort of overriding mechanism could be implemented, say a dictionary in that is appended to and overrides the defaults? So by default 'ö' becomes 'o', but this behaviour could be changed by something like:

    'ö': 'oe',
    'ü': 'ue',

But julien raises a good point, the javascript function is called URLify, not slugify, so perhaps the issue is not that slugify is 'wrong', but more that we need a python equivalent of URLify?

comment:11 Changed 10 years ago by Bjorn Kristinsson

My efforts at translating the javascript so far fall flat at the way python matches things:

import re

'ö': 'o'

regex = re.compile('[ö]|[^ö]+')

pieces = regex.findall('björn')

downcoded = ""
for piece in pieces:
    mapped = ""
        mapped = LATIN_MAP[piece]
        mapped = piece
    downcoded += mapped

print pieces, downcoded

# Expected: ['bj', 'ö', 'rn'] bjorn
# Result: ['bj', '\xc3', '\xb6' 'rn'] björn

I.e. LATIN_MAP['ö'] isn't looked up, but LATIN_MAP['\xc3'] and LATIN_MAP['\xb6'] are, separately. LATIN_MAP['\xc3\xb6'] would work, but how to make sure these 'stay together' is something that leaves me stumped.

comment:12 Changed 10 years ago by Daniel Pope <dan@…>

I think the template filter should accept an argument, so slugify:"de" might add in the German-specific rules, and so on; a option would just choose whether that was done by default. But it would be difficult to ensure those tables are comprehensive enough. I note that libiconv has rudimentary support for transliteration, so perhaps we could use their data.

Slugs and URLs are the same thing, imho.

If we can adequately provide IRIs, on the other hand, the slugify operation can be well-defined, eg.

>>> slugify(u'Føø Bär Baß')

@bjornkri: You're still using UTF-8 encoded bytestrings. You must use unicode strings.

comment:13 Changed 10 years ago by Julien Phalip

@Daniel Pope

Using IRIs seems quite promising. I guess language specific rules could be stored in the local flavors.

comment:14 Changed 10 years ago by Julien Phalip

Hmmmm... this ticket is looking less and less like a ticket, and more and more like an email discussion. Should it be brought to the dev-list? Volunteer?

comment:15 Changed 10 years ago by Julian Bez

#7980 aims to improve i18n and using CLRD data. What we need for slugify should be in there somewhere.

comment:16 Changed 10 years ago by Malcolm Tredinnick

Resolution: wontfix
Status: newclosed

Okay, this is all a bit of a non-issue. The Javascript and Python versions are not intended to give the same results. There is much more functionality available at the Python level, for a start, in the form of codecs and unicode mapping data. Secondly, we're not interested in shipping more and more data over in the Javascript file. The Javascript version works reasonably well for a bunch of cases. It doesn't work at all in other cases (e.g. Japanese text. In other cases it gives some result and some people may prefer something else. The point is that it doesn't matter. It's just an aid. If you don't like what the aid gives you, you can happily edit the field in the admin, or always do it on the Python side, or create your own Javascript function to use.

The Javascript function is not meant to be something that works perfectly for everybody because transliteration is a very ambiguous area. If it doesn't work for your purposes, don't use it.

comment:17 Changed 10 years ago by Daniel Pope <dan@…>

Component: UncategorizedTemplate system
milestone: 1.0 maybe
Resolution: wontfix
Status: closedreopened
Summary: Admin slugify function's results differ from those of slugify template filterslugify template filter poorly encodes non-English strings

Sorry Malcolm, I may have subverted this ticket a little by talking about generalised handling of slugs, and thrown you off the scent.

The original ticket was about a broken Python slugify filter, not a broken Javascript function. It was simply an observation on bjornkri's part that the admin javascript works better. The Python filter is not "just an aid". It should produce acceptably good results, which it has not done for the string u'bøøøø'.


comment:18 Changed 9 years ago by harkal

I have created a function that downcodes a string in a way similar to what urlify does but in Python.
This can be used in conjunction to slugify like this :

slug = slugify(downcode(u'Γειά σου κόσμε!'))

Or it can be called from within slugify if the developers agree to merge it in!

Have fun!

# -*- coding: utf-8 -*-
# (c) 2008 Harry Kalogirou <>
# * Language maps taken from django's javascript urlify

import re

    u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A', u'Æ': 'AE', u'Ç':'C', 
    u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I', u'Î': 'I',
    u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O', u'Õ': 'O', u'Ö':'O', 
    u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U', u'Ű': 'U',
    u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a', u'ã': 'a', u'ä':'a', 
    u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e', u'ë': 'e',
    u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n', u'ò': 'o', u'ó':'o', 
    u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u', u'ú': 'u',
    u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y'
    u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z', u'η':'h', u'θ':'8',
    u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3', u'ο':'o', u'π':'p',
    u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x', u'ψ':'ps', u'ω':'w',
    u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h', u'ώ':'w', u'ς':'s',
    u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i',
    u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z', u'Η':'H', u'Θ':'8',
    u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3', u'Ο':'O', u'Π':'P',
    u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X', u'Ψ':'PS', u'Ω':'W',
    u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H', u'Ώ':'W', u'Ϊ':'I',
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'
    u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e', u'ё':'yo', u'ж':'zh',
    u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m', u'н':'n', u'о':'o',
    u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f', u'х':'h', u'ц':'c',
    u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'', u'э':'e', u'ю':'yu',
    u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E', u'Ё':'Yo', u'Ж':'Zh',
    u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M', u'Н':'N', u'О':'O',
    u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F', u'Х':'H', u'Ц':'C',
    u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'', u'Э':'E', u'Ю':'Yu',
    u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i', u'ї':'yi', u'ґ':'g'
    u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s', u'ť':'t', u'ů':'u',
    u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R', u'Š':'S', u'Ť':'T',
    u'Ů':'U', u'Ž':'Z'

    u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o', u'ś':'s', u'ź':'z',
    u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N', u'Ó':'o', u'Ś':'S',
    u'Ź':'Z', u'Ż':'Z'

    u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k', u'ļ':'l', u'ņ':'n',
    u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E', u'Ģ':'G', u'Ī':'i',
    u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'

def _makeRegex():
    s = u"".join(ALL_DOWNCODE_MAPS.keys())
    regex = re.compile(u"[%s]|[^%s]+" % (s,s))
    return ALL_DOWNCODE_MAPS, regex

_regex = None
def downcode(s):
    This function is 'downcode' the string pass in the parameter s. This is useful 
    in cases we want the closest representation, of a multilingual string, in simple
    latin chars. The most probable use is before calling slugify.
    global _MAPINGS, _regex
    if not _regex:
        _MAPINGS, _regex = _makeRegex()    
    downcoded = ""
    for piece in _regex.findall(s):
        if _MAPINGS.has_key(piece):
            downcoded += _MAPINGS[piece]
            downcoded += piece
    return downcoded

if __name__ == "__main__":
    string = u'Καλημέρα Joe!'
    print 'Original  :', string
    print 'Downcoded :', downcode(string)

comment:19 Changed 9 years ago by Jacob

Resolution: wontfix
Status: reopenedclosed

Please don't reopen tickets closed by a committer. The correct way to revisit issues is to take it up on django-dev.

comment:20 Changed 9 years ago by Malcolm Tredinnick

Resolution: wontfix
Status: closedreopened

Jacob, I probably wontfixed in error, due to the confusion sown by Daniel. It's worth looking at this.

comment:21 Changed 9 years ago by anonymous

Triage Stage: UnreviewedDesign decision needed

comment:22 Changed 9 years ago by HM

Another datapoint: In my language both 'å' and 'ø' are valid words in and of themselves... the standard slugify reduces both of these to . Oopsy. 'æææææææ' is a popular way to describe a scream, it also becomes... .

I have my own slugify-function that turns the unicode-string into NFKD then slugifys that, then checks that the string isn't empty and if it is: adds a dummy string + the datetime + random string. This is independent of locale, which I consider a bonus.

comment:23 Changed 9 years ago by HM

And it seems trac can't handle those letters either =)

comment:24 Changed 8 years ago by Domen Kožar

Why not use proper Unicode transliteration package like ? Transliteration is currently the best way to go Unicode->ASCII

comment:25 Changed 8 years ago by Daniel Pope <dan@…>

That package is too big to bundle and too trivial for Django to depend strongly upon. But it's a good starting place if you want to write your own slugify filter.

comment:26 Changed 7 years ago by hejsan

Cc: hr.bjarni+django@… added

Hi, this ticket is way to old for a trivial feature (for non-English-speaking-country based programmers).

When slugifying characters in my language, some of them are properly downgraded to ascii lookalikes but some of them get lost. That's pretty irritating for such a trivial feature.

There seems to be a consensus in other web frameworks on slugifying international characters, and that is to have a map.

Did you (core programmers) take a look at harkal's suggestion above? This is the way i.e. WordPress and others go about solving this problem.
I also found this ready made function: slughify

It is small, concise and it works.

If you decide you don't want/need to fix/upgrade the slugify function, or if you think it will take a very long time before you decide, then I'd like to suggest that it be made into a setting as soon as possible:
SLUGIFY_FUNCTION = myown_slugify

with django.template.defaultfilters.slugify as the default

But optimally the function provided by django should work for all languages in my opinion.


comment:27 Changed 7 years ago by Erik Allik

Cc: eallik+django@… added

comment:28 Changed 7 years ago by Mikhail Korobov

Cc: kmike84@… added

comment:29 Changed 7 years ago by Erik Allik

Cc: eallik+django@… removed

comment:30 Changed 7 years ago by Luke Plant

Severity: Normal
Type: Bug

comment:31 Changed 7 years ago by Mitar

Cc: mmitar@… added
Easy pickings: unset
UI/UX: unset

I have added made slugify2 function which first downcodes and then translates to slug. It behaves exactly the same as its JavaScript counterpart. So now it is possible to have both in Python and JavaScript same behavior.

comment:32 Changed 6 years ago by Preston Holmes

Triage Stage: Design decision neededAccepted

see #16853 for a Turkish case

Seems that there have been no objections to the downcode then slugify approach.

This seems ready for someone to take a shot at implementing that approach in a patch.

comment:33 Changed 6 years ago by Mitar

You can take the above slugify2 function.

comment:34 Changed 6 years ago by yasar11732@…

Above slugify2 function won't fix #16853.

# -*- coding: utf-8 -*-
import sys
import re

from django.utils import encoding

    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'


class Downcoder(object):
    map = {}
    regex = None

    def __init__(self): = {}
        chars = u''

        for lookup in ALL_DOWNCODE_MAPS:
            for c, l in lookup.items():
      [c] = l
                chars += encoding.force_unicode(c)

        self.regex = re.compile(ur'[' + chars + ']|[^' + chars + ']+', re.U)
downcoder = Downcoder()

def downcode(value):
    downcoded = u''
    pieces = downcoder.regex.findall(value)

    if pieces:
        for p in pieces:
            mapped =
            if mapped:
                downcoded += mapped
                downcoded += p
        downcoded = value

    return value
def slugify2(value):
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    import unicodedata
    value = downcode(value)
    value = unicodedata.normalize('NFD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    return re.sub('[-\s]+', '-', value)

print(slugify2(u"Işık ılık süt iç"))

This prints "isk-lk-sut-ic", but expected value is, "isik-ilik-sut-ic".

comment:35 Changed 6 years ago by Mitar

Ups. That was a bug. Fixed version of slugify2.

comment:36 Changed 5 years ago by Aymeric Augustin

Status: reopenednew

comment:38 Changed 3 years ago by Aymeric Augustin

Resolution: wontfix
Status: newclosed

There's obviously more than one way to achieve slugification, depending on your tastes and constraints.

If we try to be smart, we'll get dozens and dozens of tickets from people who want to be smarter -- see the urlize filter for an example.

Django's implementation has the advantage of being simple and relying only on the stdlib. Pretty good solutions are available externally.

The drawbacks of implementing something more complicated outweigh the advantages at this stage.

Note: See TracTickets for help on using tickets.
Back to Top