Opened 7 years ago

Closed 11 months ago

Last modified 10 months ago

#8391 closed Bug (wontfix)

slugify template filter poorly encodes non-English strings

Reported by: bjornkri Owned by: nobody
Component: Template system Version: master
Severity: Normal Keywords:
Cc: hr.bjarni+django@…, kmike84@…, mmitar@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Going through the admin interface with a slug field, 'bøøøø' becomes 'boooo' (as expected)

But running this code:
from django.template.defaultfilters import slugify

print slugify('bøøøø')
print slugify(u'bøøøø')

results in:
'b'
'ba-a-a-a'

Results vary depending on which characters are used; I found this trying to inject a bunch of cyrillic and greek into a database, and most of the slug fields were empty. Entering them manually through the admin interface worked fine.

Change History (39)

comment:1 Changed 7 years ago by bjornkri

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

The code snippet again, for easy copying and pasting...

from django.template.defaultfilters import slugify

print slugify('bøøøø') 
print slugify(u'bøøøø')

results in: 'b' 'ba-a-a-a'

comment:2 Changed 7 years ago by Daniel Pope <dan@…>

For the first example, the expected results aren't well defined: Unicode characters can't be reliably represented in a bytestring. I think Django does smart_unicode on the input so it works for UTF-8 byte strings but that's just Django being flexible.

For the second example, I suspect where you've typed u'bøøøø', Python has interpreted your unicode string literal as the wrong character set, equivalent to

u'bøøøø'.encode('utf8').decode('iso-8859-1')

Putting the same thing in a script with a PEP-263 header gives 'b' for the second example.

comment:3 follow-up: Changed 7 years ago by julien

The reason why it doesn't give the same result in the admin and via the code above is that different algorithms are used.
In the admin, it is done with some javascript, and odd characters are replaced by their latin 'equivalent', in particular:

var LATIN_MAP = {
   ...
   'ø': 'o',
   ...
}

In the template filter, odd characters that are not representable in ASCII are simply stripped out, see the 'ignore' below:

value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

Maybe the filter should replicate the javascript's algorithm, or vice versa, to make things homogeneous?

comment:4 in reply to: ↑ 3 Changed 7 years ago by bjornkri

Yep, I've been digging in and found the same. I absolutely think the two should work the same way, especially since it would make my like so much easier ;)

var LATIN_MAP = {
   ...
   'ø': 'o',
   ...
}

I'm trying to translate the javascript into Python, but python's handling of this is giving me a headache. In the javascript function, there's a regular expression that matches one or more characters not in, a sequence of characters in, the list of special characters.

As an example, 'bjørn' becomes 'bj', 'ø' and 'rn'. It then tries to find these in the above map: LATIN_MAPbj? returns nothing so it's left unchanged, LATIN_MAPø? returns 'o', and LATIN_MAPrn? nothing again. The result is 'bjorn'

Python, on the other hand, matches 'bj', '\xc3' and '\xb8rn'. There is no match for '\xc3' in LATIN_MAP, only in the regexp. Now I just need to find a way of splitting this correctly, any ideas?

comment:5 Changed 7 years ago by bjornkri

... I really need to start using the 'Preview' function. Hope that's legible.

comment:6 Changed 7 years ago by Jökull Sólberg Auðunsson <jokullsolberg@…>

Maybe there should be a JSON file with the character map for DRY.

comment:7 Changed 7 years ago by julien

  • Summary changed from Results of slugify in the admin interface differ from the one in shell. to Admin slugify function's results defer from those of slugify template filter

There's another difference between the two algorithms. In the admin, small words are stripped out by the javascript. For example, "This is a sentence with small words" returns "sentence-small-words". Whereas the template filter gives "this-is-a-sentence-with-small-words".

Ideally, this word replacement should be configurable per-language. Maybe it already is, I've never tried.

Another remark. In the admin, the javascript function is called 'URLify', so that's maybe for a good reason...

comment:8 Changed 7 years ago by anonymous

  • milestone set to 1.0 maybe

comment:9 Changed 7 years ago by Daniel Pope <dan@…>

  • Summary changed from Admin slugify function's results defer from those of slugify template filter to Admin slugify function's results differ from those of slugify template filter

Unfortunately, slugify isn't very well-defined outside of English. In German, for example, you would want to slugify 'Grüß' as 'gruess', but the same logic doesn't apply in other languages, where generally you can just omit accents at a pinch. We could build JSON transliteration character maps, but for i18n we would need several so that they can be selected based on locale. As an alternative, we could just do something with IRIs instead of trying to coerce Unicode to a "nice" ASCII string.

comment:10 Changed 7 years ago by bjornkri

Perhaps some sort of overriding mechanism could be implemented, say a dictionary in settings.py that is appended to and overrides the defaults? So by default 'ö' becomes 'o', but this behaviour could be changed by something like:

CHARACTER_MAP = {
    'ö': 'oe',
    'ü': 'ue',
    .....
}

But julien raises a good point, the javascript function is called URLify, not slugify, so perhaps the issue is not that slugify is 'wrong', but more that we need a python equivalent of URLify?

comment:11 Changed 7 years ago by bjornkri

My efforts at translating the javascript so far fall flat at the way python matches things:

import re

LATIN_MAP = {
'ö': 'o'
}

regex = re.compile('[ö]|[^ö]+')

pieces = regex.findall('björn')

downcoded = ""
for piece in pieces:
    mapped = ""
    try:
        mapped = LATIN_MAP[piece]
    except:
        mapped = piece
    downcoded += mapped

print pieces, downcoded

# Expected: ['bj', 'ö', 'rn'] bjorn
# Result: ['bj', '\xc3', '\xb6' 'rn'] björn

I.e. LATIN_MAP['ö'] isn't looked up, but LATIN_MAP['\xc3'] and LATIN_MAP['\xb6'] are, separately. LATIN_MAP['\xc3\xb6'] would work, but how to make sure these 'stay together' is something that leaves me stumped.

comment:12 Changed 7 years ago by Daniel Pope <dan@…>

I think the template filter should accept an argument, so slugify:"de" might add in the German-specific rules, and so on; a settings.py option would just choose whether that was done by default. But it would be difficult to ensure those tables are comprehensive enough. I note that libiconv has rudimentary support for transliteration, so perhaps we could use their data.

Slugs and URLs are the same thing, imho.

If we can adequately provide IRIs, on the other hand, the slugify operation can be well-defined, eg.

>>> slugify(u'Føø Bär Baß')
u'føø-bär-baß'

@bjornkri: You're still using UTF-8 encoded bytestrings. You must use unicode strings.

comment:13 Changed 7 years ago by julien

@Daniel Pope

Using IRIs seems quite promising. I guess language specific rules could be stored in the local flavors.

comment:14 Changed 7 years ago by julien

Hmmmm... this ticket is looking less and less like a ticket, and more and more like an email discussion. Should it be brought to the dev-list? Volunteer?

comment:15 Changed 7 years ago by julianb

#7980 aims to improve i18n and using CLRD data. What we need for slugify should be in there somewhere.

comment:16 Changed 7 years ago by mtredinnick

  • Resolution set to wontfix
  • Status changed from new to closed

Okay, this is all a bit of a non-issue. The Javascript and Python versions are not intended to give the same results. There is much more functionality available at the Python level, for a start, in the form of codecs and unicode mapping data. Secondly, we're not interested in shipping more and more data over in the Javascript file. The Javascript version works reasonably well for a bunch of cases. It doesn't work at all in other cases (e.g. Japanese text. In other cases it gives some result and some people may prefer something else. The point is that it doesn't matter. It's just an aid. If you don't like what the aid gives you, you can happily edit the field in the admin, or always do it on the Python side, or create your own Javascript function to use.

The Javascript function is not meant to be something that works perfectly for everybody because transliteration is a very ambiguous area. If it doesn't work for your purposes, don't use it.

comment:17 Changed 7 years ago by Daniel Pope <dan@…>

  • Component changed from Uncategorized to Template system
  • milestone 1.0 maybe deleted
  • Resolution wontfix deleted
  • Status changed from closed to reopened
  • Summary changed from Admin slugify function's results differ from those of slugify template filter to slugify template filter poorly encodes non-English strings

Sorry Malcolm, I may have subverted this ticket a little by talking about generalised handling of slugs, and thrown you off the scent.

The original ticket was about a broken Python slugify filter, not a broken Javascript function. It was simply an observation on bjornkri's part that the admin javascript works better. The Python filter is not "just an aid". It should produce acceptably good results, which it has not done for the string u'bøøøø'.

Reopening.

comment:18 Changed 7 years ago by harkal

I have created a function that downcodes a string in a way similar to what urlify does but in Python.
This can be used in conjunction to slugify like this :

slug = slugify(downcode(u'Γειά σου κόσμε!'))

Or it can be called from within slugify if the developers agree to merge it in!

Have fun!

#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# (c) 2008 Harry Kalogirou <harkal@gmail.com>
# 
# * Language maps taken from django's javascript urlify
#

import re

LATIN_MAP = {
    u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A', u'Æ': 'AE', u'Ç':'C', 
    u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I', u'Î': 'I',
    u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O', u'Õ': 'O', u'Ö':'O', 
    u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U', u'Ű': 'U',
    u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a', u'ã': 'a', u'ä':'a', 
    u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e', u'ë': 'e',
    u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n', u'ò': 'o', u'ó':'o', 
    u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u', u'ú': 'u',
    u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y'
}
LATIN_SYMBOLS_MAP = {
    u'©':'(c)'
}
GREEK_MAP = {
    u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z', u'η':'h', u'θ':'8',
    u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3', u'ο':'o', u'π':'p',
    u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x', u'ψ':'ps', u'ω':'w',
    u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h', u'ώ':'w', u'ς':'s',
    u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i',
    u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z', u'Η':'H', u'Θ':'8',
    u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3', u'Ο':'O', u'Π':'P',
    u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X', u'Ψ':'PS', u'Ω':'W',
    u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H', u'Ώ':'W', u'Ϊ':'I',
    u'Ϋ':'Y'
}
TURKISH_MAP = {
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'
}
RUSSIAN_MAP = {
    u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e', u'ё':'yo', u'ж':'zh',
    u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m', u'н':'n', u'о':'o',
    u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f', u'х':'h', u'ц':'c',
    u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'', u'э':'e', u'ю':'yu',
    u'я':'ya',
    u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E', u'Ё':'Yo', u'Ж':'Zh',
    u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M', u'Н':'N', u'О':'O',
    u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F', u'Х':'H', u'Ц':'C',
    u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'', u'Э':'E', u'Ю':'Yu',
    u'Я':'Ya'
}
UKRAINIAN_MAP = {
    u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i', u'ї':'yi', u'ґ':'g'
}
CZECH_MAP = {
    u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s', u'ť':'t', u'ů':'u',
    u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R', u'Š':'S', u'Ť':'T',
    u'Ů':'U', u'Ž':'Z'
}

POLISH_MAP = {
    u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o', u'ś':'s', u'ź':'z',
    u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N', u'Ó':'o', u'Ś':'S',
    u'Ź':'Z', u'Ż':'Z'
}

LATVIAN_MAP = {
    u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k', u'ļ':'l', u'ņ':'n',
    u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E', u'Ģ':'G', u'Ī':'i',
    u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
}

def _makeRegex():
    ALL_DOWNCODE_MAPS = {}
    ALL_DOWNCODE_MAPS.update(LATIN_MAP)
    ALL_DOWNCODE_MAPS.update(LATIN_SYMBOLS_MAP)
    ALL_DOWNCODE_MAPS.update(GREEK_MAP)
    ALL_DOWNCODE_MAPS.update(TURKISH_MAP)
    ALL_DOWNCODE_MAPS.update(RUSSIAN_MAP)
    ALL_DOWNCODE_MAPS.update(UKRAINIAN_MAP)
    ALL_DOWNCODE_MAPS.update(CZECH_MAP)
    ALL_DOWNCODE_MAPS.update(POLISH_MAP)
    ALL_DOWNCODE_MAPS.update(LATVIAN_MAP)
    
    s = u"".join(ALL_DOWNCODE_MAPS.keys())
    regex = re.compile(u"[%s]|[^%s]+" % (s,s))
    
    return ALL_DOWNCODE_MAPS, regex

_MAPINGS = None
_regex = None
def downcode(s):
    """
    This function is 'downcode' the string pass in the parameter s. This is useful 
    in cases we want the closest representation, of a multilingual string, in simple
    latin chars. The most probable use is before calling slugify.
    """
    global _MAPINGS, _regex
    
    if not _regex:
        _MAPINGS, _regex = _makeRegex()    
        
    downcoded = ""
    for piece in _regex.findall(s):
        if _MAPINGS.has_key(piece):
            downcoded += _MAPINGS[piece]
        else:
            downcoded += piece
    return downcoded


if __name__ == "__main__":
    string = u'Καλημέρα Joe!'
    print 'Original  :', string
    print 'Downcoded :', downcode(string)


comment:19 Changed 6 years ago by jacob

  • Resolution set to wontfix
  • Status changed from reopened to closed

Please don't reopen tickets closed by a committer. The correct way to revisit issues is to take it up on django-dev.

comment:20 Changed 6 years ago by mtredinnick

  • Resolution wontfix deleted
  • Status changed from closed to reopened

Jacob, I probably wontfixed in error, due to the confusion sown by Daniel. It's worth looking at this.

comment:21 Changed 6 years ago by anonymous

  • Triage Stage changed from Unreviewed to Design decision needed

comment:22 Changed 6 years ago by HM

Another datapoint: In my language both 'å' and 'ø' are valid words in and of themselves... the standard slugify reduces both of these to . Oopsy. 'æææææææ' is a popular way to describe a scream, it also becomes... .

I have my own slugify-function that turns the unicode-string into NFKD then slugifys that, then checks that the string isn't empty and if it is: adds a dummy string + the datetime + random string. This is independent of locale, which I consider a bonus.

comment:23 Changed 6 years ago by HM

And it seems trac can't handle those letters either =)

comment:24 Changed 6 years ago by iElectric

Why not use proper Unicode transliteration package like http://pypi.python.org/pypi/Unidecode/0.04.1 ? Transliteration is currently the best way to go Unicode->ASCII

comment:25 Changed 6 years ago by Daniel Pope <dan@…>

That package is too big to bundle and too trivial for Django to depend strongly upon. But it's a good starting place if you want to write your own slugify filter.

comment:26 Changed 5 years ago by hejsan

  • Cc hr.bjarni+django@… added

Hi, this ticket is way to old for a trivial feature (for non-English-speaking-country based programmers).

When slugifying characters in my language, some of them are properly downgraded to ascii lookalikes but some of them get lost. That's pretty irritating for such a trivial feature.

There seems to be a consensus in other web frameworks on slugifying international characters, and that is to have a map.

Did you (core programmers) take a look at harkal's suggestion above? This is the way i.e. WordPress and others go about solving this problem.
I also found this ready made function: slughify

It is small, concise and it works.

If you decide you don't want/need to fix/upgrade the slugify function, or if you think it will take a very long time before you decide, then I'd like to suggest that it be made into a setting as soon as possible:
SLUGIFY_FUNCTION = myown_slugify

with django.template.defaultfilters.slugify as the default

But optimally the function provided by django should work for all languages in my opinion.

Thanks

comment:27 Changed 5 years ago by RaceCondition

  • Cc eallik+django@… added

comment:28 Changed 4 years ago by kmike

  • Cc kmike84@… added

comment:29 Changed 4 years ago by RaceCondition

  • Cc eallik+django@… removed

comment:30 Changed 4 years ago by lukeplant

  • Severity set to Normal
  • Type set to Bug

comment:31 Changed 4 years ago by mitar

  • Cc mmitar@… added
  • Easy pickings unset
  • UI/UX unset

I have added made slugify2 function which first downcodes and then translates to slug. It behaves exactly the same as its JavaScript counterpart. So now it is possible to have both in Python and JavaScript same behavior.

comment:32 Changed 4 years ago by ptone

  • Triage Stage changed from Design decision needed to Accepted

see #16853 for a Turkish case

Seems that there have been no objections to the downcode then slugify approach.

This seems ready for someone to take a shot at implementing that approach in a patch.

comment:33 Changed 4 years ago by mitar

You can take the above slugify2 function.

comment:34 Changed 4 years ago by yasar11732@…

Above slugify2 function won't fix #16853.

# -*- coding: utf-8 -*-
import sys
import re

from django.utils import encoding

TURKISH_MAP = {
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G'
}

ALL_DOWNCODE_MAPS = [
    TURKISH_MAP,
]

class Downcoder(object):
    map = {}
    regex = None

    def __init__(self):
        self.map = {}
        chars = u''

        for lookup in ALL_DOWNCODE_MAPS:
            for c, l in lookup.items():
                self.map[c] = l
                chars += encoding.force_unicode(c)

        self.regex = re.compile(ur'[' + chars + ']|[^' + chars + ']+', re.U)
        
downcoder = Downcoder()

def downcode(value):
    downcoded = u''
    pieces = downcoder.regex.findall(value)

    if pieces:
        for p in pieces:
            mapped = downcoder.map.get(p)
            if mapped:
                downcoded += mapped
            else:
                downcoded += p
    else:
        downcoded = value

    return value
    
def slugify2(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = downcode(value)
    value = unicodedata.normalize('NFD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    return re.sub('[-\s]+', '-', value)

print(slugify2(u"Işık ılık süt iç"))
        

This prints "isk-lk-sut-ic", but expected value is, "isik-ilik-sut-ic".

comment:35 Changed 4 years ago by mitar

Ups. That was a bug. Fixed version of slugify2.

comment:36 Changed 2 years ago by aaugustin

  • Status changed from reopened to new

comment:38 Changed 11 months ago by aaugustin

  • Resolution set to wontfix
  • Status changed from new to closed

There's obviously more than one way to achieve slugification, depending on your tastes and constraints.

If we try to be smart, we'll get dozens and dozens of tickets from people who want to be smarter -- see the urlize filter for an example.

Django's implementation has the advantage of being simple and relying only on the stdlib. Pretty good solutions are available externally.

The drawbacks of implementing something more complicated outweigh the advantages at this stage.

Note: See TracTickets for help on using tickets.
Back to Top