#33218 closed Bug (invalid)
slugify() can't handle Turkish İ while allow_unicode = True
| Reported by: | sowinski | Owned by: | nobody |
|---|---|---|---|
| Component: | Utilities | Version: | dev |
| Severity: | Normal | Keywords: | slugify |
| Cc: | Triage Stage: | Unreviewed | |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
Please see the following example.
The first character test_str = "i̇zmit" is not a normal i. It is the İ from the Turkish alphabet.
Using allow_unicode=True should keep the Turkish İ instead of replacing it with a normal i.
import unicodedata
import re
def slugify(value, allow_unicode=False):
"""
Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
dashes to single dashes. Remove characters that aren't alphanumerics,
underscores, or hyphens. Convert to lowercase. Also strip leading and
trailing whitespace, dashes, and underscores.
"""
value = str(value)
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value.lower())
return re.sub(r'[-\s]+', '-', value).strip('-_')
test_str = "i̇zmit"
output = slugify(test_str, allow_unicode = True)
print(test_str)
print(output)
print(test_str == output)
Change History (2)
follow-up: 2 comment:1 by , 4 years ago
| Component: | CSRF → Utilities |
|---|---|
| Resolution: | → invalid |
| Status: | new → closed |
comment:2 by , 4 years ago
Thank you for the fast response.
I do not agree, because of this behavior it would be impossible to create an article for the capital of Turkey while allow_unicode=True.
https://tr.wikipedia.org/wiki/%C4%B0stanbul
Maybe someone else have a international website and will hit this problem.
I solved the problem by adding the I to the regular expression.
value = re.sub(r'[^\w\si̇-]', '', value.lower())
I testes the implementation with all cities in the world with all the different language variants of the city name and it worked for me.
http://www.geonames.org/
It is interesting to see that this the only edge case. Not sure if this will work in all situations. So I run only my modification if the strange i is in the string. Otherwise is jump to the django implementation.
See: https://github.com/wagtail/wagtail/issues/7637#issuecomment-949366560
It's not about 'İ' but about '̇' which is the second character. IMO,
slugify()properly removes '̇', see:See also related ticket #30892 about "İ".