﻿id	summary	reporter	owner	description	type	status	component	version	severity	resolution	keywords	cc	stage	has_patch	needs_docs	needs_tests	needs_better_patch	easy	ui_ux
30892	"slugify() doesn't return a valid slug for ""İ""."	Luis Nell	Christoffer Sjöbergsson	"While working on an international project, we discovered that the turkish/azerbaijani letter `İ` can not be properly processed when `SlugField` and `slugify` are run with `allow_unicode=True`.

The project itself runs with Django 2.2.6 and Wagtail 2.6.2. I first talked about this in the Wagtail Support Channel and while researching further, discovered that this is a Django/Python related issue. This was tested on Python 3.6 and on Python 3.7.

(quick shoutout to Matt Wescott @gasmanic of Wagtail Fame for being a sparing partner in this)

There is a rather detailed analysis (README) in a git repo I created https://github.com/originell/django-wagtail-turkish-i - it was also the basis for my initial call for help in wagtail's support channel. Meanwhile I have extended it with a Django-only project, as to be a 100% sure this has nothing to do with Wagtail.

I was not able to find anything similar in trac. While I encourage whoever is reading this to actually read the README in the git repo, I want to honor your time and will try to provide a more concise version of the bug here.

===  Explanation

`models.py`

{{{#!python
from django.db import models
from django.utils.text import slugify


class Page(models.Model):
    title = models.CharField(max_length=255)
    slug = models.SlugField(allow_unicode=True)

    def __str__(self):
        return self.title
}}}

Using this in a shell/test like a (Model)Form might:

{{{#!python
from django.utils.text import slugify

page = Page(title=""Hello İstanbul"")
page.slug = slugify(page.title, allow_unicode=True)
page.full_clean()
}}}

`full_clean()` then raises

    django.core.exceptions.ValidationError: {'slug': [""Enter a valid 'slug' consisting of Unicode letters, numbers, underscores, or hyphens.""]}

Why is that?

`slugify` does the following internally:

{{{#!python
re.sub(r'[^\w\s-]', '', value).strip().lower()
}}}

Thing is, Python's `.lower()` of the `İ` in `İstanbul` looks like this:

{{{#!python
>>> [unicodedata.name(character) for character in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
}}}

So, while `slugify()` finishes, the result is then passed into `SlugField` which uses the `slug_unicode_re` (`^[-\w]+\Z`). However, the ''Combining Dot Above'' is not a valid `\w`:

{{{#!python
>>>  [(character, unicodedata.name(character), slug_unicode_re.match(character)) for character in 'İ'.lower()]
[
 ('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),
 # EUREKA!!
 ('̇', 'COMBINING DOT ABOVE', None)
]
}}}

So that's why the `ValidationError` is raised.

=== Proposed Solution

The culprit seems to be the order in which `lower()` is called in slugify. The assumption that the lowercase version of a `re.sub`-ed string is still a valid `slug_unicode_re`, does not seem to hold true.

Hence, instead of doing this in `slugify()`

{{{#!python
re.sub(r'[^\w\s-]', '', value).strip().lower()
}}}

It might be better to do it like this

{{{#!python
re.sub(r'[^\w\s-]', '', value.lower()).strip()
}}}

=== Is Python the actual culprit?

Yeah it might be. Matt (@gasmanic) urged me to also take a look if Python might be doing this wrong. 

-   The `İ` is the ''Latin Capital Letter I with Dot Above''. It's codepoint is `U+0130` According to the chart for the Latin Extended-A set (https://www.unicode.org/charts/PDF/U0100.pdf), it's lowercase version is `U+0069`.
-   `U+0069` lives in the ''C0 Controls and Basic Latin set'' (https://www.unicode.org/charts/PDF/U0000.pdf). Lo and behold: it is the ''Latin small letter I''. So a latin lowercase `i`.

Does this really mean that Python is doing something weird here by adding the ''Combining dot above''? Honestly, I can't imagine that and I am probably missing an important thing here because my view is too naive.

---

I hope this shorter explanation makes sense. If it does not, please try to read through the detailed analysis in the repo (https://github.com/originell/django-wagtail-turkish-i/blob/master/README.md). If that also does not make a ton of sense, let me know.

In any case, thank you for taking the time to read this bug report. Looking forward to feedback and thoughts. I am happy to oblige in any capacity necessary. "	Bug	closed	Utilities	dev	Normal	fixed			Accepted	1	0	0	0	0	0
