Opened 4 years ago

Last modified 3 months ago

#16501 new New feature

Add option to accept unicode characters in SlugField

Reported by: norn Owned by: pbnan
Component: Core (Other) Version: master
Severity: Normal Keywords: unicode, slug
Cc: kwadrat Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I patch every new version of my django installation and wondering why nobody fixed this in branch tree.

The problem is extremely simple - I like to use slug in my native language (and google likes the same). To fix the problem we just need to do this in validators.py:

slug_re = re.compile(r'^[-\w]+$',flags=re.U)

instead of this

slug_re = re.compile(r'^[-\w]+$')

The solution is simple. And the same we should do with URLValidator.

Attachments (1)

slug_unicode_16501.diff (2.3 KB) - added by pbnan 3 years ago.

Download all attachments as: .zip

Change History (25)

comment:1 Changed 4 years ago by aaugustin

  • Easy pickings unset
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

Congrats, you have found the very best way to piss off the maintainers of an open-source project in your first sentence :)

You may want to read the FAQ: https://docs.djangoproject.com/en/1.3/faq/contributing/#faq-contributing-code

A bug won't be fixed if you don't submit a patch: https://docs.djangoproject.com/en/dev/internals/contributing/writing-code/submitting-patches/


Now, regarding your report, out of the 4 files called validators.py in Django, you're referring to the one in django/core. You're saying that the built-in validator django.core.validators.validate_slug should accept unicode characters.

This function is currently documented as:

A RegexValidator instance that ensures a value consists of only letters, numbers, underscores or hyphens.

We could debate if "letters" means "ASCII letters" or "Unicode characters that have some property that says they're a letter". Currently, it means "ASCII letters".

validate_slug is used in only one place in Django, as the default validator for SlugField.

So, the real problem here is that SlugField will only accept ASCII letters.


In my opinion, the whole point of a slug is to contain only ASCII characters, to make sure URLs built with slugs have no charset issues, no matter where they're copy-pasted (URL bars, email, IRC, IM, etc.)

If the SlugField provided by Django doesn't meet your expectations, you can easily build your own:

class MyCustomSlugField(CharField):
    default_validators = [my_custom_validate_slug]

I'm leaning towards WONTFIX, but I'd like someone else to confirm my analysis before closing the ticket.

Last edited 4 years ago by aaugustin (previous) (diff)

comment:2 Changed 4 years ago by anonymous

The world is not english speaking only.

Why do we have to do some magic instead of using out-of-the-box functionality?

comment:3 follow-up: Changed 4 years ago by justinlilly

Non-ASCII letters are completely valid in URL schemes. The whole point of a slug isn't to contain ASCII letters. It is merely to provide a human-readable URL entry. There are plenty of humans who read outside of ASCII. Assuming the fix can be made backwards compatible, I see no reason not to have it.

Also, aaugustin, your tone came off a little combative/hostile. I'm sure you didn't mean it that way, but that's how it came out. Just a heads up.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 4 years ago by aaugustin

  • Triage Stage changed from Unreviewed to Design decision needed

Replying to anonymous:

The world is not english speaking only.

Sure — as a native speaker of a language that uses accents heavily, I agree completely :)


Why do we have to do some magic instead of using out-of-the-box functionality?

Well, if you don't like the out-of-the-box functionality that everyone else uses, it sounds reasonable to write you own variant. There's hardly any magic in my suggestion.


Replying to justinlilly:

Non-ASCII letters are completely valid in URL schemes.

Yes, since RFC 3986, it's possible to use non-ASCII characters in URLs without ambiguities: the charset must be UTF-8.

That RFC was published is 2005. Browser vendors may not have changed it immediately (historically, most browsers defaulted to latin1 in western languages), and some people still use browsers from a few years ago. I don't know exactly what's the current status. We need to check how mainstream browsers react to an URL like "http://localhost/how-to-brew-café/" before proceeding: do they properly utf-8-encode and percent-encode it?


The whole point of a slug isn't to contain ASCII letters. It is merely to provide a human-readable URL entry. There are plenty of humans who read outside of ASCII.

I'm a bit confused by this. Django provides a SlugField that is designed to contain an URL-friendly version of a CharField. For instance, the title of a blog post could be "How to brew café" (in the CharField) and the slug would be "how-to-brew-cafe" (in the SlugField). If you assume that an URL can contain any text, you don't need SlugField at all; you can just use the original title.

A drawback is that the actual, correct URL is: http://localhost/how-to-brew-caf%C3%A9/. Some browsers may percent-decode and utf-8-decode this to display it as: http://localhost/how-to-brew-café/ — I think at least Firefox does. I don't know about IE. If it turns out we have to choose between http://localhost/how-to-brew-caf%C3%A9/ and http://localhost/how-to-brew-cafe/, I think the latter is more readable.


Also, aaugustin, your tone came off a little combative/hostile. I'm sure you didn't mean it that way, but that's how it came out. Just a heads up.

Thanks for the hint and sorry -- the tone of the original report got a bit on my nerves. Comment 2 is great too... :)

comment:5 in reply to: ↑ 4 Changed 4 years ago by norn

Replying to aaugustin:

Replying to justinlilly:

Non-ASCII letters are completely valid in URL schemes.

Yes, since RFC 3986, it's possible to use non-ASCII characters in URLs without ambiguities: the charset must be UTF-8.

That RFC was published is 2005. Browser vendors may not have changed it immediately (historically, most browsers defaulted to latin1 in western languages), and some people still use browsers from a few years ago. I don't know exactly what's the current status. We need to check how mainstream browsers react to an URL like "http://localhost/how-to-brew-café/" before proceeding: do they properly utf-8-encode and percent-encode it?


Well, why should we fear of browser incompatibility to pre-2005 standards in Django to be released in 2011/2012?
Django 1.0 to 1.3 do not allow utf8 in slug, but world is changing. If RFC3986 was approved in 2005, then django have to comply it.

I am sorry if my tone seems not friendly. English is not my native language, but I am definitely do not want to offense anybody. I am just trying to solve the problem and communicate in the most effecient way without formal shell. Peace!

comment:6 Changed 4 years ago by aaugustin

Regarding browser support, the current consensus (per a recent discussion on the mailing list) is to drop IE6 and start at IE7 — released in 2006. We have to live with the fact that many developers use Django to create websites for end users that aren't on the cutting edge of technology :)

In my opinion, the real issue here isn't the date of the RFC. It's the support in mainstream browsers (IE, Firefox, Chrome, and to a lesser extent Safari, Opera). I don't know if the RFC was actually implemented in these browsers, and when — many RFCs die unimplemented... That why I suggested that, if someone is interested in this, the first step is to prove that it works correctly in IE >= 7, Firefox >= 3, and the latest Chrome, Safari and Opera.

Then, there's still the question of whether the percent-encoded version is more readable than the ASCII version, but that's a matter of taste, and I'm not the one who makes these decisions :)

comment:7 Changed 4 years ago by jacob

  • Triage Stage changed from Design decision needed to Accepted

Marking accepted: SlugField should be useful in most of the world, not just the English-speaking parts.

(SlugField probably should take a regex argument so people could easily change the concept of what a "slug" is. But that's another ticket :)

comment:8 Changed 3 years ago by pbnan

  • Owner changed from nobody to pbnan
  • Status changed from new to assigned

Changed 3 years ago by pbnan

comment:9 Changed 3 years ago by pbnan

  • Has patch set
  • Resolution set to fixed
  • Status changed from assigned to closed
  • Type changed from Bug to New feature

comment:10 Changed 3 years ago by pbnan

  • Resolution fixed deleted
  • Status changed from closed to reopened

Closed accidentally, sorry.

comment:11 Changed 3 years ago by kwadrat

  • Cc kwadrat added
  • Triage Stage changed from Accepted to Ready for checkin

Reviewed - ready for checkin.

comment:12 Changed 3 years ago by PaulM

  • Triage Stage changed from Ready for checkin to Accepted

I'm bumping this back out of RFC. While I agree wholeheartedly that we need to allow users to choose unicode slugs if they so desire, we can't just arbitrarily change the way this works. Many developers depend on the promise that the slug is ASCII. Arbitrarily switching to allow unicode characters will produce bugs in previously stable code, and may introduce security vulnerabilities.

Django's backwards compatibility policy requires that features like this continue to work as documented. Before this patch is ready to commit, the patch must be re-worked so that the default slug behavior of reducing to ASCII remains. Whether or not the default behavior should change at some point in the future (1.5 or 1.6) is a matter that should be discussed on the django-dev list.

comment:13 Changed 2 years ago by aaugustin

  • Status changed from reopened to new

comment:14 Changed 2 years ago by FrankBie

  • Needs documentation set
  • Patch needs improvement set

I would propose to add a UnicodeSlugField instead of introducing an incompatible change, thats why I would propose a patch needs update.

comment:15 Changed 20 months ago by slikts

One of the requirements in my app is that I need to have non-English slugs for use in URLs. This should pose no problem, since modern browsers support Unicode in URLs, and Django is a framework that supports modern practices as well, right? So I try to do it, but find out that SlugField only works with ASCII by default. This is mildly annoying, but not really a problem, since you can just extend SlugField, yes? So I look up the source for SlugField and modify the validator regexp like this:

slug_re = re.compile(r'^[-\w_]+$', re.UNICODE)
validate_slug = RegexValidator(slug_re, "Enter a valid name consisting of letters, numbers, underscores or hyphens.", 'invalid')


class SlugField(models.SlugField):
    default_validators = [validate_slug]

As it turns out, it just doesn't work, Unicode alphanumerical characters are still rejected as invalid, and I have not the slightest idea why. A workaround I found was extending directly from CharField, but having to completely reimplement Django functionality just because I want to use more than ASCII is not very nice at all.

Last edited 20 months ago by slikts (previous) (diff)

comment:16 Changed 20 months ago by fcurella

I've wrote a patch adding a unicode option for SlugField. Pull Request is at https://github.com/django/django/pull/1979

Last edited 20 months ago by fcurella (previous) (diff)

comment:17 Changed 20 months ago by fcurella

I've created an alternative pul request that adds db.models.UnicodeSlugField and forms.UnicodeSlugField https://github.com/django/django/pull/1987

On one hand, I like the new field approach because it gives us an easier upgrade path. On the other hand, it feels like I'm polluting django.db.models and django.forms with slightly different versions of something that's already there.

Unfortunately, just passing a regex argument (as suggested by Jacob) is not going to be enough, because we'll also have to pass an error message string for the validator.

comment:18 Changed 8 months ago by timgraham

An alternate approach suggested by Claude on the mailing list was to "create a new validate_uslug validator (better name anyone?) and allow a custom validator to be passed to the SlugField constructor." We need someone to compare these three approaches (at least) and find a consensus on the DevelopersMailingList in order to move this issue forward.

comment:19 Changed 8 months ago by timgraham

Well, I realized that 3rd approach is more or less the same as the PR linked in comment 16.

I created a django-developers thread to get feedback on the two approaches.

comment:20 Changed 8 months ago by timgraham

  • Summary changed from validators.py don't like unicode slug to Add option to accept unicode characters in SlugField

comment:21 Changed 7 months ago by berkerpeksag

I've opened PR #3729 to revise PR #1979.

Changes:

  • Fixed merge conflicts
  • Added release notes
  • Added more documentation
  • Fixed commit message format
  • Adressed review comments

comment:22 Changed 7 months ago by berkerpeksag

  • Needs documentation unset
  • Patch needs improvement unset
  • Version changed from 1.3 to master

comment:23 Changed 7 months ago by timgraham

  • Patch needs improvement set

comment:24 Changed 3 months ago by kutenai

Submitted a new PR #4518 to replace PR #3729.

The new pull request is rebased onto the latest master, and fixes some items discussed in the previous pull request.

There is one item that needs to be reviwed.

django/contrib/admin/options.py:1744

        @property
    def media(self):
        extra = '' if settings.DEBUG else '.min'
        js = ['jquery%s.js' % extra, 'jquery.init.js', 'inlines%s.js' % extra]
        # TODO: Was removed in master, modified in patch.
        if self.prepopulated_fields:
            js.extend(['xregexp.min.js', 'urlify.js', 'prepopulate%s.js' % extra])
        if self.filter_vertical or self.filter_horizontal:
            js.extend(['SelectBox.js', 'SelectFilter2.js'])
        return forms.Media(js=[static('admin/js/%s' % url) for url in js])

I am not sure what to do with the updated code.

Note: See TracTickets for help on using tickets.
Back to Top