Opened 13 years ago

Closed 9 years ago

Last modified 8 years ago

#16501 closed New feature (fixed)

Add option to accept unicode characters in SlugField

Reported by: norn Owned by: Piotr Banaszkiewicz
Component: Core (Other) Version: dev
Severity: Normal Keywords: unicode, slug
Cc: kwadrat Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I patch every new version of my django installation and wondering why nobody fixed this in branch tree.

The problem is extremely simple - I like to use slug in my native language (and google likes the same). To fix the problem we just need to do this in validators.py:

slug_re = re.compile(r'^[-\w]+$',flags=re.U)

instead of this

slug_re = re.compile(r'^[-\w]+$')

The solution is simple. And the same we should do with URLValidator.

Attachments (1)

slug_unicode_16501.diff (2.3 KB ) - added by Piotr Banaszkiewicz 13 years ago.

Download all attachments as: .zip

Change History (29)

comment:1 by Aymeric Augustin, 13 years ago

Easy pickings: unset

Congrats, you have found the very best way to piss off the maintainers of an open-source project in your first sentence :)

You may want to read the FAQ: https://docs.djangoproject.com/en/1.3/faq/contributing/#faq-contributing-code

A bug won't be fixed if you don't submit a patch: https://docs.djangoproject.com/en/dev/internals/contributing/writing-code/submitting-patches/


Now, regarding your report, out of the 4 files called validators.py in Django, I suppose you're referring to the one in django/core. You're saying that the built-in validator django.core.validators.validate_slug should accept unicode characters.

This function is currently documented as:

A RegexValidator instance that ensures a value consists of only letters, numbers, underscores or hyphens.

We could debate if "letters" means "ASCII letters" or "Unicode characters that have some property that says they're a letter". Currently, it means "ASCII letters".

validate_slug is used in only one place in Django, as the default validator for SlugField.

So, the real problem here is that SlugField will only accept ASCII letters.


In my opinion, the whole point of a slug is to contain only ASCII characters, to make sure URLs built with slugs have no charset issues, no matter where they're copy-pasted (URL bars, email, IRC, IM, etc.)

If the SlugField provided by Django doesn't meet your expectations, you can easily build your own:

class MyCustomSlugField(CharField):
    default_validators = [my_custom_validate_slug]

I'm leaning towards WONTFIX, but I'd like someone else to confirm my analysis before closing the ticket.

Version 0, edited 13 years ago by Aymeric Augustin (next)

comment:2 by anonymous, 13 years ago

The world is not english speaking only.

Why do we have to do some magic instead of using out-of-the-box functionality?

comment:3 by Justin Lilly, 13 years ago

Non-ASCII letters are completely valid in URL schemes. The whole point of a slug isn't to contain ASCII letters. It is merely to provide a human-readable URL entry. There are plenty of humans who read outside of ASCII. Assuming the fix can be made backwards compatible, I see no reason not to have it.

Also, aaugustin, your tone came off a little combative/hostile. I'm sure you didn't mean it that way, but that's how it came out. Just a heads up.

in reply to:  3 ; comment:4 by Aymeric Augustin, 13 years ago

Triage Stage: UnreviewedDesign decision needed

Replying to anonymous:

The world is not english speaking only.

Sure — as a native speaker of a language that uses accents heavily, I agree completely :)


Why do we have to do some magic instead of using out-of-the-box functionality?

Well, if you don't like the out-of-the-box functionality that everyone else uses, it sounds reasonable to write you own variant. There's hardly any magic in my suggestion.


Replying to justinlilly:

Non-ASCII letters are completely valid in URL schemes.

Yes, since RFC 3986, it's possible to use non-ASCII characters in URLs without ambiguities: the charset must be UTF-8.

That RFC was published is 2005. Browser vendors may not have changed it immediately (historically, most browsers defaulted to latin1 in western languages), and some people still use browsers from a few years ago. I don't know exactly what's the current status. We need to check how mainstream browsers react to an URL like "http://localhost/how-to-brew-café/" before proceeding: do they properly utf-8-encode and percent-encode it?


The whole point of a slug isn't to contain ASCII letters. It is merely to provide a human-readable URL entry. There are plenty of humans who read outside of ASCII.

I'm a bit confused by this. Django provides a SlugField that is designed to contain an URL-friendly version of a CharField. For instance, the title of a blog post could be "How to brew café" (in the CharField) and the slug would be "how-to-brew-cafe" (in the SlugField). If you assume that an URL can contain any text, you don't need SlugField at all; you can just use the original title.

A drawback is that the actual, correct URL is: http://localhost/how-to-brew-caf%C3%A9/. Some browsers may percent-decode and utf-8-decode this to display it as: http://localhost/how-to-brew-café/ — I think at least Firefox does. I don't know about IE. If it turns out we have to choose between http://localhost/how-to-brew-caf%C3%A9/ and http://localhost/how-to-brew-cafe/, I think the latter is more readable.


Also, aaugustin, your tone came off a little combative/hostile. I'm sure you didn't mean it that way, but that's how it came out. Just a heads up.

Thanks for the hint and sorry -- the tone of the original report got a bit on my nerves. Comment 2 is great too... :)

in reply to:  4 comment:5 by norn, 13 years ago

Replying to aaugustin:

Replying to justinlilly:

Non-ASCII letters are completely valid in URL schemes.

Yes, since RFC 3986, it's possible to use non-ASCII characters in URLs without ambiguities: the charset must be UTF-8.

That RFC was published is 2005. Browser vendors may not have changed it immediately (historically, most browsers defaulted to latin1 in western languages), and some people still use browsers from a few years ago. I don't know exactly what's the current status. We need to check how mainstream browsers react to an URL like "http://localhost/how-to-brew-café/" before proceeding: do they properly utf-8-encode and percent-encode it?


Well, why should we fear of browser incompatibility to pre-2005 standards in Django to be released in 2011/2012?
Django 1.0 to 1.3 do not allow utf8 in slug, but world is changing. If RFC3986 was approved in 2005, then django have to comply it.

I am sorry if my tone seems not friendly. English is not my native language, but I am definitely do not want to offense anybody. I am just trying to solve the problem and communicate in the most effecient way without formal shell. Peace!

comment:6 by Aymeric Augustin, 13 years ago

Regarding browser support, the current consensus (per a recent discussion on the mailing list) is to drop IE6 and start at IE7 — released in 2006. We have to live with the fact that many developers use Django to create websites for end users that aren't on the cutting edge of technology :)

In my opinion, the real issue here isn't the date of the RFC. It's the support in mainstream browsers (IE, Firefox, Chrome, and to a lesser extent Safari, Opera). I don't know if the RFC was actually implemented in these browsers, and when — many RFCs die unimplemented... That why I suggested that, if someone is interested in this, the first step is to prove that it works correctly in IE >= 7, Firefox >= 3, and the latest Chrome, Safari and Opera.

Then, there's still the question of whether the percent-encoded version is more readable than the ASCII version, but that's a matter of taste, and I'm not the one who makes these decisions :)

comment:7 by Jacob, 13 years ago

Triage Stage: Design decision neededAccepted

Marking accepted: SlugField should be useful in most of the world, not just the English-speaking parts.

(SlugField probably should take a regex argument so people could easily change the concept of what a "slug" is. But that's another ticket :)

comment:8 by Piotr Banaszkiewicz, 13 years ago

Owner: changed from nobody to Piotr Banaszkiewicz
Status: newassigned

by Piotr Banaszkiewicz, 13 years ago

Attachment: slug_unicode_16501.diff added

comment:9 by Piotr Banaszkiewicz, 13 years ago

Has patch: set
Resolution: fixed
Status: assignedclosed
Type: BugNew feature

comment:10 by Piotr Banaszkiewicz, 13 years ago

Resolution: fixed
Status: closedreopened

Closed accidentally, sorry.

comment:11 by kwadrat, 13 years ago

Cc: kwadrat added
Triage Stage: AcceptedReady for checkin

Reviewed - ready for checkin.

comment:12 by Paul McMillan, 13 years ago

Triage Stage: Ready for checkinAccepted

I'm bumping this back out of RFC. While I agree wholeheartedly that we need to allow users to choose unicode slugs if they so desire, we can't just arbitrarily change the way this works. Many developers depend on the promise that the slug is ASCII. Arbitrarily switching to allow unicode characters will produce bugs in previously stable code, and may introduce security vulnerabilities.

Django's backwards compatibility policy requires that features like this continue to work as documented. Before this patch is ready to commit, the patch must be re-worked so that the default slug behavior of reducing to ASCII remains. Whether or not the default behavior should change at some point in the future (1.5 or 1.6) is a matter that should be discussed on the django-dev list.

comment:13 by Aymeric Augustin, 12 years ago

Status: reopenednew

comment:14 by FrankBie, 12 years ago

Needs documentation: set
Patch needs improvement: set

I would propose to add a UnicodeSlugField instead of introducing an incompatible change, thats why I would propose a patch needs update.

comment:15 by slikts, 11 years ago

One of the requirements in my app is that I need to have non-English slugs for use in URLs. This should pose no problem, since modern browsers support Unicode in URLs, and Django is a framework that supports modern practices as well, right? So I try to do it, but find out that SlugField only works with ASCII by default. This is mildly annoying, but not really a problem, since you can just extend SlugField, yes? So I look up the source for SlugField and modify the validator regexp like this:

slug_re = re.compile(r'^[-\w_]+$', re.UNICODE)
validate_slug = RegexValidator(slug_re, "Enter a valid name consisting of letters, numbers, underscores or hyphens.", 'invalid')


class SlugField(models.SlugField):
    default_validators = [validate_slug]

As it turns out, it just doesn't work, Unicode alphanumerical characters are still rejected as invalid, and I have not the slightest idea why. A workaround I found was extending directly from CharField, but having to completely reimplement Django functionality just because I want to use more than ASCII is not very nice at all.

Last edited 11 years ago by slikts (previous) (diff)

comment:16 by Flavio Curella, 11 years ago

I've wrote a patch adding a unicode option for SlugField. Pull Request is at https://github.com/django/django/pull/1979

Last edited 11 years ago by Flavio Curella (previous) (diff)

comment:17 by Flavio Curella, 11 years ago

I've created an alternative pul request that adds db.models.UnicodeSlugField and forms.UnicodeSlugField https://github.com/django/django/pull/1987

On one hand, I like the new field approach because it gives us an easier upgrade path. On the other hand, it feels like I'm polluting django.db.models and django.forms with slightly different versions of something that's already there.

Unfortunately, just passing a regex argument (as suggested by Jacob) is not going to be enough, because we'll also have to pass an error message string for the validator.

comment:18 by Tim Graham, 10 years ago

An alternate approach suggested by Claude on the mailing list was to "create a new validate_uslug validator (better name anyone?) and allow a custom validator to be passed to the SlugField constructor." We need someone to compare these three approaches (at least) and find a consensus on the DevelopersMailingList in order to move this issue forward.

comment:19 by Tim Graham, 10 years ago

Well, I realized that 3rd approach is more or less the same as the PR linked in comment 16.

I created a django-developers thread to get feedback on the two approaches.

comment:20 by Tim Graham, 10 years ago

Summary: validators.py don't like unicode slugAdd option to accept unicode characters in SlugField

comment:21 by Berker Peksag, 10 years ago

I've opened PR #3729 to revise PR #1979.

Changes:

  • Fixed merge conflicts
  • Added release notes
  • Added more documentation
  • Fixed commit message format
  • Adressed review comments

comment:22 by Berker Peksag, 10 years ago

Needs documentation: unset
Patch needs improvement: unset
Version: 1.3master

comment:23 by Tim Graham, 10 years ago

Patch needs improvement: set

comment:24 by Ed Henderson, 10 years ago

Submitted a new PR #4518 to replace PR #3729.

The new pull request is rebased onto the latest master, and fixes some items discussed in the previous pull request.

There is one item that needs to be reviwed.

django/contrib/admin/options.py:1744

        @property
    def media(self):
        extra = '' if settings.DEBUG else '.min'
        js = ['jquery%s.js' % extra, 'jquery.init.js', 'inlines%s.js' % extra]
        # TODO: Was removed in master, modified in patch.
        if self.prepopulated_fields:
            js.extend(['xregexp.min.js', 'urlify.js', 'prepopulate%s.js' % extra])
        if self.filter_vertical or self.filter_horizontal:
            js.extend(['SelectBox.js', 'SelectFilter2.js'])
        return forms.Media(js=[static('admin/js/%s' % url) for url in js])

I am not sure what to do with the updated code.

comment:25 by Tim Graham <timograham@…>, 9 years ago

Resolution: fixed
Status: newclosed

In f8cc4644:

Fixed #16501 -- Added an allow_unicode parameter to SlugField.

Thanks Flavio Curella and Berker Peksag for the initial patch.

comment:26 by Tim Graham <timograham@…>, 8 years ago

In 054e7442:

Refs #16501, #26474 -- Added xregexp.js source file.

comment:27 by Tim Graham <timograham@…>, 8 years ago

In ef93af91:

[1.10.x] Refs #16501, #26474 -- Added xregexp.js source file.

Backport of 054e74420b7a31bac67d4993b462eea7b9b7a5ba from master

comment:28 by Tim Graham <timograham@…>, 8 years ago

In 307de7d9:

[1.9.x] Refs #16501, #26474 -- Added xregexp.js source file.

Backport of 054e74420b7a31bac67d4993b462eea7b9b7a5ba from master

Note: See TracTickets for help on using tickets.
Back to Top