Opened 14 years ago

Closed 13 years ago

#16656 closed New feature (wontfix)

Make urlize TLDs configurable

Reported by: Ralph Broenink Owned by: anonymous
Component: Template system Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The urlize filter currently only makes domain-only links ending with .com, .net or .org clickable. I believe this is done to avoid cases where a non-domain would be made clickable. (See also #9189.)

However I understand this philosophy, those TLDs are not that common in many non-English countries. For example, all 'words' ending with '.nl' are considered domain names within The Netherlands and most companies will not even have a .com, .net or .org domain name. In Gemany, .de is very common and .co.uk can hardly been seen outside a domain name context in the United Kingdom.

To allow websites in those countries to fully utilize the urlize filter, I would suggest that the list of TLDs that is used by urlize would be made configurable, with the default being ('com','net','org').

Attachments (1)

urlize_tlds.diff (3.5 KB ) - added by Matt Stevenson <mattoc@…> 14 years ago.
Patch to add new urlize functionality

Download all attachments as: .zip

Change History (15)

comment:1 by Paul McMillan, 14 years ago

Triage Stage: UnreviewedAccepted

I agree this should be addressed. It might even be worth adding appropriate domains to the localization.

comment:2 by Matt Stevenson <mattoc@…>, 14 years ago

Owner: changed from nobody to anonymous
Status: newassigned

by Matt Stevenson <mattoc@…>, 14 years ago

Attachment: urlize_tlds.diff added

Patch to add new urlize functionality

comment:3 by Matt Stevenson <mattoc@…>, 14 years ago

Has patch: set

comment:4 by Simon Meers, 14 years ago

Needs documentation: set
Needs tests: set
Patch needs improvement: set

Won't work in template mode yet due to the single argument limitation? Template rendering tests would be a good idea there.

It's probably worth providing a more exhaustive list of TLDs by default, but it probably makes sense to make this overridable.

As reluctant as I am to suggest it, adding a URLIZE_TLDS to global_settings might not be a bad idea, though it would also need docs. I don't think this is necessarily a good fit for localization; I don't see why we in Australia wouldn't want {{ www.example.nl|urlize }} not to work just because it's a site from a different country?

comment:5 by Simon Meers, 14 years ago

Actually maintaining a list of TLDs will probably be too painful; maybe just a smarter regex...

comment:6 by anonymous, 14 years ago

I agree that a complete list of TLDs would not be maintainable. Localizing the list of TLDs would still be problematic. For example, particular systems might commonly use .eu domains in an US locale (i.e. a registry).

A regex that would have to accept TLDs ranging from 2 characters (.nl) to 6 characters (.museum). The problem with this is that it might match too many strings. Imagine a children's website that automatically matches .xxx domains. Moreover, the ICANN has approved customized TLDs, allowing a wide range of new TLDs to be added. This makes it difficult, if not impossible, to provide a regex that matches all TLDs (imagine a .hewlettpackard) and providing a complete list of TLDs is entirely out of the question.

Although Django may make a start with providing some list of TLDs (i.e. some generic TLDs), site owners have to be able to specify a list of TLDs they'd like to be matched.

comment:7 by Paul McMillan, 14 years ago

I guess the question is how we want to balance the problem of urlizing things incorrectly with the problem of not linking something that should be linked. Django has erred on the side of not urlizing things, but I think we could be a bit more aggressive.

I'm not aware of a language that contains many words that end in a dot followed by 2 letters. This pattern would cover all the country codes (including the places that use .co.uk style third-levels). These are the most actively in flux and would be the hardest to properly maintain. If we hardcode the list of the other TLDs (I count 21 on wikipedia), that makes our job much more reasonable. Alternatively, we could limit ourselves to the common 3-letter ones + the country code regex, and make everyone else use the http:// syntax.

I believe that ICANN's approval of custom TLDs will not have a great deal of bearing on this issue for some years yet (so we should fix the common issue now, and worry about those if they become a problem). First, custom TLDs will be EXPENSIVE. Second, the processes for doing it hasn't had the details worked out, and once people are able to apply, it will still take a while to go through.

I think we should add a global setting so that people can add or remove things if they want to (some people are likely to want to urlize .onion addresses, for instance). We should assume that word-like-things ending in "dot letter letter" are ccTLDs. We should have a list of common 3-letter TLDs built in.

I'm in favor of making that list com net org edu gov mil info biz. If we conservatively expand to TLDs that have been around for a while, and aggressively include 2-letter ccTLDs, I think we strike a reasonable balance. We're still not actively excluding anybody (since they can still use the http:// or www. syntax), but we're minimizing the potential future issues as new more exotic TLDs become available.

comment:8 by Aymeric Augustin, 14 years ago

I'm going to implement PaulM's proposal, which is a reasonable compromise IMO.

comment:9 by Aymeric Augustin, 14 years ago

Resolution: fixed
Status: assignedclosed

In [17359]:

Fixed #16656 -- Changed the urlize filter to accept more top-level domains.

comment:10 by Aymeric Augustin, 14 years ago

Resolution: fixed
Status: closedreopened

I'm not aware of a language that contains many words that end in a dot followed by 2 letters.

In fact, this isn't true: foo.py and bar.rb match this pattern, but they certainly aren't urls.

comment:11 by Aymeric Augustin, 14 years ago

Resolution: wontfix
Status: reopenedclosed

It isn't possible to disambiguate some country codes from common file extensions, including .ai (Anguilla ),.mo (Macao), .pl (Poland), .ps (Palestine), .py (Paraguay), .sh (Saint Helena), .so (Somalia), and possibly others. Suffix-based detection is insufficient for ccTLDs. I have rolled back this part of the commit.

The proper solution to highlight an URL is to use the http(s):// or www prefixes.

I'm not in favor of further complicating this tag, and I'm against adding yet another setting.

Version 0, edited 14 years ago by Aymeric Augustin (next)

comment:12 by Ralph Broenink, 14 years ago

Resolution: wontfix
Status: closedreopened

I don't really see why we wouldn't add a setting for the default TLDs, since there's absolutely no reason why .com, .net and .org are included, while .info isn't, for example. I know that .info is not one of the original gTLDs, but for any 'mortal' user, this is just random. He would see that his .com's are automatically urlized, but his .info's aren't.

Allowing this to be customized would furthermore help internationalization of this filter and would 'explain' why some TLDs were included and some weren't (or at least leave the explaining to the developer). You (more specifically, a developer using this setting) could also choose to not include any TLD, since that would be even more consistent.

comment:13 by Aymeric Augustin, 14 years ago

Has patch: unset
Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset

I expanded the list of TLDs from the arbitrary choice of com|net|org to the slightly less arbitrary choice of the seven original gTLDs com|edu|gov|int|mil|net|org. I had to draw the line somewhere and I don't want the debate on this list to turn into bikeshedding.

That said, I just read the list of comments ahead again, and I agree that it could make sense to add a setting...

I'm going to leave this ticket open, could you provide a patch with tests and docs ?

comment:14 by Aymeric Augustin, 13 years ago

Resolution: wontfix
Status: reopenedclosed

Closing wontfix again, I'm still ambivalent about adding a setting, and this is a rather minor problem. It's also trivial to monkey-patch urlize or to write your own version if you need to.

Please open a pull request with tests and docs and write to django-developers if you want to change something again.

Note: See TracTickets for help on using tickets.
Back to Top