Code

Opened 3 years ago

Closed 13 months ago

#16656 closed New feature (wontfix)

Make urlize TLDs configurable

Reported by: ralphje Owned by: anonymous
Component: Template system Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The urlize filter currently only makes domain-only links ending with .com, .net or .org clickable. I believe this is done to avoid cases where a non-domain would be made clickable. (See also #9189.)

However I understand this philosophy, those TLDs are not that common in many non-English countries. For example, all 'words' ending with '.nl' are considered domain names within The Netherlands and most companies will not even have a .com, .net or .org domain name. In Gemany, .de is very common and .co.uk can hardly been seen outside a domain name context in the United Kingdom.

To allow websites in those countries to fully utilize the urlize filter, I would suggest that the list of TLDs that is used by urlize would be made configurable, with the default being ('com','net','org').

Attachments (1)

urlize_tlds.diff (3.5 KB) - added by Matt Stevenson <mattoc@…> 3 years ago.
Patch to add new urlize functionality

Download all attachments as: .zip

Change History (15)

comment:1 Changed 3 years ago by PaulM

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Accepted

I agree this should be addressed. It might even be worth adding appropriate domains to the localization.

comment:2 Changed 3 years ago by Matt Stevenson <mattoc@…>

  • Owner changed from nobody to anonymous
  • Status changed from new to assigned

Changed 3 years ago by Matt Stevenson <mattoc@…>

Patch to add new urlize functionality

comment:3 Changed 3 years ago by Matt Stevenson <mattoc@…>

  • Has patch set

comment:4 Changed 3 years ago by DrMeers

  • Needs documentation set
  • Needs tests set
  • Patch needs improvement set

Won't work in template mode yet due to the single argument limitation? Template rendering tests would be a good idea there.

It's probably worth providing a more exhaustive list of TLDs by default, but it probably makes sense to make this overridable.

As reluctant as I am to suggest it, adding a URLIZE_TLDS to global_settings might not be a bad idea, though it would also need docs. I don't think this is necessarily a good fit for localization; I don't see why we in Australia wouldn't want {{ www.example.nl|urlize }} not to work just because it's a site from a different country?

comment:5 Changed 3 years ago by DrMeers

Actually maintaining a list of TLDs will probably be too painful; maybe just a smarter regex...

comment:6 Changed 3 years ago by anonymous

I agree that a complete list of TLDs would not be maintainable. Localizing the list of TLDs would still be problematic. For example, particular systems might commonly use .eu domains in an US locale (i.e. a registry).

A regex that would have to accept TLDs ranging from 2 characters (.nl) to 6 characters (.museum). The problem with this is that it might match too many strings. Imagine a children's website that automatically matches .xxx domains. Moreover, the ICANN has approved customized TLDs, allowing a wide range of new TLDs to be added. This makes it difficult, if not impossible, to provide a regex that matches all TLDs (imagine a .hewlettpackard) and providing a complete list of TLDs is entirely out of the question.

Although Django may make a start with providing some list of TLDs (i.e. some generic TLDs), site owners have to be able to specify a list of TLDs they'd like to be matched.

comment:7 Changed 3 years ago by PaulM

I guess the question is how we want to balance the problem of urlizing things incorrectly with the problem of not linking something that should be linked. Django has erred on the side of not urlizing things, but I think we could be a bit more aggressive.

I'm not aware of a language that contains many words that end in a dot followed by 2 letters. This pattern would cover all the country codes (including the places that use .co.uk style third-levels). These are the most actively in flux and would be the hardest to properly maintain. If we hardcode the list of the other TLDs (I count 21 on wikipedia), that makes our job much more reasonable. Alternatively, we could limit ourselves to the common 3-letter ones + the country code regex, and make everyone else use the http:// syntax.

I believe that ICANN's approval of custom TLDs will not have a great deal of bearing on this issue for some years yet (so we should fix the common issue now, and worry about those if they become a problem). First, custom TLDs will be EXPENSIVE. Second, the processes for doing it hasn't had the details worked out, and once people are able to apply, it will still take a while to go through.

I think we should add a global setting so that people can add or remove things if they want to (some people are likely to want to urlize .onion addresses, for instance). We should assume that word-like-things ending in "dot letter letter" are ccTLDs. We should have a list of common 3-letter TLDs built in.

I'm in favor of making that list com net org edu gov mil info biz. If we conservatively expand to TLDs that have been around for a while, and aggressively include 2-letter ccTLDs, I think we strike a reasonable balance. We're still not actively excluding anybody (since they can still use the http:// or www. syntax), but we're minimizing the potential future issues as new more exotic TLDs become available.

comment:8 Changed 2 years ago by aaugustin

I'm going to implement PaulM's proposal, which is a reasonable compromise IMO.

comment:9 Changed 2 years ago by aaugustin

  • Resolution set to fixed
  • Status changed from assigned to closed

In [17359]:

Fixed #16656 -- Changed the urlize filter to accept more top-level domains.

comment:10 Changed 2 years ago by aaugustin

  • Resolution fixed deleted
  • Status changed from closed to reopened

I'm not aware of a language that contains many words that end in a dot followed by 2 letters.

In fact, this isn't true: foo.py and bar.rb match this pattern, but they certainly aren't urls.

comment:11 Changed 2 years ago by aaugustin

  • Resolution set to wontfix
  • Status changed from reopened to closed

It isn't possible to disambiguate some country codes from common file extensions, including .ai (Anguilla ),.mo (Macao), .pl (Poland), .ps (Palestine), .py (Paraguay), .sh (Saint Helena), .so (Somalia), and possibly others. Suffix-based detection is insufficient for ccTLDs. I have rolled back this part of the commit.

The proper solution to highlight an URL is to use the http(s):// or www prefixes.

I'm not in favor of further complicating this tag, and I'm against adding yet another setting.

Version 0, edited 2 years ago by aaugustin (next)

comment:12 Changed 2 years ago by ralphje

  • Resolution wontfix deleted
  • Status changed from closed to reopened

I don't really see why we wouldn't add a setting for the default TLDs, since there's absolutely no reason why .com, .net and .org are included, while .info isn't, for example. I know that .info is not one of the original gTLDs, but for any 'mortal' user, this is just random. He would see that his .com's are automatically urlized, but his .info's aren't.

Allowing this to be customized would furthermore help internationalization of this filter and would 'explain' why some TLDs were included and some weren't (or at least leave the explaining to the developer). You (more specifically, a developer using this setting) could also choose to not include any TLD, since that would be even more consistent.

comment:13 Changed 2 years ago by aaugustin

  • Has patch unset
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

I expanded the list of TLDs from the arbitrary choice of com|net|org to the slightly less arbitrary choice of the seven original gTLDs com|edu|gov|int|mil|net|org. I had to draw the line somewhere and I don't want the debate on this list to turn into bikeshedding.

That said, I just read the list of comments ahead again, and I agree that it could make sense to add a setting...

I'm going to leave this ticket open, could you provide a patch with tests and docs ?

comment:14 Changed 13 months ago by aaugustin

  • Resolution set to wontfix
  • Status changed from reopened to closed

Closing wontfix again, I'm still ambivalent about adding a setting, and this is a rather minor problem. It's also trivial to monkey-patch urlize or to write your own version if you need to.

Please open a pull request with tests and docs and write to django-developers if you want to change something again.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.