Opened 3 months ago

Closed 3 months ago

Last modified 3 months ago

#36452 closed Bug (invalid)

DomainNameValidator forbids digits in TLDs

Reported by: Shai Berger Owned by:
Component: Core (Other) Version: dev
Severity: Normal Keywords: validation domain
Cc: Shai Berger, Claude Paroz, Mike Edmunds Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: yes UI/UX: no

Description

I think there's a small bug in the domain validator, that has been lurking quietly for years, and is now biting me a little. The issue is digits in top-level-domains -- e.g. email.com1. As far as I can read the definition in RFC 1035 (page 8), this is a perfectly valid domain name, but our regex, as I write this, allows only letters. This is the regex for i18n-supporting domains; there's an "ascii_only_tld" regex right next to it, which does allow digits -- this makes me quite certain that it's a bug.

Of note: The class DomainNameValidator is relatively new - only added about a year ago -- but it inherits the regex from older URLValidator, which, it seems, has forbidden digits in TLDs at least since Django 2.x. Since EmailValidator now also uses the regexes from DomainNameValidator, it is also affected.

Change History (3)

comment:1 by David Sanders, 3 months ago

I suppose technically it's incorrect however I didn't see any registered TLDs with digits and the folks that were involved with the recent update were hesitant to touch the existing regex for fear of breaking something.

I'm just wondering "what would Carlton decide here" lmao

comment:2 by Sarah Boyce, 3 months ago

Cc: Claude Paroz Mike Edmunds added
Resolution: needsinfo
Status: newclosed

I was trying to check if email.com1 should be valid.

Looking at this list of top level domains (https://www.icann.org/en/contracted-parties/registry-operators/resources/list-of-top-level-domains), the only numeric top level domains are prefixed with XN--. This is allowed by our validator.

I think I agree that looking at the RFC, the definition isn't this strict and digits are allowed without hyphens:

<domain> ::= <subdomain> | " "

<subdomain> ::= <label> | <subdomain> "." <label>

<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]

<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>

<let-dig-hyp> ::= <let-dig> | "-"

<let-dig> ::= <letter> | <digit>

<letter> ::= any one of the 52 alphabetic characters A through Z in
upper case and a through z in lower case

<digit> ::= any one of the ten digits 0 through 9

Before we continue, I think we should get confirmation that tlds like com1 are valid

in reply to:  2 comment:3 by Mike Edmunds, 3 months ago

Resolution: needsinfoinvalid

Replying to Sarah Boyce:

Before we continue, I think we should get confirmation that tlds like com1 are valid

I believe com1 is not a valid TLD under current ICANN rules. Since ICANN decides what's a valid gTLD, their policies override whatever RFC 1035 may seem to allow.

There's a pretty thorough review here: https://stackoverflow.com/questions/9071279/number-in-the-top-level-domain/53875771.

That said, I haven't personally reviewed RFC 1035 and all 29(!) RFCs that modify it. The ICANN gTLD policies are from 2012; there's a new gTLD policy in draft form now, and I haven't reviewed that either. So if someone finds a newer policy that would allow digits in TLDs—or better yet, real-world evidence of a (non-IDNA) TLD containing digits—then we should revisit this.

The exception, as Sarah noted, is an IDNA-encoded TLD starting with xn--. ICANN allows those, and so does Django's DomainNameValidator.

(Also, I suppose there could be internal-use-only TLDs containing digits, which might be valid under the RFCs but wouldn't be usable on the public Internet. That seems pretty niche, and anyone having that use case could subclass Django's DomainNameValidator to cover it.)

Note: See TracTickets for help on using tickets.
Back to Top