Opened 22 months ago
Closed 22 months ago
#34169 closed Bug (duplicate)
Regex bug in EmailValidator class allows top domain label of an email address's domain_part to start with a hyphen
Reported by: | Niko | Owned by: | nobody |
---|---|---|---|
Component: | Core (Mail) | Version: | 4.1 |
Severity: | Normal | Keywords: | Email, EmailValidator, core, regex |
Cc: | rohandeshpande832@…, norton@… | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
We found a possible bug with the email validation regex for the
domain part of the email address. This regex exists in the EmailValidator class in
the django/django/core/validators.py file. We are referencing the domain_regex variable
in line 187.
Short description of the bug:
In short, the current domain_regex will consider an email address as valid if a hyphen is in the start of the top-domain part of the email address. However, according to the RFC 5321 documentation, putting a hyphen in the start or end of any sub-domains in the domain part of the email address is considered as an invalid email address. So to conclude, if a hyphen exists in the start of the top-domain part of the email address then Django's EmailValidator should consider that email as invalid.
Long description of the bug:
To be on the same page, we will define some nomenclature to describe the bug in more detail.
Let's use this as an example email address xyz@a.b.com
.
- The "xyz" part is the local part of the email address and it is not of the interest in this bug report.
- The "a.b.com" is the domain part of the email address.
- The "a", "b" and "com" are the sub-domains and "com" is specifically a top-domain.
We will reference the section 4.1.2 of the RFC 5321 Simple Mail Transfer Protocol specification as
that document represents the ground truth representation of how email address should be structured.
The ABNF representation of the domain part of the email address is represented as something below.
Domain = sub-domain *("." sub-domain)
sub-domain = Let-dig [Ldh-str]
Let-dig = ALPHA / DIGIT
Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig
We believe the implementation of the domain checker is incorrect based on the above ABNF representation.
The Domain
component of an email adress, as defined by the information above, is comprised of a sub-domain
followed by 0 or more tokens containing a period "." and another sub-domain
. The definition of a subdomain is a Let-dig
followed by an [Ldh-str]
. Note the definition of Let-dig
is restricted to alphanumeric characters as given by 'ALPHA / DIGIT'
. Also note the definition of [Ldh-str]
as 0 or more alphanumeric characters or hyphens, followed by a Let-dig
. So, a [Ldh-str]
must end with a strictly alphanumeric character.
Since a sub-domain has a strict ordering where Let-dig
come first and then [Ldh-str]
comes next, we can infer that the first and last character in each subdomain can only be an alphanumeric character (a-zA-z0-9). Thus, the fact that Django's email checker allows the subdomain "-com", for example, is a violation of RFC's specifications because of the placement of the hyphen as the first character.
Reference: https://www.rfc-editor.org/rfc/rfc5321.html#section-4.1.2
Python script which uses the bugged function to demonstrate our point