Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#25452 closed Bug (wontfix)

Email validation for domain `gmail.-com` is considered valid

Reported by: Paul Hallett Owned by: Anton Baklanov
Component: Forms Version: 1.8
Severity: Normal Keywords:
Cc: cmawebsite@…, dheeru.rathor14@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

When entering an email like "test@gmail.-com" the email validator returns True.

Particularly, the validate_domain_part allows domains with a hyphen character in the TLD:

from django.core.validators import validate_email
validate_email.validate_domain_part('gmail.-com')
True

Nearly all other special characters return correctly:

from django.core.validators import validate_email
validate_email.validate_domain_part('gmail._com')
False

Unless my knowledge of valid TLDs is wrong, I don't think this is correct :(

Change History (16)

comment:1 Changed 5 years ago by Collin Anderson

Cc: cmawebsite@… added
Easy pickings: unset
Triage Stage: UnreviewedAccepted

Confirmed on 1.8 and 1.9. Chrome's email validation rejects this, so I assume this is unintentional.

comment:2 Changed 5 years ago by Paul Hallett

We've been investigating this more and it appears that hyphens can be in TLDs, just not at the start of the beginning:

Domain names may be formed from the set of alphanumeric ASCII characters (a-z, A-Z, 0-9), but characters are case-insensitive. In addition the hyphen is permitted if it is surrounded by characters, digits or hyphens, although it is not to start or end a label.

https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process

Last edited 5 years ago by Paul Hallett (previous) (diff)

comment:3 Changed 5 years ago by Tim Graham

Please check URLValidator to see if it handles this (if so, maybe you could borrow from it) or if it requires a similar fix.

comment:4 Changed 5 years ago by Anton Baklanov

Owner: changed from nobody to Anton Baklanov
Status: newassigned

According to the https://tools.ietf.org/html/rfc1035 domain labels can contain hyphens but not as their first character.

EmailValidator.domain_regex checks this for all labels but the last one (the TLD) and it looks like a bug to me (not something that was done on intention).

URLValidator uses more complex domain validation regex set (including unicode, etc).

I will double check if borrowing those checks into EmailValidator won't violate any standards and come back with a patch in case it's ok.

comment:5 Changed 5 years ago by Dheerendra Rathor

Cc: dheeru.rathor14@… added

comment:6 Changed 5 years ago by Anton Baklanov

Has patch: set

The pull request

Unfortunately I have not found a good way to merge URLValidator and EmailValidator since there are tons of small differences.

So I decided to fix EmailValidator instead. I tried to be as accurate as possible (actual regex changes are just few characters long).

During reading various RFCs and articles I've found some other easy-fixable issues (like allowing quoted '@', space or backslash for dot atom local part, allowing spaces inside quoted local part, etc) those are included in above PR as well.

Proper RFC-based email validation is a surprisingly hard task and definitely no one wants to have few pages long regex in Django.

So updated validator is still not fully RFC compliant but few issues are fixed now (or covered with test cases).

Commit messages includes detailed description of all changes that were made.

comment:7 Changed 4 years ago by Ned Batchelder

Can I respectfully suggest that continuing to tweak this complex regex to get asymptotically closer to perfection is not worth it? Especially to fix false positives. What real-world problem is happening because "gmail.-com" is accepted? "gmail.ccomm" is also accepted, but is just as useless as an email address.

comment:8 Changed 4 years ago by Tim Graham

I am open to that if you can get consensus on the DevelopersMailingList on a set of limitations that we can document so that we have something to point to when we get requests for enhancements. I imagine this policy would also include URLValidator.

comment:9 Changed 4 years ago by Collin Anderson

I think we should try to just match the standard html <input type="email"> validation. I'd imagine that most uses cases would want to match that. We might be able to use the regex verbatim from the standard itself:

https://html.spec.whatwg.org/multipage/forms.html#e-mail-state-(type=email)

If people want to allow things outside of that they could use a custom regex.

Last edited 4 years ago by Collin Anderson (previous) (diff)

comment:10 Changed 4 years ago by Collin Anderson

Though it gets more complicated when considering Unicode. Unicode needs to get normalized to ascii before running through the official regex.

Here's how chrome does it: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/html/forms/EmailInputType.cpp

comment:11 Changed 4 years ago by Anton Baklanov

How about resolving this issue with the smallest possible change and moving future validation regex improvement discussion into separate ticket? Would this be legit?

comment:12 Changed 4 years ago by Ned Batchelder

@bak1an: Can you explain why it's important to reject "gmail.-com"? Why add *any* more complexity just to reject more bogus email addresses?

comment:13 Changed 4 years ago by Tim Graham

Patch needs improvement: set

I started a thread on django-developers to find a way forward.

comment:14 Changed 4 years ago by Tim Graham

Resolution: wontfix
Status: assignedclosed

The consensus on the mailing list seems to be to simplify the validation, not make it more comprehensive.

comment:15 Changed 4 years ago by Claude Paroz

Tim, is there a ticket about the simplification, or should we create one?

comment:16 Changed 4 years ago by Tim Graham

Ticket for simplifying the validation is #26423.

Note: See TracTickets for help on using tickets.
Back to Top