#2049 closed defect (fixed)
[patch] isValidEmail is too narrow
Reported by: | Owned by: | Adrian Holovaty | |
---|---|---|---|
Component: | Validators | Version: | dev |
Severity: | normal | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The validator for email addresses is defined using a regular expression:
email_re = re.compile(r'^[A-Z0-9._%-][+A-Z0-9._%-]*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)
This is too narrow. RFC 2822 defines
(3.2.4) atext = ALPHA / DIGIT / ; Any character except controls, "!" / "#" / ; SP, and specials. "$" / "%" / ; Used for atoms "&" / "'" / "*" / "+" / "-" / "/" / "=" / "?" / "^" / "_" / "`" / "{" / "|" / "}" / "~" atom = [CFWS] 1*atext [CFWS] dot-atom = [CFWS] dot-atom-text [CFWS] dot-atom-text = 1*atext *("." 1*atext) (3.4.1) addr-spec = local-part "@" domain local-part = dot-atom / quoted-string / obs-local-part
This boild down to [-.!#$%&'*+/=?^_`{}|~0-9A-Z]+
for the localpart, if you ignore quoted-string and obs-local-part.
Change History (10)
comment:1 by , 18 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:2 by , 18 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
Hmm, the solution posted above was already a simplification. dots are only allowed between other symbols. The real solution would be
[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*
why should this be exponential? As the regex state machine digests input, there's always only one choice for the next step, depending on whether the input character is a dot '.' or not. As long as a regexp only requires limited lookahead, it's not exponential. This one is only linear. Probably the discussion you are referring to is about a different regexp.
Taking all into account, I'd propose this one:
email_re = re.compile(r"^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$", re.IGNORECASE)
in django.core.validators.
comment:3 by , 18 years ago
Yeah, I was referring to the previous regular expression we had, which we replaced with the current one.
comment:4 by , 18 years ago
Actually that's too narrow as well ;)
This is the shortest regex I know of which implements the full RFC822 grammar: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
comment:5 by , 18 years ago
I don't get your point. You asked me to for a regexp that improves the current one and still does not require exponential cost. Isn't it an improvement over the comletely random current restrictions (e.g., why does it support "%" but not "=" etc.)?
Your link points to a regexp that also tries to cope with what RFC 2822 consideres obsolete. Do we agree that we don't need to support this? This is the stuff that makes the regexp so bloated and exponential.
Otherwise, I can extend my solution for quoted-strings, that still shouldn't require exponential cost, if that's what you think that should be done.
comment:6 by , 18 years ago
Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)
comment:7 by , 18 years ago
Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)
Oops, sorry ... usually no-one looks into a ticket for days, and now two different people within minutes ;-) - I'm confused!
Just collecting more material from rfc for the quoting part ... this should be feasible without too much effort if we ignore the possibility of comments.
(3.2.1) NO-WS-CTL = %d1-8 / ; US-ASCII control characters %d11 / ; that do not include the %d12 / ; carriage return, line feed, %d14-31 / ; and white space characters %d127 (3.2.2) quoted-pair = ("\" text) / obs-qp (3.2.5) qtext = NO-WS-CTL / ; Non white space controls %d33 / ; The rest of the US-ASCII %d35-91 / ; characters not including "\" %d93-126 ; or the quote character qcontent = qtext / quoted-pair quoted-string = [CFWS] DQUOTE *([FWS] qcontent) [FWS] DQUOTE [CFWS]
comment:8 by , 18 years ago
Summary: | isValidEmail is too narrow → [patch] isValidEmail is too narrow |
---|
The following regexp matches email addresses with a localpart of either dot-atom or quoted string, except:
- obsolete forms
- comments
- CRLF inside the quoted-string
Since comments and CRLF should be ignored semantically, this allows users to enter all email addresses except obsolete forms.
Still, it's not exponential. Handling comments would push you into the exponential area.
email_re = re.compile( r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string r')@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE) # domain
comment:9 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
We previously had a more intense regex to valid e-mail addresses, but it was quite slow for large e-mail addresses, due to being an exponential regex. If you can come up with a regex that is more accurate *and* is still fast, please post the entire line here and reopen the ticket.