Opened 18 years ago

Closed 18 years ago

Last modified 17 years ago

#2049 closed defect (fixed)

[patch] isValidEmail is too narrow

Reported by: mir@… Owned by: Adrian Holovaty
Component: Validators Version: dev
Severity: normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The validator for email addresses is defined using a regular expression:

email_re = re.compile(r'^[A-Z0-9._%-][+A-Z0-9._%-]*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)

This is too narrow. RFC 2822 defines

(3.2.4)

atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /
                        "~"

atom            =       [CFWS] 1*atext [CFWS]

dot-atom        =       [CFWS] dot-atom-text [CFWS]

dot-atom-text   =       1*atext *("." 1*atext)

(3.4.1)

addr-spec       =       local-part "@" domain

local-part      =       dot-atom / quoted-string / obs-local-part

This boild down to [-.!#$%&'*+/=?^_`{}|~0-9A-Z]+ for the localpart, if you ignore quoted-string and obs-local-part.

Change History (10)

comment:1 by Adrian Holovaty, 18 years ago

Resolution: worksforme
Status: newclosed

We previously had a more intense regex to valid e-mail addresses, but it was quite slow for large e-mail addresses, due to being an exponential regex. If you can come up with a regex that is more accurate *and* is still fast, please post the entire line here and reopen the ticket.

comment:2 by mir@…, 18 years ago

Resolution: worksforme
Status: closedreopened

Hmm, the solution posted above was already a simplification. dots are only allowed between other symbols. The real solution would be

[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*

why should this be exponential? As the regex state machine digests input, there's always only one choice for the next step, depending on whether the input character is a dot '.' or not. As long as a regexp only requires limited lookahead, it's not exponential. This one is only linear. Probably the discussion you are referring to is about a different regexp.

Taking all into account, I'd propose this one:

email_re = re.compile(r"^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$", re.IGNORECASE)

in django.core.validators.

comment:3 by Adrian Holovaty, 18 years ago

Yeah, I was referring to the previous regular expression we had, which we replaced with the current one.

comment:4 by James Bennett, 18 years ago

Actually that's too narrow as well ;)

This is the shortest regex I know of which implements the full RFC822 grammar: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

comment:5 by anonymous, 18 years ago

I don't get your point. You asked me to for a regexp that improves the current one and still does not require exponential cost. Isn't it an improvement over the comletely random current restrictions (e.g., why does it support "%" but not "=" etc.)?

Your link points to a regexp that also tries to cope with what RFC 2822 consideres obsolete. Do we agree that we don't need to support this? This is the stuff that makes the regexp so bloated and exponential.

Otherwise, I can extend my solution for quoted-strings, that still shouldn't require exponential cost, if that's what you think that should be done.

comment:6 by Adrian Holovaty, 18 years ago

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

comment:7 by mir@…, 18 years ago

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

Oops, sorry ... usually no-one looks into a ticket for days, and now two different people within minutes ;-) - I'm confused!

Just collecting more material from rfc for the quoting part ... this should be feasible without too much effort if we ignore the possibility of comments.

(3.2.1)

NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters
                        %d127

(3.2.2)

quoted-pair     =       ("\" text) / obs-qp

(3.2.5)
qtext           =       NO-WS-CTL /     ; Non white space controls

                        %d33 /          ; The rest of the US-ASCII
                        %d35-91 /       ;  characters not including "\"
                        %d93-126        ;  or the quote character

qcontent        =       qtext / quoted-pair

quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]

comment:8 by mir@…, 18 years ago

Summary: isValidEmail is too narrow[patch] isValidEmail is too narrow

The following regexp matches email addresses with a localpart of either dot-atom or quoted string, except:

  • obsolete forms
  • comments
  • CRLF inside the quoted-string

Since comments and CRLF should be ignored semantically, this allows users to enter all email addresses except obsolete forms.

Still, it's not exponential. Handling comments would push you into the exponential area.

email_re = re.compile(
        r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
        r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
        r')@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)  # domain

comment:9 by Adrian Holovaty, 18 years ago

Resolution: fixed
Status: reopenedclosed

(In [3026]) Fixed #2049 -- Made isValidEmail validator wider in scope. Thanks, mir@…

comment:10 by anonymous, 17 years ago

WHere is defined RegExp ? Is a java internal function ?

Note: See TracTickets for help on using tickets.
Back to Top