Opened 12 years ago

Closed 12 years ago

Last modified 11 years ago

#2049 closed defect (fixed)

[patch] isValidEmail is too narrow

Reported by: mir@… Owned by: Adrian Holovaty
Component: Validators Version: master
Severity: normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no


The validator for email addresses is defined using a regular expression:

email_re = re.compile(r'^[A-Z0-9._%-][+A-Z0-9._%-]*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)

This is too narrow. RFC 2822 defines


atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /

atom            =       [CFWS] 1*atext [CFWS]

dot-atom        =       [CFWS] dot-atom-text [CFWS]

dot-atom-text   =       1*atext *("." 1*atext)


addr-spec       =       local-part "@" domain

local-part      =       dot-atom / quoted-string / obs-local-part

This boild down to [-.!#$%&'*+/=?^_`{}|~0-9A-Z]+ for the localpart, if you ignore quoted-string and obs-local-part.

Change History (10)

comment:1 Changed 12 years ago by Adrian Holovaty

Resolution: worksforme
Status: newclosed

We previously had a more intense regex to valid e-mail addresses, but it was quite slow for large e-mail addresses, due to being an exponential regex. If you can come up with a regex that is more accurate *and* is still fast, please post the entire line here and reopen the ticket.

comment:2 Changed 12 years ago by mir@…

Resolution: worksforme
Status: closedreopened

Hmm, the solution posted above was already a simplification. dots are only allowed between other symbols. The real solution would be


why should this be exponential? As the regex state machine digests input, there's always only one choice for the next step, depending on whether the input character is a dot '.' or not. As long as a regexp only requires limited lookahead, it's not exponential. This one is only linear. Probably the discussion you are referring to is about a different regexp.

Taking all into account, I'd propose this one:

email_re = re.compile(r"^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$", re.IGNORECASE)

in django.core.validators.

comment:3 Changed 12 years ago by Adrian Holovaty

Yeah, I was referring to the previous regular expression we had, which we replaced with the current one.

comment:4 Changed 12 years ago by James Bennett

Actually that's too narrow as well ;)

This is the shortest regex I know of which implements the full RFC822 grammar:

comment:5 Changed 12 years ago by anonymous

I don't get your point. You asked me to for a regexp that improves the current one and still does not require exponential cost. Isn't it an improvement over the comletely random current restrictions (e.g., why does it support "%" but not "=" etc.)?

Your link points to a regexp that also tries to cope with what RFC 2822 consideres obsolete. Do we agree that we don't need to support this? This is the stuff that makes the regexp so bloated and exponential.

Otherwise, I can extend my solution for quoted-strings, that still shouldn't require exponential cost, if that's what you think that should be done.

comment:6 Changed 12 years ago by Adrian Holovaty

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

comment:7 Changed 12 years ago by mir@…

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

Oops, sorry ... usually no-one looks into a ticket for days, and now two different people within minutes ;-) - I'm confused!

Just collecting more material from rfc for the quoting part ... this should be feasible without too much effort if we ignore the possibility of comments.


NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters


quoted-pair     =       ("\" text) / obs-qp

qtext           =       NO-WS-CTL /     ; Non white space controls

                        %d33 /          ; The rest of the US-ASCII
                        %d35-91 /       ;  characters not including "\"
                        %d93-126        ;  or the quote character

qcontent        =       qtext / quoted-pair

quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE

comment:8 Changed 12 years ago by mir@…

Summary: isValidEmail is too narrow[patch] isValidEmail is too narrow

The following regexp matches email addresses with a localpart of either dot-atom or quoted string, except:

  • obsolete forms
  • comments
  • CRLF inside the quoted-string

Since comments and CRLF should be ignored semantically, this allows users to enter all email addresses except obsolete forms.

Still, it's not exponential. Handling comments would push you into the exponential area.

email_re = re.compile(
        r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
        r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
        r')@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)  # domain

comment:9 Changed 12 years ago by Adrian Holovaty

Resolution: fixed
Status: reopenedclosed

(In [3026]) Fixed #2049 -- Made isValidEmail validator wider in scope. Thanks, mir@…

comment:10 Changed 11 years ago by anonymous

WHere is defined RegExp ? Is a java internal function ?

Note: See TracTickets for help on using tickets.
Back to Top