Code

Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#2049 closed defect (fixed)

[patch] isValidEmail is too narrow

Reported by: mir@… Owned by: adrian
Component: Validators Version: master
Severity: normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: UI/UX:

Description

The validator for email addresses is defined using a regular expression:

email_re = re.compile(r'^[A-Z0-9._%-][+A-Z0-9._%-]*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)

This is too narrow. RFC 2822 defines

(3.2.4)

atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /
                        "~"

atom            =       [CFWS] 1*atext [CFWS]

dot-atom        =       [CFWS] dot-atom-text [CFWS]

dot-atom-text   =       1*atext *("." 1*atext)

(3.4.1)

addr-spec       =       local-part "@" domain

local-part      =       dot-atom / quoted-string / obs-local-part

This boild down to [-.!#$%&'*+/=?^_`{}|~0-9A-Z]+ for the localpart, if you ignore quoted-string and obs-local-part.

Attachments (0)

Change History (10)

comment:1 Changed 8 years ago by adrian

  • Resolution set to worksforme
  • Status changed from new to closed

We previously had a more intense regex to valid e-mail addresses, but it was quite slow for large e-mail addresses, due to being an exponential regex. If you can come up with a regex that is more accurate *and* is still fast, please post the entire line here and reopen the ticket.

comment:2 Changed 8 years ago by mir@…

  • Resolution worksforme deleted
  • Status changed from closed to reopened

Hmm, the solution posted above was already a simplification. dots are only allowed between other symbols. The real solution would be

[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*

why should this be exponential? As the regex state machine digests input, there's always only one choice for the next step, depending on whether the input character is a dot '.' or not. As long as a regexp only requires limited lookahead, it's not exponential. This one is only linear. Probably the discussion you are referring to is about a different regexp.

Taking all into account, I'd propose this one:

email_re = re.compile(r"^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$", re.IGNORECASE)

in django.core.validators.

comment:3 Changed 8 years ago by adrian

Yeah, I was referring to the previous regular expression we had, which we replaced with the current one.

comment:4 Changed 8 years ago by ubernostrum

Actually that's too narrow as well ;)

This is the shortest regex I know of which implements the full RFC822 grammar: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

comment:5 Changed 8 years ago by anonymous

I don't get your point. You asked me to for a regexp that improves the current one and still does not require exponential cost. Isn't it an improvement over the comletely random current restrictions (e.g., why does it support "%" but not "=" etc.)?

Your link points to a regexp that also tries to cope with what RFC 2822 consideres obsolete. Do we agree that we don't need to support this? This is the stuff that makes the regexp so bloated and exponential.

Otherwise, I can extend my solution for quoted-strings, that still shouldn't require exponential cost, if that's what you think that should be done.

comment:6 Changed 8 years ago by adrian

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

comment:7 Changed 8 years ago by mir@…

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

Oops, sorry ... usually no-one looks into a ticket for days, and now two different people within minutes ;-) - I'm confused!

Just collecting more material from rfc for the quoting part ... this should be feasible without too much effort if we ignore the possibility of comments.

(3.2.1)

NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters
                        %d127

(3.2.2)

quoted-pair     =       ("\" text) / obs-qp

(3.2.5)
qtext           =       NO-WS-CTL /     ; Non white space controls

                        %d33 /          ; The rest of the US-ASCII
                        %d35-91 /       ;  characters not including "\"
                        %d93-126        ;  or the quote character

qcontent        =       qtext / quoted-pair

quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]

comment:8 Changed 8 years ago by mir@…

  • Summary changed from isValidEmail is too narrow to [patch] isValidEmail is too narrow

The following regexp matches email addresses with a localpart of either dot-atom or quoted string, except:

  • obsolete forms
  • comments
  • CRLF inside the quoted-string

Since comments and CRLF should be ignored semantically, this allows users to enter all email addresses except obsolete forms.

Still, it's not exponential. Handling comments would push you into the exponential area.

email_re = re.compile(
        r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
        r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
        r')@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)  # domain

comment:9 Changed 8 years ago by adrian

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [3026]) Fixed #2049 -- Made isValidEmail validator wider in scope. Thanks, mir@…

comment:10 Changed 7 years ago by anonymous

WHere is defined RegExp ? Is a java internal function ?

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.