Django

Code

Ticket #2049 (closed: fixed)

Opened 2 years ago

Last modified 1 year ago

[patch] isValidEmail is too narrow

Reported by: mir@noris.de Assigned to: adrian
Milestone: Component: Validators
Version: SVN Keywords:
Cc: Triage Stage: Unreviewed
Has patch: 1 Needs documentation: 0
Needs tests: 0 Patch needs improvement: 0

Description

The validator for email addresses is defined using a regular expression:

email_re = re.compile(r'^[A-Z0-9._%-][+A-Z0-9._%-]*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)

This is too narrow. RFC 2822 defines

(3.2.4)

atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /
                        "~"

atom            =       [CFWS] 1*atext [CFWS]

dot-atom        =       [CFWS] dot-atom-text [CFWS]

dot-atom-text   =       1*atext *("." 1*atext)

(3.4.1)

addr-spec       =       local-part "@" domain

local-part      =       dot-atom / quoted-string / obs-local-part

This boild down to [-.!#$%&'*+/=?^_`{}|~0-9A-Z]+ for the localpart, if you ignore quoted-string and obs-local-part.

Attachments

Change History

05/31/06 09:31:53 changed by adrian

  • status changed from new to closed.
  • resolution set to worksforme.

We previously had a more intense regex to valid e-mail addresses, but it was quite slow for large e-mail addresses, due to being an exponential regex. If you can come up with a regex that is more accurate *and* is still fast, please post the entire line here and reopen the ticket.

05/31/06 10:31:18 changed by mir@noris.de

  • status changed from closed to reopened.
  • resolution deleted.

Hmm, the solution posted above was already a simplification. dots are only allowed between other symbols. The real solution would be

[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*

why should this be exponential? As the regex state machine digests input, there's always only one choice for the next step, depending on whether the input character is a dot '.' or not. As long as a regexp only requires limited lookahead, it's not exponential. This one is only linear. Probably the discussion you are referring to is about a different regexp.

Taking all into account, I'd propose this one:

email_re = re.compile(r"^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$", re.IGNORECASE)

in django.core.validators.

05/31/06 10:54:24 changed by adrian

Yeah, I was referring to the previous regular expression we had, which we replaced with the current one.

05/31/06 10:56:14 changed by ubernostrum

Actually that's too narrow as well ;)

This is the shortest regex I know of which implements the full RFC822 grammar: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

05/31/06 11:14:06 changed by anonymous

I don't get your point. You asked me to for a regexp that improves the current one and still does not require exponential cost. Isn't it an improvement over the comletely random current restrictions (e.g., why does it support "%" but not "=" etc.)?

Your link points to a regexp that also tries to cope with what RFC 2822 consideres obsolete. Do we agree that we don't need to support this? This is the stuff that makes the regexp so bloated and exponential.

Otherwise, I can extend my solution for quoted-strings, that still shouldn't require exponential cost, if that's what you think that should be done.

05/31/06 11:24:34 changed by adrian

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

05/31/06 11:38:04 changed by mir@noris.de

Note that two separate people have been commenting to this ticket -- ubernostrum and I are two separate people. :)

Oops, sorry ... usually no-one looks into a ticket for days, and now two different people within minutes ;-) - I'm confused!

Just collecting more material from rfc for the quoting part ... this should be feasible without too much effort if we ignore the possibility of comments.

(3.2.1)

NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters
                        %d127

(3.2.2)

quoted-pair     =       ("\" text) / obs-qp

(3.2.5)
qtext           =       NO-WS-CTL /     ; Non white space controls

                        %d33 /          ; The rest of the US-ASCII
                        %d35-91 /       ;  characters not including "\"
                        %d93-126        ;  or the quote character

qcontent        =       qtext / quoted-pair

quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]

05/31/06 12:26:56 changed by mir@noris.de

  • summary changed from isValidEmail is too narrow to [patch] isValidEmail is too narrow.

The following regexp matches email addresses with a localpart of either dot-atom or quoted string, except:

- obsolete forms - comments - CRLF inside the quoted-string

Since comments and CRLF should be ignored semantically, this allows users to enter all email addresses except obsolete forms.

Still, it's not exponential. Handling comments would push you into the exponential area.

email_re = re.compile(
        r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
        r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
        r')@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$', re.IGNORECASE)  # domain

05/31/06 13:08:29 changed by adrian

  • status changed from reopened to closed.
  • resolution set to fixed.

(In [3026]) Fixed #2049 -- Made isValidEmail validator wider in scope. Thanks, mir@noris.de

08/12/07 11:12:35 changed by anonymous

WHere is defined RegExp? ? Is a java internal function ?


Add/Change #2049 ([patch] isValidEmail is too narrow)




Change Properties
Action