Code

Opened 7 years ago

Closed 6 years ago

#3989 closed (duplicate)

Django seems to parse only the addr-spec production of RFC 2822

Reported by: Pierre Thierry <nowhere.man@…> Owned by: nobody
Component: Validators Version: master
Severity: Keywords:
Cc: Christophe Scherrer <chris@…> Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: UI/UX:

Description

The RFC 2822 defines some productions for the grammar of electronic mail adresses. The mailbox production enables the use of addresses like:

John Doe <john.doe@example.com>

The address within brackets is defined by the addr-spec production, which by itself constitutes a simpler but valid address format. (see the ABNF rules and their comment in 3.4)

When I used a contact form in a web site powered by Django, an address satisfying the mailbox production was not accepted, but one satisfying addr-spec was. I was told on the #django channel that it should be an issue not specific to this site, but to Django itself.

Attachments (2)

email_re.diff (1.3 KB) - added by Vinay Sajip <vinay_sajip@…> 7 years ago.
Patch to email RE to allow matching "mailbox" production
email_re.2.diff (4.4 KB) - added by Vinay Sajip <vinay_sajip@…> 7 years ago.
Updated patch with additions to newforms/fields.py and docs/model_api.txt

Download all attachments as: .zip

Change History (21)

comment:1 Changed 7 years ago by Simon G. <dev@…>

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset
  • Triage Stage changed from Unreviewed to Design decision needed

comment:2 Changed 7 years ago by mtredinnick

  • Triage Stage changed from Design decision needed to Accepted

Let's see how simple the patch looks. Should be a reasonable change.

comment:3 Changed 7 years ago by Michael Axiak <axiak@…>

I was under the impression that Django's EmailField was a simple emailfield that works for most purposes, but wasn't RFC2822. It uses a simple regular expression, which is not powerful enough to easily handle the RFC's full grammar. As a quick example, I cannot email axiak@[127.0.0.1] but I should be able to mail axiak@[18.7.21.224] in most cases.

My point is, we can make the Email validation really, really complicated. Do we want to?

comment:4 follow-up: Changed 7 years ago by mtredinnick

Michael: I don't think we want to go completely overboard. Introducing a regular expression similar to the famous one in Friedl's reg-exp book would be an example of going overboard, for example. It would be impossible to debug. The full RFC grammar is a bit of overkill for the practical public Internet.

Covering the common cases is reasonable, though. The example cited in the bug summary might be good to have because it allows cutting and pasting of email addresses and using appropriate titles which may not match the username (or other names in the form). Allowing email to IP addresses seems probably too much for a primarily web-based framework, since the Internet has had DNS for quite a while now. Still, if it can be done easily, it might not hurt to include.

I readily admit that I don't really know in advance where to draw the line on this one, since it could go in many places. It wouldn't be unreasonable to say we stick with what we've got now. That's why I kind of want to see how complex a patch is. If you end up writing a couple of dozen lines of code, it's probably worth thinking about whether that's becoming too complex.

Sorry to be a bit non-specific. Aim for something small and easy to understand.

comment:5 in reply to: ↑ 4 Changed 7 years ago by Pierre Thierry <nowhere.man@…>

Replying to mtredinnick:

[a regular expression similar to the famous one in Friedl's reg-exp book] would be impossible to debug.

What would you have to? Isn't there an existing Python module to do it?

Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

Patch to email RE to allow matching "mailbox" production

comment:6 Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

  • Has patch set
  • Version set to SVN

The attached patch matches the mailbox production, without Friedl-like thoroughness (addressed for oldforms only - however, it should just be a copy-paste job for newforms). However, I'm not sure it's the whole story - there might be code elsewhere which assumes that the value of the email field is in addr_spec format, which assumption will no longer hold once the patch is applied. I've tested in the admin interface and it seems to be OK, but does e.g. send_email() need to be changed?

comment:7 Changed 7 years ago by mtredinnick

  • Needs documentation set
  • Patch needs improvement set

A couple of comments on the patch:

  1. Whatever is needed to support newforms also needs to be included. Until we have model-aware validation, we can't completely avoid patching validators.py, so the current patch is partly in the right place.
  2. Rather than defining a whole bunch of constants that are only used once and then just hang around polluting the namespace, how about using the re.VERBOSE flag and putting the reg-exp back together into a single expression, like it is in the current code? No problems with spreading it out a little bit, but the current patch makes it look like there is more going on than there really is: it takes a bit of study to realise that there are six things going into the one reg-exp and that's the only place they're useful. You can probably keep ADDR_SPEC out as a separate string, because it's used twice. Or maybe there's another way to format things, but it takes more than a couple of seconds to understand what's going on at the moment.
  3. A documentation update in model-api.txt to give a couple of examples of "valid email field" is now required, since people aren't going to necessarily guess that display names are also allowed.

Patch seems simple enough. Tweak it a little bit and I won't stand in the way.

comment:8 Changed 7 years ago by mtredinnick

  • Component changed from django.core.mail to Validators

Changed component to reflect reality.

comment:9 Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

Malcolm, re. your point 2 above, I hear you. I feel that regexes are hard to scan even when one knows what's going on - all those punctuation characters squashed up against each other. So I structured the change in the way that I did because I found it more readable, particularly when one is cross-referencing against the RFC. How about if I just add a

del DOT_ATOM, QUOTED_STRING, DOMAIN, ADDR_SPEC, DISPLAY_NAME

statement after the

email_re = ...

line, to clean up the namespace? I'll add a comment above the section, too, to explain what follows and so that it doesn't seem too complicated.

comment:10 Changed 7 years ago by SmileyChris

Vinay, what's the problem with just using re.VERBOSE?

comment:11 Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

Nothing especially - though I find it easier and quicker to test regexes with the approach I used, since I can try different combinations of grouping and alternatives without accidentally stepping on the components of the regex.

We don't even need re.VERBOSE, since we can use the feature that a sequence of string literals is concatenated by the compiler into a single string. Assuming we can keep ADDR_SPEC because of DRY, then the re becomes:

ADDR_SPEC = (
    "((?:"
        r"[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(?:\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
    ")|(?:"
        r'"(?:[\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
    "))@("
    r'(?:[A-Z0-9-]+\.)+[A-Z]{2,6}'   #domain
    ")"
)

email_re = re.compile(
    "^(?:" + ADDR_SPEC + ")|(?:\w[\w ]*)<" + ADDR_SPEC + ">$", re.IGNORECASE)

If it's generally felt that this is equally readable, then I see no problem going with it.

comment:12 Changed 7 years ago by mtredinnick

Pleaes use re.VERBOSE rather than the string concatenation, because (a) it is much easier to edit triple-quoted multi-line strings rather than line-length string fragments that cannot contain newlines and (b) automatic string concatenation like that will eventually be going away from Python, so it saves us a few seconds of porting trouble at a later time if we have an equally useful alternative now.

comment:13 Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

OK, here it is:

ADDR_SPEC = """
((?:
    [-!#$%&'*+/=?^_`{}|~0-9A-Z]+(?:\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)* # dot-atom
)|(?:
    "(?:[\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*" # quoted-string
))@(
    (?:[A-Z0-9-]+\.)+[A-Z]{2,6} #domain
)
"""

email_re = re.compile(
    "^(?:" + ADDR_SPEC + ")|(?:\w[\w ]*)<" + ADDR_SPEC + ">$", re.VERBOSE | re.IGNORECASE)

I'll look at the newforms stuff and model-api.txt and work up a patch.

Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

Updated patch with additions to newforms/fields.py and docs/model_api.txt

comment:14 Changed 7 years ago by Simon G. <dev@…>

  • Needs documentation unset
  • Patch needs improvement unset
  • Triage Stage changed from Accepted to Ready for checkin

comment:15 Changed 7 years ago by gwilson

  • Needs tests set
  • Triage Stage changed from Ready for checkin to Design decision needed

I think this should become a new email field and not built in to the existing field. Usually, in the models where I have an email address, I also have the person's first name and last name too and want the email address to be simply the addr-spec. If the name were included in the email address field, then I would be storing duplicate information and would have to strip out the address part when displaying. But I can certainly see uses for the full mailbox spec too. Different fields for different needs is the way I see this.

comment:16 follow-up: Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

Although of course it's conventional to use firstname/lastname as the display name, it's not mandated - for example, a nickname could be used as the display name, or a suffix such as " (Home)" or " (Work)" might be appended to the user's name to form the display name. So the display name is logically a different value from the user's name, even if it has the same value for most users.

For this reason, I don't think having two fields makes sense, though it may be sensible to increase the size of the email field (there's a separate discussion about this on the mailing lists, not related to display name support).

comment:17 in reply to: ↑ 16 Changed 7 years ago by gwilson

Yes, I understand that you can use anything for the display name. My point was simply that one might not always want to allow display names, as is the case with every use of the EmailField in code I've written. I do think that allowing for display names would be a welcome feature, I just think that the developer should be able to specify which format is allowable since each has its valid use cases. I also think that separate fields (i.e. EmailField and EmailWithDisplayNameField) would be better than adding an option to the existing EmailField (i.e. email = EmailField(allow_display_name=True)) or changing the current behavior of EmailField without allowing for the address-only form of entry.

comment:18 Changed 7 years ago by Vinay Sajip <vinay_sajip@…>

The DRY principle, if I'm not misapplying it, seems to lead to the conclusion that

email = EmailField(allow_display_name=True)

would be the best solution - having the default as False allows the behaviour to remain as it is currently, but which can be modified easily by setting allow_display_name=True where the use case warrants it. With the two-field solution, any future change to the EmailField needs to potentially be duplicated in EmailWithoutDisplayNameField. And there's not enough difference in behaviour between the two, IMO, to warrant two classes.

comment:19 Changed 6 years ago by jacob

  • Resolution set to duplicate
  • Status changed from new to closed

I've created #6092 to track the bigger issue here.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.