Opened 17 years ago

Closed 16 years ago

#3989 closed (duplicate)

Django seems to parse only the addr-spec production of RFC 2822

Reported by: Pierre Thierry <nowhere.man@…> Owned by: nobody
Component: Validators Version: dev
Severity: Keywords:
Cc: Christophe Scherrer <chris@…> Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The RFC 2822 defines some productions for the grammar of electronic mail adresses. The mailbox production enables the use of addresses like:

John Doe <john.doe@example.com>

The address within brackets is defined by the addr-spec production, which by itself constitutes a simpler but valid address format. (see the ABNF rules and their comment in 3.4)

When I used a contact form in a web site powered by Django, an address satisfying the mailbox production was not accepted, but one satisfying addr-spec was. I was told on the #django channel that it should be an issue not specific to this site, but to Django itself.

Attachments (2)

email_re.diff (1.3 KB ) - added by Vinay Sajip <vinay_sajip@…> 17 years ago.
Patch to email RE to allow matching "mailbox" production
email_re.2.diff (4.4 KB ) - added by Vinay Sajip <vinay_sajip@…> 17 years ago.
Updated patch with additions to newforms/fields.py and docs/model_api.txt

Download all attachments as: .zip

Change History (21)

comment:1 by Simon G. <dev@…>, 17 years ago

Triage Stage: UnreviewedDesign decision needed

comment:2 by Malcolm Tredinnick, 17 years ago

Triage Stage: Design decision neededAccepted

Let's see how simple the patch looks. Should be a reasonable change.

comment:3 by Michael Axiak <axiak@…>, 17 years ago

I was under the impression that Django's EmailField was a simple emailfield that works for most purposes, but wasn't RFC2822. It uses a simple regular expression, which is not powerful enough to easily handle the RFC's full grammar. As a quick example, I cannot email axiak@[127.0.0.1] but I should be able to mail axiak@[18.7.21.224] in most cases.

My point is, we can make the Email validation really, really complicated. Do we want to?

comment:4 by Malcolm Tredinnick, 17 years ago

Michael: I don't think we want to go completely overboard. Introducing a regular expression similar to the famous one in Friedl's reg-exp book would be an example of going overboard, for example. It would be impossible to debug. The full RFC grammar is a bit of overkill for the practical public Internet.

Covering the common cases is reasonable, though. The example cited in the bug summary might be good to have because it allows cutting and pasting of email addresses and using appropriate titles which may not match the username (or other names in the form). Allowing email to IP addresses seems probably too much for a primarily web-based framework, since the Internet has had DNS for quite a while now. Still, if it can be done easily, it might not hurt to include.

I readily admit that I don't really know in advance where to draw the line on this one, since it could go in many places. It wouldn't be unreasonable to say we stick with what we've got now. That's why I kind of want to see how complex a patch is. If you end up writing a couple of dozen lines of code, it's probably worth thinking about whether that's becoming too complex.

Sorry to be a bit non-specific. Aim for something small and easy to understand.

in reply to:  4 comment:5 by Pierre Thierry <nowhere.man@…>, 17 years ago

Replying to mtredinnick:

[a regular expression similar to the famous one in Friedl's reg-exp book] would be impossible to debug.

What would you have to? Isn't there an existing Python module to do it?

by Vinay Sajip <vinay_sajip@…>, 17 years ago

Attachment: email_re.diff added

Patch to email RE to allow matching "mailbox" production

comment:6 by Vinay Sajip <vinay_sajip@…>, 17 years ago

Has patch: set
Version: SVN

The attached patch matches the mailbox production, without Friedl-like thoroughness (addressed for oldforms only - however, it should just be a copy-paste job for newforms). However, I'm not sure it's the whole story - there might be code elsewhere which assumes that the value of the email field is in addr_spec format, which assumption will no longer hold once the patch is applied. I've tested in the admin interface and it seems to be OK, but does e.g. send_email() need to be changed?

comment:7 by Malcolm Tredinnick, 17 years ago

Needs documentation: set
Patch needs improvement: set

A couple of comments on the patch:

  1. Whatever is needed to support newforms also needs to be included. Until we have model-aware validation, we can't completely avoid patching validators.py, so the current patch is partly in the right place.
  2. Rather than defining a whole bunch of constants that are only used once and then just hang around polluting the namespace, how about using the re.VERBOSE flag and putting the reg-exp back together into a single expression, like it is in the current code? No problems with spreading it out a little bit, but the current patch makes it look like there is more going on than there really is: it takes a bit of study to realise that there are six things going into the one reg-exp and that's the only place they're useful. You can probably keep ADDR_SPEC out as a separate string, because it's used twice. Or maybe there's another way to format things, but it takes more than a couple of seconds to understand what's going on at the moment.
  3. A documentation update in model-api.txt to give a couple of examples of "valid email field" is now required, since people aren't going to necessarily guess that display names are also allowed.

Patch seems simple enough. Tweak it a little bit and I won't stand in the way.

comment:8 by Malcolm Tredinnick, 17 years ago

Component: django.core.mailValidators

Changed component to reflect reality.

comment:9 by Vinay Sajip <vinay_sajip@…>, 17 years ago

Malcolm, re. your point 2 above, I hear you. I feel that regexes are hard to scan even when one knows what's going on - all those punctuation characters squashed up against each other. So I structured the change in the way that I did because I found it more readable, particularly when one is cross-referencing against the RFC. How about if I just add a

del DOT_ATOM, QUOTED_STRING, DOMAIN, ADDR_SPEC, DISPLAY_NAME

statement after the

email_re = ...

line, to clean up the namespace? I'll add a comment above the section, too, to explain what follows and so that it doesn't seem too complicated.

comment:10 by Chris Beaven, 17 years ago

Vinay, what's the problem with just using re.VERBOSE?

comment:11 by Vinay Sajip <vinay_sajip@…>, 17 years ago

Nothing especially - though I find it easier and quicker to test regexes with the approach I used, since I can try different combinations of grouping and alternatives without accidentally stepping on the components of the regex.

We don't even need re.VERBOSE, since we can use the feature that a sequence of string literals is concatenated by the compiler into a single string. Assuming we can keep ADDR_SPEC because of DRY, then the re becomes:

ADDR_SPEC = (
    "((?:"
        r"[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(?:\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
    ")|(?:"
        r'"(?:[\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
    "))@("
    r'(?:[A-Z0-9-]+\.)+[A-Z]{2,6}'   #domain
    ")"
)

email_re = re.compile(
    "^(?:" + ADDR_SPEC + ")|(?:\w[\w ]*)<" + ADDR_SPEC + ">$", re.IGNORECASE)

If it's generally felt that this is equally readable, then I see no problem going with it.

comment:12 by Malcolm Tredinnick, 17 years ago

Pleaes use re.VERBOSE rather than the string concatenation, because (a) it is much easier to edit triple-quoted multi-line strings rather than line-length string fragments that cannot contain newlines and (b) automatic string concatenation like that will eventually be going away from Python, so it saves us a few seconds of porting trouble at a later time if we have an equally useful alternative now.

comment:13 by Vinay Sajip <vinay_sajip@…>, 17 years ago

OK, here it is:

ADDR_SPEC = """
((?:
    [-!#$%&'*+/=?^_`{}|~0-9A-Z]+(?:\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)* # dot-atom
)|(?:
    "(?:[\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*" # quoted-string
))@(
    (?:[A-Z0-9-]+\.)+[A-Z]{2,6} #domain
)
"""

email_re = re.compile(
    "^(?:" + ADDR_SPEC + ")|(?:\w[\w ]*)<" + ADDR_SPEC + ">$", re.VERBOSE | re.IGNORECASE)

I'll look at the newforms stuff and model-api.txt and work up a patch.

by Vinay Sajip <vinay_sajip@…>, 17 years ago

Attachment: email_re.2.diff added

Updated patch with additions to newforms/fields.py and docs/model_api.txt

comment:14 by Simon G. <dev@…>, 17 years ago

Needs documentation: unset
Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:15 by Gary Wilson, 17 years ago

Needs tests: set
Triage Stage: Ready for checkinDesign decision needed

I think this should become a new email field and not built in to the existing field. Usually, in the models where I have an email address, I also have the person's first name and last name too and want the email address to be simply the addr-spec. If the name were included in the email address field, then I would be storing duplicate information and would have to strip out the address part when displaying. But I can certainly see uses for the full mailbox spec too. Different fields for different needs is the way I see this.

comment:16 by Vinay Sajip <vinay_sajip@…>, 17 years ago

Although of course it's conventional to use firstname/lastname as the display name, it's not mandated - for example, a nickname could be used as the display name, or a suffix such as " (Home)" or " (Work)" might be appended to the user's name to form the display name. So the display name is logically a different value from the user's name, even if it has the same value for most users.

For this reason, I don't think having two fields makes sense, though it may be sensible to increase the size of the email field (there's a separate discussion about this on the mailing lists, not related to display name support).

in reply to:  16 comment:17 by Gary Wilson, 17 years ago

Yes, I understand that you can use anything for the display name. My point was simply that one might not always want to allow display names, as is the case with every use of the EmailField in code I've written. I do think that allowing for display names would be a welcome feature, I just think that the developer should be able to specify which format is allowable since each has its valid use cases. I also think that separate fields (i.e. EmailField and EmailWithDisplayNameField) would be better than adding an option to the existing EmailField (i.e. email = EmailField(allow_display_name=True)) or changing the current behavior of EmailField without allowing for the address-only form of entry.

comment:18 by Vinay Sajip <vinay_sajip@…>, 17 years ago

The DRY principle, if I'm not misapplying it, seems to lead to the conclusion that

email = EmailField(allow_display_name=True)

would be the best solution - having the default as False allows the behaviour to remain as it is currently, but which can be modified easily by setting allow_display_name=True where the use case warrants it. With the two-field solution, any future change to the EmailField needs to potentially be duplicated in EmailWithoutDisplayNameField. And there's not enough difference in behaviour between the two, IMO, to warrant two classes.

comment:19 by Jacob, 16 years ago

Resolution: duplicate
Status: newclosed

I've created #6092 to track the bigger issue here.

Note: See TracTickets for help on using tickets.
Back to Top