Opened 18 years ago
Closed 17 years ago
#3989 closed (duplicate)
Django seems to parse only the addr-spec production of RFC 2822
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Validators | Version: | dev |
Severity: | Keywords: | ||
Cc: | Christophe Scherrer <chris@…> | Triage Stage: | Design decision needed |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The RFC 2822 defines some productions for the grammar of electronic mail adresses. The mailbox
production enables the use of addresses like:
John Doe <john.doe@example.com>
The address within brackets is defined by the addr-spec
production, which by itself constitutes a simpler but valid address format. (see the ABNF rules and their comment in 3.4)
When I used a contact form in a web site powered by Django, an address satisfying the mailbox
production was not accepted, but one satisfying addr-spec
was. I was told on the #django
channel that it should be an issue not specific to this site, but to Django itself.
Attachments (2)
Change History (21)
comment:1 by , 18 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:2 by , 18 years ago
Triage Stage: | Design decision needed → Accepted |
---|
comment:3 by , 18 years ago
I was under the impression that Django's EmailField was a simple emailfield that works for most purposes, but wasn't RFC2822. It uses a simple regular expression, which is not powerful enough to easily handle the RFC's full grammar. As a quick example, I cannot email axiak@[127.0.0.1] but I should be able to mail axiak@[18.7.21.224] in most cases.
My point is, we can make the Email validation really, really complicated. Do we want to?
follow-up: 5 comment:4 by , 18 years ago
Michael: I don't think we want to go completely overboard. Introducing a regular expression similar to the famous one in Friedl's reg-exp book would be an example of going overboard, for example. It would be impossible to debug. The full RFC grammar is a bit of overkill for the practical public Internet.
Covering the common cases is reasonable, though. The example cited in the bug summary might be good to have because it allows cutting and pasting of email addresses and using appropriate titles which may not match the username (or other names in the form). Allowing email to IP addresses seems probably too much for a primarily web-based framework, since the Internet has had DNS for quite a while now. Still, if it can be done easily, it might not hurt to include.
I readily admit that I don't really know in advance where to draw the line on this one, since it could go in many places. It wouldn't be unreasonable to say we stick with what we've got now. That's why I kind of want to see how complex a patch is. If you end up writing a couple of dozen lines of code, it's probably worth thinking about whether that's becoming too complex.
Sorry to be a bit non-specific. Aim for something small and easy to understand.
comment:5 by , 18 years ago
Replying to mtredinnick:
[a regular expression similar to the famous one in Friedl's reg-exp book] would be impossible to debug.
What would you have to? Isn't there an existing Python module to do it?
by , 18 years ago
Attachment: | email_re.diff added |
---|
Patch to email RE to allow matching "mailbox" production
comment:6 by , 18 years ago
Has patch: | set |
---|---|
Version: | → SVN |
The attached patch matches the mailbox production, without Friedl-like thoroughness (addressed for oldforms only - however, it should just be a copy-paste job for newforms). However, I'm not sure it's the whole story - there might be code elsewhere which assumes that the value of the email
field is in addr_spec format, which assumption will no longer hold once the patch is applied. I've tested in the admin interface and it seems to be OK, but does e.g. send_email()
need to be changed?
comment:7 by , 18 years ago
Needs documentation: | set |
---|---|
Patch needs improvement: | set |
A couple of comments on the patch:
- Whatever is needed to support newforms also needs to be included. Until we have model-aware validation, we can't completely avoid patching validators.py, so the current patch is partly in the right place.
- Rather than defining a whole bunch of constants that are only used once and then just hang around polluting the namespace, how about using the re.VERBOSE flag and putting the reg-exp back together into a single expression, like it is in the current code? No problems with spreading it out a little bit, but the current patch makes it look like there is more going on than there really is: it takes a bit of study to realise that there are six things going into the one reg-exp and that's the only place they're useful. You can probably keep ADDR_SPEC out as a separate string, because it's used twice. Or maybe there's another way to format things, but it takes more than a couple of seconds to understand what's going on at the moment.
- A documentation update in model-api.txt to give a couple of examples of "valid email field" is now required, since people aren't going to necessarily guess that display names are also allowed.
Patch seems simple enough. Tweak it a little bit and I won't stand in the way.
comment:8 by , 18 years ago
Component: | django.core.mail → Validators |
---|
Changed component to reflect reality.
comment:9 by , 18 years ago
Malcolm, re. your point 2 above, I hear you. I feel that regexes are hard to scan even when one knows what's going on - all those punctuation characters squashed up against each other. So I structured the change in the way that I did because I found it more readable, particularly when one is cross-referencing against the RFC. How about if I just add a
del DOT_ATOM, QUOTED_STRING, DOMAIN, ADDR_SPEC, DISPLAY_NAME
statement after the
email_re = ...
line, to clean up the namespace? I'll add a comment above the section, too, to explain what follows and so that it doesn't seem too complicated.
comment:11 by , 18 years ago
Nothing especially - though I find it easier and quicker to test regexes with the approach I used, since I can try different combinations of grouping and alternatives without accidentally stepping on the components of the regex.
We don't even need re.VERBOSE, since we can use the feature that a sequence of string literals is concatenated by the compiler into a single string. Assuming we can keep ADDR_SPEC because of DRY, then the re becomes:
ADDR_SPEC = ( "((?:" r"[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(?:\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom ")|(?:" r'"(?:[\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string "))@(" r'(?:[A-Z0-9-]+\.)+[A-Z]{2,6}' #domain ")" ) email_re = re.compile( "^(?:" + ADDR_SPEC + ")|(?:\w[\w ]*)<" + ADDR_SPEC + ">$", re.IGNORECASE)
If it's generally felt that this is equally readable, then I see no problem going with it.
comment:12 by , 18 years ago
Pleaes use re.VERBOSE rather than the string concatenation, because (a) it is much easier to edit triple-quoted multi-line strings rather than line-length string fragments that cannot contain newlines and (b) automatic string concatenation like that will eventually be going away from Python, so it saves us a few seconds of porting trouble at a later time if we have an equally useful alternative now.
comment:13 by , 18 years ago
OK, here it is:
ADDR_SPEC = """ ((?: [-!#$%&'*+/=?^_`{}|~0-9A-Z]+(?:\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)* # dot-atom )|(?: "(?:[\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*" # quoted-string ))@( (?:[A-Z0-9-]+\.)+[A-Z]{2,6} #domain ) """ email_re = re.compile( "^(?:" + ADDR_SPEC + ")|(?:\w[\w ]*)<" + ADDR_SPEC + ">$", re.VERBOSE | re.IGNORECASE)
I'll look at the newforms stuff and model-api.txt and work up a patch.
by , 18 years ago
Attachment: | email_re.2.diff added |
---|
Updated patch with additions to newforms/fields.py and docs/model_api.txt
comment:14 by , 17 years ago
Needs documentation: | unset |
---|---|
Patch needs improvement: | unset |
Triage Stage: | Accepted → Ready for checkin |
comment:15 by , 17 years ago
Needs tests: | set |
---|---|
Triage Stage: | Ready for checkin → Design decision needed |
I think this should become a new email field and not built in to the existing field. Usually, in the models where I have an email address, I also have the person's first name and last name too and want the email address to be simply the addr-spec. If the name were included in the email address field, then I would be storing duplicate information and would have to strip out the address part when displaying. But I can certainly see uses for the full mailbox spec too. Different fields for different needs is the way I see this.
follow-up: 17 comment:16 by , 17 years ago
Although of course it's conventional to use firstname/lastname as the display name, it's not mandated - for example, a nickname could be used as the display name, or a suffix such as " (Home)" or " (Work)" might be appended to the user's name to form the display name. So the display name is logically a different value from the user's name, even if it has the same value for most users.
For this reason, I don't think having two fields makes sense, though it may be sensible to increase the size of the email field (there's a separate discussion about this on the mailing lists, not related to display name support).
comment:17 by , 17 years ago
Yes, I understand that you can use anything for the display name. My point was simply that one might not always want to allow display names, as is the case with every use of the EmailField
in code I've written. I do think that allowing for display names would be a welcome feature, I just think that the developer should be able to specify which format is allowable since each has its valid use cases. I also think that separate fields (i.e. EmailField
and EmailWithDisplayNameField
) would be better than adding an option to the existing EmailField
(i.e. email = EmailField(allow_display_name=True)
) or changing the current behavior of EmailField
without allowing for the address-only form of entry.
comment:18 by , 17 years ago
The DRY principle, if I'm not misapplying it, seems to lead to the conclusion that
email = EmailField(allow_display_name=True)
would be the best solution - having the default as False
allows the behaviour to remain as it is currently, but which can be modified easily by setting allow_display_name=True
where the use case warrants it. With the two-field solution, any future change to the EmailField
needs to potentially be duplicated in EmailWithoutDisplayNameField
. And there's not enough difference in behaviour between the two, IMO, to warrant two classes.
comment:19 by , 17 years ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
I've created #6092 to track the bigger issue here.
Let's see how simple the patch looks. Should be a reasonable change.