Code

Opened 2 years ago

Closed 2 years ago

#17386 closed Uncategorized (wontfix)

Validation & Unicode Character 'ZERO WIDTH SPACE' (U+200B)

Reported by: pennersr Owned by: nobody
Component: Forms Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Once in a while users somehow manage to inject e-mail addresses into the system containing unicode zero width space characters. I am not sure how they do it -- it probably happens when copy/pasting from a document of some sorts. Nevertheless, form validation does not reject such e-mail addresses:

>>> from django.core.validators import validate_email
>>> email=u'test@hotmail.co\u200bm'
>>> validate_email(email)
>>> # No ValidationError ?

These e-mail addresses get accepted and cause trouble later on (try sending mail to them, or hashing them for gravatar uses). Either:
a) Raise a ValidationError for such e-mail addresses, or
b) Automatically strip this character

Downside of a) is that the user is most likely unaware of this invisible character. He wouldn't know what character to remove where, even if instructed by an error message.

Attachments (0)

Change History (5)

comment:1 Changed 2 years ago by pennersr

  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

For what it is worth, I've only encountered hotmail e-mail addresses suffering from this problem:

confidential@hot\u200bmail.com
confidential@hotmail.c\u200bom

comment:2 Changed 2 years ago by aaugustin

I suppose this character is inserted as an anti-spam mechanism, precisely to defeat copy-paste.

Django won't alter user input silently — it's a bad practice that can backfire in interesting ways. And I'm not in favor of defeating a purposeful (although debatable) anti-spam mechanism.

Are non-ASCII characters acceptable in email addresses? If not Django should raise a ValidationError when an email address contains one, which would resolve this problem.

comment:3 Changed 2 years ago by aaugustin

Per RFC 3696, email addresses can use non-ASCII characters:

Any characters, or combination of bits (as octets), are permitted in DNS names.

Names will be encoded with IDNA when an ASCII representation is required.

The EmailValidator takes this into account:

class EmailValidator(RegexValidator):

    def __call__(self, value):
        try:
            super(EmailValidator, self).__call__(value)
        except ValidationError, e:
            # Trivial case failed. Try for possible IDN domain-part
            if value and u'@' in value:
                parts = value.split(u'@')
                try:
                    parts[-1] = parts[-1].encode('idna')
                except UnicodeError:
                    raise e
                super(EmailValidator, self).__call__(u'@'.join(parts))
            else:
                raise

However, \u200b encodes to nothing with IDNA:

>>> u'-\u200b-'.encode('idna') == '--'
True
>>> len(u'-\u200b-'.encode('idna'))
2

I spent some time fighting with various online encoders and couldn't confirm or infirm whether this is a valid result.

Anyway, that's the reason why the email address is valid, after IDNA encoding of the domain part.

comment:4 Changed 2 years ago by pennersr

Django won't alter user input silently — it's a bad practice that can backfire in interesting ways.
And I'm not in favor of defeating a purposeful (although debatable) anti-spam mechanism.

In this case it is debatable on whether that character is in fact user input, as the user inputting the e-mail address is totally unaware of that character being sent to the server.

That character is apparently meant to trick robots in such away that they won't recognize the e-email address. However, a user cannot tell the difference between two e-mail addresses, one with, and one without the character.

Therefore:

  • It would be indeed be wrong to raise a ValidationError, as the user wouldn't know what to do -- he literally does not see the problem.
  • It would be wrong to accept the accept the value as is, as two "equal" e-mail addresses do not pass the equality test (==, iexact), causing all sorts of trouble in any Django app comparing e-mail addresses.

As for altering input silently: multiple representations of the same date are all mapped to a single representation under the hood, so why don't we do the same for multiple representations of the same e-mail address?

comment:5 Changed 2 years ago by aaugustin

  • Resolution set to wontfix
  • Status changed from new to closed

Upon further thought, I don't believe this qualifies as a bug in Django. I don't see enough reasons to justify special casing \u200b, and I don't think Django can do something that will fit everyone.

In order to resolve this problem in your project, you can:

  • add a clean_email method in your form that does cleaned_data['email'] = cleaned_data['email'].replace('\u200b', '')
  • run a batch cleanup of your data : for obj in MyModel.objects.all(): obj.email = obj.email.replace('\u200b', ''); obj.save()

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
as The resolution will be set. Next status will be 'closed'
The resolution will be deleted. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.