Opened 15 months ago

Last modified 3 days ago

#22561 assigned Bug

EmailMessage should respect RFC2822 on max line length

Reported by: notsqrt Owned by: levkowetz
Component: Core (Mail) Version: 1.6
Severity: Normal Keywords:
Cc: petr.hroudny@…, bugs@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Follow-up of thread Email encoding (DKIM, long lines, etc..) on django-users

RFC

The RFC2822 states that:
"Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."

This statement has not been modified in 2008 in the updated version : RFC5322

History

For utf-8 encoded emails, Python uses:

  • shortest of "quoted-printable" and "base64" for the email subject
  • "base64" for the body
# stdlib, identical in python2.7 and python3.3 : email/charset.py
CHARSETS = {
   'utf-8':       (SHORTEST,  BASE64, 'utf-8'),
}

The historical reason seems to be that support for 8bit characters in emails was not largely adopted, hence the need to encode them into ASCII.

Back in 2007, in ticket 3472 (changeset 5143), it was decided to always use "quoted-printable", because using base64 seems to negatively affect spam scores.

# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from
# some spam filters.
Charset.add_charset('utf-8', Charset.SHORTEST, Charset.QP, 'utf-8')

In 2011, in ticket 11212 (changeset 16178, django 1.4), it was decided to remove "quoted-printable", and let python automatically switch between 7-bit or 8-bit encodings, based on the fact that 8-bit emails were widely supported, and MTAs were in charge of the downgrading to 7-bit if necessary.

Charset.add_charset('utf-8', Charset.SHORTEST, None, 'utf-8')

The (unintended?) side-effect of using base64 or "quoted-printable" was in fact a guarantee to have short lines in emails (for instance, rfc for quoted-printable rfc2045 states that max-length is 76 characters).

Summary of invoqued reasons for these choices

  • base64 is too big (bandwidth)
  • base64 is not supported by all clients
  • base64 has a negative effect on spam scores (cf SpamAssassin's rule on unnecessarily using base64 encoding to disguise text, but this rule also states that "This does not apply to text in the UTF-8 or big5 character sets.")
  • quoted-printable is no longer necessary, since MTAs and email clients have adopted 8bit support

Current state

Django

There was an additional ticket 12422, but not relevant to this ticket.

The current code in django/core/mail/message.py looks like:

# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from
# some spam filters.
utf8_charset = Charset.Charset('utf-8')
utf8_charset.body_encoding = None  # Python defaults to BASE64

Clients

Email clients like Gmail seem to wrap lines at 80 characters for text/plain, and switch to "Content-Transfer-Encoding: quoted-printable" for text/html and text/plain if there are non-ascii characters.

Importance

Mail Transfer Agent like Postfix often split lines that do not respect the RFC by inserting "\r\n " at the 998-th position of the line.

DKIM signatures of emails are based on the unmodified body, but the signature validation by receivers is based on the modified body, resulting in a check failure.

Apart from my own django projects, I have seen long lines in html emails sent by Sentry, for instance.

Choices

For reference, Perl library MIME-Lite recommends:

   Use encoding:     | If your message contains:
   ------------------------------------------------------------
   7bit              | Only 7-bit text, all lines <1000 characters
   8bit              | 8-bit text, all lines <1000 characters
   quoted-printable  | 8-bit text or long lines (more reliable than "8bit")
   base64            | Largely non-textual data: a GIF, a tar file, etc.

One way or another, we have to guarantee that email lines are <1000 characters.
base64 and quoted-printable do that for us.
No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.

I am not aware of other encodings that can be used for this, nor of reliable ways to split long lines.

On django-users, Russ Magee warned about possible downstream consequences.

Other references

http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html
relevant discussion on trac's trac
SpamAssassin's rule on quoted-printable messages not respecting the 76-max line length rule.

Change History (8)

comment:1 Changed 15 months ago by petr.hroudny@…

  • Cc petr.hroudny@… added
  • Needs documentation unset
  • Needs tests unset
  • Patch needs improvement unset

comment:2 Changed 15 months ago by phr

Quoted-printable should only be used to downconvert emails to 7bit-only, not to workaround their RFC incompliance regarding line lengths.

Please note that QP works decently just for languages based on ASCII with only a few accentuated characters, but performs miserably for all others.

Thus reintroducing any form of 7bit downconversion is not the proper solution to this problem.

comment:3 Changed 13 months ago by timo

  • Triage Stage changed from Unreviewed to Accepted
  • Type changed from Uncategorized to Bug

Russ seemed to accept the problem on the mailing list.

comment:4 Changed 12 months ago by ralphje

Apart from breaking DKIM, this behaviour also affects appearance of emails containing long lines (spaces appear to be added, at least in Gmail) and breaks inline HTML or CSS.

Note that long lines are commonplace when CSS rules are automatically being inlined.

comment:5 follow-up: Changed 11 months ago by claudep

No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.

I don't get your point here, how would we break html code by splitting lines? Content inside <pre>?

comment:6 Changed 8 weeks ago by levkowetz

  • Owner changed from nobody to levkowetz
  • Status changed from new to assigned

comment:7 in reply to: ↑ 5 Changed 3 days ago by ris

Replying to claudep:

I don't get your point here, how would we break html code by splitting lines? Content inside <pre>?

Putting a newline in the middle of a tag (href with a loooooong url?) would do it.

This issue is causing me some pain too.

comment:8 Changed 3 days ago by ris

  • Cc bugs@… added
Note: See TracTickets for help on using tickets.
Back to Top