Follow-up of thread Email encoding (DKIM, long lines, etc..) on django-users


The RFC2822 states that:
"Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."

This statement has not been modified in 2008 in the updated version : RFC5322


For utf-8 encoded emails, Python uses:

  • shortest of "quoted-printable" and "base64" for the email subject
  • "base64" for the body
# stdlib, identical in python2.7 and python3.3 : email/
   'utf-8':       (SHORTEST,  BASE64, 'utf-8'),

The historical reason seems to be that support for 8bit characters in emails was not largely adopted, hence the need to encode them into ASCII.

Back in 2007, in ticket 3472 (changeset 5143), it was decided to always use "quoted-printable", because using base64 seems to negatively affect spam scores.

# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from
# some spam filters.
Charset.add_charset('utf-8', Charset.SHORTEST, Charset.QP, 'utf-8')

In 2011, in ticket 11212 (changeset 16178, django 1.4), it was decided to remove "quoted-printable", and let python automatically switch between 7-bit or 8-bit encodings, based on the fact that 8-bit emails were widely supported, and MTAs were in charge of the downgrading to 7-bit if necessary.

Charset.add_charset('utf-8', Charset.SHORTEST, None, 'utf-8')

The (unintended?) side-effect of using base64 or "quoted-printable" was in fact a guarantee to have short lines in emails (for instance, rfc for quoted-printable rfc2045 states that max-length is 76 characters).

Summary of invoqued reasons for these choices

  • base64 is too big (bandwidth)
  • base64 is not supported by all clients
  • base64 has a negative effect on spam scores (cf SpamAssassin's rule on unnecessarily using base64 encoding to disguise text, but this rule also states that "This does not apply to text in the UTF-8 or big5 character sets.")
  • quoted-printable is no longer necessary, since MTAs and email clients have adopted 8bit support

Current state


There was an additional ticket 12422, but not relevant to this ticket.

The current code in django/core/mail/ looks like:

# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from
# some spam filters.
utf8_charset = Charset.Charset('utf-8')
utf8_charset.body_encoding = None  # Python defaults to BASE64


Email clients like Gmail seem to wrap lines at 80 characters for text/plain, and switch to "Content-Transfer-Encoding: quoted-printable" for text/html and text/plain if there are non-ascii characters.


Mail Transfer Agent like Postfix often split lines that do not respect the RFC by inserting "\r\n " at the 998-th position of the line.

DKIM signatures of emails are based on the unmodified body, but the signature validation by receivers is based on the modified body, resulting in a check failure.

Apart from my own django projects, I have seen long lines in html emails sent by Sentry, for instance.


For reference, Perl library MIME-Lite recommends:

   Use encoding:     | If your message contains:
   7bit              | Only 7-bit text, all lines <1000 characters
   8bit              | 8-bit text, all lines <1000 characters
   quoted-printable  | 8-bit text or long lines (more reliable than "8bit")
   base64            | Largely non-textual data: a GIF, a tar file, etc.

One way or another, we have to guarantee that email lines are <1000 characters.
base64 and quoted-printable do that for us.
No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.

I am not aware of other encodings that can be used for this, nor of reliable ways to split long lines.

On django-users, Russ Magee warned about possible downstream consequences.

Other references
relevant discussion on trac's trac
SpamAssassin's rule on quoted-printable messages not respecting the 76-max line length rule.

Quoted-printable should only be used to downconvert emails to 7bit-only, not to workaround their RFC incompliance regarding line lengths.

Please note that QP works decently just for languages based on ASCII with only a few accentuated characters, but performs miserably for all others.

Thus reintroducing any form of 7bit downconversion is not the proper solution to this problem.

Russ seemed to accept the problem on the mailing list.

Apart from breaking DKIM, this behaviour also affects appearance of emails containing long lines (spaces appear to be added, at least in Gmail) and breaks inline HTML or CSS.

Note that long lines are commonplace when CSS rules are automatically being inlined.

No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.

I don't get your point here, how would we break html code by splitting lines? Content inside <pre>?

