2018-10-31
On Halloween this year I learned two scary things. The first is that a young toddler can go trick-or-treating in your apartment building and acquire a huge amount of candy. When they are this young they have no interest in the candy itself, so you are left having to eat it all yourself.
The second scary thing is that in the heart of the ubiquitous IMAP protocol lingers a ghost of the time before UTF-8. Its name is Modified UTF-7.
UTF-7 is described in RFC 2152. It lets you encode all of Unicode, much like the other UTF encoding schemes, though it adds a neat property: it only uses printable ASCII characters to do it. Unfortunately you pay a price: it is complicated and inefficient.
First, most ASCII characters are represented by themselves.
The important exception is the shift character +
.
Instead of +
we now write +-
.
Any sequence of non-ASCII characters (or disallowed ASCII characters
such as ~
) are first converted to UTF-16BE,
then encoded as base64, and placed between a +
and a -
.
(Even though this is 2018, occasionally someone will try to claim in conversation with me that UTF-16 is better than UTF-8. The obvious response is to point to the surrogate pairs mess, but many people defending UTF-16 don't realize those are necessary. I have found I can skip over the long explanation of surrogates by simply asking: "do you mean UTF-16LE or UTF-16BE?")
There is something immediately appealing about this definition of UTF-7. You can describe it in three sentences, it is built on the popular encoding scheme base64, and it is ASCII printable.
An example:
"Hello, 世界" (UTF-8)
"Hello, \u4E16\u754C" (ASCII with unicode hex literals)
"Hello, +ThZ1TA-" (UTF-7)
UTF-7 is not a particularly appealing wire format.
In the example above UTF-7 uses 8 bytes to represent what UTF-8 does
in 6 bytes.
It becomes even less efficient if ASCII is regularly mixed in
with non-ASCII code points as we need to constantly add escape
characters.
And while it is ASCII printable, the printing is inscrutable.
Relating ThZ1TA
back to anything is beyond my mind, so I may
as well use something non-printable.
To make matters worse, this is not base64. It is modified base64.
The base64 padding character =
cannot appear in UTF-7.
To avoid it the RFC tells us to pad the UTF-16BE with zero bits
until you reach a length that can be base64 encoded without padding:
Next, the octet stream is encoded by applying the Base64 content
transfer encoding algorithm as defined in RFC 2045, modified to
omit the "=" pad character. Instead, when encoding, zero bits are
added to pad to a Base64 character boundary. When decoding, any
bits at the end of the Modified Base64 sequence that do not
constitute a complete 16-bit Unicode character are discarded.
That sounds fishy.
Base64 encodes every block of 3 bytes to 4 bytes.
If what you are encoding is not divisible by 3 then what you
have is encoded and the base64 string padded so it is divisible
by four using =
.
This means you may get up to two =
characters at the end of
a base64 string.
If we are going to pad the input as the RFC suggests so that we
never use =, we may have to pad up to two bytes of input with
zeros.
That would form a valid UTF-16 NULL!
So how do we handle this padding?
I looked inside three UTF-7 encoders and found they don't follow
the RFC at all on this.
Instead, they encode the UTF-16 to modified base64 without any
zero bit padding, and then remove any base64 =
padding from
the result.
This works and it produces shorter results with no ambiguous NULL than the RFC process. But it sure would be nice if someone had documented it.
To explain with an example, the initial base64 output for the string
世界
is ThZ1TA==
.
We removed the trailing ==
to produce UTF-7.
UTF-7 is no more. It has long since been replaced in SMTP and in MIME headers where many encodings can be used, people choose other things. However a modified version is still used in IMAP. RFC 3501 describes it:
Modified base64 is modified further, now in the encoded alphabet
/
is replaced by ,
.
This is neither the standard nor URL base64 encoding scheme you
have seen before.
The ASCII characters '\'
and ~
no longer need to be encoded.
In fact, they MUST not be encoded.
The escape character is now &
instead of +
.
So now we have modified-modified-base64 and our example above reads:
"Hello, &ThZ1TA-" (Modified UTF-7)
IMAP is a living protocol with many RFCs adding extensions.
One of those is RFC 6855
which lets a server and client negotiate UTF8=ACCEPT
capability
and drop all the UTF-7.
It even includes a negotiation mode for the future where servers
can announce UTF8=ONLY
and refuse to talk any UTF-7 with clients.
Hopefully we can get there.