UTF-7: a ghost from the time before UTF-8

2018-10-31

On Halloween this year I learned two scary things. The first is that a young toddler can go trick-or-treating in your apartment building and acquire a huge amount of candy. When they are this young they have no interest in the candy itself, so you are left having to eat it all yourself.

The second scary thing is that in the heart of the ubiquitous IMAP protocol lingers a ghost of the time before UTF-8. Its name is Modified UTF-7.

UTF-7

UTF-7 is described in RFC 2152. It lets you encode all of Unicode, much like the other UTF encoding schemes, though it adds a neat property: it only uses printable ASCII characters to do it. Unfortunately you pay a price: it is complicated and inefficient.

First, most ASCII characters are represented by themselves. The important exception is the shift character +. Instead of + we now write +-.

Any sequence of non-ASCII characters (or disallowed ASCII characters such as ~) are first converted to UTF-16BE, then encoded as base64, and placed between a + and a -.

(Even though this is 2018, occasionally someone will try to claim in conversation with me that UTF-16 is better than UTF-8. The obvious response is to point to the surrogate pairs mess, but many people defending UTF-16 don't realize those are necessary. I have found I can skip over the long explanation of surrogates by simply asking: "do you mean UTF-16LE or UTF-16BE?")

There is something immediately appealing about this definition of UTF-7. You can describe it in three sentences, it is built on the popular encoding scheme base64, and it is ASCII printable.

An example:

"Hello, 世界"           (UTF-8)
"Hello, \u4E16\u754C"  (ASCII with unicode hex literals)
"Hello, +ThZ1TA-"      (UTF-7)

UTF-7 is not a particularly appealing wire format. In the example above UTF-7 uses 8 bytes to represent what UTF-8 does in 6 bytes. It becomes even less efficient if ASCII is regularly mixed in with non-ASCII code points as we need to constantly add escape characters. And while it is ASCII printable, the printing is inscrutable. Relating ThZ1TA back to anything is beyond my mind, so I may as well use something non-printable.

To make matters worse, this is not base64. It is modified base64. The base64 padding character = cannot appear in UTF-7. To avoid it the RFC tells us to pad the UTF-16BE with zero bits until you reach a length that can be base64 encoded without padding:

      Next, the octet stream is encoded by applying the Base64 content
      transfer encoding algorithm as defined in RFC 2045, modified to
      omit the "=" pad character. Instead, when encoding, zero bits are
      added to pad to a Base64 character boundary. When decoding, any
      bits at the end of the Modified Base64 sequence that do not
      constitute a complete 16-bit Unicode character are discarded.

That sounds fishy.

Base64 encodes every block of 3 bytes to 4 bytes. If what you are encoding is not divisible by 3 then what you have is encoded and the base64 string padded so it is divisible by four using =. This means you may get up to two = characters at the end of a base64 string. If we are going to pad the input as the RFC suggests so that we never use =, we may have to pad up to two bytes of input with zeros. That would form a valid UTF-16 NULL!

So how do we handle this padding?

I looked inside three UTF-7 encoders and found they don't follow the RFC at all on this. Instead, they encode the UTF-16 to modified base64 without any zero bit padding, and then remove any base64 = padding from the result.

This works and it produces shorter results with no ambiguous NULL than the RFC process. But it sure would be nice if someone had documented it.

To explain with an example, the initial base64 output for the string 世界 is ThZ1TA==. We removed the trailing == to produce UTF-7.

Modified UTF-7

UTF-7 is no more. It has long since been replaced in SMTP and in MIME headers where many encodings can be used, people choose other things. However a modified version is still used in IMAP. RFC 3501 describes it:

Modified base64 is modified further, now in the encoded alphabet / is replaced by ,. This is neither the standard nor URL base64 encoding scheme you have seen before.
The ASCII characters '\' and ~ no longer need to be encoded. In fact, they MUST not be encoded.
The escape character is now & instead of +.

So now we have modified-modified-base64 and our example above reads:

"Hello, &ThZ1TA-"      (Modified UTF-7)

A simpler future

IMAP is a living protocol with many RFCs adding extensions. One of those is RFC 6855 which lets a server and client negotiate UTF8=ACCEPT capability and drop all the UTF-7.

It even includes a negotiation mode for the future where servers can announce UTF8=ONLY and refuse to talk any UTF-7 with clients. Hopefully we can get there.