UTF-7 - AbsoluteAstronomy.com

UTF-7 is a variable-length character encoding that was proposed for representing Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

text using a stream of ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

characters. It was originally intended to provide a means of encoding Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

text for use in Internet

Internet

The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

E-mail

Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

messages that was more efficient than the combination of UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

with quoted-printable

Quoted-printable

Quoted-printable, or QP encoding, is an encoding using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean...

Motivation

MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

, the modern standard of E-mail format, forbids encoding of headers using byte values above the ASCII range. Although MIME allows encoding the message body in various character sets (broader than ASCII), the underlying transmission infrastructure (SMTP

Simple Mail Transfer Protocol

Simple Mail Transfer Protocol is an Internet standard for electronic mail transmission across Internet Protocol networks. SMTP was first defined by RFC 821 , and last updated by RFC 5321 which includes the extended SMTP additions, and is the protocol in widespread use today...

, the main E-mail transfer standard) is still not guaranteed to be 8-bit clean

8-bit clean

8-bit clean describes a computer system that correctly handles 8-bit character sets, such as the ISO 8859 series and the UTF-8 encoding of Unicode.- History :...

. Therefore, a non-trivial content transfer encoding has to be applied in case of doubt. Unfortunately base64

Base64

Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

has a disadvantage of making even US-ASCII characters unreadable in non-MIME clients. On the other hand, UTF-8 combined with quoted-printable

Quoted-printable

Quoted-printable, or QP encoding, is an encoding using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean...

produces a very size-inefficient format requiring 6–9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP.

Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable

Quoted-printable

Quoted-printable, or QP encoding, is an encoding using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean...

or base64

Base64

Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable (or its variant, the RFC 2047/1522 ?Q?-encoding of headers).

UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the Internet Mail Consortium

Internet Mail Consortium

The Internet Mail Consortium provides information about all the Internet mail standards and technologies. They also prepare that supplement the Internet Engineering Task Force's RFCs....

recommends against its use.

8BITMIME has also been introduced, which reduces the need to encode message bodies in a 7-bit format.

A modified form of UTF-7 is currently used in the IMAP e-mail retrieval protocol for mailbox names.

Description

UTF-7 was first proposed as an experimental protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC

Request for Comments

In computer network engineering, a Request for Comments is a memorandum published by the Internet Engineering Task Force describing methods, behaviors, research, or innovations applicable to the working of the Internet and Internet-connected systems.Through the Internet Society, engineers and...

has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32.
There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7.

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' , - . / : ?. The direct characters are considered very safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range –U+007E except ~ \ + and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.

Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign (+) may be encoded as +-.

Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus

Hyphen-minus

The hyphen-minus is a character used in digital documents and computing to represent a hyphen or a minus sign .It is present in Unicode as code point ; it is also in ASCII with the same value....

) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.

Confusingly, Microsoft in its .NET documentation calls its LEB128

LEB128

LEB128 or Little Endian Base 128 is a form of variable-length code compression used to store an arbitrarily large integer in a small number of bytes. LEB128 is used in the DWARF debug file format.-Encoding format:...

string length encoding UTF-7: "A length-prefixed string represents the string length by prefixing to the string a single byte or word that contains the length of that string. This method first writes the length of the string as a UTF-7 encoded unsigned integer, and then writes that many characters to the stream by using the BinaryWriter instance's current encoding." The accompanying example code, however, shows that instead of UTF-7, a little-endian Variable-length quantity

Variable-length quantity

A variable-length quantity is a universal code that uses an arbitrary number of binary octets to represent an infinitely large integer. It was defined for use in the standard MIDI file format to save additional space for a resource constrained system, and is also used in the later Extensible...

identical to LEB128 is used; and that in fact the count is a byte count and not a character count.

Examples

"Hello, World!" is encoded as "Hello, World!"
"1 + 1 = 2" is encoded as "1 +- 1 = 2"
"£1" is encoded as "+AKM-1". The Unicode code point for the pound sign
Pound sign
The pound sign is the symbol for the pound sterling—the currency of the United Kingdom . The same symbol is used for similarly named currencies in some other countries and territories, such as the Irish pound, Gibraltar pound, Australian pound and the Italian lira...

is U+00A3 (which is 00A3₁₆ in UTF-16), which converts into modified Base64 as in the table below. There are two bits left over, which are padded to 0.
| Hex digit
| colspan="4" align="center"| 0
| colspan="4" align="center"| 0
| colspan="4" align="center"| A
| colspan="4" align="center"| 3
| colspan="2" |
|-
| Bit pattern >

0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0

Encoding
First, an encoder must decide which characters to represent directly in ASCII form, which + has to be escaped as +-, and which to place in blocks of Unicode characters. A simple encoder may encode all characters it considers safe for direct encoding directly. However the cost of ending a Unicode sequence, outputing a single character directly in ASCII and then starting another Unicode sequence is 3 to 3⅔ bytes. This is more than the 2⅔ bytes needed to represent the character as a part of a Unicode sequence. Each Unicode sequence must be encoded using the following procedure, then surrounded by the appropriate delimiters.

Using the £† (U+00A3 U+2020) character sequence as an example:
1. Express the character’s Unicode numbers (UTF-16) in Binary:
  
  0x00A3 → 0000 0000 1010 0011
  
  0x2020 → 0010 0000 0010 0000
2. Concatenate the binary sequences
  
  0000 0000 1010 0011 and 0010 0000 0010 0000 → 0000 0000 1010 0011 0010 0000 0010 0000
3. Regroup the binary into groups of six bits, starting from the left:
  
  0000 0000 1010 0011 0010 0000 0010 0000 → 000000 001010 001100 100000 001000 00
4. If the last group has less than six bits, add trailing zeros:
  
  000000 001010 001100 100000 001000 00 → 000000 001010 001100 100000 001000 000000
5. Replace each group of six bits with a respective Base64 code:
  
  000000 001010 001100 100000 001000 000000 → AKMgIA
Decoding
First an encoded data must be separated into plain ASCII text chunks (including +es followed by a dash) and nonempty Unicode blocks as mentioned in the description section. Once this is done, each Unicode block must be decoded with the following procedure (using the result of the encoding example above as our example)
1. Express each Base64 code as the bit sequence it represents:
  AKMgIA → 000000 001010 001100 100000 001000 000000
2. Regroup the binary into groups of sixteen bits, starting from the left:
  000000 001010 001100 100000 001000 000000 → 0000000010100011 0010000000100000 0000
3. If there is an incomplete group at the end, discard it (If the incomplete group contains more than four bits or contains any ones, the code is invalid):
  0000000010100011 0010000000100000
4. Each group of 16 bits is a character's Unicode (UTF-16) number and can be expressed in other forms:
  0000 0000 1010 0011 ≡ 0x00A3 ≡ 163₁₀
Security
UTF-7 allows multiple representations of the same source string. In particular ASCII characters can be represented as part of Unicode blocks. As such if standard ASCII based escaping or validation processes are used on strings that may be later intepreted as UTF-7 then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.

Older versions of Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

can be tricked into interpreting the page as UTF-7. This can be used for a cross-site scripting
Cross-site scripting
Cross-site scripting is a type of computer security vulnerability typically found in Web applications that enables attackers to inject client-side script into Web pages viewed by other users. A cross-site scripting vulnerability may be used by attackers to bypass access controls such as the same...

attack as the < and > marks can be encoded as +ADw- and +AD4- in UTF-7, which most validators let through as simple text.

See also
- Comparison of Unicode encodings
  Comparison of Unicode encodings
  This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so...
The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Motivation

Description

Examples

Encoding

Decoding

Security

See also