UTF-9 and UTF-18
Encyclopedia
UTF-9 and UTF-18 were two April Fools' Day RFC
April Fools' Day RFC
Almost every April Fools' Day since 1989, the Internet Engineering Task Force has published one or more humorous RFC documents, following in the path blazed by the June 1973 RFC 527 entitled ARPAWOCKY, which parodied Lewis Carroll's nonsense poem Jabberwocky...

 joke specifications for encoding Unicode on systems where the nonet
Nonet
A nonet refers to a group of nine.*In music, a nonet is a composition which requires nine musicians for a performance. Spohr and Martinu composed nonets....

 (nine bit group) is a better fit for the native word size than the octet
Octet (computing)
An octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as there is no standard for the size of the byte.-Overview:...

, such as the 36-bit PDP-10
PDP-10
The PDP-10 was a mainframe computer family manufactured by Digital Equipment Corporation from the late 1960s on; the name stands for "Programmed Data Processor model 10". The first model was delivered in 1966...

. Both encodings were specified in RFC 4042, written by Mark Crispin
Mark Crispin
Mark Crispin is best known as the father of the IMAP protocol, having invented it in 1985 during his time at the Stanford Knowledge Systems Laboratory. He is the author or co-author of numerous RFCs; and is the principal author of UW IMAP, one of the reference implementations of the IMAP4rev1...

 (inventor of IMAP) and released on April 1, 2005. The encodings suffer from a number of flaws and it is confirmed by their author that they were intended as a joke.

However unlike some of the "specifications" given in other April 1 RFCs
Request for Comments
In computer network engineering, a Request for Comments is a memorandum published by the Internet Engineering Task Force describing methods, behaviors, research, or innovations applicable to the working of the Internet and Internet-connected systems.Through the Internet Society, engineers and...

 they are actually technically possible to implement, and have in fact been implemented in PDP-10
PDP-10
The PDP-10 was a mainframe computer family manufactured by Digital Equipment Corporation from the late 1960s on; the name stands for "Programmed Data Processor model 10". The first model was delivered in 1966...

 assembly language. They are not endorsed by the Unicode Consortium
Unicode Consortium
The Unicode Consortium is a non-profit organization that coordinates the development of the Unicode standard. Its stated goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format schemes, claiming that many of the existing...

.

Technical details

Like the 8-bit code commonly called variable-length quantity
Variable-length quantity
A variable-length quantity is a universal code that uses an arbitrary number of binary octets to represent an infinitely large integer. It was defined for use in the standard MIDI file format to save additional space for a resource constrained system, and is also used in the later Extensible...

, UTF-9 uses a system of putting an octet in the low 8 bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

s of each nonet and using the high bit to indicate continuation. This means that ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 and Latin 1 characters take one nonet each, the rest of the BMP characters take two nonets each and non-BMP code points take three. Code points that require multiple nonets are stored starting with the most significant non-zero octet.

UTF-18 is a fixed length encoding using an 18 bit integer per code point. This allows representation of 4 planes, which are mapped to the 4 planes currently used by Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 (planes 0-2 and 14). This means that the two private use planes (15 and 16) and the currently unused planes (3-13) are not supported. The UTF-18 specification doesn't say why they didn't allow surrogates to be used for these code points though when talking about UTF-16 earlier in the RFC it says "This transformation format requires complex surrogates to represent code points outside the BMP". After complaining about their complexity it would have looked a bit hypocritical to use surrogates in their new standard. It is unlikely that planes 3-13 will be assigned by Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 any time in the foreseeable future. Thus, UTF-18, like UCS-2 and UCS-4, guarantees a fixed width for all characters (although not for all glyphs).

Problems

Both specifications suffer from the problem that standard communication protocols are built around octets rather than nonets, and so it would not be possible to exchange text in these formats without further encoding or specially designed protocols. This alone would probably be sufficient reason to consider their use impractical in most cases. However, this would be less of a problem with pure bit-stream communication protocols.

Furthermore, both UTF-9 and UTF-18 have specific problems of their own. UTF-9 requires special care when searching, as a shorter sequence can be found at the end of a longer sequence. This means that it is necessary to search backwards in order to find the start of the sequence. UTF-18 cannot represent all Unicode code points (although unlike UCS-2 it can represent all the planes that currently have non-private use code point assignments) making it a bad choice for a system that may need to support new languages (or rare CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

 ideograph
Ideograph
Ideograph is a term coined by rhetorical scholar and critic Michael Calvin McGee describing the use of particular words and phrases as political language in a way that captures particular ideological positions...

s that are added after the SIP fills up) in the future.

External links

  • RFC 4042: UTF-9 and UTF-18 Efficient Transformation Formats of Unicode
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK