UTF-8 - AbsoluteAstronomy.com

UTF-8 is a multibyte character encoding

Variable-width encoding

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...

for Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

. Like UTF-16

UTF-16/UCS-2

UTF-16 is a character encoding for Unicode capable of encoding 1,112,064 numbers in the Unicode code space from 0 to 0x10FFFF...

and UTF-32

UTF-32/UCS-4

UTF-32 is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint.The main advantage of UTF-32, versus variable...

, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible

Backward compatibility

In the context of telecommunications and computing, a device or technology is said to be backward or downward compatible if it can work with input generated by an older device...

with ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

and avoids the complications of endianness

Endianness

In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

and byte order mark

Byte Order Mark

The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream...

s (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages. The Internet Engineering Task Force

Internet Engineering Task Force

The Internet Engineering Task Force develops and promotes Internet standards, cooperating closely with the W3C and ISO/IEC standards bodies and dealing in particular with standards of the TCP/IP and Internet protocol suite...

(IETF) requires all Internet

Internet

The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

protocols to identify the encoding

Character encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium

Internet Mail Consortium

The Internet Mail Consortium provides information about all the Internet mail standards and technologies. They also prepare that supplement the Internet Engineering Task Force's RFCs....

(IMC) recommends that all e‑mail programs be able to display and create mail using UTF-8. UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs

Application programming interface

An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

, and software applications

Application software

Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...

.

UTF-8 encodes each of the 1,112,064 code point

Code point

In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...

s in the Unicode character set using one to four 8-bit byte

Byte

The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

s (termed “octets

Octet (computing)

An octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as there is no standard for the size of the byte.-Overview:...

” in the Unicode Standard). Code points with lower numerical values (i.e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes, making the encoding scheme reasonably efficient. In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with ASCII

ASCII

, are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well.

The official IANA

Internet Assigned Numbers Authority

The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

code for the UTF-8 character encoding is UTF-8.

History

By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646

Universal Character Set

The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

standard contained a non-required annex

Addendum

An addendum, in general, is an addition required to be made to a document by its reader subsequent to its printing or publication. It comes from the Latin verbal phrase addendum est, being the gerundive form of the verb addo, addere, addidi, additum, "to give to, add to", meaning " must be added"...

called UTF-1

UTF-1

UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of...

that provided a byte-stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.

In July 1992, the X/Open

X/Open

X/Open Company, Ltd. was a consortium founded by several European UNIX systems manufacturers in 1984 to identify and promote open standards in the field of information technology. More specifically, the original aim was to define a single specification for operating systems derived from UNIX, to...

committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories

Unix System Laboratories

Unix System Laboratories was originally organized as part of Bell Labs in 1989. USL joined with the UNIX Software Operation, also a Bell Laboratories division, in 1990. It assumed responsibility for Unix development and licensing activities...

submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.

In August 1992, this proposal was circulated by an IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

X/Open representative to interested parties. Ken Thompson

Ken Thompson

Kenneth Lane Thompson , commonly referred to as ken in hacker circles, is an American pioneer of computer science...

of the Plan 9

Plan 9 from Bell Labs

Plan 9 from Bell Labs is a distributed operating system. It was developed primarily for research purposes as the successor to Unix by the Computing Sciences Research Center at Bell Labs between the mid-1980s and 2002...

operating system

Operating system

An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

group at Bell Labs

Bell Labs

Bell Laboratories is the research and development subsidiary of the French-owned Alcatel-Lucent and previously of the American Telephone & Telegraph Company , half-owned through its Western Electric manufacturing subsidiary.Bell Laboratories operates its...

then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string to find code point boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9

Plan 9 from Bell Labs

to use it throughout, and then communicated their success back to X/Open.

UTF-8 was first officially presented at the USENIX

USENIX

-External links:* *...

conference in San Diego, from January 25–29, 1993.

The original specification allowed for sequences of up to six bytes, covering numbers up to 31 bits (the original limit of the Universal Character Set

Universal Character Set

). In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.

Design

The design of UTF‑8 as originally proposed by Dave Prosser and subsequently modified by Ken Thompson was intended to satisfy two objectives:

To be backward-compatible with ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

; and
To enable encoding of up to at least 2³¹ characters (the theoretical limit of the first draft proposal for the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

).

Being backward-compatible with ASCII implied that every valid ASCII character (a 7-bit character set) also be a valid UTF‑8 character sequence, specifically, a one-byte UTF‑8 character sequence whose binary value equals that of the corresponding ASCII character:

Bits	Last code point	Byte 1
7	U+007F	0xxxxxxx

Prosser’s and Thompson’s challenge was to extend this scheme to handle code points with up to 31 bits. The solution proposed by Prosser as subsequently modified by Thompson was as follows:

Bits	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+007F	0xxxxxxx
11	U+07FF	110xxxxx	10xxxxxx
16	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
21	U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+3FFFFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+7FFFFFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

The salient features of the above scheme are as follows:

Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
All continuation bytes (byte nos. 26 in the table above) have 10 as their two most-significant bits (bits 76); in contrast, the first byte never has 10 as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
As a consequence of no.3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
Prosser’s and Thompson’s scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 text — see under Advantages in section "Compared to single byte encodings" below — and indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).

Description

UTF-8 is a variable-width encoding, with each character represented by one to four bytes. If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point

Code point

value (in the range 80_{hexHexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...} to 10FFFF_hex). Thus a byte with lead bit '0' is a single-byte code, a byte with multiple leading '1' bits is the first of a multi-byte sequence, and a byte with a leading "10" bit pattern is a continuation byte of a multi-byte sequence. The format of the bytes thus allows the beginning of each sequence to be detected without decoding from the beginning of the string. UTF-16 limits Unicode to 10FFFF_hex; therefore UTF-8 is not defined beyond that value, even if it could easily be defined to reach 7FFFFFFF_hex.

Code point range	Binary Binary numeral system The binary numeral system, or base-2 number system, represents numeric values using two symbols, 0 and 1. More specifically, the usual base-2 system is a positional notation with a radix of 2... code point	UTF-8 bytes	Example
`U+0000` to `U+007F`	`0`	`0`	character '$' = code point `U+0024` = `0` → `0` → hexadecimal `24`
`U+0080` to `U+07FF`	`00000`	`110 10`	character '¢' = code point `U+00A2` = `00000` → `110 10` → hexadecimal `C2 A2`
`U+0800` to `U+FFFF`		`1110 10 10`	character '€' = code point `U+20AC` = → `1110 10 10` → hexadecimal `E2 82 AC`
`U+010000` to `U+10FFFF`	`000`	`11110 10 10 10`	character '𤭢' = code point `U+024B62` = `000` → `11110 10 10 10` → hexadecimal `F0 A4 AD A2`

So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This includes Latin

Latin alphabet

The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

letters with diacritic

Diacritic

A diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...

s and characters from the Greek

Greek alphabet

The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

, Cyrillic

Cyrillic alphabet

The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

, Coptic

Coptic alphabet

The Coptic alphabet is the script used for writing the Coptic language. The repertoire of glyphs is based on the Greek alphabet augmented by letters borrowed from the Demotic and is the first alphabetic script used for the Egyptian language...

, Armenian

Armenian alphabet

The Armenian alphabet is an alphabet that has been used to write the Armenian language since the year 405 or 406. It was devised by Saint Mesrop Mashtots, an Armenian linguist and ecclesiastical leader, and contained originally 36 letters. Two more letters, օ and ֆ, were added in the Middle Ages...

, Hebrew

Hebrew alphabet

The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...

, Arabic

Arabic alphabet

The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...

, Syriac

Syriac alphabet

The Syriac alphabet is a writing system primarily used to write the Syriac language from around the 2nd century BC . It is one of the Semitic abjads directly descending from the Aramaic alphabet and shares similarities with the Phoenician, Hebrew, Arabic, and the traditional Mongolian alphabets.-...

and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane

Mapping of Unicode character planes

In the Unicode system, planes are groups of numerical values that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal...

(which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode

Mapping of Unicode characters

Unicode’s Universal Character Set has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point which is an integer between 0 and 1,114,111 used to represent each character within the internal logic of text processing software .As of Unicode 5.2.0,...

, which include less common CJK characters and various historic scripts.

Codepage layout
















															bgcolor=#ff0000\|	bgcolor=#ff0000\|

Legend:
cells are control characters, cells are punctuation, cells are digits

Numerical digit

A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...

and cells are ASCII letters.

cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add.

cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. When a start byte could form both overlong and valid encodings, the lowest non-overlong-encoded codepoint is shown, marked by an asterisk "*".

cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for overlong encoding of basic ASCII characters. The remaining red cells indicate start bytes of sequences that could only encode numbers larger than the 0x10FFFF limit of Unicode. The byte 244 (hex 0xF4) could also encode some values greater than 0x10FFFF; such a sequence is also invalid.

Invalid byte sequences

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

the red invalid bytes in the above table
an unexpected continuation byte
a start byte not followed by enough continuation bytes
a sequence that decodes to a value that should use a shorter sequence (an "overlong form").

Many earlier decoders would happily try to decode these. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes. Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft's IIS

Internet Information Services

Internet Information Services – formerly called Internet Information Server – is a web server application and set of feature extension modules created by Microsoft for use with Microsoft Windows. It is the most used web server after Apache HTTP Server. IIS 7.5 supports HTTP, HTTPS,...

web server and Apache's Tomcat servlet container.

RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Many UTF-8 decoders throw exceptions on encountering errors, since such errors suggest the input is not a UTF-8 string at all. This can turn what would otherwise be harmless errors (producing a message such as "no such file") into a denial of service bug. For instance Python 3.0 would exit immediately if the command line contained invalid UTF-8, so it was impossible to write a Python program that could handle such input.

An increasingly popular option is to detect errors with a separate API, and for converters to translate the first byte to a replacement and continue parsing with the next byte. Popular replacements are:

The replacement character ' (U+FFFD)

The symbol for substitute ' (U+2426) (ISO 2047)
The '?' or '¿' character (U+003F or U+00BF)
The invalid Unicode code points U+DC80..U+DCFF where the low 8 bits are the byte's value.
Interpret the bytes according to another encoding (often ISO-8859-1
ISO/IEC 8859-1
ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally...

or CP1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

).

Replacing errors is "lossy": more than one UTF-8 string converts to the same Unicode result. Therefore the original UTF-8 should be stored, and translation should only be used when displaying the text to the user.

Invalid code points

UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.

Whether an actual application should do this with surrogate halves is debatable. Allowing them allows lossless storage of invalid UTF-16, and allows CESU encoding (described below) to be decoded. There are other code points that are far more important to detect and reject, such as the reversed-BOM U+FFFE, or the C1 controls, caused by improper conversion of CP1252 text or double-encoding

Mojibake

, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...

of UTF-8. These are invalid in HTML

HTML

HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

Official name and variants

The official name is "UTF-8". All letters are upper-case, and the name is hyphenated. This spelling is used in all the documents relating to the encoding.

Alternatively, the name "utf-8" may be used by all standards conforming to the Internet Assigned Numbers Authority

Internet Assigned Numbers Authority

(IANA) list (which include CSS

Cascading Style Sheets

Cascading Style Sheets is a style sheet language used to describe the presentation semantics of a document written in a markup language...

, HTML

HTML

HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

, XML

XML

Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

, and HTTP headers), as the declaration is case insensitive.

Other descriptions that omit the hyphen or replace it with a space, such as "utf8" or "UTF 8", are not accepted as correct by the governing standards. Despite this, most agents such as browsers can understand them, and so standards intended to describe existing practice (such as HTML5) may effectively require their recognition.

MySQL omits the hyphen in the following query:
SET NAMES 'utf8'

Derivatives

The following implementations show slight differences from the UTF-8 specification. They are incompatible with the UTF-8 specification.

CESU-8

Many pieces of software added UTF-8 conversions for UCS-2 data and did not alter their UTF-8 conversion when UCS-2 was replaced with the surrogate-pair supporting UTF-16. The result is that each half of a UTF-16 surrogate pair is encoded as its own 3-byte UTF-8 encoding, resulting in 6-byte sequences rather than 4 for characters outside the Basic Multilingual Plane

Mapping of Unicode character planes

. Oracle

Oracle Database

The Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....

databases use this, as well as Java and Tcl as described below, and probably a great deal of other Windows software where the programmers were unaware of the complexities of UTF-16. Although most usage is by accident, a supposed benefit is that this preserves UTF-16 binary sorting order when CESU-8 is binary sorted.

Modified UTF-8

In Modified UTF-8, the null character

Null character

The null character , abbreviated NUL, is a control character with the value zero.It is present in many character sets, including ISO/IEC 646 , the C0 control code, the Universal Character Set , and EBCDIC...

(U+0000) is encoded as 0xC0,0x80; this is not valid UTF-8 because it is not the shortest possible representation.
Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000, which allows such strings (with a null byte appended) to be processed by traditional null-terminated string

Null-terminated string

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character...

functions.

All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.

In normal usage, the Java programming language

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

supports standard UTF-8 when reading and writing strings through and . However it uses Modified UTF-8 for object serialization, for the Java Native Interface

Java Native Interface

The Java Native Interface is a programming framework that enables Java code running in a Java Virtual Machine to call and to be called by native applications and libraries written in other languages such as C, C++ and assembly.-Purpose and features:JNI enables one to write native methods to...

, and for embedding constant strings in class files

Class (file format)

In the Java programming language, source files are compiled into machine-readable class files which have a .class extension. Since Java is a platform-independent language, source code is compiled into an output file known as bytecode, which it stores in a .class file. If a source file has more...

. Tcl

Tcl

Tcl is a scripting language created by John Ousterhout. Originally "born out of frustration", according to the author, with programmers devising their own languages intended to be embedded into applications, Tcl gained acceptance on its own...

also uses the same modified UTF-8 as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.

Byte order mark

Many Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte order mark

Byte Order Mark

(BOM), and is commonly referred to as a UTF-8 BOM, even though it is not relevant to byte order. The BOM can also appear if another encoding with a BOM is translated to UTF-8 without stripping it.

The presence of the UTF-8 BOM may cause interoperability problems with existing software that could otherwise handle UTF-8; for example:

Older text editors may display the BOM as "ï»¿" at the start of the document, even if the UTF-8 file contains only ASCII and would otherwise display correctly.
Programming language parsers not explicitly designed for UTF-8 can often handle UTF-8 in string constants and comments, but cannot parse the BOM at the start of the file.
Programs that identify file types by leading characters may fail to identify the file if a BOM is present even if the user of the file could skip the BOM. Or conversely they will identify the file when the user cannot handle the BOM. An example is the Unix shebang
Shebang (Unix)
In computing, a shebang is the character sequence consisting of the characters number sign and exclamation point , when it occurs as the first two characters on the first line of a text file...

syntax.
Programs that insert information at the start of a file will result in a file with the BOM somewhere in the middle of it (this is also a problem with the UTF-16 BOM). One example is offline browsers that add the originating URL to the start of the file.

If compatibility with existing programs is not important, the BOM could be used to identify if a file is in UTF-8 versus a legacy encoding, but this is still problematic, due to many instances where the BOM is added or removed without actually changing the encoding, or various encodings are concatenated together. Checking if the text is valid UTF-8 is more reliable than using BOM. The Unicode standard recommends against the BOM for UTF-8

Advantages

The ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
UTF-8 is the only encoding for XML entities that does not require a BOM or an indication of the encoding.
UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, with UTF-8 as the preferred and most used encoding.
UTF-8 strings can be fairly reliably recognized as such by a simple heuristic algorithm. The chance of a random string of bytes being valid UTF-8 and not pure ASCII is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence. ISO/IEC 8859-1
ISO/IEC 8859-1
ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally...

is even less likely to be mis-recognized as UTF-8: the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol. This is an advantage that most other encodings do not have, causing errors (mojibake
Mojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...

) if the receiving application isn't told and can't guess the correct encoding. Even word-based UTF-16 can be mistaken for byte encodings (like in the "bush hid the facts
Bush hid the facts
Bush hid the facts is a common name for a bug present in the function IsTextUnicode of Microsoft Windows, which causes a file of text encoded in Windows-1252 or similar encoding to be interpreted as if it were UTF-16LE, resulting in mojibake...

" bug).
Sorting
Lexicographical order
In mathematics, the lexicographic or lexicographical order, , is a generalization of the way the alphabetical order of words is based on the alphabetical order of letters.-Definition:Given two partially ordered sets A and B, the lexicographical order on...

of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.
Other byte-based encodings can pass through the same API. This means, however, that the encoding must be identified. Because the other encodings are unlikely to be valid UTF-8, a reliable way to implement this is to assume UTF-8 and switch to a legacy encoding only if several invalid UTF-8 byte sequences are encountered.

Disadvantages

A UTF-8 parser
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

that is not compliant with current versions of the standard might accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.

Advantages

UTF-8 can encode any Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

character, avoiding the need to figure out and set a "code page
Code page
Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...

" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time. For many languages there has been more than one single-byte encoding in usage, so even knowing the language was insufficient information to display it correctly.
The bytes 0xfe and 0xff do not appear, so a valid UTF-8 stream never matches the UTF-16 byte order mark
Byte Order Mark
The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream...

and thus cannot be confused with it. The absence of 0xFF (\377) also eliminates the need to escape this byte in Telnet
TELNET
Telnet is a network protocol used on the Internet or local area networks to provide a bidirectional interactive text-oriented communications facility using a virtual terminal connection...

(and FTP control connection).

Disadvantages

UTF-8 encoded text is larger than the appropriate single-byte encoding except for plain ASCII characters. In the case of languages which used 8-bit character sets with non-Latin alphabets encoded in the upper half (such as most Cyrillic
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

and Greek alphabet
Greek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

code pages), letters in UTF-8 will be double the size. For some languages such as Thai
Thai language
Thai , also known as Central Thai and Siamese, is the national and official language of Thailand and the native language of the Thai people, Thailand's dominant ethnic group. Thai is a member of the Tai group of the Tai–Kadai language family. Historical linguists have been unable to definitively...

and Hindi
Hindi
Standard Hindi, or more precisely Modern Standard Hindi, also known as Manak Hindi , High Hindi, Nagari Hindi, and Literary Hindi, is a standardized and sanskritized register of the Hindustani language derived from the Khariboli dialect of Delhi...

's Devanagari
Devanagari
Devanagari |deva]]" and "nāgarī" ), also called Nagari , is an abugida alphabet of India and Nepal...

, letters will be triple the size (this has caused objections in India and other countries).
It is possible in UTF-8 (or any other multi-byte encoding) to split a string in the middle of a character, which may result in an invalid string if the pieces are not concatenated later.
If the code points are all the same size, measurements of a fixed number of them is easy. Due to ASCII-era documentation where "character" is used as a synonym for "byte" this is often considered important. However, by measuring string positions using bytes instead of "characters" most algorithms can be easily and efficiently adapted for UTF-8.

Advantages

UTF-8 uses the codes 0-127 only for the ASCII characters. This means that UTF-8 is an ASCII extension
Extended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

and can with limited change be supported by software that supports an ASCII extension and handles non-ASCII characters as free text.
UTF-8 can encode any Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

character. Files in different languages can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be supported (in the same text) without special codes inserted or manual settings to switch the encoding.
UTF-8 is "self-synchronizing": character boundaries are easily found when searching either forwards or backwards. If bytes are lost due to error or corruption
Data corruption
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data...

, one can always locate the beginning of the next character and thus limit the damage. Many multi-byte encodings are much harder to resynchronize.
Any byte oriented string searching algorithm
String searching algorithm
String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....

can be used with UTF-8 data, since the sequence of bytes for a character cannot occur anywhere else. Some older variable-length encodings (such as Shift JIS) did not have this property and thus made string-matching algorithms rather complicated. In Shift-JIS the end byte of a character and the first byte of the next character could look like another legal character, something that can't happen in UTF-8.
Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete UTF-1
UTF-1
UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of...

encoding).

Disadvantages

For certain languages UTF-8 will take more space than an older multi-byte encoding. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.

Advantages

A text byte stream cannot be losslessly converted to UTF-16, due to the possible presence of errors in the byte stream encoding. This causes unexpected and often severe problems attempting to use existing data in a system that uses UTF-16 as an internal encoding. Results are security bugs, DoS if bad encoding throws an exception, and data loss when different byte streams convert to the same UTF-16. Due to the ASCII compatibility and high degree of pattern recognition in UTF-8, random byte streams can be passed losslessly through a system using it, as interpretation can be deferred until display.
Converting to UTF-16 while maintaining compatibility with existing programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated. Invalid encodings make the duplicated APIs not exactly map to each other, often making it impossible to do some action with one of them.
Characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be the obsolete constant-length UCS-2 encoding, leading to code that works for most text but suddenly fails for non-BMP
Mapping of Unicode character planes
In the Unicode system, planes are groups of numerical values that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal...

characters.
Text encoded in UTF-8 is often smaller than (or the same size as) the same text encoded in UTF-16.
- This is always true for text using only code points below U+0800 (which includes all modern European languages), as each code point's UTF-8 encoding is one or two bytes then.
- Even if text contains code points not below U+0800, it might contain so many code points below U+0080 (which UTF-8 encodes in one byte) that the UTF-8 encoding is still smaller. As HTML markup and line terminators are code points below U+0080, most HTML source is smaller if encoded in UTF-8 even for Asian scripts.
Most communication and storage was designed for a stream of bytes. A UTF-16 string must use a pair of bytes for each code unit:
- The order of those two bytes becomes an issue and must be specified in the UTF-16 protocol, such as with a byte order mark
  Byte Order Mark
  The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream...
  
  .
- If an odd number of bytes is missing from UTF-16, the whole rest of the string will be meaningless text. Any bytes missing from UTF-8 will still allow the text to be recovered accurately starting with the next character after the missing bytes. If any partial character is removed the corruption is always recognizable.

Disadvantages

A simplistic parser for UTF-16 is unlikely to convert invalid sequences to ASCII. Since the dangerous characters in most situations are ASCII, a simplistic UTF-16 parser is much less dangerous than a simplistic UTF-8 parser.
Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters. This happens for pure text, but rarely for HTML documents. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space in UTF-16 than in UTF-8 .
In UCS-2 (but not UTF-16) Unicode code points are all the same size, making measurements of a fixed number of them easy. Due to ASCII-era documentation where "character" is used as a synonym for "byte", this is often considered important. Most UTF-16 implementations, including Windows, measure non-BMP characters as 2 units in UTF-16, as this is the only practical way to handle the strings. A similar variability in character size applies to UTF-8.

External links

There are several current definitions of UTF-8 in various standards documents:

RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
The Unicode Standard, Version 6.0, §3.9 D92, §3.10 D95 (2011)
ISO/IEC 10646:2003 Annex D (2003)

They supersede the definitions given in the following obsolete works:

ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
The Unicode Standard, Version 5.0, §3.9 D92, §3.10 D95 (2007)
The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
The Unicode Standard, Version 2.0, Appendix A (1996)
RFC 2044 (1996)
RFC 2279 (1998)
The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1 : UTF-8 Shortest Form (2000)
Unicode Standard Annex #27: Unicode 3.1 (2001)

They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

Original UTF-8 paper (or pdf) for Plan 9 from Bell Labs
Plan 9 from Bell Labs
Plan 9 from Bell Labs is a distributed operating system. It was developed primarily for research purposes as the successor to Unix by the Computing Sciences Research Center at Bell Labs between the mid-1980s and 2002...
RFC 5198 defines UTF-8 NFC
Unicode equivalence
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character...

for Network Interchange
UTF-8 test pages by Andreas Prilop, Jost Gippert and the World Wide Web Consortium
How to configure e-mail clients to send UTF-8 text
Unix/Linux: UTF-8/Unicode FAQ, Linux Unicode HOWTO, UTF-8 and Gentoo
The Unicode/UTF-8-character table displays UTF-8 in a variety of formats (with Unicode and HTML encoding information)
Unicode and Multilingual Web Browsers from Alan Wood's Unicode Resources describes support and additional configuration of Unicode/UTF-8 in modern browsers
JSP Wiki Browser Compatibility page details specific problems with UTF-8 in older browsers
Mathematical Symbols in Unicode
Graphical View of UTF-8 in ICU's Converter Explorer

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

History

Design

Description

Codepage layout

Invalid byte sequences

Invalid code points

Official name and variants

Derivatives

CESU-8

Modified UTF-8

Byte order mark

Advantages

Disadvantages

Advantages

Disadvantages

Advantages

Disadvantages

Advantages

Disadvantages

See also

External links