Unicode Specials - AbsoluteAstronomy.com

Specials is the name of a short Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

block allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 codepoints, 5 are assigned as of Unicode 6.0:, marks start of annotated text, marks start of annotating text, marks end of annotating text, placeholder in the text for another unspecified object, for example in a compound document

Compound document

In computing, a compound document is a document type typically produced using word processing software, and is a regular text document intermingled with non-text elements such as spreadsheets, pictures, digital videos, digital audio, and other multimedia features...

. used to replace an unknown or unprintable character not a character. not a character.

FFFE and FFFF are not unassigned in the usual sense, but guaranteed not to be a Unicode character at all. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. The U+FEFF is Unicode's byte-order mark, named "zero-width no-break space" (as inclusion of it in text shall not be noticed). If this character is read in the wrong byte order (for example, due to an endianness

Endianness

In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

bug), it will read 0xFFFE, which is illegal Unicode.

Replacement character

The replacement character (often a black diamond with a white question mark) is a symbol found in the Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

standard at codepoint U+FFFD in the Specials table. It is used to indicate problems when a system is not able to decode a stream of data to a correct symbol. It is most commonly seen when a font does not contain a character, but is also seen when the data is invalid and does not match any character:

Consider a text file

Text file

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a computer file system...

containing the German word für in the ISO-8859-1 encoding. This file is now opened with a text editor that assumes the input is UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

. As the first byte (0x66) is within the range 0x00–0x7F, UTF-8 correctly interprets it as an f. The second byte (0xFC) is not a legal value for the start of any UTF-8 encoded character. A text editor could therefore replace the byte with the replacement character symbol to warn the user that something went wrong. The last byte (0x72) is also within the code range 0x00–0x7F and can be decoded correctly. The whole string now displays like this: .

A poorly-implemented text editor might save the replacement in UTF-8 form; the text file data will then look like this: 0x66 0xEF 0xBF 0xBD 0x72, which will be displayed in ISO-8859-1 again as fï¿½r (see mojibake

Mojibake

, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...

). The replacement also destroys the original byte, making it impossible to recover what character was intended.

A better (but harder to implement) design is to preserve the original bytes, including the error, and only convert to the replacement when displaying the text. Another alternative is to make different replacements for each different error byte; one popular replacement is the (otherwise invalid Unicode) U+DC80 through U+DCFF. Both of these are seeing more common use in modern software: if a web page uses ISO-8859-1 (or Windows-1252

Windows-1252

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

) but specifies the encoding as UTF-8, most web browsers display all umlauts

Umlaut (diacritic)

The diaeresis and the umlaut are diacritics that consist of two dots placed over a letter, most commonly a vowel. When that letter is an i or a j, the diacritic replaces the tittle: ï....

, ß

In the German alphabet, ß is a letter that originated as a ligature of ss or sz. Like double "s", it is pronounced as an , but in standard spelling, it is only used after long vowels and diphthongs, while ss is used after short vowels...

's and some other characters in the higher range as instead (since these bytes are almost certainly arranged to be invalid UTF-8 sequences). Newer software (including new versions of Internet Explorer

Internet Explorer

Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

) now translate the erroneous bytes individually to characters in Windows-1252. This gives a more readable presentation of incorrectly sent pages.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.