Charset detection
Encyclopedia
Character encoding detection, charset detection, or code page detection is the process of heuristic
ally guessing the character encoding
of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system
would mis-detect the phrase "Bush hid the facts
" in ASCII as Chinese UTF-16LE.
One of the few cases where charset detection works reliably is detecting UTF-8
. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. Unfortunately badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.
Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents can declare their encoding in a
Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band
using the Content-type header.
Heuristic
Heuristic refers to experience-based techniques for problem solving, learning, and discovery. Heuristic methods are used to speed up the process of finding a satisfactory solution, where an exhaustive search is impractical...
ally guessing the character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
would mis-detect the phrase "Bush hid the facts
Bush hid the facts
Bush hid the facts is a common name for a bug present in the function IsTextUnicode of Microsoft Windows, which causes a file of text encoded in Windows-1252 or similar encoding to be interpreted as if it were UTF-16LE, resulting in mojibake...
" in ASCII as Chinese UTF-16LE.
One of the few cases where charset detection works reliably is detecting UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. Unfortunately badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.
Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents can declare their encoding in a
meta
element, thus:Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band
Out-of-band
The term out-of-band has different uses in communications and telecommunication. In case of out-of-band control signaling, signaling bits are sent in special order in a dedicated signaling frame...
using the Content-type header.
See also
- International Components for UnicodeInternational Components for UnicodeInternational Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...
- A library that can perform charset detection.