Character encodings in HTML
Encyclopedia
HTML
(Hypertext Markup Language) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international character
s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII
two goals are worth considering: the information's integrity
, and universal browser
display.
can include the character encoding or "
(HTTP)
Content-Type: text/html; charset=ISO-8859-1
For HTML it is possible to include this information inside the
HTML5 also allows the following syntax to mean exactly the same:
XHTML
documents have a third option: to express the character encoding via XML
declaration, as follows:
<meta http-equiv="Content-Type"> may be interpreted directly by a browser, like an ordinary HTML tag, or it may be used by the the HTTP server to generate corresponding headers when it serves the document . The HTTP/1.1 header specification for a HTML document must label an appropriate encoding in the Content-Type header, missing charset= parameter results in acceptance of ISO-8859-1 (so HTTP/1.1 formally does not offer such option as an unspecified character encoding), and this specification supersedes all HTML (or XHTML) meta element ones. This can pose a problem if the server generates an incorrect header and one does not have the access or the knowledge to change them.
As each of these methods explain to the receiver how the file being sent should be interpreted, it would be inappropriate for these declarations not to match the actual character encoding used. Because a server usually can't know how a document is encoded—especially if documents are created on different platforms or in different regions—many servers simply do not include a reference to the "
displays mojibake
because it cannot find out which character encoding was used. Due to widespread and persistent ignorance of HTTP charset= over the Internet (at its server side), WWW Consortium disappointed in HTTP/1.1’s strict approach and encourage browser developers to use some fixes in violation of RFC 2616.
If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user's settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user's language. For Western European languages, it is typical and fairly safe to assume Windows-1252
, which is similar to ISO-8859-1 but has printable characters in place of some control codes. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English
-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In CJK
environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well.
It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8
, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
or hexadecimal
) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.
code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "™", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.
Character entity references have the format
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a markup delimiting characters mentioned above, and for a few special characters (or not at all if a native Unicode
encoding like UTF-8
is used). However, to prevent HTML injection attacks like Cross Site Scripting you should be very careful to use HTML entity escaping properly. If HTML attributes are not fully quoted, then you must entity encode whitespace like space, tab, and others. Other HTML contexts like javascript, css styles, and URLs require different escaping formats. See http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet for details on each of the different contexts.
there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
All other character entity references have to be defined before they can be used. For example, use of
, which is an XML application, supports the HTML entity set, along with XML's predefined entities.
However, use of
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
(Hypertext Markup Language) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
two goals are worth considering: the information's integrity
Integrity
Integrity is a concept of consistency of actions, values, methods, measures, principles, expectations, and outcomes. In ethics, integrity is regarded as the honesty and truthfulness or accuracy of one's actions...
, and universal browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...
display.
Specifying the document's character encoding
There are several ways to specify which character encoding is used in the document. First, the web serverWeb server
Web server can refer to either the hardware or the software that helps to deliver content that can be accessed through the Internet....
can include the character encoding or "
charset
" in the Hypertext Transfer ProtocolHypertext Transfer Protocol
The Hypertext Transfer Protocol is a networking protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web....
(HTTP)
Content-Type
header, which would typically look like this:Content-Type: text/html; charset=ISO-8859-1
For HTML it is possible to include this information inside the
head
element near the top of the document:HTML5 also allows the following syntax to mean exactly the same:
XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
documents have a third option: to express the character encoding via XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
declaration, as follows:
<meta http-equiv="Content-Type"> may be interpreted directly by a browser, like an ordinary HTML tag, or it may be used by the the HTTP server to generate corresponding headers when it serves the document . The HTTP/1.1 header specification for a HTML document must label an appropriate encoding in the Content-Type header, missing charset= parameter results in acceptance of ISO-8859-1 (so HTTP/1.1 formally does not offer such option as an unspecified character encoding), and this specification supersedes all HTML (or XHTML) meta element ones. This can pose a problem if the server generates an incorrect header and one does not have the access or the knowledge to change them.
As each of these methods explain to the receiver how the file being sent should be interpreted, it would be inappropriate for these declarations not to match the actual character encoding used. Because a server usually can't know how a document is encoded—especially if documents are created on different platforms or in different regions—many servers simply do not include a reference to the "
charset
" in the Content-Type
header, thus avoiding making false promises. However, if the document does not specify the encoding either, this may result in the equally bad situation where the user agentUser agent
In computing, a user agent is a client application implementing a network protocol used in communications within a client–server distributed computing system...
displays mojibake
Mojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...
because it cannot find out which character encoding was used. Due to widespread and persistent ignorance of HTTP charset= over the Internet (at its server side), WWW Consortium disappointed in HTTP/1.1’s strict approach and encourage browser developers to use some fixes in violation of RFC 2616.
If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user's settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user's language. For Western European languages, it is typical and fairly safe to assume Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
, which is similar to ISO-8859-1 but has printable characters in place of some control codes. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well.
It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
Character references
In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimalDecimal
The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....
or hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.
HTML character references
Numeric character references can be in decimal format,&#DD;
, where DD
is a variable number of decimal digits. Similarly there is a hexadecimal format, &#xHHHH;
, where HHHH
is a variable number of hexadecimal digits. Hexadecimal character references are case-insensitive in HTML. For example, the character 'λ' can be represented as λ
, λ
or λ
. Numeric references always refer to UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "™", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.
Character entity references have the format
&name;
where "name" is a case-sensitive alphanumeric string. For example, 'λ' can also be encoded as λ
in an HTML document. The character entity references <
, >
, "
and &
are predefined in HTML and SGML, because <
, >
, "
and &
are already used to delimit markup. This notably does not include XML's '
(') entity. For a list of all named HTML character entity references, see List of XML and HTML character entity references (approximately 250 entries).Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a markup delimiting characters mentioned above, and for a few special characters (or not at all if a native Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
encoding like UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
is used). However, to prevent HTML injection attacks like Cross Site Scripting you should be very careful to use HTML entity escaping properly. If HTML attributes are not fully quoted, then you must entity encode whitespace like space, tab, and others. Other HTML contexts like javascript, css styles, and URLs require different escaping formats. See http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet for details on each of the different contexts.
XML character references
Unlike traditional HTML with its large range of character entity references, in XMLXML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
&
→ & (ampersandAmpersandAn ampersand is a logogram representing the conjunction word "and". The symbol is a ligature of the letters in et, Latin for "and".-Etymology:...
, U+0026)<
→ < (less-than sign, U+003C)>
→ > (greater-than sign, U+003E)"
→ " (quotation mark, U+0022)'
→ ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. For example, use of
é
(which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x
in hexadecimal numeric references be in lowercase: for example ਛ
rather than ਛ
. XHTMLXHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
, which is an XML application, supports the HTML entity set, along with XML's predefined entities.
However, use of
'
in XHTML should generally be avoided for compatibility reasons. '
or '
may be used instead.&
has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences &amp;amp;amp; ...
in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was probably applied repeatedly.