Big5
Encyclopedia
Big-5 or Big5 is a character encoding
method used in Taiwan
, Hong Kong
, and Macau
for Traditional Chinese character
s.
Mainland China
, which uses Simplified Chinese Characters, uses the GB instead.
The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.
The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS)
with the following structure:
(the prefix 0x signifying hexadecimal numbers).
Certain variants of the Big5 character set, for example the HKSCS
, use an expanded range for the lead byte including values in the 0x81 to 0xA0 range (similar to Shift JIS).
If the second byte is not in the correct range, behaviour is undefined
(i.e., varies from system to system).
The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.
Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (ASCII
, or an 8-bit character set such as code page 437
), so that you will find a mix of DBCS characters and single-byte characters in Big5-encoded text.
Bytes in the range 0x00 to 0x7f that are not part of a double-byte character are assumed to be single-byte characters.
(For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.)
The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system.
In old MSDOS-based systems, they are likely to be displayed as 8-bit characters;
in modern systems, they are likely to either give unpredictable results or generate an error.
The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below), dingbat
s, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for Suzhou numerals
, zhuyin fuhao, etc.)
In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone.
For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the 0xa3c0–0xa3fe range, and additional logograms would be placed in either the 0xc6a1–0xc8fe or the 0xf9d6–0xfefe range.
Sometimes, this is not possible due to the large number of extended characters to be added;
for example, Cyrillic letters and Japanese kana
have been placed in the zone associated with "frequently-used characters".
(The above might need some explanation by putting it in historical perspective, as it is theoretically incorrect: Back when text mode personal computing was still the norm, characters were normally represented as single bytes and each character takes one position on the screen. There was therefore a practical reason to insist that double-byte characters must take up two positions on the screen, namely that off-the-shelf, American-made software would then be usable without modification in a DBCS-based system. If a character can take an arbitrary number of screen positions, software that assumes that one byte of text takes one screen position would produce incorrect output. Of course, if a computer never had to deal with the text screen, the manufacturer would not enforce this artificial restriction; the Apple Macintosh is an example. Nevertheless, the encoding itself must be designed so that it works correctly on text-screen-based systems.)
To illustrate this point, consider the Big5 code 0xa14b (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code 0xa14b just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.
Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" (0xa1ca, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals
, which is a form of scientific notation
that requires the number to be laid out in a 2-D form consisting of at least two rows.
; this is mostly to do with a compatibility reason. However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified. Big5 has always been defined as a DBCS, though when used it must be paired with a suitable, unspecified SBCS and therefore used as what some people call a MBCS
; nevertheless, Big5 by itself, as defined, is strictly a DBCS.
The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old DOS
-based systems, Code Page 437
—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be Code Page 437.
Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in Code Page 437.
The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.
to support large character sets such as used for Chinese, Japanese and Korean led to governments and industry to find creative solutions to enable their languages to be rendered on computers. A variety of ad-hoc and usually proprietary input methods led to efforts to develop a standard system. As a result, Big5 encoding was defined by the Institute for Information Industry
of Taiwan in 1984. The name "Big5" is in recognition that the standard emerged from collaboration of five of Taiwan's largest IT firms: Acer (宏碁); MiTAC
(神通); JiaJia (佳佳), ZERO ONE Technology (零壹 or 01tech); and, First International Computer (FIC)
(大眾).
Big5 was rapidly popularized in Taiwan and worldwide among Chinese who used the traditional Chinese character set through its adoption in several commercial software packages, notably the E-TEN
Chinese DOS
input system (ETen Chinese System).
The Republic of China
government declared Big5 as their standard in mid-1980s since it was, by then, the de facto standard for using traditional Chinese on computers.
, biology
, Japanese kana
. As a result, many Big-5 supporting software include extensions to address the problems.
The plethora of variations make UTF-8
or UTF-16 a more consistent code page for modern use.
(倚天) Chinese operating system, the following code points are added to make it compliant with IBM5550 code page:
In some versions of Eten, there are extra graphical symbols and Simplified Chinese characters.
(微軟) created its own version of Big5 extension as Code page 950
for use with Microsoft Windows
, which supports ETEN's extensions, but only the F9D6-F9FE code points. In Windows ME
, the euro
currency symbol
was mapped to Big-5 code point A3E1, but not in later versions of the operating system.
After installing Microsoft's HKSCS patch on top of traditional Chinese Windows (or any version of Windows 2000 and above with proper language pack), applications using code page 950 automatically use a hidden code page 951 table. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.
Code page 950 used by Windows 2000 and Windows XP maps hiragana and katakana characters to Unicode private use area block when exporting to Unicode, but to the proper hiragana and katakana Unicode blocks in Windows Vista.
. The fonts support Japanese kana
, kokuji, and other characters missing in Big-5. As a result, the ChinaSea extensions have become more popular than the government-supported extensions. Some Hong Kong BBSes
had used encodings in ChinaSea fonts before the introduction of HKSCS
.
. It adds support for kokuji and proprietary dingbats
(including Doraemon
) not found in HKSCS.
and Unicode
(the project is not compatible with HKSCS), the success of this extension is limited at best.
Despite the problems, characters previously mapped to Unicode Private Use Area are remapped to the standardized equivalents when exporting characters to Unicode format.
and Sun Daily, belongs to the Oriental Press Group Limited(東方報業集團有限公司) in Hong Kong, use a downloadable font with a different Big-5 extension coding than the HKSCS
.
Big5-2003 incorporates all Big-5 characters introduced in the 1984 ETEN extensions (code points A3C0-A3E0, C6A1-C7F2, and F9D6-F9FE) and the Euro symbol. Cyrillic characters were not included because the authority claimed CNS 11643 does not include such characters.
made a CDP font(漢字構形資料庫) in late 90s, which the latest release version 2.5 included 112,533 characters, some less than the Mojikyo
fonts.
also adopted Big5 for character encoding. However, Cantonese uses many archaic and some colloquial Chinese characters that were not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions Government Chinese Character Set in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions were commonly distributed as a patch. It is still being distributed as a patch by Microsoft, but a full Unicode font is also available from the Hong Kong Government’s web site.
There are two encoding schemes of HKSCS: one encoding scheme is for the Big-5 coding standard and the other is for the ISO 10646 standard. Subsequent to the initial release, there are also HKSCS-2001 and HKSCS-2004. The HKSCS-2004 is aligned technically with the ISO/IEC 10646:2003 and its Amendment 1 published in April 2004 by the International Organization for Standardization (ISO).
HKSCS includes all the characters from the common ETEN extension, plus some characters from Simplified Chinese, place names, people's names, and Cantonese phrases (including profanity
).
Chinese character encoding
In computing, Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean — and obsolete Vietnamese, all of which use Chinese characters...
method used in Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...
, Hong Kong
Hong Kong
Hong Kong is one of two Special Administrative Regions of the People's Republic of China , the other being Macau. A city-state situated on China's south coast and enclosed by the Pearl River Delta and South China Sea, it is renowned for its expansive skyline and deep natural harbour...
, and Macau
Macau
Macau , also spelled Macao , is, along with Hong Kong, one of the two special administrative regions of the People's Republic of China...
for Traditional Chinese character
Traditional Chinese character
Traditional Chinese characters refers to Chinese characters in any character set which does not contain newly created characters or character substitutions performed after 1946. It most commonly refers to characters in the standardized character sets of Taiwan, of Hong Kong, or in the Kangxi...
s.
Mainland China
Mainland China
Mainland China, the Chinese mainland or simply the mainland, is a geopolitical term that refers to the area under the jurisdiction of the People's Republic of China . According to the Taipei-based Mainland Affairs Council, the term excludes the PRC Special Administrative Regions of Hong Kong and...
, which uses Simplified Chinese Characters, uses the GB instead.
Organization
The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by Kangxi radical.The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.
The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS)
DBCS
A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols...
with the following structure:
First byte ("lead byte") | 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters) |
Second byte | 0x40 to 0x7e, 0xa1 to 0xfe |
(the prefix 0x signifying hexadecimal numbers).
Certain variants of the Big5 character set, for example the HKSCS
HKSCS
The Hong Kong Supplementary Character Set is a set of Chinese characters -- 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong . It evolved from the preceding Government Chinese Character Set or GCCS...
, use an expanded range for the lead byte including values in the 0x81 to 0xA0 range (similar to Shift JIS).
If the second byte is not in the correct range, behaviour is undefined
Undefined behaviour
In computer programming, undefined behavior is a feature of some programming languages—most famously C. In these languages, to simplify the specification and allow some flexibility in implementation, the specification leaves the results of certain operations specifically undefined.For...
(i.e., varies from system to system).
The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.
Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, or an 8-bit character set such as code page 437
Code page 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
), so that you will find a mix of DBCS characters and single-byte characters in Big5-encoded text.
Bytes in the range 0x00 to 0x7f that are not part of a double-byte character are assumed to be single-byte characters.
(For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.)
The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system.
In old MSDOS-based systems, they are likely to be displayed as 8-bit characters;
in modern systems, they are likely to either give unpredictable results or generate an error.
A more detailed look at the organization
In the original Big5, the encoding is compartmentalized into different zones:0x8140 to 0xa0fe | Reserved for user-defined characters 造字 |
0xa140 to 0xa3bf | "Graphical characters" 圖形碼 |
0xa3c0 to 0xa3fe | Reserved, not for user-defined characters |
0xa440 to 0xc67e | Frequently used characters 常用字 |
0xc6a1 to 0xc8fe | Reserved for user-defined characters |
0xc940 to 0xf9d5 | Less frequently used characters 次常用字 |
0xf9d6 to 0xfefe | Reserved for user-defined characters |
The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below), dingbat
Dingbat
A dingbat is an ornament, character or spacer used in typesetting, sometimes more formally known as a "printer's ornament" or "printer's character"....
s, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for Suzhou numerals
Suzhou numerals
The Suzhou numerals or huama is a numeral system used in China before the introduction of Arabic numerals.-History:The Suzhou numeral system is the only surviving variation of the rod numeral system. The rod numeral system is a positional numeral system used by the Chinese in mathematics...
, zhuyin fuhao, etc.)
In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone.
For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the 0xa3c0–0xa3fe range, and additional logograms would be placed in either the 0xc6a1–0xc8fe or the 0xf9d6–0xfefe range.
Sometimes, this is not possible due to the large number of extended characters to be added;
for example, Cyrillic letters and Japanese kana
Kana
Kana are the syllabic Japanese scripts, as opposed to the logographic Chinese characters known in Japan as kanji and the Roman alphabet known as rōmaji...
have been placed in the zone associated with "frequently-used characters".
What a Big5 code actually encodes
An individual Big5 code does not always represent a complete semantic unit. The Big5 codes of logograms are always logograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of double-byte character sets as normally used in CJK (Chinese, Japanese, and Korean) computing, and is not a unique problem of Big5.(The above might need some explanation by putting it in historical perspective, as it is theoretically incorrect: Back when text mode personal computing was still the norm, characters were normally represented as single bytes and each character takes one position on the screen. There was therefore a practical reason to insist that double-byte characters must take up two positions on the screen, namely that off-the-shelf, American-made software would then be usable without modification in a DBCS-based system. If a character can take an arbitrary number of screen positions, software that assumes that one byte of text takes one screen position would produce incorrect output. Of course, if a computer never had to deal with the text screen, the manufacturer would not enforce this artificial restriction; the Apple Macintosh is an example. Nevertheless, the encoding itself must be designed so that it works correctly on text-screen-based systems.)
To illustrate this point, consider the Big5 code 0xa14b (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code 0xa14b just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.
Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" (0xa1ca, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals
Suzhou numerals
The Suzhou numerals or huama is a numeral system used in China before the introduction of Arabic numerals.-History:The Suzhou numeral system is the only surviving variation of the rod numeral system. The rod numeral system is a positional numeral system used by the Chinese in mathematics...
, which is a form of scientific notation
Scientific notation
Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. Scientific notation has a number of useful properties and is commonly used in calculators and by scientists, mathematicians, doctors, and engineers.In scientific...
that requires the number to be laid out in a 2-D form consisting of at least two rows.
The Matching SBCS
In practice, Big5 cannot be used without a matching Single Byte Character Set (SBCS)SBCS
SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
; this is mostly to do with a compatibility reason. However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified. Big5 has always been defined as a DBCS, though when used it must be paired with a suitable, unspecified SBCS and therefore used as what some people call a MBCS
Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...
; nevertheless, Big5 by itself, as defined, is strictly a DBCS.
The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old DOS
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
-based systems, Code Page 437
Code page 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be Code Page 437.
Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in Code Page 437.
The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.
History
The inability of ASCIIASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
to support large character sets such as used for Chinese, Japanese and Korean led to governments and industry to find creative solutions to enable their languages to be rendered on computers. A variety of ad-hoc and usually proprietary input methods led to efforts to develop a standard system. As a result, Big5 encoding was defined by the Institute for Information Industry
Institute for Information Industry
The Institute for Information Industry was established in 1979 as a Non-Governmental Organization through the joint efforts of public and private sectors, to support the development/applications of the information industry as well as the information society in Taiwan under the supervision of the...
of Taiwan in 1984. The name "Big5" is in recognition that the standard emerged from collaboration of five of Taiwan's largest IT firms: Acer (宏碁); MiTAC
MiTAC
MiTAC International Corp. is a Taiwan electronics company established December 8, 1982. It is a subsidiary of MiTAC Inc. ....
(神通); JiaJia (佳佳), ZERO ONE Technology (零壹 or 01tech); and, First International Computer (FIC)
First International Computer
First International Computer, Inc. is a Taiwanese computer and components manufacturer, that designs and manufactures computer products and electronic components for other electronics equipment manufacturers worldwide. The company's products include motherboards, embedded computing systems,...
(大眾).
Big5 was rapidly popularized in Taiwan and worldwide among Chinese who used the traditional Chinese character set through its adoption in several commercial software packages, notably the E-TEN
E-TEN
E-TEN Information Systems Co., Ltd. was an electronics manufacturing company based in Taiwan, specializing in sophisticated handheld devices such as smartphones....
Chinese DOS
DOS
DOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...
input system (ETen Chinese System).
The Republic of China
Republic of China
The Republic of China , commonly known as Taiwan , is a unitary sovereign state located in East Asia. Originally based in mainland China, the Republic of China currently governs the island of Taiwan , which forms over 99% of its current territory, as well as Penghu, Kinmen, Matsu and other minor...
government declared Big5 as their standard in mid-1980s since it was, by then, the de facto standard for using traditional Chinese on computers.
Extensions
The original Big-5 only include CJK logograms from 常用國字標準字體表 (4808字) and 次常用國字標準字體表 (6343字), but not letters from people's names, place names, dialects, chemistryChemistry
Chemistry is the science of matter, especially its chemical reactions, but also its composition, structure and properties. Chemistry is concerned with atoms and their interactions with other atoms, and particularly with the properties of chemical bonds....
, biology
Biology
Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...
, Japanese kana
Kana
Kana are the syllabic Japanese scripts, as opposed to the logographic Chinese characters known in Japan as kanji and the Roman alphabet known as rōmaji...
. As a result, many Big-5 supporting software include extensions to address the problems.
The plethora of variations make UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
or UTF-16 a more consistent code page for modern use.
ETEN extensions
In ETENETEN operation system
ETen Chinese System was the most popular DOS-compatible traditional Chinese operating system before Chinese Windows 95.DOS did not support Chinese characters, which are not in Extended ASCII...
(倚天) Chinese operating system, the following code points are added to make it compliant with IBM5550 code page:
- A3C0-A3E0: 33 control characters.
- C6A1-C875: circle 1-10, bracket 1-10, Roman letters 1-9 (i-ix), CJK radical glyphs, Japanese hiraganaHiraganais a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...
, Japanese katakanaKatakanais a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...
, Cyrillic characters - F9D6-F9FE: '碁', '銹', '恒', '裏', '墻', '粧', '嫺', and 34 extra symbols.
In some versions of Eten, there are extra graphical symbols and Simplified Chinese characters.
Microsoft code pages
MicrosoftMicrosoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
(微軟) created its own version of Big5 extension as Code page 950
Code page 950
Code page 950 is Microsoft's implementation of the de facto standard Big5. The code page is not registered with IANA, and hence, is not a standard to communicate information over the internet. The major difference between code page 950 and Big5 is the incorporation of some ETEN characters at...
for use with Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, which supports ETEN's extensions, but only the F9D6-F9FE code points. In Windows ME
Windows Me
Windows Millennium Edition, or Windows Me , is a graphical operating system released on September 14, 2000 by Microsoft, and was the last operating system released in the Windows 9x series. Support for Windows Me ended on July 11, 2006....
, the euro
Euro
The euro is the official currency of the eurozone: 17 of the 27 member states of the European Union. It is also the currency used by the Institutions of the European Union. The eurozone consists of Austria, Belgium, Cyprus, Estonia, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg,...
currency symbol
Euro sign
The euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...
was mapped to Big-5 code point A3E1, but not in later versions of the operating system.
After installing Microsoft's HKSCS patch on top of traditional Chinese Windows (or any version of Windows 2000 and above with proper language pack), applications using code page 950 automatically use a hidden code page 951 table. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.
Code page 950 used by Windows 2000 and Windows XP maps hiragana and katakana characters to Unicode private use area block when exporting to Unicode, but to the proper hiragana and katakana Unicode blocks in Windows Vista.
ChinaSea font
ChinaSea fonts (中國海字集) are Traditional Chinese fonts made by ChinaSea. The fonts are rarely sold separately, but are bundled with other products, such as the Chinese version of Microsoft Office 97Microsoft Office 97
Microsoft Office 97 was a major milestone release of Microsoft Office, which included hundreds of new features and improvements, introduced "Command Bars", a paradigm in which menus and toolbars were made more similar in capability and visual design featured natural language systems and...
. The fonts support Japanese kana
Kana
Kana are the syllabic Japanese scripts, as opposed to the logographic Chinese characters known in Japan as kanji and the Roman alphabet known as rōmaji...
, kokuji, and other characters missing in Big-5. As a result, the ChinaSea extensions have become more popular than the government-supported extensions. Some Hong Kong BBSes
Bulletin board system
A Bulletin Board System, or BBS, is a computer system running software that allows users to connect and log in to the system using a terminal program. Once logged in, a user can perform functions such as uploading and downloading software and data, reading news and bulletins, and exchanging...
had used encodings in ChinaSea fonts before the introduction of HKSCS
HKSCS
The Hong Kong Supplementary Character Set is a set of Chinese characters -- 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong . It evolved from the preceding Government Chinese Character Set or GCCS...
.
'Sakura' font
The 'Sakura' font (日和字集 Sakura Version) is developed in Hong Kong and is designed to be compatible with HKSCSHKSCS
The Hong Kong Supplementary Character Set is a set of Chinese characters -- 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong . It evolved from the preceding Government Chinese Character Set or GCCS...
. It adds support for kokuji and proprietary dingbats
Dingbat
A dingbat is an ornament, character or spacer used in typesetting, sometimes more formally known as a "printer's ornament" or "printer's character"....
(including Doraemon
Doraemon
is a Japanese manga series created by Fujiko F. Fujio which later became an anime series and an Asian franchise...
) not found in HKSCS.
Unicode-at-on
Unicode-at-on (Unicode補完計畫), formerly BIG5 Extension, extends BIG-5 by altering code page tables, but uses the ChinaSea extensions starting with version 2. However, with the bankruptcy of ChinaSea, late development, and the increasing popularity of HKSCSHKSCS
The Hong Kong Supplementary Character Set is a set of Chinese characters -- 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong . It evolved from the preceding Government Chinese Character Set or GCCS...
and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
(the project is not compatible with HKSCS), the success of this extension is limited at best.
Despite the problems, characters previously mapped to Unicode Private Use Area are remapped to the standardized equivalents when exporting characters to Unicode format.
OPG
The web sites of the Oriental Daily NewsOriental Daily News
Oriental Daily News is a Chinese language newspaper in Hong Kong. It was established in 1969. It is one of the two newspapers published by the Oriental Press Group Limited , found by Ma's Family.-History:While Oriental Daily targets at a more mature reader group, Sun Daily Oriental Daily News is...
and Sun Daily, belongs to the Oriental Press Group Limited(東方報業集團有限公司) in Hong Kong, use a downloadable font with a different Big-5 extension coding than the HKSCS
HKSCS
The Hong Kong Supplementary Character Set is a set of Chinese characters -- 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong . It evolved from the preceding Government Chinese Character Set or GCCS...
.
Taiwan Ministry of Education font
The Taiwan Ministry of Education supplied its own font, the Taiwan Ministry of Education font(臺灣教育部造字檔) for use internally.Taiwan Council of Agriculture font
Taiwan's Council of Agriculture font, Executive Yuan introduced a 133-character custom font, the Taiwan Council of Agriculture font(臺灣農委會常用中文外字集) that includes 84 characters from the 'fish' radical and 7 from the 'bird' radical.Big5+
The Chinese Foundation for Digitization Technology(中文數位化技術推廣委員會) introduced Big5+ in 1997, which used over 20000 code points to incorporate all CJK logograms in Unicode 1.1. However, the extra code points exceeded the original Big-5 definition (Big5+ uses high byte values 81-FE and low byte values 40-7E and 80-FE), preventing it from being installed on Microsoft Windows.Big-5E
To allow Windows users to use custom fonts, the Chinese Foundation for Digitization Technology introduced Big-5E, which added 3954 characters (in three blocks of code points: 8E40-A0FE, 8140-86DF, 86E0-875C) and removed the Japanese kana from the ETEN extension. Unlike Big-5+, Big5E extends Big-5 within its original definition. Mac OS X 10.3 and later supports Big-5E in the fonts LiHei Pro (儷黑 Pro.ttf) and LiSong Pro (儷宋 Pro.ttf).Big5-2003
The Chinese Foundation for Digitization Technology made a Big5 definition and put it into CNS 11643 in note form, making it part of the official standard in Taiwan.Big5-2003 incorporates all Big-5 characters introduced in the 1984 ETEN extensions (code points A3C0-A3E0, C6A1-C7F2, and F9D6-F9FE) and the Euro symbol. Cyrillic characters were not included because the authority claimed CNS 11643 does not include such characters.
CDP
The Academia SinicaAcademia Sinica
The Academia Sinica , headquartered in the Nangang District of Taipei, is the national academy of Taiwan. It supports research activities in a wide variety of disciplines, ranging from mathematical and physical sciences, to life sciences, and to humanities and social sciences.Academia Sinica has...
made a CDP font(漢字構形資料庫) in late 90s, which the latest release version 2.5 included 112,533 characters, some less than the Mojikyo
Mojikyo
is a set of computer software and fonts for enhanced logogram word-processing. , it collected 126,560/142,228 characters . Among them, 101,936/128,573 characters belong to the extended CJKV family...
fonts.
HKSCS
Hong KongHong Kong
Hong Kong is one of two Special Administrative Regions of the People's Republic of China , the other being Macau. A city-state situated on China's south coast and enclosed by the Pearl River Delta and South China Sea, it is renowned for its expansive skyline and deep natural harbour...
also adopted Big5 for character encoding. However, Cantonese uses many archaic and some colloquial Chinese characters that were not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions Government Chinese Character Set in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions were commonly distributed as a patch. It is still being distributed as a patch by Microsoft, but a full Unicode font is also available from the Hong Kong Government’s web site.
There are two encoding schemes of HKSCS: one encoding scheme is for the Big-5 coding standard and the other is for the ISO 10646 standard. Subsequent to the initial release, there are also HKSCS-2001 and HKSCS-2004. The HKSCS-2004 is aligned technically with the ISO/IEC 10646:2003 and its Amendment 1 published in April 2004 by the International Organization for Standardization (ISO).
HKSCS includes all the characters from the common ETEN extension, plus some characters from Simplified Chinese, place names, people's names, and Cantonese phrases (including profanity
Profanity
Profanity is a show of disrespect, or a desecration or debasement of someone or something. Profanity can take the form of words, expressions, gestures, or other social behaviors that are socially constructed or interpreted as insulting, rude, vulgar, obscene, desecrating, or other forms.The...
).
External links
- Big5 character code table
- Chinese character codes: an update by Christian Wittern
- CNS 11643 official web site has information about the Big5e character set (an extended version of Big5) in the "Chinese Information Code" section
- Big5 introduction Contains differences between extensions.
- Graphical View of Big5 in ICU's Converter Explorer
- 教育部標準字體 Download page of the Taiwan Ministry of Education fonts
- 文獻處理實驗室 Download pages of the CDP font
- Hong Kong Supplementary Character Set Info Downloadable HKSCS documents & font
- 香港參考宋體 Download page of Dynalab(華康科技有限公司)'s HKSCS font.
- Microsoft's Windows Codepage 950 (Traditional Chinese Big5)
- on.cc Download page of the OPG font
- 中國海字集視窗版(v3.0)下載網頁 Download page of the ChinaSea font