Japanese language and computers
Encyclopedia
In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese
and others common to language
s which have a very large number of characters. The number of characters needed in order to write English is very small, and thus it is possible to use only one byte
to encode one English character. However, the number of characters in Japanese is much more than 256, and hence Japanese cannot be encoded using only one byte, and Japanese is thus encoded using two or more bytes, in a so-called "double byte" or "multi-byte" encoding. Some problems relate to transliteration
and romanization
, some to character encoding, and some to the input of Japanese text.
Japanese characters for use on a computer, including JIS
, Shift-JIS
, EUC
, and Unicode
. While mapping the set of kana
is a simple matter, kanji
has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards are still in use today.
For example, most Japanese e-mail
s are in JIS encoding
and web page
s in Shift-JIS
and yet mobile phones in Japan usually use some form of Extended Unix Code
. If a program fails to determine the encoding scheme employed, it can cause mojibake
(misconverted garbled/garbage characters, literally "transformed characters" from the combination of moji meaning character and the stem of bakeru meaning to change form) and thus unreadable text on computers.
To understand how this state of affairs has arisen, it is useful to learn a little about the history of the encodings. The first encoding to become widely used was JIS X 0201
, which is a single-byte encoding that only covers standard 7-bit ASCII characters with half-width katakana extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers). This means that only katakana
, not kanji, was supported using this technique. Still many embedded equipments with displays have only katakana support.
The development of kanji encodings was the beginning of the split. Shift JIS supports kanji and was developed to be completely backward compatible with JIS X 0201
, and thus is in much embedded electronic equipment.
However, Shift JIS has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it. For example, a text search method can get false hits if it is not designed for Shift JIS. EUC
, on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus EUC
encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus JIS encoding
was developed for sending and receiving e-mails.
In character set standards such as JIS
, not all required characters are included, so gaiji ( "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in Internet
environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be written using a larger character set (such as Unicode) that supports the required character.
Unicode
is supposed to solve all encoding problems in all languages of the world. The UTF-8
encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software and no gaiji methods are needed. There are still controversies. For Japanese, the kanji characters have been unified
with Chinese, that is a character considered to be the same in both Japanese and Chinese have been given one and the same code number in Unicode, even if they look a little different. This process, called Han unification
, has caused controversy. The previous encodings in Japan, Taiwan Area, Mainland China and Korea have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas. Unicode is slowly growing because it is better supported by software from outside Japan, but still most homepages in Japanese use Shift-JIS. Wikipedia uses Unicode.
(Chinese characters), 2 sets of kana (phonetic syllabaries) and roman letters. While kana and roman letters can be typed directly into a computer, entering kanji is a more complicated process as there are far more kanji than there are keys on most keyboards. To input kanji on modern computers, the reading of kanji is usually entered first, then an input method editor
(IME), also sometimes known as a front-end processor, shows a list of candidate kanji that are a phonetic match, and allows the user to choose the correct kanji. More-advanced IMEs work not by word but by phrase, thus increasing the likelihood of getting the desired characters as the first option presented. Kanji readings inputs can be either via romanization
(rōmaji nyūryoku, ) or direct kana input (kana nyūryoku, ). Romaji input is more common on PC's and other full-size keyboards (although direct input is also widely supported), whereas direct kana input is typically used on mobile phones and similar devices – each of the 10 digits (1–9,0) corresponds to one of the 10 columns in the gojūon
table of kana, and multiple presses select the row.
There are two main systems for the romanization
of Japanese, known as Kunrei-shiki
and Hepburn
; in practice, "keyboard romaji" (also known as wāpuro rōmaji
or "word processor romaji") generally allows a loose combination of both. IME implementations may even handle keys for letters unused in any romanization scheme, such as L, converting them to the most appropriate equivalent. With kana input, each key on the keyboard directly corresponds to one kana. The JIS keyboard system is the national standard, but some people use alternatives like Oyayubi shift system.
. Yokogaki style writes left-to-right, top-to-bottom, as with English. Tategaki style writes first top-to-bottom, and then moves right-to-left.
At present, handling of downward text is incomplete. For example, HTML
has no support for tategaki and Japanese users must use HTML tables to simulate it. However, CSS
level 3 includes a property "writing-mode" which can render tategaki when given the value "tb-rl" (i.e. top to bottom, right to left). Word processors and DTP
software have more complete support for it.
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
and others common to language
Language
Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...
s which have a very large number of characters. The number of characters needed in order to write English is very small, and thus it is possible to use only one byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
to encode one English character. However, the number of characters in Japanese is much more than 256, and hence Japanese cannot be encoded using only one byte, and Japanese is thus encoded using two or more bytes, in a so-called "double byte" or "multi-byte" encoding. Some problems relate to transliteration
Transliteration
Transliteration is a subset of the science of hermeneutics. It is a form of translation, and is the practice of converting a text from one script into another...
and romanization
Romanization
In linguistics, romanization or latinization is the representation of a written word or spoken speech with the Roman script, or a system for doing so, where the original word or language uses a different writing system . Methods of romanization include transliteration, for representing written...
, some to character encoding, and some to the input of Japanese text.
Character encodings
There are several standard methods to encodeCharacter encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
Japanese characters for use on a computer, including JIS
JIS encoding
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:* A set of standard character sets for Japanese, notably:...
, Shift-JIS
Shift-JIS
Shift JIS is a character encoding for the Japanese language originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1...
, EUC
Extended Unix Code
Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...
, and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
. While mapping the set of kana
Kana
Kana are the syllabic Japanese scripts, as opposed to the logographic Chinese characters known in Japan as kanji and the Roman alphabet known as rōmaji...
is a simple matter, kanji
Kanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...
has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards are still in use today.
For example, most Japanese e-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
s are in JIS encoding
JIS encoding
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:* A set of standard character sets for Japanese, notably:...
and web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...
s in Shift-JIS
Shift-JIS
Shift JIS is a character encoding for the Japanese language originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1...
and yet mobile phones in Japan usually use some form of Extended Unix Code
Extended Unix Code
Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...
. If a program fails to determine the encoding scheme employed, it can cause mojibake
Mojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...
(misconverted garbled/garbage characters, literally "transformed characters" from the combination of moji meaning character and the stem of bakeru meaning to change form) and thus unreadable text on computers.
To understand how this state of affairs has arisen, it is useful to learn a little about the history of the encodings. The first encoding to become widely used was JIS X 0201
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
, which is a single-byte encoding that only covers standard 7-bit ASCII characters with half-width katakana extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers). This means that only katakana
Katakana
is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...
, not kanji, was supported using this technique. Still many embedded equipments with displays have only katakana support.
The development of kanji encodings was the beginning of the split. Shift JIS supports kanji and was developed to be completely backward compatible with JIS X 0201
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
, and thus is in much embedded electronic equipment.
However, Shift JIS has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it. For example, a text search method can get false hits if it is not designed for Shift JIS. EUC
Extended Unix Code
Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...
, on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus EUC
Extended Unix Code
Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...
encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus JIS encoding
JIS encoding
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:* A set of standard character sets for Japanese, notably:...
was developed for sending and receiving e-mails.
In character set standards such as JIS
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is...
, not all required characters are included, so gaiji ( "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...
environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be written using a larger character set (such as Unicode) that supports the required character.
Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
is supposed to solve all encoding problems in all languages of the world. The UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software and no gaiji methods are needed. There are still controversies. For Japanese, the kanji characters have been unified
Han unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
with Chinese, that is a character considered to be the same in both Japanese and Chinese have been given one and the same code number in Unicode, even if they look a little different. This process, called Han unification
Han unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
, has caused controversy. The previous encodings in Japan, Taiwan Area, Mainland China and Korea have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas. Unicode is slowly growing because it is better supported by software from outside Japan, but still most homepages in Japanese use Shift-JIS. Wikipedia uses Unicode.
Text input
Written Japanese uses several different scripts: kanjiKanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...
(Chinese characters), 2 sets of kana (phonetic syllabaries) and roman letters. While kana and roman letters can be typed directly into a computer, entering kanji is a more complicated process as there are far more kanji than there are keys on most keyboards. To input kanji on modern computers, the reading of kanji is usually entered first, then an input method editor
Input method editor
An input method is an operating system component or program that allows any data, such as keyboard strokes or mouse movements, to be received as input. In this way users can enter characters and symbols not found on their input devices...
(IME), also sometimes known as a front-end processor, shows a list of candidate kanji that are a phonetic match, and allows the user to choose the correct kanji. More-advanced IMEs work not by word but by phrase, thus increasing the likelihood of getting the desired characters as the first option presented. Kanji readings inputs can be either via romanization
Romanization
In linguistics, romanization or latinization is the representation of a written word or spoken speech with the Roman script, or a system for doing so, where the original word or language uses a different writing system . Methods of romanization include transliteration, for representing written...
(rōmaji nyūryoku, ) or direct kana input (kana nyūryoku, ). Romaji input is more common on PC's and other full-size keyboards (although direct input is also widely supported), whereas direct kana input is typically used on mobile phones and similar devices – each of the 10 digits (1–9,0) corresponds to one of the 10 columns in the gojūon
Gojuon
The is a Japanese ordering of kana.It is named for the 5×10 grid in which the characters are displayed, but the grid is not completely filled, and, further, there is an extra character added outside the grid at the end: with 5 gaps and 1 extra character, the current number of distinct kana in a...
table of kana, and multiple presses select the row.
There are two main systems for the romanization
Romanization
In linguistics, romanization or latinization is the representation of a written word or spoken speech with the Roman script, or a system for doing so, where the original word or language uses a different writing system . Methods of romanization include transliteration, for representing written...
of Japanese, known as Kunrei-shiki
Kunrei-shiki
is a Japanese romanization system, i.e. a system for transcribing the Japanese language into the Latin alphabet. It is abbreviated as Kunrei-shiki. Its name is rendered Kunreisiki using Kunrei-shiki itself....
and Hepburn
Hepburn romanization
The is named after James Curtis Hepburn, who used it to transcribe the sounds of the Japanese language into the Latin alphabet in the third edition of his Japanese–English dictionary, published in 1887. The system was originally proposed by the in 1885...
; in practice, "keyboard romaji" (also known as wāpuro rōmaji
Wapuro romaji
, or kana spelling, is a style of romanization of Japanese originally devised for entering Japanese into word processors while using a Western QWERTY keyboard....
or "word processor romaji") generally allows a loose combination of both. IME implementations may even handle keys for letters unused in any romanization scheme, such as L, converting them to the most appropriate equivalent. With kana input, each key on the keyboard directly corresponds to one kana. The JIS keyboard system is the national standard, but some people use alternatives like Oyayubi shift system.
Direction of text
Japanese can be written in two directionsHorizontal and vertical writing in East Asian scripts
Many East Asian scripts can be written horizontally or vertically. The Chinese, Japanese and Korean scripts can be oriented in either direction, as they consist mainly of disconnected syllabic units, each occupying a square block of space...
. Yokogaki style writes left-to-right, top-to-bottom, as with English. Tategaki style writes first top-to-bottom, and then moves right-to-left.
At present, handling of downward text is incomplete. For example, HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
has no support for tategaki and Japanese users must use HTML tables to simulate it. However, CSS
Cascading Style Sheets
Cascading Style Sheets is a style sheet language used to describe the presentation semantics of a document written in a markup language...
level 3 includes a property "writing-mode" which can render tategaki when given the value "tb-rl" (i.e. top to bottom, right to left). Word processors and DTP
Desktop publishing
Desktop publishing is the creation of documents using page layout software on a personal computer.The term has been used for publishing at all levels, from small-circulation documents such as local newsletters to books, magazines and newspapers...
software have more complete support for it.
See also
- Japanese writing systemJapanese writing systemThe modern Japanese writing system uses three main scripts:*Kanji, adopted Chinese characters*Kana, a pair of syllabaries , consisting of:...
- Japanese languageJapanese languageis a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
- CJK characters
- Korean language and computersKorean language and computersThis article addresses how computers are used to read and write Korean, using Hangul.-Character encodings:In RFC 1557, a method known as ISO-2022-KR for a 7-bit encoding of Korean characters in email was described. Where 8 bits are allowed, the EUC-KR encoding is preferred. These two...
- List of FEP software for Symbian S60
External links
- Japanese Owned computer companies in United States
- A complete introduction to Japanese character encodings
- Chinese, Japanese, and Korean character set standards and encoding systems
- Japanese text encoding
- A collection of free Japanese typefaces
- How to install japanese font
- Online Japanese Dictionary of Linguistics