JIS X 0208
Encyclopedia
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard
, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language
. The official title of the current standard is . It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997.
Partial implementations of the character set are not compatible. Because there are places where such things have happened as the original drafting committee of the first standard taking care to separate things between level 1 and level 2 and the second standard then shuffling some itaiji among the level, at least in the first and second standards, it is conjectured that non-kanji
and level 1-only implementations were hypothesized at those times. However, such implementations have never been specified as compatible.
Even though there are provisions in the JIS X 0208:1997 standard concerning compatibility, at the present time, it is generally considered that this standard neither certifies compatibility nor is it an official manufacturing standard that amounts to a declaration of self-compatibility. Consequently, de facto, JIS X 0208-“compatible” products are not considered to exist. Terminology such as and is included in JIS X 0208, but the semantics of these terms vary from person to person.
s, column/line numbers and numbers are used. For a way to identify a character without depending on a code, character names are used.
For example, the bit combination corresponding to the graphic character “space” is 010 0000 as a 7-bit number, and 0010 0000 as an 8-bit number. According to those column/line numbers, this is represented as 2/0.
The first and second of two bytes are each permitted to indicate the 94 column/line numbers from 2/1 to 7/14. Consequently, there are 94 rows, and 94 cells in each row. Thus, there are 8836 (94 × 94) code points.
A code point is referenced as a . Each row is given a number from 1 to 94, and within each row, each cell is given a number from 1 to 94. A code number is expressed in the form “row-cell”, the row and cell numbers being separated by a hyphen
. For example, the character “” has a code point at row 16, cell 1, so its code number is represented as “16-01”.
The correspondence between code numbers and graphic characters is represented, with row numbers made into line numbers and cell numbers made into column numbers, on the 94-line 94-column graphic character code table.
This structure is also used in the Chinese GB 2312
and the Korean KS C 5601 (currently KS X 1001
).
These empty areas contain code points that should basically not be used. Except when there is prior agreement among the relevant parties, characters (gaiji) for information interchange should not be assigned to the unassigned code points.
Even when assigning characters to unassigned code points, graphic characters defined in the standard should not be assigned to them, and the same character should not be assigned to multiple unassigned code points; characters should not be duplicated in the set.
Furthermore, when assigning characters to unassigned code points, it is necessary to be cautious of unification
in regards to kanji glyphs. For example, row 25 cell 66 corresponds to the kanji meaning “high” or “expensive”; both the form with the character meaning “mouth” in the middle and the less common form with a ladder-like construction are subsumed into the same code point. Consequently, limiting point 25-66 to the “mouth” form and assigning the latter “ladder” form to an unassigned code point would technically be in violation of the standard.
For example, both the character at ISO/IEC 646
column 4 line 1 and the one at JIS X 0208 row 3 cell 33 have the name “LATIN CAPITAL LETTER A”. Therefore, the character at 4/1 in ISO/IEC 646 and the character at 3-33 in this standard can be concluded to be the same character. Also, for the ISO/IEC 646 International Reference Version, 2/2 (quotation mark), 2/7 (apostrophe), 2/13 (hyphen-minus), and 7/14 (tilde) are characters that do not exist in this standard.
Character names not for kanji use uppercase Roman letters, spaces, and hyphens. Non-kanji characters are given a , but some provisions for these names do not exist.
The names of kanji are mechanically set according to the corresponding hexadecimal representation of their code in the Universal Character Set
(UCS). The name of a kanji can be arrived at by prepending the UCS code with “CJK UNIFIED IDEOGRAPH-”. For example, row 16 cell 1 corresponds to 4E9C in UCS, so the name of it would be “CJK UNIFIED IDEOGRAPH-4E9C”. Kanji are not given Japanese common names.
, kana
, and so forth.
Special characters
Numerals
Latin letters
Hiragana
Katakana
Greek letters
Cyrillic letters
Box drawing characters
Kanji
are absent from JIS X 0208. There are the aforementioned four characters “QUOTATION MARK”, “APOSTROPHE”, “HYPHEN-MINUS”, and “TILDE”. The former three are split into different code points in the kanji set (Nishimura, 1978; JIS X 0221-1:2001 standard, Section 3.8.7). The “TILDE” of IRV has no corresponding character in the kanji set.
In the following table, the ISO/IEC 646 IRV characters in question are compared with their multiple equivalents in JIS X 0208, except for the ISO/IEC 646 IRV character “TILDE”, which is compared with the “WAVE DASH” of JIS X 0208. The entries under the “Symbol” columns utilize UCS/Unicode code points, so the specifics of display may differ.
This means that the kanji set is the most widespread non-upward-compatible character set in the world; it is counted as one of the weak points of this standard.
Even with the 90 special characters, numerals, and Latin letters the kanji set and the IRV set have in common, this standard does not follow the arrangement of ISO/IEC 646. These 90 characters are split into rows 1 through 4.
As to the cause of how these numerals, Latin letters, and so forth in the kanji set are the and how the original implementation came forth with a differing interpretation compared to the IRV, it is thought that it is due to these incompatibilities.
Ever since the first standard, it has been possible to represent such as encircled numbers, ligatures for measurement unit names, and Roman numerals
; they were not given independent kuten code points. Although individual companies that manufacture information systems can make an effort to represent these characters as customers may require by the composition of the characters, none has requested to have them added to the standard, instead choosing to proprietarily offer them as gaiji.
In the fourth standard (1997), all these characters were explicitly defined as characters that accompany an advancement of the current position; that is to say, they are spacing characters. Furthermore, it was ruled that they should not be made by the composition of characters. For this reason, it became disallowed to represent Latin characters with diacritic
s at all, with possibly the sole exception of the ångström
symbol (Å
) at row 2 cell 82.
and katakana
in JIS X 0208, unlike JIS X 0201
, includes dakuten
and handakuten markings as part of a character. The katakana and (both obsolete in modern Japanese) as well as the small , not in JIS X 0201, are also included.
The arrangement of kana in JIS X 0208 is different from the arrangement of katakana in JIS X 0201. In JIS X 0201, the syllabary starts with , followed by the small kana sorted by gojūon
order, followed by the full-size kana, also in gojūon order . On the other hand, in JIS X 0208, the kana are sorted first by gojūon order, then in the order of “small kana, full-size kana, kana with dakuten, and kana with handakuten” such that the same fundamental kana is grouped with its derivatives . This ordering was chosen in order to more simply facilitate the sorting of kana-based dictionary look-ups (Yasuoka, 2006).
As mentioned above, in this standard, the previously defined katakana order in JIS X 0201 was not followed in JIS X 0208. It is thought that the JIS X 0201 katakana being “half-width kana” arose due to the incompatibility with the katakana of this standard. This point is also one of the weaknesses of this standard.
In the second and third standards, they added four and two characters to level 2, respectively, bringing the total kanji to 6355. Also, in the second standard, character forms were changed as well as transposition among the levels; in the third standard as well, character forms were changed. These are described further below.
, the tōyō kanji correction draft, and the jinmeiyō kanji
as a basis. Also, JIS C 6260 (“To-Do-Fu-Ken (Prefecture) Identification Code”; currently JIS X 0401) and JIS C 6261 (“Identification code for cities, towns and villages”; currently JIS X 0402) were consulted; kanji for nearly all Japanese prefectures
, cities, districts, wards, towns, villages, and so forth were intentionally placed in level 1. Furthermore, amendments by experts were added.
Level 2 was dedicated to kanji that made an appearance in the aforementioned four major listings but were not selected for level 1. As noted below, the kanji of level 1 were ordered by their pronunciation, so among the kanji whose pronunciation were difficult to determine, there were those that were transferred from level 1 to level 2 on that basis (Nishimura, 1978).
Due to these decisions, for the most part, level 1 contains more frequently used kanji, and level 2 contains more infrequently used kanji, but of course, those were judged by the standards of the day; over the passage of time, some level 2 kanji have become more frequently used, such as one meaning “to soar” and one meaning “to glitter” ; and inversely, some level 1 kanji have become infrequent, notably the ones meaning “centimeter” and “millimeter” . Also, a few jinmeiyō kanji, being added after the kanji set was defined, fall into level 2.
order. As a general rule, the on (Chinese-sound) reading is considered the representative reading; where a kanji has multiple on readings, the reading judged to be predominant in use frequency is used for the representative reading (JIS C 6226-1978 standard, Section 3.4). For those atypical kanji that do not have on readings, the kun reading was employed as the representative reading. Where a verb kun reading must be used as the representative reading, the ren'yōkei (rather than the shūshikei) form is used.
For example, cells 1 to 41 on row 16 are 41 characters sorted as starting with a reading of a
. Within these, 22 characters, including 16-10 (: on reading “ki”; kun reading “aoi”) and 16-32 (: on readings “zoku” and “shoku”; kun reading “awa”) are there on the basis of their kun readings. 16-09 (: on reading “hō”, kun reading “a(i)”) and 16-23 (: on readings “sō” and “kyū”, kun reading “atsuka(i)”) are just two examples of ren'yōkei-form verbs used for the representative reading.
Where the representative reading is the same between different kanji, a kanji that uses an on reading is placed ahead of one that uses a kun reading. Where the on or kun readings are the same between more than one kanji, they are then ordered by their primary radical
and stroke
count.
Whether on level 1 or level 2, itaiji are arranged to directly follow their exemplar form. For example, in level 2, right after row 49 cell 88 , the immediately following characters deviate from the general rule (stroke count in this case) to include three variants of 49-88 .
The kanji in level 2 are arranged in order of primary radical and stroke count. Where these two properties are the same for different kanji, they are then sorted by reading.
It has been pointed out that there are kanji in the kanji set that are not found in comprehensive, unabridged kanji dictionaries, and that the sources thereof are unknown. For example, only one year after the first standard was established, Tajima (1979) reported that he had confirmed 63 kanji that were not to be found in Shinjigen (a large kanji dictionary published by Kadokawa Shoten
), nor in Dai Kan-Wa jiten
, and they did not make sense as ryakuji
of any sort; he noted that it would be preferable for kanji not available in kanji dictionaries to be selected from definite sources. These kanji came to be known as or , among other names.
The drafting committee for the fourth version of the standard also saw the existence of kanji with sources unknown as a problem, and so made an inquiry into just what kind of sources the drafting committee of the first version referenced. As a result, it was discovered that the original drafting committee had heavily relied on the “Correspondence Analysis Results” to collect kanji. When the drafting committee investigated the “Correspondence Analysis Results”, it became clear that many of the kanji included in the kanji set but not found in exhaustive kanji dictionaries supposedly came from the “Japanese Personality Registration Name Kanji” and “Kanji for National Administrative District Listing” lists mentioned in the “Correspondence Analysis Results”.
It was confirmed that no original text for the “Japanese Personality Registration Name Kanji” referenced in the “Correspondence Analysis Results” exists. For the “National Administrative District Listing”, Sasahara Hiroyuki of the fourth version's drafting committee examined the kanji that appeared on the in-progress development pages for the first standard. The committee also consulted many ancient writings, as well as many examples of personal names in a database of NTT
phone books.
Due to this thorough investigation, the committee was able to pare down the number of kanji for which the source cannot be confidently explained to twelve, shown on the table to the right. Of these, it is conjectured that several glyphs came about due to copying errors.
s allowed are limited; the extent to which particular allographic glyphs are unified into a graphemic code point is clearly defined.
Furthermore, according to the specifications in the standard, a is an abstract notion as to the graphical representation of a graphic character; a is the representation as a graphical shape that a glyph takes in actuality (e.g. due to a glyph being handwritten, printed, displayed on a screen, etc.). For a single glyph, there exist an endless range of possible concretely and/or visibly different character forms. A variation between a character form of one glyph is termed a .
The extent to which a glyph is unified to one code point is determined according to that code point's and the that can be applied to that example glyph; that is, the example glyph for a code point applies to that code point, and any glyphs for which the parts that compose the example glyph are replaced in accordance with the unification criteria also apply to that code point.
For example, the example glyph at 33-46 is composed of radical 9
and the kanji that eventually spawned the both the so
kana . Also, in unification criterion 101, there are three kanji displayed: the first takes the form most often seen in Japanese ; the second contains a more traditional form in which the first two strokes form radical 12
(the kanji numeral for the number 8: ); and the third is like the second, except that radical 12 is inverted . Consequently, all three permutations all apply to the code point at line 33 cell 46.
In the fourth standard, including one of the errata for the first printing, there are 186 unification criteria.
When a code point’s example glyph is composed of more than one part glyph, unification criteria can be applied to each part. After a unification criterion is applied to one part glyph, that part cannot have any more unification criteria applied to it. Also, a unification criterion is not allowed to apply if the resulting glyph would coincide with that of another code point entirely.
An example glyph is no more than an example for that code point; it is not a glyph “endorsed” by the standard. Also, the unification criteria need only be used for generally used kanji and for the purpose of assigning things to the code points of this standard. The standard requests that generally unused kanji not be created based on the example glyphs and unification criteria.
The kanji of the kanji set are not chosen completely consistently according to the unification criteria. For example, although 41-7 corresponds to the form where the third and fourth strokes cross as well as the form where they don’t according to unification criterion 72, 20-73 only corresponds to the form where they do not cross , and 80-90 only corresponds to the form where they do .
The terms “unification”, “unification criteria”, and “example glyph” were adopted in the fourth standard. From the first to the third version, kanji and relations between kanji were grouped into three types: , , and ; it was explained that the characters recognized as equivalent “consolidate to just one point”. “Equivalence” included, other than kanji with exactly the same shape, kanji with differences due to style, and kanji where the difference in character form is small.
In the first standard, it was stipulated that “this standard ... does not establish the particulars of character forms” (Section 3.1); it also states that “the aim of this standard is to establish the general idea of characters and their codes; the design of their character forms and such lie outside its scope.” In the second and third standards as well, notes to the effect that specific designs of character forms lie outside its scope (the note on item 1). The fourth standard also stipulates that “This standard regulates graphic characters as well as their bit patterns, and the use, specific designs of individual characters, and so forth are not within the scope of this standard” (JIS X 0208:1997, item 1).
7-bit code for kanji
8-bit code for kanji
International Reference Version + 7-bit code for kanji
International Reference Version + 8-bit code for kanji
Latin characters + 7-bit code for kanji
Latin characters + 8-bit code for kanji
Shift coded character set
RFC 1468 coded character set
Among these coded character sets stipulated in the fourth standard, only the “Shift” coded character set is registered by the IANA
.
s to indicate the kanji set are listed below. Here, “ESC” refers to the control character “ESCAPE”.
JIS X 0208:1997, in regards to when a character is common to both sets, basically forbids the use of the code point in the kanji set (which is one of two code points), eliminating duplicate encodings. It is judged that characters that have the same name are the same character.
For example, both the name of the character corresponding to the bit pattern 4/1 in the ISO/IEC 646 IRV character set and the name of the character corresponding to row 3 cell 33 of the kanji set are “LATIN CAPITAL LETTER A”. In International Reference Version + 8-bit code for kanji, whether by the bit pattern 4/1 or by the bit pattern corresponding to the kanji set’s row 3 cell 33 (10/3 12/1), the letter “A
” (i.e. “LATIN CAPITAL LETTER A”) is represented. The standard forbids the use of the “10/3 12/1” bit pattern, in an attempt to eliminate the duplicate encoding.
In consideration to implementations that treat the characters of the code points in the kanji set as “full-width characters” and those of the IRV character set or the graphic character set for Latin characters as different characters, the use of the kanji set code points is permitted only for the sake of backwards compatibility. For example, for the purpose of backwards compatibility, it is permitted to consider 10/3 12/1 in International Reference Version + 8-bit code for kanji to correspond to a full-width “A”.
If the kanji set is used along with the IRV graphic character set or the graphic character set for Latin characters, then even if the standard is abided by strictly, the unique encoding of a character is not guaranteed. For example, in the International Reference Version + 8-bit code for kanji, it is valid to represent a hyphen
with the bit pattern 2/13 for the character “HYPHEN-MINUS”, as well as with the kanji set’s row 1 cell 30 (bit pattern 10/1 11/14) for the character “HYPHEN”. In addition, the standard does not define which of the two to use for what, and so the hyphen is not given one unique encoding. The same problem affects the minus sign, the quotation mark
s, and so forth.
Moreover, even if the kanji set is used as a separate code, there is no guarantee that the unique encoding of characters is implemented. In many cases, however, the full-width “IDEOGRAPHIC SPACE” at row 1 cell 1 and the half-width space (2/0) coexist. How the two should be different is not self-explanatory, and is not specified in the standard.
kanji code standardization research and study committee produced the draft. The committee chairman was Moriguchi Shigeichi.
There are 108 special characters, 3384 level 2 kanji, and box-drawing characters are not included. Accordingly, the kanji set was composed of 453 non-kanji and 6349 kanji for a total of 6802 characters. The standard itself was set in Shaken Co., Ltd’s Ishii Mincho typeface.
The draft of the second standard was based on the consideration of factors such as the promulgation of the jōyō kanji
, the enforcement of the jinmeiyō kanji
, and the standardization of Japanese-language Teletex
by the Ministry of Posts and Telecommunications; also, the next modification was performed to keep pace with JIS C 6234-1983 (24-pixel matrix printer character forms; presently JIS X 9052).
Addition of special characters
Newly added box-drawing characters
Swapping of itaiji code points
Additions to the level 2 kanji
Modification of character forms
Among the changes in those 300 or so kanji character forms, many level 1 glyphs that were in the style of the Kangxi Dictionary
were changed into variants, and especially more simplified forms (e.g. ryakuji
and extended shinjitai
). For example, a couple of code points that are often the subject of criticism due to being greatly changed are row 18 cell 10 (78JIS: , 83JIS: ) and row 38 cell 34 (78JIS: , 83JIS: ).
There were many smaller changes away from the Kangxi-style variants; for example, row 25 cell 84 lost part of a stroke. Also, where some glyphs for level 1 kanji were not Kangxi-style forms, there were some changed into their Kangxi-style forms; for example, row 80 cell 49 gained part of a stroke (i.e., the same part of the stroke that 25-84 lost).
In order to elucidate the original intent of the first standard, these ended up falling into parameters for unification criteria in the fourth standard. The difference in form for the examples noted above (“” and “”) falls under the parameters for unification criterion 42 (concerning the component “”).
The bulk of the changes to character forms are differences between level 1 and level 2 kanji. Specifically, simplification was done more often for level 1 kanji than for level 2 kanji; simplifications applied to level 1 kanji (e.g. “” to “” and “” to “”) were not generally applied to kanji in level 2 (“” stayed as-is). The aforementioned 25-84 and 80-49 were given different treatment likewise, as the former is in level 1 and the latter is in level 2. Even so, there were some changes regardless of the level; for instance characters containing the “door” and “winter” components were changed with no different treatment between level 1 and level 2 kanji.
However, for 29 code points (such as the problematic 18-10 and 38-34 mentioned above), the forms inherited by the fourth standard contradicts the original intent of the first. For these, there are special unification criteria to maintain compatibility with the previous standards at these code points.
When the new “X” category for Japanese Industrial Standards (for information-related fields) was introduced, the second standard was re-termed JIS X 0208-1983 on 1 March 1987.
225 kanji glyphs were changed, and two characters were added to level 2 (“” and “”). Some of the changes and the two additions corresponded to the 118 jinmeiyō kanji
added in March 1990. The standard itself was set in Heisei Mincho.
The basic policies of this revision were to perform no changes the character set, to clarify ambiguous provisions, and to make the standard relatively easier to use. Addition, removal, and code point rearrangement were not done, and without exception, the example glyphs were also left unchanged. However, the stipulations of the standard were completely re-written and/or supplemented. Whereas the third standard was 65 pages long without the explanations, the fourth standard was 374 pages without the explanations.
The main points of the revision are:
Definition of encoding methods
Definition of the general prohibition of the use of unassigned code points and methods of usage for unassigned code points
General elimination of duplicate encodings
Investigation into sources of kanji
Definition of kanji unification criteria
Inclusion of de facto standards
(extended kanji) was designed “with the goal being to offer a sufficient character set for the purposes of encoding the modern Japanese language that JIS X 0208 intended to be from the start”; it defines a character set that expands upon the kanji set of JIS X 0208. The drafters of JIS X 0213 recommend migration from JIS X 0208 to JIS X 0213, among the advantages being JIS X 0213’s compatibility with the Hyōgai Kanji Glyph List and with newer jinmeiyō kanji
.
Contrary to the expectations of the drafters, adoption of JIS X 0213 has been anything but fast since its enactment in the year 2000. The drafting committee of JIS X 0213:2004 wrote (in the year 2004), “The status where ‘what the majority of information systems can use in common is JIS X 0208 only’ still continues.” (JIS X 0213:2000, Appendix 1:2004, section 2.9.7)
For Microsoft Windows
, the predominant operating system
(and hence supplying the predominant desktop environment
) in the personal computing sector, the JIS X 0213 repertoire has been included since Windows Vista
, released in November 2006. Mac OS X
has been compatible with JIS X 0213 since version 10.1
(released in 2001). Many Unix-like
s such as Linux
can (optionally) support JIS X 0213 if desired. Therefore, it is thought that with time, JIS X 0213 support on personal computers will not be an impediment to its eventual adoption.
Among the drafters of JIS X 0213, there are those who expect to see a mix of JIS X 0208 and JIS X 0213 before any adoption of JIS X 0213 (Satō, 2004). However, JIS X 0208 continues to be used for the present, and many predict it to endure as a standard. There are barriers that need to be overcome if JIS X 0213 is to supplant JIS X 0208 in common usage:
, several companies have implemented their own encodings of the character set.
’s graphic character set for Latin characters: 2/2 (QUOTATION MARK), 2/7 (APOSTROPHE), and 2/13 (HYPHEN-MINUS). The kanji set contains all character included in JIS X 0201’s graphic character set for katakana.
The kanji set and the graphic character set for Latin characters can be used together as specified in JIS X 0208 (Latin characters + 7-bit code for kanji and the Latin characters + 8-bit code for kanji). The kanji set, graphic character set for Latin characters, and JIS X 0201’s graphic character set for katakana can be used together as specified in JIS X 0208 (the shift-coded character set; i.e. Shift JIS). The kanji set and graphic character set for katakana can be used together in EUC-JP.
(supplementary kanji) defines additional characters for JIS X 0208 and their code points for the purposes of information processing that requires characters not found in JIS X 0208.
Among the code points that the second version of JIS X 0208 changed, 28 code points in JIS X 0212 reflect the character forms from before the changes. Also, JIS X 0212 reassigns the “closure mark” that JIS X 0208 had assigned as a non-kanji (at row 1 cell 26) as a kanji (at row 16 cell 17). JIS X 0212 has no characters in common with JIS X 0208 other than these.
JIS X 0208 and JIS X 0212 can be used together in EUC-JP. Also, JIS X 0208 and JIS X 0212 are both source standards for UCS/Unicode’s Han unification
, so using UCS/Unicode, the kanji from both standards can be made to coexist.
However, in the fourth version of JIS X 0208, the connection to JIS X 0212 was not defined at all. It is believed that this is because the drafting committee of the fourth JIS X 0208 standard had a critical opinion of the selection and identification methods of JIS X 0212. The text of the fourth standard, as well as pointing out the problematic points of the character selection of JIS X 0212, states that “it is thought that not only is character selection impossible, it is also impossible to use together; the connection to JIS X 0212 is not defined at all.” (section 3.3.1)
(extension kanji) defines a kanji set that expands upon the kanji set of JIS X 0208. According to this standard, it is “designed with the goal being to offer a sufficient character set for the purposes of encoding the modern Japanese language that JIS X 0208 intended to be from the start.”
JIS X 0208 and JIS X 0213 cannot be used at the same time. The kanji set of JIS X 0213 includes all characters that can be represented in the kanji set of JIS X 0208; they are part and parcel of the 1183 non-kanji and 10,050 kanji (for a total of 11,233 characters) codes defined by JIS X 0213.
At first, JIS X 0213 may appear to define or offer as a reference an upwardly compatible coded character set for JIS X 0208. However, in a strict sense, JIS X 0213 is not upwardly compatible with JIS X 0208. The JIS X 0213 drafting committee admitted as such (JIS X 0213:2000 section 5.3.2, JIS X 0213:2000 Appendix 1:2004 section 3.2.2).
This is because of the unification criteria that apply to some JIS X 0208 code points; namely, some different kanji glyphs that were represented by one JIS X 0208 code point due to their being unified are given separate code points in JIS X 0213. For this reason, data encoded according to JIS X 0208 codes cannot necessarily be directly converted into JIS X 0213 codes and remain correct.
For example, the glyph at row 33 cell 46 of JIS X 0208 (“”, described above) unifies a few variants due to its right-hand component. In JIS X 0213, two forms (the ones containing the component “”) are unified on plane 1 row 33 cell 46, and the other (containing the component “”) is located at plane 1 row 14 cell 41. Therefore, whether JIS X 0208 row 33 cell 46 should be mapped to JIS X 0213 plane 1 row 33 cell 46 or plane 1 row 14 cell 41 is computationally indeterminate.
For the most part, row m cell n in JIS X 0208 corresponds to plane 1 row m cell n in JIS X 0213; therefore, not much confusion arises in practice. This is because most typefaces have come to use the glyphs exemplified in JIS X 0208, and most users are not consciously aware of the unification criteria.
in ISO/IEC 10646 (UCS) and Unicode
. Every kanji in JIS X 0208 corresponds to its own code point in UCS/Unicode’s Basic Multilingual Plane (BMP).
The non-kanji in JIS X 0208 also correspond to their own code points in the BMP. However, for some special characters, some systems implement a different correspondences from those of UCS/Unicode’s (which are based on the character names given JIS X 0208:1997).
Japanese Industrial Standards
Japanese Industrial Standards specifies the standards used for industrial activities in Japan.The standardization process is coordinated by Japanese Industrial Standards Committee and published through Japanese Standards Association.-History:...
, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
. The official title of the current standard is . It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997.
Scope of use and compatibility
The character set JIS X 0208 establishes is primarily for the purpose of between data processing systems and the devices connected to them, or mutually between data communication systems. This character set can be used for data processing and text processing.Partial implementations of the character set are not compatible. Because there are places where such things have happened as the original drafting committee of the first standard taking care to separate things between level 1 and level 2 and the second standard then shuffling some itaiji among the level, at least in the first and second standards, it is conjectured that non-kanji
Kanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...
and level 1-only implementations were hypothesized at those times. However, such implementations have never been specified as compatible.
Even though there are provisions in the JIS X 0208:1997 standard concerning compatibility, at the present time, it is generally considered that this standard neither certifies compatibility nor is it an official manufacturing standard that amounts to a declaration of self-compatibility. Consequently, de facto, JIS X 0208-“compatible” products are not considered to exist. Terminology such as and is included in JIS X 0208, but the semantics of these terms vary from person to person.
Code structure
JIS X 0208 codes are fundamentally two bytes of either seven or eight bits. However, one , “space”, and every is represented with a one-byte code. In order to represent code pointCode point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...
s, column/line numbers and numbers are used. For a way to identify a character without depending on a code, character names are used.
Column/Line numbers
In order to represent the of a one-byte code, two decimal numbers – a column number and a line number – are used. Three high-order bits out of seven or four high-order bits out of eight, counting from zero to seven or from zero to fifteen respectively, form the column number. Four low-order bits counting from zero to fifteen form the line number.For example, the bit combination corresponding to the graphic character “space” is 010 0000 as a 7-bit number, and 0010 0000 as an 8-bit number. According to those column/line numbers, this is represented as 2/0.
Code points and code numbers
In a two-byte code, the first of the two bytes similarly provides the group of codes, called a , and the individual code within the row, called a . A row and a cell form a kuten point, or rather, a code point.The first and second of two bytes are each permitted to indicate the 94 column/line numbers from 2/1 to 7/14. Consequently, there are 94 rows, and 94 cells in each row. Thus, there are 8836 (94 × 94) code points.
A code point is referenced as a . Each row is given a number from 1 to 94, and within each row, each cell is given a number from 1 to 94. A code number is expressed in the form “row-cell”, the row and cell numbers being separated by a hyphen
Hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...
. For example, the character “” has a code point at row 16, cell 1, so its code number is represented as “16-01”.
The correspondence between code numbers and graphic characters is represented, with row numbers made into line numbers and cell numbers made into column numbers, on the 94-line 94-column graphic character code table.
This structure is also used in the Chinese GB 2312
GB 2312
GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters...
and the Korean KS C 5601 (currently KS X 1001
KS X 1001
KS X 1001 is a South Korean coded character set standard to represent hangul and hanja characters on a computer. It is arranged as 94×94 table , therefore its code points are pairs of integers 1–94...
).
Unassigned code points
Among the 2-byte codes, rows 9 to 15 and 85 to 94 are ; that is, they are code points with no characters assigned to them. Also, some cells in other rows are also essentially unassigned code points.These empty areas contain code points that should basically not be used. Except when there is prior agreement among the relevant parties, characters (gaiji) for information interchange should not be assigned to the unassigned code points.
Even when assigning characters to unassigned code points, graphic characters defined in the standard should not be assigned to them, and the same character should not be assigned to multiple unassigned code points; characters should not be duplicated in the set.
Furthermore, when assigning characters to unassigned code points, it is necessary to be cautious of unification
Han unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
in regards to kanji glyphs. For example, row 25 cell 66 corresponds to the kanji meaning “high” or “expensive”; both the form with the character meaning “mouth” in the middle and the less common form with a ladder-like construction are subsumed into the same code point. Consequently, limiting point 25-66 to the “mouth” form and assigning the latter “ladder” form to an unassigned code point would technically be in violation of the standard.
Character names
For characters given codes in this standards, each is given a name. By using a character's name, it is possible to discern characters without relying on their codes. The names of characters are coordinated with other character set standards, so for some characters in some character sets, one can decide whether or not they are the same as characters in other character sets.For example, both the character at ISO/IEC 646
ISO/IEC 646
ISO/IEC 646:1991, Information technology — ISO 7-bit coded character set for information interchange, is an ISO standard that since its first edition in 1972 has specified a 7-bit character code from which several national standards are derived...
column 4 line 1 and the one at JIS X 0208 row 3 cell 33 have the name “LATIN CAPITAL LETTER A”. Therefore, the character at 4/1 in ISO/IEC 646 and the character at 3-33 in this standard can be concluded to be the same character. Also, for the ISO/IEC 646 International Reference Version, 2/2 (quotation mark), 2/7 (apostrophe), 2/13 (hyphen-minus), and 7/14 (tilde) are characters that do not exist in this standard.
Character names not for kanji use uppercase Roman letters, spaces, and hyphens. Non-kanji characters are given a , but some provisions for these names do not exist.
The names of kanji are mechanically set according to the corresponding hexadecimal representation of their code in the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...
(UCS). The name of a kanji can be arrived at by prepending the UCS code with “CJK UNIFIED IDEOGRAPH-”. For example, row 16 cell 1 corresponds to 4E9C in UCS, so the name of it would be “CJK UNIFIED IDEOGRAPH-4E9C”. Kanji are not given Japanese common names.
Overview
JIS X 0208 prescribes a set of 6879 graphical characters that correspond to two-byte codes with either seven or eight bits to the byte; in JIS X 0208, this is called the , which includes 6355 kanji as well as 524 , including characters such as Latin lettersLatin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
, kana
Kana
Kana are the syllabic Japanese scripts, as opposed to the logographic Chinese characters known in Japan as kanji and the Roman alphabet known as rōmaji...
, and so forth.
Special characters
- Occupies rows 1 and 2. There are 18 such as the “ideographic space” ( ), and the Japanese comma and period; eight diacritical marksDiacriticA diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...
such as dakuten and handakutenDakuten, colloquially ten-ten , is a diacritic sign most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced. Handakuten , colloquially maru , is a diacritic used with the kana for syllables starting with h to indicate that they should...
; 10 characters for such as the Iteration markIteration markIteration marks are characters or punctuation marks that represent a duplicated character or word.-Chinese:In Chinese, 二 or 々 is used in casual writing to represent a doubled character, but it is never used in formal writing or printed matter...
; 22 ; 45 ; and 32 unit symbols, which includes the currency signCurrency signA currency sign is a graphic symbol used as a shorthand for a currency's name, especially in reference to amounts of money. They typically employ the first letter or character of the currency, sometimes with minor changes such as ligatures or overlaid vertical or horizontal bars...
and the postal mark, for a total of 147 characters.
Numerals
Numerical digit
A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...
- Occupies part of row 3. The ten digits from “0” to “9”.
Latin letters
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
- Occupies part of row 3. The 26 letters of the English alphabet in uppercase and lowercase form for a total of 52.
Hiragana
Hiragana
is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...
- Occupies row 4. Contains 48 unvoiced kana (including the obsolete wiWi (kana)ゐ, in hiragana, or ヰ in katakana, is a nearly obsolete Japanese kana, each of which represent one mora. It is presumed that ゐ represented and that ゐ and い indicated different pronunciations until somewhere between the Kamakura period and the Taishō period when they both came to be pronounced...
and weWe (kana)ゑ, in hiragana, or ヱ in katakana, is a nearly obsolete Japanese kana.It is presumed that ゑ represented . It is thought that, after ゑ and え came to denote the same pronunciation as イェ in the Kamakura period, they came to be pronounced as the modern エ ; there is also the view that the pronunciation...
), 20 voiced kana (dakutenDakuten, colloquially ten-ten , is a diacritic sign most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced. Handakuten , colloquially maru , is a diacritic used with the kana for syllables starting with h to indicate that they should...
), 5 semi-voiced kana (handakutenDakuten, colloquially ten-ten , is a diacritic sign most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced. Handakuten , colloquially maru , is a diacritic used with the kana for syllables starting with h to indicate that they should...
), 10 small kana for palatalized and assimilated sounds, for a total of 83 characters.
Katakana
Katakana
is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...
- Occupies row 5. There are 86 characters; in addition to the katakana equivalents of the hiragana characters, the small kaKa (kana)か, in hiragana, or カ in katakana, is one of the Japanese kana, which each represent one mora. Both represent . The shapes of these kana both originate from 加....
/keKe (kana)け, in hiragana, or ケ in katakana, is one of the Japanese kana, each of which represents one mora. Both represent . The shape of these kana come from the kanji 計 and 介, respectively....
kana (/) and the vuU (kana)う in hiragana or ウ in katakana is one of the Japanese kana, each of which represents one mora. In the modern Japanese system of alphabetical order, they occupy the third place in the modern Gojūon system of collating kana. In the Iroha, they occupied the 24th position, between む and ゐ...
kana .
Greek letters
Greek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...
- Occupies row 6. The 24 letters of the Greek alphabet in uppercase and lowercase form (minus the final sigmaSigmaSigma is the eighteenth letter of the Greek alphabet, and carries the 'S' sound. In the system of Greek numerals it has a value of 200. When used at the end of a word, and the word is not all upper case, the final form is used, e.g...
) for a total of 48.
Cyrillic letters
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
- Occupies row 7. The 33 letters of the Russian alphabetRussian alphabetThe Russian alphabet is a form of the Cyrillic script, developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
in uppercase and lowercase form for a total of 66.
Box drawing characters
Box drawing characters
Box drawing characters, also known as line drawing characters, or pseudographics, are widely used in text user interfaces to draw various frames and boxes...
- Occupies row 8. Thin segments, thick segments, and mixed thin and thick segments, 32 total.
Kanji
Kanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...
- The 2965 characters of from row 16 to row 47, and the 3390 characters of from row 48 to row 84 for a total of 6355.
Special characters, numerals, and Latin characters
As for the special characters in the kanji set, some characters from the graphic character set of the International Reference Version (IRV) of ISO/IEC 646ISO/IEC 646
ISO/IEC 646:1991, Information technology — ISO 7-bit coded character set for information interchange, is an ISO standard that since its first edition in 1972 has specified a 7-bit character code from which several national standards are derived...
are absent from JIS X 0208. There are the aforementioned four characters “QUOTATION MARK”, “APOSTROPHE”, “HYPHEN-MINUS”, and “TILDE”. The former three are split into different code points in the kanji set (Nishimura, 1978; JIS X 0221-1:2001 standard, Section 3.8.7). The “TILDE” of IRV has no corresponding character in the kanji set.
In the following table, the ISO/IEC 646 IRV characters in question are compared with their multiple equivalents in JIS X 0208, except for the ISO/IEC 646 IRV character “TILDE”, which is compared with the “WAVE DASH” of JIS X 0208. The entries under the “Symbol” columns utilize UCS/Unicode code points, so the specifics of display may differ.
ISO/IEC 646 IRV | JIS X 0208 | ||||
---|---|---|---|---|---|
Column/Line | Symbol | Name | Kuten | Symbol | Name |
2/2 | " | QUOTATION MARK | 1-15 | ¨ | DIAERESIS |
1-40 | “ | LEFT DOUBLE QUOTATION MARK | |||
1-41 | ” | RIGHT DOUBLE QUOTATION MARK | |||
1-77 | ″ | DOUBLE PRIME | |||
2/7 | ' | APOSTROPHE | 1-13 | ´ | ACUTE ACCENT |
1-38 | ‘ | LEFT SINGLE QUOTATION MARK | |||
1-39 | ’ | RIGHT SINGLE QUOTATION MARK | |||
1-76 | ′ | PRIME | |||
2/13 | - | HYPHEN-MINUS | 1-30 | ‐ | HYPHEN |
1-61 | − | MINUS SIGN | |||
7/14 | ~ | TILDE | (no corresponding character) | ||
(no corresponding character) | 1-33 | 〜 | WAVE DASH |
This means that the kanji set is the most widespread non-upward-compatible character set in the world; it is counted as one of the weak points of this standard.
Even with the 90 special characters, numerals, and Latin letters the kanji set and the IRV set have in common, this standard does not follow the arrangement of ISO/IEC 646. These 90 characters are split into rows 1 through 4.
As to the cause of how these numerals, Latin letters, and so forth in the kanji set are the and how the original implementation came forth with a differing interpretation compared to the IRV, it is thought that it is due to these incompatibilities.
Ever since the first standard, it has been possible to represent such as encircled numbers, ligatures for measurement unit names, and Roman numerals
Roman numerals
The numeral system of ancient Rome, or Roman numerals, uses combinations of letters from the Latin alphabet to signify values. The numbers 1 to 10 can be expressed in Roman numerals as:...
; they were not given independent kuten code points. Although individual companies that manufacture information systems can make an effort to represent these characters as customers may require by the composition of the characters, none has requested to have them added to the standard, instead choosing to proprietarily offer them as gaiji.
In the fourth standard (1997), all these characters were explicitly defined as characters that accompany an advancement of the current position; that is to say, they are spacing characters. Furthermore, it was ruled that they should not be made by the composition of characters. For this reason, it became disallowed to represent Latin characters with diacritic
Diacritic
A diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...
s at all, with possibly the sole exception of the ångström
Ångström
The angstrom or ångström, is a unit of length equal to 1/10,000,000,000 of a meter . Its symbol is the Swedish letter Å....
symbol (Å
Å
Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...
) at row 2 cell 82.
Hiragana and katakana
The hiraganaHiragana
is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...
and katakana
Katakana
is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...
in JIS X 0208, unlike JIS X 0201
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
, includes dakuten
Dakuten
, colloquially ten-ten , is a diacritic sign most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced. Handakuten , colloquially maru , is a diacritic used with the kana for syllables starting with h to indicate that they should...
and handakuten markings as part of a character. The katakana and (both obsolete in modern Japanese) as well as the small , not in JIS X 0201, are also included.
The arrangement of kana in JIS X 0208 is different from the arrangement of katakana in JIS X 0201. In JIS X 0201, the syllabary starts with , followed by the small kana sorted by gojūon
Gojuon
The is a Japanese ordering of kana.It is named for the 5×10 grid in which the characters are displayed, but the grid is not completely filled, and, further, there is an extra character added outside the grid at the end: with 5 gaps and 1 extra character, the current number of distinct kana in a...
order, followed by the full-size kana, also in gojūon order . On the other hand, in JIS X 0208, the kana are sorted first by gojūon order, then in the order of “small kana, full-size kana, kana with dakuten, and kana with handakuten” such that the same fundamental kana is grouped with its derivatives . This ordering was chosen in order to more simply facilitate the sorting of kana-based dictionary look-ups (Yasuoka, 2006).
As mentioned above, in this standard, the previously defined katakana order in JIS X 0201 was not followed in JIS X 0208. It is thought that the JIS X 0201 katakana being “half-width kana” arose due to the incompatibility with the katakana of this standard. This point is also one of the weaknesses of this standard.
Kanji
How the kanji in this standard were chosen from what sources, why they are split into level 1 and level 2, and how they are arranged are all explained in detail in the fourth standard (1997). According to it, the kanji included in the next four kanji listings were reflected in the 6349 characters of the first standard (1978).- The Information Processing Society of Japan kanji code committee compiled this list in 1971. In the below “Correspondence Analysis Results”, this appears to be 6086 characters.
- Selected by the Administrative Management Agency of Japan in 1975, it consists of 2817 characters. For data for the purpose of selection, the Agency made a report which, starting with the “Kanji Listing for Standard Code (Tentative)”, contrasted several kanji listings, the , or for short.
- One of the kanji listings that compose the “Correspondence Analysis Results”, consisting of 3044 characters. It no longer exists. The original list was nonexistent for the original drafting committee; this kanji list was reflected in the standard to follow the “Correspondence Analysis Results”.
- One of the kanji listings that compose the “Correspondence Analysis Results”, consisting of 3251 characters. They are the kanji used in the list of all administrative place names compiled by the Japan Geographic Data Center, the . The original drafting committee did not investigate the listing itself; the kanji used from this list followed the “Correspondence Analysis Results”.
In the second and third standards, they added four and two characters to level 2, respectively, bringing the total kanji to 6355. Also, in the second standard, character forms were changed as well as transposition among the levels; in the third standard as well, character forms were changed. These are described further below.
Level partitioning
For level 1, characters common to multiple kanji glyph listings were chosen, using the tōyō kanjiToyo kanji
The tōyō kanji, also known as the Tōyō kanjihyō are the result of a reform of the Kanji characters of Chinese origin in the Japanese written language. They were the kanji declared "official" by the Japanese on November 16, 1946...
, the tōyō kanji correction draft, and the jinmeiyō kanji
Jinmeiyo kanji
The are a set of 861 Chinese characters known as the "name kanji" in English. They are a supplementary set of characters that can be legally used in registered personal names in Japan, despite not being in that country's set of "commonly used characters" . As a rule, registered personal names may...
as a basis. Also, JIS C 6260 (“To-Do-Fu-Ken (Prefecture) Identification Code”; currently JIS X 0401) and JIS C 6261 (“Identification code for cities, towns and villages”; currently JIS X 0402) were consulted; kanji for nearly all Japanese prefectures
Prefectures of Japan
The prefectures of Japan are the country's 47 subnational jurisdictions: one "metropolis" , Tokyo; one "circuit" , Hokkaidō; two urban prefectures , Osaka and Kyoto; and 43 other prefectures . In Japanese, they are commonly referred to as...
, cities, districts, wards, towns, villages, and so forth were intentionally placed in level 1. Furthermore, amendments by experts were added.
Level 2 was dedicated to kanji that made an appearance in the aforementioned four major listings but were not selected for level 1. As noted below, the kanji of level 1 were ordered by their pronunciation, so among the kanji whose pronunciation were difficult to determine, there were those that were transferred from level 1 to level 2 on that basis (Nishimura, 1978).
Due to these decisions, for the most part, level 1 contains more frequently used kanji, and level 2 contains more infrequently used kanji, but of course, those were judged by the standards of the day; over the passage of time, some level 2 kanji have become more frequently used, such as one meaning “to soar” and one meaning “to glitter” ; and inversely, some level 1 kanji have become infrequent, notably the ones meaning “centimeter” and “millimeter” . Also, a few jinmeiyō kanji, being added after the kanji set was defined, fall into level 2.
Arrangement
The kanji in level 1 are sorted in order of each one’s “representative reading” (i.e. a canonical reading chosen for the purposes of this standard only); the reading of a kanji for this may be an on or a kun reading; readings are sorted in gojūonGojuon
The is a Japanese ordering of kana.It is named for the 5×10 grid in which the characters are displayed, but the grid is not completely filled, and, further, there is an extra character added outside the grid at the end: with 5 gaps and 1 extra character, the current number of distinct kana in a...
order. As a general rule, the on (Chinese-sound) reading is considered the representative reading; where a kanji has multiple on readings, the reading judged to be predominant in use frequency is used for the representative reading (JIS C 6226-1978 standard, Section 3.4). For those atypical kanji that do not have on readings, the kun reading was employed as the representative reading. Where a verb kun reading must be used as the representative reading, the ren'yōkei (rather than the shūshikei) form is used.
For example, cells 1 to 41 on row 16 are 41 characters sorted as starting with a reading of a
A (kana)
あ in hiragana or ア in katakana is one of the Japanese kana that each represent one mora. あ is based on the sōsho style of kanji 安, and ア is from the radical of kanji 阿. In the modern Japanese system of alphabetical order, it occupies the first position of the alphabet, before い. Additionally, it...
. Within these, 22 characters, including 16-10 (: on reading “ki”; kun reading “aoi”) and 16-32 (: on readings “zoku” and “shoku”; kun reading “awa”) are there on the basis of their kun readings. 16-09 (: on reading “hō”, kun reading “a(i)”) and 16-23 (: on readings “sō” and “kyū”, kun reading “atsuka(i)”) are just two examples of ren'yōkei-form verbs used for the representative reading.
Where the representative reading is the same between different kanji, a kanji that uses an on reading is placed ahead of one that uses a kun reading. Where the on or kun readings are the same between more than one kanji, they are then ordered by their primary radical
Section headers of a Chinese dictionary
Section headers , also known as index keys or classifiers, are graphic portions of Chinese characters which are used for organizing entries in Chinese dictionaries into sections which all share the same graphic part...
and stroke
Stroke (CJK character)
CJK strokes, also called CJK strokes or CJKV strokes are the calligraphic strokes needed to write the Chinese characters used in East Asia...
count.
Whether on level 1 or level 2, itaiji are arranged to directly follow their exemplar form. For example, in level 2, right after row 49 cell 88 , the immediately following characters deviate from the general rule (stroke count in this case) to include three variants of 49-88 .
The kanji in level 2 are arranged in order of primary radical and stroke count. Where these two properties are the same for different kanji, they are then sorted by reading.
Kanji from unknown sources
Kuten | Symbol | Classification |
---|---|---|
52-55 | Unknown | |
52-63 | Unknown | |
54-12 | Source unclear | |
55-27 | Unidentifiable | |
57-43 | Source unclear | |
58-83 | Source unclear | |
59-91 | Source unclear | |
60-57 | Source unclear | |
74-12 | Source unclear | |
74-57 | Source unclear | |
79-64 | Source unclear | |
81-50 | Source unclear |
It has been pointed out that there are kanji in the kanji set that are not found in comprehensive, unabridged kanji dictionaries, and that the sources thereof are unknown. For example, only one year after the first standard was established, Tajima (1979) reported that he had confirmed 63 kanji that were not to be found in Shinjigen (a large kanji dictionary published by Kadokawa Shoten
Kadokawa Shoten
is a well-known Japanese publishing company based in Tokyo, Japan. Kadokawa has published both manga novels and magazines, such as Newtype magazine...
), nor in Dai Kan-Wa jiten
Dai Kan-Wa jiten
The is a Japanese dictionary of kanji compiled by Morohashi Tetsuji. Remarkable for its comprehensiveness and size, Morohashi's dictionary contains over 50,000 character entries and 530,000 compound words...
, and they did not make sense as ryakuji
Ryakuji
Ryakuji are colloquial simplifications of Kanji.- Status :Ryakuji are not covered in the Kanji Kentei, nor are they officially recognized...
of any sort; he noted that it would be preferable for kanji not available in kanji dictionaries to be selected from definite sources. These kanji came to be known as or , among other names.
The drafting committee for the fourth version of the standard also saw the existence of kanji with sources unknown as a problem, and so made an inquiry into just what kind of sources the drafting committee of the first version referenced. As a result, it was discovered that the original drafting committee had heavily relied on the “Correspondence Analysis Results” to collect kanji. When the drafting committee investigated the “Correspondence Analysis Results”, it became clear that many of the kanji included in the kanji set but not found in exhaustive kanji dictionaries supposedly came from the “Japanese Personality Registration Name Kanji” and “Kanji for National Administrative District Listing” lists mentioned in the “Correspondence Analysis Results”.
It was confirmed that no original text for the “Japanese Personality Registration Name Kanji” referenced in the “Correspondence Analysis Results” exists. For the “National Administrative District Listing”, Sasahara Hiroyuki of the fourth version's drafting committee examined the kanji that appeared on the in-progress development pages for the first standard. The committee also consulted many ancient writings, as well as many examples of personal names in a database of NTT
Nippon Telegraph and Telephone
, commonly known as NTT, is a Japanese telecommunications company headquartered in Tokyo, Japan. Ranked the 31st in Fortune Global 500, NTT is the largest telecommunications company in Asia, and the second-largest in the world in terms of revenue....
phone books.
Due to this thorough investigation, the committee was able to pare down the number of kanji for which the source cannot be confidently explained to twelve, shown on the table to the right. Of these, it is conjectured that several glyphs came about due to copying errors.
Unification of kanji variants
According to the specifications in the fourth standard (1997), is the action of giving the same code point to a character without regard to its different character forms. In the fourth standard, the glyphGlyph
A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....
s allowed are limited; the extent to which particular allographic glyphs are unified into a graphemic code point is clearly defined.
Furthermore, according to the specifications in the standard, a is an abstract notion as to the graphical representation of a graphic character; a is the representation as a graphical shape that a glyph takes in actuality (e.g. due to a glyph being handwritten, printed, displayed on a screen, etc.). For a single glyph, there exist an endless range of possible concretely and/or visibly different character forms. A variation between a character form of one glyph is termed a .
The extent to which a glyph is unified to one code point is determined according to that code point's and the that can be applied to that example glyph; that is, the example glyph for a code point applies to that code point, and any glyphs for which the parts that compose the example glyph are replaced in accordance with the unification criteria also apply to that code point.
For example, the example glyph at 33-46 is composed of radical 9
Radical 9
Radical 9 meaning "person" is one of 23 of the 214 Kangxi radicals that are composed of 2 strokes.In the Kangxi Dictionary there are 794 characters to be found under this radical.- Characters with Radical 9:- Literature :...
and the kanji that eventually spawned the both the so
So (kana)
そ, in hiragana, or ソ, in katakana, is one of the Japanese kana, each of which represents one mora. Both represent .-Stroke order:-Alternative form:...
kana . Also, in unification criterion 101, there are three kanji displayed: the first takes the form most often seen in Japanese ; the second contains a more traditional form in which the first two strokes form radical 12
Radical 12
Radical 12, meaning eight or all, is one of 23 of the 214 Kangxi radicals that are composed of 2 strokes. "八" is two bent lines that signal divide...
(the kanji numeral for the number 8: ); and the third is like the second, except that radical 12 is inverted . Consequently, all three permutations all apply to the code point at line 33 cell 46.
In the fourth standard, including one of the errata for the first printing, there are 186 unification criteria.
When a code point’s example glyph is composed of more than one part glyph, unification criteria can be applied to each part. After a unification criterion is applied to one part glyph, that part cannot have any more unification criteria applied to it. Also, a unification criterion is not allowed to apply if the resulting glyph would coincide with that of another code point entirely.
An example glyph is no more than an example for that code point; it is not a glyph “endorsed” by the standard. Also, the unification criteria need only be used for generally used kanji and for the purpose of assigning things to the code points of this standard. The standard requests that generally unused kanji not be created based on the example glyphs and unification criteria.
The kanji of the kanji set are not chosen completely consistently according to the unification criteria. For example, although 41-7 corresponds to the form where the third and fourth strokes cross as well as the form where they don’t according to unification criterion 72, 20-73 only corresponds to the form where they do not cross , and 80-90 only corresponds to the form where they do .
The terms “unification”, “unification criteria”, and “example glyph” were adopted in the fourth standard. From the first to the third version, kanji and relations between kanji were grouped into three types: , , and ; it was explained that the characters recognized as equivalent “consolidate to just one point”. “Equivalence” included, other than kanji with exactly the same shape, kanji with differences due to style, and kanji where the difference in character form is small.
In the first standard, it was stipulated that “this standard ... does not establish the particulars of character forms” (Section 3.1); it also states that “the aim of this standard is to establish the general idea of characters and their codes; the design of their character forms and such lie outside its scope.” In the second and third standards as well, notes to the effect that specific designs of character forms lie outside its scope (the note on item 1). The fourth standard also stipulates that “This standard regulates graphic characters as well as their bit patterns, and the use, specific designs of individual characters, and so forth are not within the scope of this standard” (JIS X 0208:1997, item 1).
Unification criteria for compatibility
In the fourth standard, is defined. Their application is limited to 29 code points whose glyphs vary greatly between the standards JIS C 6226-1983 on and after and JIS C 6226-1978. For those 29 code points, the glyphs from JIS C 6226-1983 on and after are displayed as “A”, and the glyphs from JIS C 6226-1978 as “B”. On each of them, both “A” and “B” glyphs may be applied. However, in order to claim compatibility with the standard, whether the “A” or “B” form has been used for each code point must be explicitly noted.Eight kinds of codes
In JIS X 0208:1997, article 7 in the standard itself as well as Appendices 1 and 2 define a total of eight kinds of codes.- The terms and are considered synonyms here.
- JIS X 0211 is the national standard that conforms to ISO/IEC 6429.
- The “CL” (control left), “GL” (graphic left), “CR” (control right), and “GR” (graphic right) regions are, in column/line notation, from 0/0 to 1/15, from 2/1 to 7/14, from 8/0 to 9/15, and from 10/1 to 15/14, respectively.
- For each code, 2/0 is assigned the graphic character “SPACE” and 7/15 the control character “DELETE”.
7-bit code for kanji
- Stipulated in the standard itself. The C0 set of JIS X 0211 is allotted to the CL region, and the kanji set is allotted to the GL region.
8-bit code for kanji
- Stipulated in the standard itself. The C0 set of JIS X 0211 is allotted to the CL region, and the kanji set is allotted to the GL region. To the CR region, either the C1 set of JIS X 0211 is allotted, or nothing is allotted. Nothing is allotted to the GR region.
International Reference Version + 7-bit code for kanji
- Stipulated in the standard itself. The C0 set is allotted to the CL region. The graphic characters of the International Reference Version (IRV) of ISO/IEC 646ISO/IEC 646ISO/IEC 646:1991, Information technology — ISO 7-bit coded character set for information interchange, is an ISO standard that since its first edition in 1972 has specified a 7-bit character code from which several national standards are derived...
and the kanji set of JIS X 0208 are, respectively, allotted to the GL region according to “SHIFT-IN” and “SHIFT-OUT”.
International Reference Version + 8-bit code for kanji
- Stipulated in the standard itself. The C0 set of JIS X 0211 is allotted to the CL region, the graphic character set of the IRV of ISO/IEC 646 to the GL region, and the kanji set to the GR region. In EUC-JP, this is suited for things that do not use the graphic character set for katakana of JIS X 0201JIS X 0201JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
and the supplemental kanji of JIS X 0212JIS X 0212JIS X 0212 is a Japanese Industrial Standard defining coded character set for encoding the characters used in Japanese. This standard extends JIS X 0208.-History:...
.
Latin characters + 7-bit code for kanji
- Stipulated in the standard itself. The IRV graphic character set of IRV + 7-bit code for kanji is replaced by the graphic character set for Latin characters of JIS X 0201.
Latin characters + 8-bit code for kanji
- Stipulated in the standard itself. The IRV graphic character set of IRV + 8-bit code for kanji is replaced by the graphic character set for Latin characters of JIS X 0201.
Shift coded character set
- Stipulated in Appendix 1: . This is Shift JIS.
- Stipulated in Appendix 2: . It resembles ISO-2022-JP, which is governed by
RFC 1468 . However, in contrast to ISO-2022-JP, which is a code where seven bits go into a byte, theRFC 1468 coded character set is a code where eight bits go into a byte.
Among these coded character sets stipulated in the fourth standard, only the “Shift” coded character set is registered by the IANA
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...
.
Escape sequence
It is also possible, based on the code extension techniques, to use the kanji set. In each G-buffer, the escape sequenceEscape sequence
An escape sequence is a series of characters used to change the state of computers and their attached peripheral devices. These are also known as control sequences, reflecting their use in device control. Some control sequences are special characters that always have the same meaning...
s to indicate the kanji set are listed below. Here, “ESC” refers to the control character “ESCAPE”.
Standard | G0 | G1 | G2 | G3 |
---|---|---|---|---|
78 | ESC 2/4 4/0 | ESC 2/4 2/9 4/0 | ESC 2/4 2/10 4/0 | ESC 2/4 2/11 4/0 |
83 | ESC 2/4 4/2 | ESC 2/4 2/9 4/2 | ESC 2/4 2/10 4/2 | ESC 2/4 2/11 4/2 |
90 onward | ESC 2/6 4/0 ESC 2/4 4/2 | ESC 2/6 4/0 ESC 2/4 2/9 4/2 | ESC 2/6 4/0 ESC 2/4 2/10 4/2 | ESC 2/6 4/0 ESC 2/4 2/11 4/2 |
The problem of duplicate encodings
When using the kanji set of this standard with either the ISO/IEC 646’s IRV’s graphic character set or JIS X 0201’s graphic character set for Latin characters, the treatment of the characters common to both sets becomes problematic. Unless one takes special measures, the characters included in both sets do not all map to each other one-to-one, and a single character may be given more than one code point; that is, it may cause a .JIS X 0208:1997, in regards to when a character is common to both sets, basically forbids the use of the code point in the kanji set (which is one of two code points), eliminating duplicate encodings. It is judged that characters that have the same name are the same character.
For example, both the name of the character corresponding to the bit pattern 4/1 in the ISO/IEC 646 IRV character set and the name of the character corresponding to row 3 cell 33 of the kanji set are “LATIN CAPITAL LETTER A”. In International Reference Version + 8-bit code for kanji, whether by the bit pattern 4/1 or by the bit pattern corresponding to the kanji set’s row 3 cell 33 (10/3 12/1), the letter “A
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...
” (i.e. “LATIN CAPITAL LETTER A”) is represented. The standard forbids the use of the “10/3 12/1” bit pattern, in an attempt to eliminate the duplicate encoding.
In consideration to implementations that treat the characters of the code points in the kanji set as “full-width characters” and those of the IRV character set or the graphic character set for Latin characters as different characters, the use of the kanji set code points is permitted only for the sake of backwards compatibility. For example, for the purpose of backwards compatibility, it is permitted to consider 10/3 12/1 in International Reference Version + 8-bit code for kanji to correspond to a full-width “A”.
If the kanji set is used along with the IRV graphic character set or the graphic character set for Latin characters, then even if the standard is abided by strictly, the unique encoding of a character is not guaranteed. For example, in the International Reference Version + 8-bit code for kanji, it is valid to represent a hyphen
Hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...
with the bit pattern 2/13 for the character “HYPHEN-MINUS”, as well as with the kanji set’s row 1 cell 30 (bit pattern 10/1 11/14) for the character “HYPHEN”. In addition, the standard does not define which of the two to use for what, and so the hyphen is not given one unique encoding. The same problem affects the minus sign, the quotation mark
Quotation mark
Quotation marks or inverted commas are punctuation marks at the beginning and end of a quotation, direct speech, literal title or name. Quotation marks can also be used to indicate a different meaning of a word or phrase than the one typically associated with it and are often used to express irony...
s, and so forth.
Moreover, even if the kanji set is used as a separate code, there is no guarantee that the unique encoding of characters is implemented. In many cases, however, the full-width “IDEOGRAPHIC SPACE” at row 1 cell 1 and the half-width space (2/0) coexist. How the two should be different is not self-explanatory, and is not specified in the standard.
History
Until five years have passed after a Japanese Industrial Standard has been established, reaffirmed, or revised, the prior standard undergoes a process of reaffirmation, revision, or withdrawal. Since establishment, the standard has been subject to revision three times, and at present, the fourth standard is valid.First standard
The first standard is JIS C 6226-1978 , established by the Japanese Minister of International Trade and Industry on 1 January 1978. It is also called 78JIS for short. Entrusted by the Agency of Industrial Science and Technology, a JIPDECJIPDEC
Japan Information Processing Development Corporation was established in 1967 for development of key IT technologies and policies, as related to the Ministry of Internal Affairs and Communications and Ministry of Economy, Trade and Industry...
kanji code standardization research and study committee produced the draft. The committee chairman was Moriguchi Shigeichi.
There are 108 special characters, 3384 level 2 kanji, and box-drawing characters are not included. Accordingly, the kanji set was composed of 453 non-kanji and 6349 kanji for a total of 6802 characters. The standard itself was set in Shaken Co., Ltd’s Ishii Mincho typeface.
Second standard
The second standard JIS C 6226-1983 revised the first standard on 1 September 1983. It is also called 83JIS. Entrusted by the AIST, a JIPDEC kanji code-related JIS committee produced the draft. The committee chairman was Motooka Tōru.The draft of the second standard was based on the consideration of factors such as the promulgation of the jōyō kanji
Joyo kanji
The is the guide to kanji characters announced officially by the Japanese Ministry of Education. Current jōyō kanji are those on a list of 2,136 characters issued in 2010...
, the enforcement of the jinmeiyō kanji
Jinmeiyo kanji
The are a set of 861 Chinese characters known as the "name kanji" in English. They are a supplementary set of characters that can be legally used in registered personal names in Japan, despite not being in that country's set of "commonly used characters" . As a rule, registered personal names may...
, and the standardization of Japanese-language Teletex
Teletex
Teletex was an ITU-T specification for a text and document communications service that could be provided over telephone lines. Teletex allowed for the transmission and routing of Group 4 facsimile documents...
by the Ministry of Posts and Telecommunications; also, the next modification was performed to keep pace with JIS C 6234-1983 (24-pixel matrix printer character forms; presently JIS X 9052).
Addition of special characters
- 39 characters were added to the special characters. Among these 39, per JICST recommendations, and from such standards as JIS Z 8201-1981 (mathematical symbols) and JIS Z 8202-1982 (quantity, unit, and chemical symbols), things that could not be represented by composition were chosen.
Newly added box-drawing characters
- 32 box-drawing characters were added.
Swapping of itaiji code points
- Code points were swapped for 22 sets of level 1 and level 2 kanji with mutually related itaiji. For example, (level 1’s) row 36 cell 59 in the first standard was moved to (level 2’s) row 52 cell 68; the point originally at row 52 cell 68 was in turn moved to row 36 cell 59.
Additions to the level 2 kanji
- Three characters from level 1 and one character from level 2 were given new code points at previously unassigned code points in row 84 as level 2 kanji. Itaiji for each of those code points were moved into their original locations. For example, row 84 cell 1 in the second standard was moved there to accommodate a different form not included in the first standard at row 22 cell 38 as a level 1 kanji .
Modification of character forms
- The character forms of approximately 300 kanji were amended.
Among the changes in those 300 or so kanji character forms, many level 1 glyphs that were in the style of the Kangxi Dictionary
Kangxi dictionary
The Kangxi Dictionary was the standard Chinese dictionary during the 18th and 19th centuries. The Kangxi Emperor of the Manchu Qing Dynasty ordered its compilation in 1710. The creator innovated greatly by reusing and confirming the new Zihui system of 596 radicals, since then known as 596 Kangxi...
were changed into variants, and especially more simplified forms (e.g. ryakuji
Ryakuji
Ryakuji are colloquial simplifications of Kanji.- Status :Ryakuji are not covered in the Kanji Kentei, nor are they officially recognized...
and extended shinjitai
Extended shinjitai
is the extension of the shinjitai simplification method to : kanji not included in the jōyō kanji list. They are unofficial characters: the official forms of hyōgaiji are kyūjitai .-Simplified forms:...
). For example, a couple of code points that are often the subject of criticism due to being greatly changed are row 18 cell 10 (78JIS: , 83JIS: ) and row 38 cell 34 (78JIS: , 83JIS: ).
There were many smaller changes away from the Kangxi-style variants; for example, row 25 cell 84 lost part of a stroke. Also, where some glyphs for level 1 kanji were not Kangxi-style forms, there were some changed into their Kangxi-style forms; for example, row 80 cell 49 gained part of a stroke (i.e., the same part of the stroke that 25-84 lost).
In order to elucidate the original intent of the first standard, these ended up falling into parameters for unification criteria in the fourth standard. The difference in form for the examples noted above (“” and “”) falls under the parameters for unification criterion 42 (concerning the component “”).
The bulk of the changes to character forms are differences between level 1 and level 2 kanji. Specifically, simplification was done more often for level 1 kanji than for level 2 kanji; simplifications applied to level 1 kanji (e.g. “” to “” and “” to “”) were not generally applied to kanji in level 2 (“” stayed as-is). The aforementioned 25-84 and 80-49 were given different treatment likewise, as the former is in level 1 and the latter is in level 2. Even so, there were some changes regardless of the level; for instance characters containing the “door” and “winter” components were changed with no different treatment between level 1 and level 2 kanji.
However, for 29 code points (such as the problematic 18-10 and 38-34 mentioned above), the forms inherited by the fourth standard contradicts the original intent of the first. For these, there are special unification criteria to maintain compatibility with the previous standards at these code points.
When the new “X” category for Japanese Industrial Standards (for information-related fields) was introduced, the second standard was re-termed JIS X 0208-1983 on 1 March 1987.
Third standard
The third standard JIS X 0208-1990 revised the second standard on 1 September 1990. It is also called 90JIS for short. Entrusted by the AIST, a committee at the Japanese Standards Association for the revision of JIS X 0208 created the draft. The committee chairman was Tajima Kazuo.225 kanji glyphs were changed, and two characters were added to level 2 (“” and “”). Some of the changes and the two additions corresponded to the 118 jinmeiyō kanji
Jinmeiyo kanji
The are a set of 861 Chinese characters known as the "name kanji" in English. They are a supplementary set of characters that can be legally used in registered personal names in Japan, despite not being in that country's set of "commonly used characters" . As a rule, registered personal names may...
added in March 1990. The standard itself was set in Heisei Mincho.
Fourth standard
The fourth standard JIS X 0208:1997 revised the third standard on 20 January 1997. It is also called 97JIS for short. Entrusted by the AIST, a JSA committee for research and study of coded character sets produced the draft. The committee chairman was Shibano Kōji.The basic policies of this revision were to perform no changes the character set, to clarify ambiguous provisions, and to make the standard relatively easier to use. Addition, removal, and code point rearrangement were not done, and without exception, the example glyphs were also left unchanged. However, the stipulations of the standard were completely re-written and/or supplemented. Whereas the third standard was 65 pages long without the explanations, the fourth standard was 374 pages without the explanations.
The main points of the revision are:
Definition of encoding methods
- Until the third standard, only the encoding method based on JIS X 0202 code extension was defined. This is something unusual as far as coded character sets go. In the fourth standard, encoding methods that do not use escape sequences for the purpose of code extension were defined.
Definition of the general prohibition of the use of unassigned code points and methods of usage for unassigned code points
- The third standard, in an explanation that was not part of the standard, described things as if there were places where for some unassigned code points, it was acceptable to assign gaiji. In the fourth standard, it was clarified that use of unassigned code points is generally prohibited. Also, the conditions for the usage of unassigned code points were specified.
General elimination of duplicate encodings
- Each character was given a “character name” that maps to those of other standards. Also, encoding methods to use them together with the ISO/IEC 646’s International Reference Version or JIS X 0201 were specified. When JIS X 0208 is used together with either, among two assigned code points for characters with the same name, only one is permitted; thus, duplicate encodings were generally eliminated.
Investigation into sources of kanji
- Characters included in the standard so far that are found in neither the Kangxi DictionaryKangxi dictionaryThe Kangxi Dictionary was the standard Chinese dictionary during the 18th and 19th centuries. The Kangxi Emperor of the Manchu Qing Dynasty ordered its compilation in 1710. The creator innovated greatly by reusing and confirming the new Zihui system of 596 radicals, since then known as 596 Kangxi...
nor the Dai Kanwa Jiten were identified. Accordingly, exactly with what purpose for inclusion and from which sources these kanji came during compilation of the first standard was investigated.
Definition of kanji unification criteria
- Based on things such as the materials for the drafting of the first standard, an attempt was made to restore the intent of the first standard for the scope of the glyphs each code point represents. Moreover, the criteria for unifying kanji glyphs were clearly defined.
Inclusion of de facto standards
- By the time of the fourth standard, the encoding methods Shift JIS and ISO-2022-JP had become de facto standardDe facto standardA de facto standard is a custom, convention, product, or system that has achieved a dominant position by public acceptance or market forces...
s for personal computing and e-mail, respectively. These encoding methods were included as “Shift-Coded Representation” and “RFC 1468 -Coded Representation” (described above).
Future
JIS X 0213JIS X 0213
JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 . As well as adding a number of special characters, characters with diacritic marks,...
(extended kanji) was designed “with the goal being to offer a sufficient character set for the purposes of encoding the modern Japanese language that JIS X 0208 intended to be from the start”; it defines a character set that expands upon the kanji set of JIS X 0208. The drafters of JIS X 0213 recommend migration from JIS X 0208 to JIS X 0213, among the advantages being JIS X 0213’s compatibility with the Hyōgai Kanji Glyph List and with newer jinmeiyō kanji
Jinmeiyo kanji
The are a set of 861 Chinese characters known as the "name kanji" in English. They are a supplementary set of characters that can be legally used in registered personal names in Japan, despite not being in that country's set of "commonly used characters" . As a rule, registered personal names may...
.
Contrary to the expectations of the drafters, adoption of JIS X 0213 has been anything but fast since its enactment in the year 2000. The drafting committee of JIS X 0213:2004 wrote (in the year 2004), “The status where ‘what the majority of information systems can use in common is JIS X 0208 only’ still continues.” (JIS X 0213:2000, Appendix 1:2004, section 2.9.7)
For Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, the predominant operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
(and hence supplying the predominant desktop environment
Desktop environment
In graphical computing, a desktop environment commonly refers to a style of graphical user interface derived from the desktop metaphor that is seen on most modern personal computers. These GUIs help the user in easily accessing, configuring, and modifying many important and frequently accessed...
) in the personal computing sector, the JIS X 0213 repertoire has been included since Windows Vista
Windows Vista
Windows Vista is an operating system released in several variations developed by Microsoft for use on personal computers, including home and business desktops, laptops, tablet PCs, and media center PCs...
, released in November 2006. Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
has been compatible with JIS X 0213 since version 10.1
Mac OS X v10.1
Mac OS X version 10.1, code named "Puma", is the second major release of Mac OS X, Apple's desktop and server operating system. It superseded Mac OS X v10.0 and preceded Mac OS X v10.2. Version 10.1 was released on 25 September 2001 as a 'free update' to version 10.0...
(released in 2001). Many Unix-like
Unix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
s such as Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
can (optionally) support JIS X 0213 if desired. Therefore, it is thought that with time, JIS X 0213 support on personal computers will not be an impediment to its eventual adoption.
Among the drafters of JIS X 0213, there are those who expect to see a mix of JIS X 0208 and JIS X 0213 before any adoption of JIS X 0213 (Satō, 2004). However, JIS X 0208 continues to be used for the present, and many predict it to endure as a standard. There are barriers that need to be overcome if JIS X 0213 is to supplant JIS X 0208 in common usage:
- The character repertoires utilized in Japanese mobile phoneMobile phoneA mobile phone is a device which can make and receive telephone calls over a radio link whilst moving around a wide geographic area. It does so by connecting to a cellular network provided by a mobile network operator...
s at the present time are based on JIS X 0208. There are no officially announced plans whatsoever to migrate these to JIS X 0213 compatibility. As mobile phones are now a pervasive aspect of Japanese textual communication (see Japanese mobile phone cultureJapanese mobile phone cultureIn Japan, mobile phones have become ubiquitous. In Japanese, mobile phones are called keitai denwa , literally "portable telephones," and are often known simply as keitai....
), being a widespread, commonly accessed medium for sending e-mailE-mailElectronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
and accessing the World Wide WebWorld Wide WebThe World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
, a lack of adoption for mobile phones deters usage elsewhere. - JIS X 0213 is not strictly upward-compatible with JIS X 0208. For large-scale archives (e.g. bibliographic databaseBibliographic databaseA bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, patents, books, etc...
s and Aozora BunkoAozora BunkoAozora Bunko is a Japanese digital library. This online collection encompasses several thousands of works of Japanese-language fiction and non-fiction. These include out-of-copyright books or works that the authors wish to make freely available....
) that use JIS X 0208 and follow its unification criteria strictly, it is thought that it would be extremely difficult work to both convert all the data to JIS X 0213 and preserve the integrity of the text. - In practice, many systems define and use unassigned code points in JIS X 0208. For example, Windows assigns IBM extended characters and user-defined character areas, and mobile phones assign emojiEmojiis the Japanese term for the picture characters or emoticons used in Japanese electronic messages and webpages. Originally meaning pictograph, the word literally means e "picture" + moji "letter". The characters are used much like emoticons elsewhere, but a wider range is provided, and the icons...
in some such places. The code points of these gaiji conflict with the code points that JIS X 0213 codes use, so there would be some difficulty in migrating these systems from JIS X 0208 to JIS X 0213. There are also plans to migrate to UCSUniversal Character SetThe Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...
/UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
and use the JIS X 0213 repertoire from there, but until a system administrator is able to judge that the implementations of UCS/Unicode surrogate pairs and character compositions are sufficiently stable, he or she is likely to hesitate to use the repertoire of JIS X 0213 that requires those implementations. - The improvements to provided by JIS X 0213 are mostly in the realm of characters that are not used as often as the ones already present in JIS X 0208. Because there are nearly twice as many glyphs that need to be implemented for less usage of those extra glyphs, it can be a low return on investment in many cases, especially where resources are constrained.
Implementations
Because JIS X 0208 / JIS C 6226 is primarily a character set and not a strictly defined character encodingCharacter encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
, several companies have implemented their own encodings of the character set.
- Apple Computer Inc.: MacJapanese
- FujitsuFujitsuis a Japanese multinational information technology equipment and services company headquartered in Tokyo, Japan. It is the world's third-largest IT services provider measured by revenues....
: JEF kanji code - Hitachi Ltd.: KEISKEISKEIS is a stateful EBCDIC charset used in Hitachi mainframe systems. KEIS is an acronym for "Kanji processing Extended Information System".-Encoding structure:Here are the valid ranges of bytes according to its encoding structure.-References:...
- MicrosoftMicrosoftMicrosoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
: Code page 932Code page 932Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions... - NECNEC, a Japanese multinational IT company, has its headquarters in Minato, Tokyo, Japan. NEC, part of the Sumitomo Group, provides information technology and network solutions to business enterprises, communications services providers and government....
: JIPS
ISO/IEC 646 IRV
As noted above, the kanji set is not upwardly compatible with the ISO/IEC 646 IRV graphic character set. The kanji set and the IRV graphic character set can be used together as specified in JIS X 0208 (IRV + 7-bit code for kanji and IRV + 8-bit code for kanji). They can be used together in EUC-JP as well.JIS X 0201
The kanji set lacks three characters included in JIS X 0201JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
’s graphic character set for Latin characters: 2/2 (QUOTATION MARK), 2/7 (APOSTROPHE), and 2/13 (HYPHEN-MINUS). The kanji set contains all character included in JIS X 0201’s graphic character set for katakana.
The kanji set and the graphic character set for Latin characters can be used together as specified in JIS X 0208 (Latin characters + 7-bit code for kanji and the Latin characters + 8-bit code for kanji). The kanji set, graphic character set for Latin characters, and JIS X 0201’s graphic character set for katakana can be used together as specified in JIS X 0208 (the shift-coded character set; i.e. Shift JIS). The kanji set and graphic character set for katakana can be used together in EUC-JP.
JIS X 0212
JIS X 0212JIS X 0212
JIS X 0212 is a Japanese Industrial Standard defining coded character set for encoding the characters used in Japanese. This standard extends JIS X 0208.-History:...
(supplementary kanji) defines additional characters for JIS X 0208 and their code points for the purposes of information processing that requires characters not found in JIS X 0208.
Among the code points that the second version of JIS X 0208 changed, 28 code points in JIS X 0212 reflect the character forms from before the changes. Also, JIS X 0212 reassigns the “closure mark” that JIS X 0208 had assigned as a non-kanji (at row 1 cell 26) as a kanji (at row 16 cell 17). JIS X 0212 has no characters in common with JIS X 0208 other than these.
JIS X 0208 and JIS X 0212 can be used together in EUC-JP. Also, JIS X 0208 and JIS X 0212 are both source standards for UCS/Unicode’s Han unification
Han unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
, so using UCS/Unicode, the kanji from both standards can be made to coexist.
However, in the fourth version of JIS X 0208, the connection to JIS X 0212 was not defined at all. It is believed that this is because the drafting committee of the fourth JIS X 0208 standard had a critical opinion of the selection and identification methods of JIS X 0212. The text of the fourth standard, as well as pointing out the problematic points of the character selection of JIS X 0212, states that “it is thought that not only is character selection impossible, it is also impossible to use together; the connection to JIS X 0212 is not defined at all.” (section 3.3.1)
JIS X 0213
JIS X 0213JIS X 0213
JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 . As well as adding a number of special characters, characters with diacritic marks,...
(extension kanji) defines a kanji set that expands upon the kanji set of JIS X 0208. According to this standard, it is “designed with the goal being to offer a sufficient character set for the purposes of encoding the modern Japanese language that JIS X 0208 intended to be from the start.”
JIS X 0208 and JIS X 0213 cannot be used at the same time. The kanji set of JIS X 0213 includes all characters that can be represented in the kanji set of JIS X 0208; they are part and parcel of the 1183 non-kanji and 10,050 kanji (for a total of 11,233 characters) codes defined by JIS X 0213.
At first, JIS X 0213 may appear to define or offer as a reference an upwardly compatible coded character set for JIS X 0208. However, in a strict sense, JIS X 0213 is not upwardly compatible with JIS X 0208. The JIS X 0213 drafting committee admitted as such (JIS X 0213:2000 section 5.3.2, JIS X 0213:2000 Appendix 1:2004 section 3.2.2).
This is because of the unification criteria that apply to some JIS X 0208 code points; namely, some different kanji glyphs that were represented by one JIS X 0208 code point due to their being unified are given separate code points in JIS X 0213. For this reason, data encoded according to JIS X 0208 codes cannot necessarily be directly converted into JIS X 0213 codes and remain correct.
For example, the glyph at row 33 cell 46 of JIS X 0208 (“”, described above) unifies a few variants due to its right-hand component. In JIS X 0213, two forms (the ones containing the component “”) are unified on plane 1 row 33 cell 46, and the other (containing the component “”) is located at plane 1 row 14 cell 41. Therefore, whether JIS X 0208 row 33 cell 46 should be mapped to JIS X 0213 plane 1 row 33 cell 46 or plane 1 row 14 cell 41 is computationally indeterminate.
For the most part, row m cell n in JIS X 0208 corresponds to plane 1 row m cell n in JIS X 0213; therefore, not much confusion arises in practice. This is because most typefaces have come to use the glyphs exemplified in JIS X 0208, and most users are not consciously aware of the unification criteria.
ISO/IEC 10646 and Unicode
The kanji set of JIS X 0208 is among the original source standards for the Han unificationHan unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
in ISO/IEC 10646 (UCS) and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
. Every kanji in JIS X 0208 corresponds to its own code point in UCS/Unicode’s Basic Multilingual Plane (BMP).
The non-kanji in JIS X 0208 also correspond to their own code points in the BMP. However, for some special characters, some systems implement a different correspondences from those of UCS/Unicode’s (which are based on the character names given JIS X 0208:1997).
See also
- JIS coded character sets
- JIS X 0201JIS X 0201JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
“7-bit and 8-bit coded character sets for information interchange” - JIS X 0202 “Information technology – Character code structure and extension techniques” (ISO/IEC 2022ISO/IEC 2022ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard specifying...
) - JIS X 0208 “7-bit and 8-bit double byte coded KANJI sets for information interchange”
- JIS X 0211 “Control functions for coded character sets” (ISO/IEC 6429)
- JIS X 0212JIS X 0212JIS X 0212 is a Japanese Industrial Standard defining coded character set for encoding the characters used in Japanese. This standard extends JIS X 0208.-History:...
“Code of the supplementary Japanese graphic character set for information interchange” - JIS X 0213JIS X 0213JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 . As well as adding a number of special characters, characters with diacritic marks,...
“7-bit and 8-bit double byte coded extended KANJI sets for information interchange” - JIS X 0221 “Universal Multiple-Octet Coded Character Set (UCS)” (ISO/IEC 10646)
- JIS X 0201
- Extended shinjitaiExtended shinjitaiis the extension of the shinjitai simplification method to : kanji not included in the jōyō kanji list. They are unofficial characters: the official forms of hyōgaiji are kyūjitai .-Simplified forms:...
- Ghost kanji
- Help:Japanese
External links
- The International Register that the IPSJ/ITSCJ supervises.
- Japanese Character Set JIS C 6226-1978
- Japanese Character Set JIS C 6226-1983
- Update Registration 87 Japanese Graphic Character Set for Information Interchange Japanese Industrial Standards Committee database search (the latest standard may be read here). Japanese Standards Association database search: (a copy of the latest standard may be purchased here). Unification-related provisions in the JIS X 0208 and 0213 standards Cyber Librarian - JIS kanji listing