Shift-JIS
Encyclopedia
Shift JIS is a character encoding
for the Japanese language
originally developed by a Japan
ese company called ASCII Corporation
in conjunction with Microsoft
and standardized as JIS X 0208 Appendix 1. It is based on character sets defined within JIS
standards JIS X 0201
:1997 (for the single-byte characters) and JIS X 0208
:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth katakana
characters in the single-byte range 0xA1 to 0xDF. The single-byte characters 0x
00 to 0x7F match the ASCII
encoding, except for a yen
sign (U+00A5) at 0x5C and an overline
(U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201. Shift JIS can be and is used for HTML since the important start and end of HMTL tags and fields, <, >, /, " appear as themselves only, not as a part of a two byte sequence.
Shift JIS requires an 8-bit clean
medium for transmission. It is fully backwards compatible
with the legacy JIS X 0201
single-byte encoding, meaning it supports half-width katakana and that any valid JIS X 0201 string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of code word
s makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. On the other hand, the competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 code point
s, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters. The same thing is valid for UTF-8
which is a world standard, better supported by software, and is predicted to fully replace Shift-JIS and EUC-JP.
For a double-byte JIS sequence , the transformation to the corresponding Shift JIS bytes is:
There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here — these are really extensions to JIS X 0208 rather than to Shift JIS itself. The most popular extension here is to the Windows-31J, otherwise known as Code page 932
, popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis". Secondly, Shift JIS has more encoding space than is needed, for JIS X 0201 and JIS X 0208 and this space can and is used for yet more characters. The space, with lead bytes 0xF5 to 0xF9, is used by Japanese mobile phone
operators for pictographs
for use in E-mail
, for example. (KDDI
goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4).
Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA
registration, so there is much scope for confusion, if the extensions are used. Microsoft Code Page 932 is registered separately from Shift JIS.
IBM CCSID
943 has the same extensions as Code Page 932. As with most code pages and encodings, it is recommended by Microsoft, Apple, the Unicode Consortium and most major operating system makers that Unicode
be used instead.
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
for the Japanese language
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
originally developed by a Japan
Japan
Japan is an island nation in East Asia. Located in the Pacific Ocean, it lies to the east of the Sea of Japan, China, North Korea, South Korea and Russia, stretching from the Sea of Okhotsk in the north to the East China Sea and Taiwan in the south...
ese company called ASCII Corporation
ASCII (company)
was a publishing company based in Tokyo, Japan. It became a subsidiary of Kadokawa Group Holdings in 2004, and merged with another Kadokawa subsidiary MediaWorks on April 1, 2008, and became ASCII Media Works. The company published Monthly ASCII as the main publication...
in conjunction with Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
and standardized as JIS X 0208 Appendix 1. It is based on character sets defined within JIS
Japanese Industrial Standards
Japanese Industrial Standards specifies the standards used for industrial activities in Japan.The standardization process is coordinated by Japanese Industrial Standards Committee and published through Japanese Standards Association.-History:...
standards JIS X 0201
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
:1997 (for the single-byte characters) and JIS X 0208
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is...
:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth katakana
Katakana
is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...
characters in the single-byte range 0xA1 to 0xDF. The single-byte characters 0x
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
00 to 0x7F match the ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
encoding, except for a yen
Japanese yen
The is the official currency of Japan. It is the third most traded currency in the foreign exchange market after the United States dollar and the euro. It is also widely used as a reserve currency after the U.S. dollar, the euro and the pound sterling...
sign (U+00A5) at 0x5C and an overline
Overline
An overline or overbar or overscore , refers to the typographical feature of a line drawn immediately above the text, for example used to indicate medieval sigla. Specifically, a line drawn over one symbol is a macron, and a line over a collection of symbols is a vinculum...
(U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201. Shift JIS can be and is used for HTML since the important start and end of HMTL tags and fields, <, >, /, " appear as themselves only, not as a part of a two byte sequence.
Shift JIS requires an 8-bit clean
8-bit clean
8-bit clean describes a computer system that correctly handles 8-bit character sets, such as the ISO 8859 series and the UTF-8 encoding of Unicode.- History :...
medium for transmission. It is fully backwards compatible
Backward compatibility
In the context of telecommunications and computing, a device or technology is said to be backward or downward compatible if it can work with input generated by an older device...
with the legacy JIS X 0201
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
single-byte encoding, meaning it supports half-width katakana and that any valid JIS X 0201 string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of code word
Code word
In communication, a code word is an element of a standardized code or protocol. Each code word is assembled in accordance with the specific rules of the code and assigned a unique meaning...
s makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. On the other hand, the competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...
s, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters. The same thing is valid for UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
which is a world standard, better supported by software, and is predicted to fully replace Shift-JIS and EUC-JP.
For a double-byte JIS sequence , the transformation to the corresponding Shift JIS bytes is:
Multiple versions
Many different versions of Shift JIS exist, conflicting with some code points. This is one reason why applications are recommended to use Unicode such as UTF-8 or UTF-16 instead.There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here — these are really extensions to JIS X 0208 rather than to Shift JIS itself. The most popular extension here is to the Windows-31J, otherwise known as Code page 932
Code page 932
Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...
, popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis". Secondly, Shift JIS has more encoding space than is needed, for JIS X 0201 and JIS X 0208 and this space can and is used for yet more characters. The space, with lead bytes 0xF5 to 0xF9, is used by Japanese mobile phone
Mobile phone
A mobile phone is a device which can make and receive telephone calls over a radio link whilst moving around a wide geographic area. It does so by connecting to a cellular network provided by a mobile network operator...
operators for pictographs
Emoji
is the Japanese term for the picture characters or emoticons used in Japanese electronic messages and webpages. Originally meaning pictograph, the word literally means e "picture" + moji "letter". The characters are used much like emoticons elsewhere, but a wider range is provided, and the icons...
for use in E-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
, for example. (KDDI
KDDI
is a Japanese telecommunications operator formed in October 2000 through the merger of DDI Corp., KDD Corp., and IDO Corp. It has its headquarters in the Garden Air Tower in Iidabashi, Chiyoda, Tokyo....
goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4).
Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...
registration, so there is much scope for confusion, if the extensions are used. Microsoft Code Page 932 is registered separately from Shift JIS.
IBM CCSID
CCSID
CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...
943 has the same extensions as Code Page 932. As with most code pages and encodings, it is recommended by Microsoft, Apple, the Unicode Consortium and most major operating system makers that Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
be used instead.
Shift JIS byte map
The chart below gives the detailed meaning of each byte in a Shift JIS encoded stream.External links
- Shift-JIS A table of the non-ASCII part of the codeset.
- Microsoft's definition of Code Page 932
- Forms of Shift-JIS in ICU (International Components for UnicodeInternational Components for UnicodeInternational Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...
)