UTF-EBCDIC
Encyclopedia
UTF-EBCDIC is a character encoding
used to represent Unicode
characters. It is meant to be EBCDIC
-friendly, so that legacy EBCDIC
applications on mainframes
may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8
's advantages for existing ASCII
-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first. The main difference between this encoding and UTF-8 is that it allows unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this 101XXXXX was used instead of 10XXXXXX as the format for later bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, UTF-EBCDIC will generally produce larger output for the same input data than UTF-8.
This transformation leaves the data in an ASCII based format, so a reversible byte-byte transform is made on this data using a lookup table to make it as close to normal EBCDIC code pages as feasible. These steps can be easily reversed to recover the unicode code points.
Generally, this encoding form is rarely used, even on EBCDIC based mainframes for which it was designed. IBM
EBCDIC based mainframe operating systems, like z/OS
, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL
, PL/I
, Java
and the IBM
XML
toolkit support UTF-16 on IBM mainframes.
instead of IBM-37 due to the location of the square brackets. CCSID
37 has [] at hex BA and BB instead of at hex AD and BD respectively.
>
bgcolor=#f00|
bgcolor=#f00|
bgcolor=#f00|
bgcolor=#f00|
bgcolor=#f00|
bgcolor=#f00|
]]|208}}
||
||
||
||
||
||
||
||
||
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|-
!
||
|bgcolor=#fff|
||
||
||
||
||
||
||
||
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
||
|-
!
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
|}
White cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte (this value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65 (hex 0x41), if this would result in an invalid overlong form). Reduced payload of a continuation byte (5 bits, compared to 6 bits in UTF-8) results in different ranges of code points represented by code sequences of the same length.
Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
used to represent Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
characters. It is meant to be EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
-friendly, so that legacy EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
applications on mainframes
Mainframe computer
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...
may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
's advantages for existing ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first. The main difference between this encoding and UTF-8 is that it allows unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this 101XXXXX was used instead of 10XXXXXX as the format for later bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, UTF-EBCDIC will generally produce larger output for the same input data than UTF-8.
This transformation leaves the data in an ASCII based format, so a reversible byte-byte transform is made on this data using a lookup table to make it as close to normal EBCDIC code pages as feasible. These steps can be easily reversed to recover the unicode code points.
Generally, this encoding form is rarely used, even on EBCDIC based mainframes for which it was designed. IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
EBCDIC based mainframe operating systems, like z/OS
Z/OS
z/OS is a 64-bit operating system for mainframe computers, produced by IBM. It derives from and is the successor to OS/390, which in turn followed a string of MVS versions.Starting with earliest:*OS/VS2 Release 2 through Release 3.8...
, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL
COBOL
COBOL is one of the oldest programming languages. Its name is an acronym for COmmon Business-Oriented Language, defining its primary domain in business, finance, and administrative systems for companies and governments....
, PL/I
PL/I
PL/I is a procedural, imperative computer programming language designed for scientific, engineering, business and systems programming applications...
, Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
and the IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
toolkit support UTF-16 on IBM mainframes.
Codepage layout
There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As you can see, the single byte portion is similar to IBM-1047EBCDIC 1047
Code page 01047 is an EBCDIC code page with the full Latin-1 character set.It is possible to translate the character codes from the CP 01047 charset to ISO 8859-1 character codes, so that translation back to the CP 01047 charset is an exact value-preserving round-trip conversion....
instead of IBM-37 due to the location of the square brackets. CCSID
CCSID
CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...
37 has [] at hex BA and BB instead of at hex AD and BD respectively.
||
||
||
||
||
||
||
||
||
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|-
!
||
|bgcolor=#fff|
||
||
||
||
||
||
||
||
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
|bgcolor=#fff|
||
|-
!
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
|}
White cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte (this value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65 (hex 0x41), if this would result in an invalid overlong form). Reduced payload of a continuation byte (5 bits, compared to 6 bits in UTF-8) results in different ranges of code points represented by code sequences of the same length.
Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.
External links
- http://www.unicode.org/reports/tr16/ Unicode Technical Report #16: the definition of UTF-EBCDIC