MARC-8 - AbsoluteAstronomy.com

The MARC-8 charset is a MARC standard

MARC standards

MARC, MAchine-Readable Cataloging, is a data format and set of related standards used by libraries to encode and share information about books and other material they collect...

used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in library computer systems. The encoding now known as MARC-8 was introduced in 1968 with the beginning of the use of the MARC format. Over the years it has grown to include code points for a large repertoire of characters including Latin, Cyrillic, Arabic, Hebrew, and Greek scripts and over 15,000 characters used in writing Chinese, Japanese and Korean. If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

must be used instead. UTF-8 has support for many more characters than MARC-8. MARC-8 is rarely used outside of library records.

Technical Details

MARC-8 uses a variant of the ISO-2022 encoding. It uses escape characters to represent characters beyond the 7-bit ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

range of characters.

It generally uses the same logical BiDi ordering as Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

.

The combining characters and base characters are in a different order than used in Unicode. The following are some examples. The combining characters are not always stored in reverse order as Unicode normalization. The MARC-21 standard describes the MARC-8 Unicode conversion issues in more detail.

Displayed Character	Unicode NFD	MARC-8
á	a ́	́ a
ậ	a ̣ ̂	̂ ̣ a

Code structure

The ISO/IEC 2022

ISO/IEC 2022

ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard specifying...

coding specifies a two-layer mapping between character codes and displayed characters. In MARC-8, character codes from the 7-bit ASCII graphic range (0x20–0x7F) are referred to as "G0" codes, while codes from the "high ASCII" range (0xA0–0xFF) are referred to as the "G1" codes. Graphic character sets are designated and invoked by means of a multiple byte escape sequence consisting of the escape character, an Intermediate character sequence, and a Final character in the form ESC I F.

The following table shows the intermediate byte after the ESC byte (hexadecimal 1B), and the corresponding ASCII characters.

Intermediate Bytes
	G0 set				G1 set
	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...		MBCS		SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...		MBCS
Normal ISO-2022	28	(	24	$	29	)	24 29	$)
Alternate ISO-2022 (additional 63+16 sets)	2C	,	24 2C	$,	2D	-	24 2D	$-

The following table shows the final bytes in hexadecimal and the corresponding ASCII characters after the intermediate bytes.

Final Bytes
Bytes	Characters	Name	Type	Comment
31	1	Chinese, Japanese, Korean (EACC Chinese Character Code for Information Interchange Chinese Character Code for Information Interchange or CCCII is a character set developed specifically to address the problem of interchange of Chinese information... )	MBCS
32	2	Basic Hebrew	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
33	3	Basic Arabic	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
34	4	Extended Arabic	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
42	B	Basic Latin (ASCII ASCII The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text... )	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
21 45	!E	Extended Latin (ANSEL ANSEL ANSEL, American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use, is a character set used in text encodings like MARC-8... )	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...	The 21(hex) technically is a second byte of the Intermediate segment of this escape sequence.
4E	N	Basic Cyrillic	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
51	Q	Extended Cyrillic	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
53	S	Basic Greek	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...

The EACC is the only multibyte encoding of MARC-8, it encodes each CJK

CJK

CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

character in three ASCII bytes.

For example, to encode the U+4EBA CJK character (人) you will need the following bytes

\x1B\x24\x31\x21\x30\x64

The \x1B\x24\x31 switches to EACC/CJK, and the \x21\x30\x34 corresponds to the U+4EBA.

Custom set extension

In addition to the ISO-2022 character sets, the following custom sets are available too. The byte designation follows the escape byte (hexadecimal 1B). There is no intermediate byte.

Final Bytes
Bytes	Characters	Name	Type	Comment
62	b	Subscript set	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
67	g	Greek Symbol set	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...	The alpha, beta, gamma characters normally do not round trip map to Unicode.
70	p	Superscript set	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...
73	s	Basic Latin (ASCII ASCII The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text... )	SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...

External links

MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media - The official MARC-8 standard as maintained by the US Library of Congress

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.