8-bit clean
Encyclopedia
8-bit clean describes a computer system that correctly handles 8-bit
character sets
, such as the ISO 8859 series and the UTF-8
encoding of Unicode
.
of each Byte
free for use as a parity
, flag bit, or meta data control bit. 7-bit systems and data links are unable to handle more complex character codes which are commonplace in non-English
-speaking countries with larger alphabet
s.
Binary file
s cannot be transmitted through 7-bit data channels directly. To work around this, binary-to-text encodings have been devised which use only 7-bit ASCII
characters. Some of these encodings are uuencoding, Ascii85
, SREC
, BinHex
, kermit
and MIME
's Base64
. EBCDIC
-based systems cannot handle all characters used in UUencoded data. However, the base64 encoding does not have this problem.
Perhaps the final 7-bit restriction, primarily imposed due to the pervasive use of RS-232
protocol for serial ports between devices, notably computers and modems, was lifted in the mid-1990's when RS-232
was largely replaced with Ethernet
and with USB.
during transmission in the 20th century. But some Internet implementations really did not care about formal discouraging of the 8-bit data and allowed high bit set bytes to pass through.
Many early communications protocol
standards, such as RFC 780, RFC 788, RFC 821 for SMTP, RFC 977 for NNTP, RFC 1056, RFC 2821, RFC 5321, were designed to work over such "7-bit" communication links. They specifically mention the use of ASCII character set "transmitted as a 8-bit byte with the high-order bit cleared to zero" and some of these
explicitly restrict all data to 7-bit characters.
For the first few decades of email networks (1971 to the early 1990s),
most email messages were plain text
in the 7-bit US-ASCII character set.
According to RFC 1428, the original RFC 821 definition of SMTP limits Internet Mail to
lines (1000 characters or less) of 7-bit US-ASCII characters.
Later the format of email messages were re-defined
in order to support
messages that are not entirely US-ASCII text
(text messages in character sets
other than US-ASCII,
and non-text messages,
such as audio and images).
The Internet community generally adds features by "extension", allowing communication in both directions between upgraded machines and not-yet-upgraded machines, rather than declaring formerly standards-compliant legacy software to be "broken" and insisting that all software world-wide be upgraded to the latest standard.
In the mid-1990s
, people objected to "just send 8 bits (to RFC 821 SMTP servers)",
perhaps because of a perception that "just send 8 bits"
is an implicit declaration that ISO 8859-1 become the new "standard encoding", forcing everyone in the world to use the same character set.
Instead, the recommended way to take advantage of 8-bit-clean links between machines is to use the ESMTP (RFC 1869) 8BITMIME extension.
8-bit
The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system. The Zilog Z80 and the Motorola 6800 were also used in similar computers...
character sets
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
, such as the ISO 8859 series and the UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
encoding of Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
.
History
Up to the early 1990s, many programs and data transmission channels assumed that all characters would be represented as numbers between 0 and 127 (7 bits). On computers and data links using 8-bit bytes this left the top bitBit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
of each Byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
free for use as a parity
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
, flag bit, or meta data control bit. 7-bit systems and data links are unable to handle more complex character codes which are commonplace in non-English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
-speaking countries with larger alphabet
Alphabet
An alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...
s.
Binary file
Binary file
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...
s cannot be transmitted through 7-bit data channels directly. To work around this, binary-to-text encodings have been devised which use only 7-bit ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
characters. Some of these encodings are uuencoding, Ascii85
Ascii85
Ascii85 is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data , it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data...
, SREC
SREC (file format)
The Motorola S-record format is an ASCII hexadecimal text encoding for binary data. It is also known as the SREC or S19 format. Each record contains a checksum to detect data that has been corrupted during transmission. The first record may include arbitrary comments such as a program name or...
, BinHex
BinHex
BinHex, short for "binary-to-hexadecimal", is a binary-to-text encoding system that was used on the Mac OS for sending binary files through e-mail. It is similar to Uuencode, but combined both "forks" of the Mac file system together, along with extended file information...
, kermit
Kermit (protocol)
Kermit is a computer file transfer/management protocol and a set of communications software tools primarily used in the early years of personal computing in the 1980s; it provides a consistent approach to file transfer, terminal emulation, script programming, and character set conversion across...
and MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...
's Base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...
. EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
-based systems cannot handle all characters used in UUencoded data. However, the base64 encoding does not have this problem.
Perhaps the final 7-bit restriction, primarily imposed due to the pervasive use of RS-232
RS-232
In telecommunications, RS-232 is the traditional name for a series of standards for serial binary single-ended data and control signals connecting between a DTE and a DCE . It is commonly used in computer serial ports...
protocol for serial ports between devices, notably computers and modems, was lifted in the mid-1990's when RS-232
RS-232
In telecommunications, RS-232 is the traditional name for a series of standards for serial binary single-ended data and control signals connecting between a DTE and a DCE . It is commonly used in computer serial ports...
was largely replaced with Ethernet
Ethernet
Ethernet is a family of computer networking technologies for local area networks commercially introduced in 1980. Standardized in IEEE 802.3, Ethernet has largely replaced competing wired LAN technologies....
and with USB.
SMTP and NNTP 8-bit cleanness
Historically, various media were used to transfer messages, some of them only supporting 7-bit data, so an 8-bit message had really good chances to be garbledMojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...
during transmission in the 20th century. But some Internet implementations really did not care about formal discouraging of the 8-bit data and allowed high bit set bytes to pass through.
Many early communications protocol
Communications protocol
A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications...
standards, such as RFC 780, RFC 788, RFC 821 for SMTP, RFC 977 for NNTP, RFC 1056, RFC 2821, RFC 5321, were designed to work over such "7-bit" communication links. They specifically mention the use of ASCII character set "transmitted as a 8-bit byte with the high-order bit cleared to zero" and some of these
explicitly restrict all data to 7-bit characters.
For the first few decades of email networks (1971 to the early 1990s),
most email messages were plain text
Plain text
In computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....
in the 7-bit US-ASCII character set.
According to RFC 1428, the original RFC 821 definition of SMTP limits Internet Mail to
lines (1000 characters or less) of 7-bit US-ASCII characters.
Later the format of email messages were re-defined
in order to support
messages that are not entirely US-ASCII text
(text messages in character sets
other than US-ASCII,
and non-text messages,
such as audio and images).
The Internet community generally adds features by "extension", allowing communication in both directions between upgraded machines and not-yet-upgraded machines, rather than declaring formerly standards-compliant legacy software to be "broken" and insisting that all software world-wide be upgraded to the latest standard.
In the mid-1990s
1990s
File:1990s decade montage.png|From left, clockwise: The Hubble Space Telescope floats in space after it was taken up in 1990; American F-16s and F-15s fly over burning oil fields and the USA Lexie in Operation Desert Storm, also known as the 1991 Gulf War; The signing of the Oslo Accords on...
, people objected to "just send 8 bits (to RFC 821 SMTP servers)",
perhaps because of a perception that "just send 8 bits"
is an implicit declaration that ISO 8859-1 become the new "standard encoding", forcing everyone in the world to use the same character set.
Instead, the recommended way to take advantage of 8-bit-clean links between machines is to use the ESMTP (RFC 1869) 8BITMIME extension.
See also
- MIME 8bit encoding
- 8-bit data in Telnet
- 32-bit clean