Ascii85
Encyclopedia
Ascii85 is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 characters to represent four bytes of binary data (making the encoded size ¹⁄₄ larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode
Uuencode
Uuencoding is a form of binary-to-text encoding that originated in the Unix program uuencode, for encoding binary data for transmission over the uucp mail system.The name "uuencoding" is derived from "Unix-to-Unix encoding"...

 or Base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

, which use four characters to represent three bytes of data (¹⁄₃ increase, assuming eight bits per ASCII character).

Its main modern use is in Adobe
Adobe Systems
Adobe Systems Incorporated is an American computer software company founded in 1982 and headquartered in San Jose, California, United States...

's PostScript
PostScript
PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...

 and Portable Document Format
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....

 file formats.

Basic idea

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocol
Communications protocol
A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications...

s that were designed to carry only human-readable
Human-readable
A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...

 text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace
Whitespace (computer science)
In computer science, whitespace is any single character or series of characters that represents horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visual mark, but typically does occupy an area on a page...

. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.

4 bytes can represent 232 = 4,294,967,296 possible values. 5 radix
Radix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...

-85 digits provide 855 = 4,437,053,125 possible values, enough to provide for a unique representation for each possible 32-bit value. Further, notice that five radix-84 digits provide 845 = 4,182,119,424 representable values. The implication of this is that 85 is the minimum possible integral base that will represent four bytes in five characters, hence its choice.

When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a big-endian convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 ("!") to 117 ("u").

Because all-zero data is quite common, an exception is made for the sake of data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

, and an all-zero group is encoded as a single character "z" instead of "!!!!!".

Groups of characters that decode to a value greater than (encoded as "s8W-!") will cause a decoding error, as will "z" characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.

One disadvantage of Ascii85 is that encoded data may contain characters such as backslash and quote, which have special meaning in many programming languages and in some text-based protocols.

btoa version

The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

) and three 32-bit checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

s. The decoder needs to use the file length to see how much of the group was padding.
This program also introduced the special "z" short form for an all-zero group. Version 4.2 added a "y" exception for a group of all ASCII space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

 characters (0x20202020).

Adobe version

Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. In particular, Adobe uses the delimiter "~>" to mark the end of an Ascii85-encoded string, and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to three null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.

The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character "u", and as many bytes as were added as padding are omitted from the end of the output (see example).

NOTE: The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is not a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with 'u's) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).
Ascii85 (also called "Base85") is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 characters to represent four bytes of binary data (making the encoded size ¹⁄₄ larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode
Uuencode
Uuencoding is a form of binary-to-text encoding that originated in the Unix program uuencode, for encoding binary data for transmission over the uucp mail system.The name "uuencoding" is derived from "Unix-to-Unix encoding"...

 or Base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

, which use four characters to represent three bytes of data (¹⁄₃ increase, assuming eight bits per ASCII character).

Its main modern use is in Adobe
Adobe Systems
Adobe Systems Incorporated is an American computer software company founded in 1982 and headquartered in San Jose, California, United States...

's PostScript
PostScript
PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...

 and Portable Document Format
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....

 file formats.

Basic idea

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocol
Communications protocol
A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications...

s that were designed to carry only human-readable
Human-readable
A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...

 text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace
Whitespace (computer science)
In computer science, whitespace is any single character or series of characters that represents horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visual mark, but typically does occupy an area on a page...

. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.

4 bytes can represent 232 = 4,294,967,296 possible values. 5 radix
Radix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...

-85 digits provide 855 = 4,437,053,125 possible values, enough to provide for a unique representation for each possible 32-bit value. Further, notice that five radix-84 digits provide 845 = 4,182,119,424 representable values. The implication of this is that 85 is the minimum possible integral base that will represent four bytes in five characters, hence its choice.

When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a big-endian convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 ("!") to 117 ("u").

Because all-zero data is quite common, an exception is made for the sake of data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

, and an all-zero group is encoded as a single character "z" instead of "!!!!!".

Groups of characters that decode to a value greater than (encoded as "s8W-!") will cause a decoding error, as will "z" characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.

One disadvantage of Ascii85 is that encoded data may contain characters such as backslash and quote, which have special meaning in many programming languages and in some text-based protocols.

btoa version

The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

) and three 32-bit checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

s. The decoder needs to use the file length to see how much of the group was padding.
This program also introduced the special "z" short form for an all-zero group. Version 4.2 added a "y" exception for a group of all ASCII space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

 characters (0x20202020).

Adobe version

Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. In particular, Adobe uses the delimiter "~>" to mark the end of an Ascii85-encoded string, and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to three null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.

The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character "u", and as many bytes as were added as padding are omitted from the end of the output (see example).

NOTE: The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is not a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with 'u's) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).
Ascii85 (also called "Base85") is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 characters to represent four bytes of binary data (making the encoded size ¹⁄₄ larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode
Uuencode
Uuencoding is a form of binary-to-text encoding that originated in the Unix program uuencode, for encoding binary data for transmission over the uucp mail system.The name "uuencoding" is derived from "Unix-to-Unix encoding"...

 or Base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

, which use four characters to represent three bytes of data (¹⁄₃ increase, assuming eight bits per ASCII character).

Its main modern use is in Adobe
Adobe Systems
Adobe Systems Incorporated is an American computer software company founded in 1982 and headquartered in San Jose, California, United States...

's PostScript
PostScript
PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...

 and Portable Document Format
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....

 file formats.

Basic idea

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocol
Communications protocol
A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications...

s that were designed to carry only human-readable
Human-readable
A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...

 text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace
Whitespace (computer science)
In computer science, whitespace is any single character or series of characters that represents horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visual mark, but typically does occupy an area on a page...

. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.

4 bytes can represent 232 = 4,294,967,296 possible values. 5 radix
Radix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...

-85 digits provide 855 = 4,437,053,125 possible values, enough to provide for a unique representation for each possible 32-bit value. Further, notice that five radix-84 digits provide 845 = 4,182,119,424 representable values. The implication of this is that 85 is the minimum possible integral base that will represent four bytes in five characters, hence its choice.

When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a big-endian convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 ("!") to 117 ("u").

Because all-zero data is quite common, an exception is made for the sake of data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

, and an all-zero group is encoded as a single character "z" instead of "!!!!!".

Groups of characters that decode to a value greater than (encoded as "s8W-!") will cause a decoding error, as will "z" characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.

One disadvantage of Ascii85 is that encoded data may contain characters such as backslash and quote, which have special meaning in many programming languages and in some text-based protocols.

btoa version

The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

) and three 32-bit checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

s. The decoder needs to use the file length to see how much of the group was padding.
This program also introduced the special "z" short form for an all-zero group. Version 4.2 added a "y" exception for a group of all ASCII space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

 characters (0x20202020).

Adobe version

Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. In particular, Adobe uses the delimiter "~>" to mark the end of an Ascii85-encoded string, and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to three null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.

The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character "u", and as many bytes as were added as padding are omitted from the end of the output (see example).

NOTE: The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is not a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with 'u's) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).


In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.

Adobe's specification does not support the "y" exception.

Example



A quote from Thomas Hobbes's
Thomas Hobbes
Thomas Hobbes of Malmesbury , in some older texts Thomas Hobbs of Malmsbury, was an English philosopher, best known today for his work on political philosophy...

 Leviathan
Leviathan (book)
Leviathan or The Matter, Forme and Power of a Common Wealth Ecclesiasticall and Civil — commonly called simply Leviathan — is a book written by Thomas Hobbes and published in 1651. Its name derives from the biblical Leviathan...

:
Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.


initially encoded as US-ASCII and then reencoded in Ascii85 is as follows:


<~9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKFCj@.4Gp$d7F!,L7@<6@)/0JDEF O@3BB/F*&OCAfu2/AKY
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF-FD5W8ARlolDIa
l(DId >uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>

| Text content
| colspan="8" align="center"| M
| colspan="8" align="center"| a
| colspan="8" align="center"| n
| colspan="8" align="center"|
| align="center"| ...
| colspan="8" align="center"| s
| colspan="8" align="center"| u
| colspan="8" align="center"| r
| colspan="8" align="center"| e
|-
| ASCII
| colspan="8" align="center"| 77
| colspan="8" align="center"| 97
| colspan="8" align="center"| 110
| colspan="8" align="center"| 32
| align="center"| ...
| colspan="8" align="center"| 115
| colspan="8" align="center"| 117
| colspan="8" align="center"| 114
| colspan="8" align="center"| 101
|-
| Bit pattern >
0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 0
| align="center"| ...
0 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0


Since the last 4-tuple is incomplete, it must be padded with three zero bytes:
| Text content
| colspan="8" align="center"| .
| colspan="8" align="center"| \0
| colspan="8" align="center"| \0
| colspan="8" align="center"| \0
|-
| ASCII
| colspan="8" align="center"| 46
| colspan="8" align="center"| 0
| colspan="8" align="center"| 0
| colspan="8" align="center"| 0
|-
| Bit pattern>
0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Since three bytes of padding had to be added, the three final characters 'YkO' are omitted from the output.

Decoding is done inversely, except that the last 5-tuple is padded with 'u' characters:
| ASCII
| colspan="6" align="center"| /
| colspan="7" align="center"| c
| colspan="6" align="center"| u
| colspan="7" align="center"| u
| colspan="6" align="center"| u
|-
| Base 85 (+33)
| colspan="6" align="center"| 14 (47)
| colspan="7" align="center"| 66 (99)
| colspan="6" align="center"| 84 (117)
| colspan="7" align="center"| 84 (117)
| colspan="6" align="center"| 84 (117)
|-
| 32-bit Value
| colspan="32" align="center"| 771,955,124 = 14×854 + 66×853 + 84×852 + 84×85 + 84
|-
| Bit pattern>
0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0


Since the input had to be padded with three 'u' bytes, the last three bytes of the output are ignored and we end up with the original period.

Note that the input sentence does not contain 4 consecutive zero bytes, so the example does not show the use of the 'z' abbreviation.

Compatibility

The Ascii85 encoding is compatible with 7-bit and 8-bit MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

, while having less overhead than Base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

.

One potential compatibility issue of Ascii85 is that 'single' and "double" quotation marks, brackets and ampersands (&) cannot be used unescaped in markup languages like XML or SGML.

RFC 1924 version

Published on April 1, 1996, and thus presumably not meant to be taken completely seriously
April Fools' Day
April Fools' Day is celebrated in different countries around the world on April 1 every year. Sometimes referred to as All Fools' Day, April 1 is not a national holiday, but is widely recognized and celebrated as a day when many people play all kinds of jokes and foolishness...

, informational RFC 1924: "A Compact Representation of IPv6 Addresses" by Robert Elz
Kevin Robert Elz
Kevin Robert Elz, often referred to in computing circles as Robert Elz, or simply kre, is a computer programmer and a pioneer in connecting Australia to the Internet, and more recently, in connecting Thailand.-Career:...

 suggests a base-85 encoding of IPv6
IPv6
Internet Protocol version 6 is a version of the Internet Protocol . It is designed to succeed the Internet Protocol version 4...

addresses. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.

The proposed character set is, in order, 09, AZ, az, and then the 23 characters !#$%&*+-;<=>?@^_`{|}~. The highest possible representable address, 2128−1 = 74×8519 + 53×8518 + 5×8517 + ..., would be encoded as =r54lj&NUUO~Hi%c2ym0.

While the RFC chose a different character set in order to prevent the use of certain problematic characters ("',./:[]\), it still requires escaping for SGML-based protocols, notably for XML.

External links


The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK