Data conversion
Encyclopedia
Data conversion is the conversion of computer data from one format
to another. Throughout a computer environment, data is encoded in a variety of ways. For example, computer hardware
is built on the basis of certain standards, which requires that data contains, for example, parity bit
checks. Similarly, the operating system
is predicated on certain standards for data and file handling. Furthermore, each computer program handles data in a different manner. Whenever any one of these variable is changed, data must be converted in some way before it can be used by a different computer, operating system or program. Even different versions of these elements usually involve different data structures. For example, the changing of bit
s from one format to another, usually for the purpose of application interoperability or of capability of using new features, is merely a data conversion. Data conversions may as simple as the conversion of a text file
from one character encoding
system to another; or more complex, such as the conversion of office file formats, or the conversion of image and audio file formats.
There are many ways in which data is converted within the computer environment. This may be seamless, as in the case of upgrading to a newer version of a computer program. Alternatively, the conversion may require processing by the use of a special conversion program, or it may involve a complex process of going through intermediary stages, or involving complex "exporting" and "importing" procedures, which may converting to and from a tab-delimited or comma-separated text file. In some cases, a program may recognise several data file formats at the data input stage and then is also capable of storing the output data in a number of different formats. Such a program may be used to convert a file format. If the source format or target format is not recognised, then at times third program may be available which permits the conversion to an intermediate format, which can then be reformatted using the first program. There are many possible scenarios.
in mind. These include:
For example, a true color image can easily be converted to grayscale, while the opposite conversion is a painstaking process. Converting a Unix
text file to a Microsoft
(DOS/Windows) text file involves adding characters, but this does not increase the entropy since it is rule-based; whereas the addition of color information to a grayscale image cannot be done programmatically, since only a human knows which colors are needed for each section of the picture–there are no rules that can be used to automate that process. Converting a 24-bit PNG to a 48-bit one does not add information to it, it only pads existing RGB pixel values with zeroes, so that a pixel with a value of FF C3 56, for example, becomes FF00 C300 5600. The conversion makes it possible to change a pixel to have a value of, for instance, FF80 C340 56A0, but the conversion itself does not do that, only further manipulation of the image can. Converting an image or audio file in a lossy format (like JPEG
or Vorbis
) to a lossless
(like PNG or FLAC
) or uncompressed (like BMP
or WAV
) format only wastes space, since the same image with its loss of original information (the artifacts of lossy compression) becomes the target. A JPEG image can never be restored to the quality of the original lossless image from which it was made, no matter how much the user tries the "JPEG Artifact
Removal" feature of his or her image manipulation program.
Automatic restorage of information that was lost through a lossy compression process would probably require important advances in artificial intelligence
.
Because of these realities of computing and information theory, data conversion is more often than not a complex and error-prone process that requires the help of experts.
text from KOI8-R
to Windows-1251
using a lookup table between the two encodings, but the modern approach is to convert the KOI8-R file to Unicode
first and from that to Windows-1251. This is a more manageable approach: an application specializing in character encoding conversion would have to keep hundreds of lookup tables, for all the permutations of character encoding conversions available, while keeping lookup tables just for each character set to Unicode scales down the number to a few tens.
Pivotal conversion is similarly used in other areas. Office applications, when employed to convert between office file formats, use their internal, default file format as a pivot. For example, a word processor
may convert an RTF
file to a WordPerfect
file by converting the RTF to OpenDocument
and then that to WordPerfect format. An image conversion program does not convert a PCX
image to PNG directly; instead, when loading the PCX image, it decodes it to a simple bitmap format for internal use in memory, and when commanded to convert to PNG, that memory image is converted to the target format. An audio converter that converts from FLAC
to AAC
decodes the source file to raw PCM
data in memory first, and then performs the lossy AAC compression on that memory image to produce the target file.
to an earlier version to enable transfer and use by other users who do not have the same later version of Word installed on their computer.
Loss of information can be mitigated by approximation in the target format. There is no way of converting a character like ä to ASCII
, since the ASCII standard lacks it, but the information may be retained by approximating the character as ae. Of course, this is not an optimal solution, and can impact operations like searching and copying; and if a language makes a distinction between ä and ae, then that approximation does involve loss of information.
Data conversion can also suffer from inexactitude, the result of converting between formats that are conceptually different. The WYSIWYG
paradigm, extant in word processors and desktop publishing
applications, versus the structural-descriptive paradigm, found in SGML, XML
and many applications derived therefrom, like HTML
and MathML
, is one example. Using a WYSIWYG HTML editor conflates the two paradigms, and the result is HTML files with suboptimal, if not nonstandard, code. In the WYSIWYG paradigm a double linebreak signifies a new paragraph, as that is the visual cue for such a construct, but a WYSIWYG HTML editor will usually convert such a sequence to
, which is structurally no new paragraph at all. As another example, converting from PDF
to an editable word processor format is a tough chore, because PDF records the textual information like engraving on stone, with each character given a fixed position and linebreaks hard-coded, whereas word processor formats accommodate text reflow. PDF does not know of a word space character—the space between two letters and the space between two words differ only in quantity. Therefore, a title with ample letter-spacing for effect will usually end up with spaces in the word processor file, for example INTRODUCTION with spacing of 1 em
as I N T R O D U C T I O N on the word processor.
will be needed to carry out conversion. Reverse engineering can achieve close approximation of the original specifications, but errors and missing features can still result.
s such as NRZ
and RZ
can be accomplished when necessary.
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
to another. Throughout a computer environment, data is encoded in a variety of ways. For example, computer hardware
Computer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...
is built on the basis of certain standards, which requires that data contains, for example, parity bit
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
checks. Similarly, the operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
is predicated on certain standards for data and file handling. Furthermore, each computer program handles data in a different manner. Whenever any one of these variable is changed, data must be converted in some way before it can be used by a different computer, operating system or program. Even different versions of these elements usually involve different data structures. For example, the changing of bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
s from one format to another, usually for the purpose of application interoperability or of capability of using new features, is merely a data conversion. Data conversions may as simple as the conversion of a text file
Text file
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a computer file system...
from one character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
system to another; or more complex, such as the conversion of office file formats, or the conversion of image and audio file formats.
There are many ways in which data is converted within the computer environment. This may be seamless, as in the case of upgrading to a newer version of a computer program. Alternatively, the conversion may require processing by the use of a special conversion program, or it may involve a complex process of going through intermediary stages, or involving complex "exporting" and "importing" procedures, which may converting to and from a tab-delimited or comma-separated text file. In some cases, a program may recognise several data file formats at the data input stage and then is also capable of storing the output data in a number of different formats. Such a program may be used to convert a file format. If the source format or target format is not recognised, then at times third program may be available which permits the conversion to an intermediate format, which can then be reformatted using the first program. There are many possible scenarios.
Information basics
Before any data conversion is carried out, the user or application programmer should keep a few basics of computing and information theoryInformation theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
in mind. These include:
- Information can easily be discarded by the computer, but adding information takes effort.
- The computer can add information only in a rule-based fashion.
- Upsampling the data or converting to a more feature-rich format does not add information; it merely makes room for that addition, which usually a human must do.
For example, a true color image can easily be converted to grayscale, while the opposite conversion is a painstaking process. Converting a Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
text file to a Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
(DOS/Windows) text file involves adding characters, but this does not increase the entropy since it is rule-based; whereas the addition of color information to a grayscale image cannot be done programmatically, since only a human knows which colors are needed for each section of the picture–there are no rules that can be used to automate that process. Converting a 24-bit PNG to a 48-bit one does not add information to it, it only pads existing RGB pixel values with zeroes, so that a pixel with a value of FF C3 56, for example, becomes FF00 C300 5600. The conversion makes it possible to change a pixel to have a value of, for instance, FF80 C340 56A0, but the conversion itself does not do that, only further manipulation of the image can. Converting an image or audio file in a lossy format (like JPEG
JPEG
In computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....
or Vorbis
Vorbis
Vorbis is a free software / open source project headed by the Xiph.Org Foundation . The project produces an audio format specification and software implementation for lossy audio compression...
) to a lossless
Lossless data compression
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be reconstructed, in exchange...
(like PNG or FLAC
FLAC
FLAC is a codec which allows digital audio to be losslessly compressed such that file size is reduced without any information being lost...
) or uncompressed (like BMP
Windows bitmap
The BMP File Format, also known as Bitmap Image File or Device Independent Bitmap file format or simply a Bitmap, is a Raster graphics image file format used to store bitmap digital images, independently of the display device , especially on Microsoft Windows and OS/2 operating systems.The BMP...
or WAV
WAV
Waveform Audio File Format , is a Microsoft and IBM audio file format standard for storing an audio bitstream on PCs...
) format only wastes space, since the same image with its loss of original information (the artifacts of lossy compression) becomes the target. A JPEG image can never be restored to the quality of the original lossless image from which it was made, no matter how much the user tries the "JPEG Artifact
Compression artifact
A compression artifact is a noticeable distortion of media caused by the application of lossy data compression....
Removal" feature of his or her image manipulation program.
Automatic restorage of information that was lost through a lossy compression process would probably require important advances in artificial intelligence
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
.
Because of these realities of computing and information theory, data conversion is more often than not a complex and error-prone process that requires the help of experts.
Pivotal conversion
Data conversion can occur directly from one format to another, but many applications that convert between multiple formats use a pivotal encoding by way of which any source format is converted to its target. For example, it is possible to convert CyrillicCyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
text from KOI8-R
KOI8-R
KOI8-R is an 8-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet. It also happens to cover Bulgarian, but is not used since CP1251 is accepted. A derivative encoding is KOI8-U, which adds Ukrainian characters...
to Windows-1251
Windows-1251
Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...
using a lookup table between the two encodings, but the modern approach is to convert the KOI8-R file to Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
first and from that to Windows-1251. This is a more manageable approach: an application specializing in character encoding conversion would have to keep hundreds of lookup tables, for all the permutations of character encoding conversions available, while keeping lookup tables just for each character set to Unicode scales down the number to a few tens.
Pivotal conversion is similarly used in other areas. Office applications, when employed to convert between office file formats, use their internal, default file format as a pivot. For example, a word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....
may convert an RTF
Rich Text Format
The Rich Text Format is a proprietary document file format with published specification developed by Microsoft Corporation since 1987 for Microsoft products and for cross-platform document interchange....
file to a WordPerfect
WordPerfect
WordPerfect is a word processing application, now owned by Corel.Bruce Bastian, a Brigham Young University graduate student, and BYU computer science professor Dr. Alan Ashton joined forces to design a word processing system for the city of Orem's Data General Corp. minicomputer system in 1979...
file by converting the RTF to OpenDocument
OpenDocument
The Open Document Format for Office Applications is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents....
and then that to WordPerfect format. An image conversion program does not convert a PCX
PCX
PCX is an image file format developed by the now-defunct ZSoft Corporation of Marietta, Georgia. It was the native file format for PC Paintbrush and became one of the first widely accepted DOS imaging standards, although it has since been succeeded by more sophisticated image formats, such as GIF,...
image to PNG directly; instead, when loading the PCX image, it decodes it to a simple bitmap format for internal use in memory, and when commanded to convert to PNG, that memory image is converted to the target format. An audio converter that converts from FLAC
FLAC
FLAC is a codec which allows digital audio to be losslessly compressed such that file size is reduced without any information being lost...
to AAC
Advanced Audio Coding
Advanced Audio Coding is a standardized, lossy compression and encoding scheme for digital audio. Designed to be the successor of the MP3 format, AAC generally achieves better sound quality than MP3 at similar bit rates....
decodes the source file to raw PCM
Pulse-code modulation
Pulse-code modulation is a method used to digitally represent sampled analog signals. It is the standard form for digital audio in computers and various Blu-ray, Compact Disc and DVD formats, as well as other uses such as digital telephone systems...
data in memory first, and then performs the lossy AAC compression on that memory image to produce the target file.
Lost and inexact data conversion
The objective of data conversion is to maintain all of the data, and as much of the embedded information as possible. This can only be done if the target format supports the same features and data structures present in the source file. Conversion of a word processing document to a plain text file necessarily involves loss of formatting information, because plain text format does not support word processing constructs such as marking a word as boldface. For this reason, conversion from one format to one which does not support a feature which is important to the user is rarely carried out, though it may be necessary for interoperability, e.g. converting a file from one version of Microsoft WordMicrosoft Word
Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
to an earlier version to enable transfer and use by other users who do not have the same later version of Word installed on their computer.
Loss of information can be mitigated by approximation in the target format. There is no way of converting a character like ä to ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, since the ASCII standard lacks it, but the information may be retained by approximating the character as ae. Of course, this is not an optimal solution, and can impact operations like searching and copying; and if a language makes a distinction between ä and ae, then that approximation does involve loss of information.
Data conversion can also suffer from inexactitude, the result of converting between formats that are conceptually different. The WYSIWYG
WYSIWYG
WYSIWYG is an acronym for What You See Is What You Get. The term is used in computing to describe a system in which content displayed onscreen during editing appears in a form closely corresponding to its appearance when printed or displayed as a finished product...
paradigm, extant in word processors and desktop publishing
Desktop publishing
Desktop publishing is the creation of documents using page layout software on a personal computer.The term has been used for publishing at all levels, from small-circulation documents such as local newsletters to books, magazines and newspapers...
applications, versus the structural-descriptive paradigm, found in SGML, XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
and many applications derived therefrom, like HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
and MathML
MathML
Mathematical Markup Language is an application of XML for describing mathematical notations and capturing both its structure and content. It aims at integrating mathematical formulae into World Wide Web pages and other documents...
, is one example. Using a WYSIWYG HTML editor conflates the two paradigms, and the result is HTML files with suboptimal, if not nonstandard, code. In the WYSIWYG paradigm a double linebreak signifies a new paragraph, as that is the visual cue for such a construct, but a WYSIWYG HTML editor will usually convert such a sequence to
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
to an editable word processor format is a tough chore, because PDF records the textual information like engraving on stone, with each character given a fixed position and linebreaks hard-coded, whereas word processor formats accommodate text reflow. PDF does not know of a word space character—the space between two letters and the space between two words differ only in quantity. Therefore, a title with ample letter-spacing for effect will usually end up with spaces in the word processor file, for example INTRODUCTION with spacing of 1 em
Em (typography)
An em is a unit of measurement in the field of typography, equal to the currently specified point size.The name of em is related to M. Originally the unit was derived from the width of the capital "M" in the given typeface....
as I N T R O D U C T I O N on the word processor.
Open vs. secret specifications
Successful data conversion requires thorough knowledge of the workings of both source and target formats. In the case where the specification of a format is unknown, reverse engineeringReverse engineering
Reverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...
will be needed to carry out conversion. Reverse engineering can achieve close approximation of the original specifications, but errors and missing features can still result.
Electronics
Data format conversion can also occur at the physical layer of an electronic communication system. Conversion between line codeLine code
In telecommunication, a line code is a code chosen for use within a communications system for baseband transmission purposes...
s such as NRZ
Non-return-to-zero
In telecommunication, a non-return-to-zero line code is a binary code in which 1's are represented by one significant condition and 0's are represented by some other significant condition , with no other neutral or rest condition. The pulses have more energy than a RZ code...
and RZ
Return-to-zero
For the Delp/Goudreau band, see RTZReturn-to-zero describes a line code used in telecommunications signals in which the signal drops to zero between each pulse. This takes place even if a number of consecutive 0's or 1's occur in the signal. The signal is self-clocking...
can be accomplished when necessary.
See also
- data migrationData migrationData migration is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks...
- data transformationData transformationIn metadata and data warehouse, a data transformation converts data from a source data format into destination data.Data transformation can be divided into two steps:...
- Comparison of programming languages (basic instructions)#Data conversions
- Transcoding