Document file format
Encyclopedia
A document file format is a text
or binary
file
format for storing document
s on a storage media
, especially for use by computer
s.
There currently exist a multitude of incompatible document file formats.
A rough consensus has been established that XML
is to be the basis for future document file formats. Examples of XML-based open
standards are DocBook
, XHTML
, and, more recently, the ISO
/IEC
standards OpenDocument
(ISO 26300:2006) and Office Open XML (ISO 29500:2008).
In 1993, the ITU-T
tried to establish a standard for document file formats, known as the Open Document Architecture
(ODA) which was supposed to replace all competing document file formats. It is described in ITU-T documents T.411 through T.421, which are equivalent to ISO 8613. It did not succeed.
Page description language
s such as PostScript
and PDF
have become the de facto
standard for documents that a typical user should only be able to create and read, not edit. In 2001, PDF became an international ISO
/IEC
standard (ISO 15930-1:2001, ISO 19005-1:2005, ISO 32000-1:2008).
HTML
is the most used and open international standard and it is also used as document file format. It has also become ISO
/IEC
standard (ISO 15445:2000).
The default binary file format used by Microsoft Word
(.doc
) has become widespread de facto
standard for office documents, but it is a proprietary format and is not always fully supported by other word processors.
Text file
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a computer file system...
or binary
Binary file
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...
file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
format for storing document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...
s on a storage media
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....
, especially for use by computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
s.
There currently exist a multitude of incompatible document file formats.
A rough consensus has been established that XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
is to be the basis for future document file formats. Examples of XML-based open
Open format
An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...
standards are DocBook
DocBook
DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software but it can be used for any other sort of documentation....
, XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
, and, more recently, the ISO
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
/IEC
International Electrotechnical Commission
The International Electrotechnical Commission is a non-profit, non-governmental international standards organization that prepares and publishes International Standards for all electrical, electronic and related technologies – collectively known as "electrotechnology"...
standards OpenDocument
OpenDocument
The Open Document Format for Office Applications is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents....
(ISO 26300:2006) and Office Open XML (ISO 29500:2008).
In 1993, the ITU-T
ITU-T
The ITU Telecommunication Standardization Sector is one of the three sectors of the International Telecommunication Union ; it coordinates standards for telecommunications....
tried to establish a standard for document file formats, known as the Open Document Architecture
Open Document Architecture
The Open Document Architecture and interchange format is a free and open international standard document file format maintained by the ITU-T to replace all proprietary document file formats...
(ODA) which was supposed to replace all competing document file formats. It is described in ITU-T documents T.411 through T.421, which are equivalent to ISO 8613. It did not succeed.
Page description language
Page description language
A page description language is a language that describes the appearance of a printed page in a higher level than an actual output bitmap. An overlapping term is printer control language, but it should not be confused as referring solely to Hewlett-Packard's PCL...
s such as PostScript
PostScript
PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...
and PDF
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
have become the de facto
De facto
De facto is a Latin expression that means "concerning fact." In law, it often means "in practice but not necessarily ordained by law" or "in practice or actuality, but not officially established." It is commonly used in contrast to de jure when referring to matters of law, governance, or...
standard for documents that a typical user should only be able to create and read, not edit. In 2001, PDF became an international ISO
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
/IEC
International Electrotechnical Commission
The International Electrotechnical Commission is a non-profit, non-governmental international standards organization that prepares and publishes International Standards for all electrical, electronic and related technologies – collectively known as "electrotechnology"...
standard (ISO 15930-1:2001, ISO 19005-1:2005, ISO 32000-1:2008).
HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
is the most used and open international standard and it is also used as document file format. It has also become ISO
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
/IEC
International Electrotechnical Commission
The International Electrotechnical Commission is a non-profit, non-governmental international standards organization that prepares and publishes International Standards for all electrical, electronic and related technologies – collectively known as "electrotechnology"...
standard (ISO 15445:2000).
The default binary file format used by Microsoft Word
Microsoft Word
Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
(.doc
DOC (computing)
In computing, DOC or doc is a filename extension for word processing documents; most commonly for Microsoft Word. Historically, the extension was used for documentation in plain-text format, particularly of programs or computer hardware, on a wide range of operating systems...
) has become widespread de facto
De facto
De facto is a Latin expression that means "concerning fact." In law, it often means "in practice but not necessarily ordained by law" or "in practice or actuality, but not officially established." It is commonly used in contrast to de jure when referring to matters of law, governance, or...
standard for office documents, but it is a proprietary format and is not always fully supported by other word processors.
Common document file formats
- ASCIIASCIIThe American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, UTF-8UTF-8UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
— plain textPlain textIn computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....
formats - AmigaguideAmigaguideAmigaGuide is a hypertext document file format designed for the Amiga, files are stored in ASCII so it is possible to read and edit a file without the need for special software.Since Workbench 2.1 an Amiga Guide system for O.S...
- .docDOC (computing)In computing, DOC or doc is a filename extension for word processing documents; most commonly for Microsoft Word. Historically, the extension was used for documentation in plain-text format, particularly of programs or computer hardware, on a wide range of operating systems...
for Microsoft WordMicrosoft WordMicrosoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
— Structural binary format developed by Microsoft (specifications available since 2008 under the Open Specification PromiseMicrosoft Open Specification PromiseThe Microsoft Open Specification Promise , is a promise by Microsoft, published in September 2006, to not assert legal rights over certain Microsoft patents on implementations of an included list of technologies....
) - DjVuDjVuDjVu is a computer file format designed primarily to store scanned documents, especially those containing a combination of text, line drawings, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy...
— file format designed primarily to store scanned documents - DocBookDocBookDocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software but it can be used for any other sort of documentation....
— an XML format for technical documenation - HTMLHTMLHyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
(.html, .htm), (open standard, ISO from 2000), in combination with possible image files referred to. - FictionBookFictionBookFictionBook is an open XML-based e-book format, which originated and gained popularity in Russia. The FictionBook files have the .fb2 filename extension....
(.fb2) — open XML-based e-book format - Office Open XML — .docx (XML-based standard for office documents, ISO standard from 2008)
- OpenDocumentOpenDocumentThe Open Document Format for Office Applications is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents....
— .odt (XML-based standard for office documents, ISO standard from 2006) - OpenOffice.org XMLOpenOffice.org XMLOpenOffice.org XML is an open XML-based file format developed as an open community effort by Sun Microsystems and other OpenOffice.org project contributors in 2000-2002. The open-source software application suite OpenOffice.org 1.x and StarOffice 6 used the format as their native and default file...
— .sxw (open, XML-based format for office documents) - OXPS — Open XML Paper Specification
- PalmDoc — Common HandheldPersonal digital assistantA personal digital assistant , also known as a palmtop computer, or personal data assistant, is a mobile device that functions as a personal information manager. Current PDAs often have the ability to connect to the Internet...
document format - PluckerPluckerPlucker is an offline Web and free e-book reader for Palm OS based handheld devices, Windows Mobile devices and other PDAs. Plucker contains POSIX tools, scripts and "conduits" which work on Unix, Linux, Mac OS X, and Microsoft Windows...
— Handheld navigable widely used document standard - .pages for PagesPagesPages is a word processor and page layout application developed by Apple. It is part of the iWork productivity suite and runs on the Mac OS X & iOS operating systems. The first version of Pages was announced on January 11, 2005, and was released one month later. The most recent Macintosh version,...
- PDFPortable Document FormatPortable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
— Open standard for documents exchange. ISO standards from 2001, 2005, 2008. It is readable on almost every platform with free or open source readers. Open source PDF creators are also available. - Rich Text Format (RTF)Rich Text FormatThe Rich Text Format is a proprietary document file format with published specification developed by Microsoft Corporation since 1987 for Microsoft products and for cross-platform document interchange....
— meta data format being developed by Microsoft since 1987 for Microsoft products and cross-platformCross-platformIn computing, cross-platform, or multi-platform, is an attribute conferred to computer software or computing methods and concepts that are implemented and inter-operate on multiple computer platforms...
document interchange - SYmbolic LinK (SYLK)SYmbolic LinK (SYLK)Symbolic Link is a Microsoft file format typically used to exchange data between applications, specifically spreadsheets. SYLK files conventionally have a .slk suffix. Composed of only displayable ANSI characters, it can be easily created and processed by other applications, such as...
- TeXTeXTeX is a typesetting system designed and mostly written by Donald Knuth and released in 1978. Within the typesetting system, its name is formatted as ....
— Popular open-source typesetting program and format. First successful mathematical notation language. - TEIText Encoding InitiativeThe Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....
— XML format for digital publication - TroffTrofftroff is a document processing system developed by AT&T for the Unix operating system.-History:troff can trace its origins back to a text formatting program called RUNOFF, written by Jerome H. Saltzer for MIT's CTSS operating system in the mid-1960s...
- Uniform Office FormatUniform Office FormatUniform Office Format sometimes known as Unified Office Format is an open standard for 'office' applications developed in China. It includes word processing, presentation, and spreadsheet modules, and is made up of GUI, API, and format specifications...
— Chinese standard - WordPerfectWordPerfectWordPerfect is a word processing application, now owned by Corel.Bruce Bastian, a Brigham Young University graduate student, and BYU computer science professor Dr. Alan Ashton joined forces to design a word processing system for the city of Orem's Data General Corp. minicomputer system in 1979...
(.wpd, .wp, .wp7, .doc) (Note: possible confusion with Word format extension)
See also
- List of file formats
- List of document markup languages
- Comparison of document markup languagesComparison of document markup languagesThe following tables compare general and technical information for a number of document markup languages. Please see the individual markup languages' articles for further information.-General information:...
- Open formatOpen formatAn open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...