File format
Encyclopedia
A file format is a particular way that information is encoded for storage in a computer file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...

.

Since a disk drive, or indeed any computer storage
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....

, can store only bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

s, the computer must have some way of converting information
Information
Information in its most restricted technical sense is a message or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as...

 to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....

 documents, there will typically be several different formats. Sometimes these formats compete with each other.

File formats can be divided into proprietary and open format
Open format
An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...

s.

Generality

Some file formats are designed for very particular sorts of data: PNG files, for example, store bitmapped
Raster graphics
In computer graphics, a raster graphics image, or bitmap, is a data structure representing a generally rectangular grid of pixels, or points of color, viewable via a monitor, paper, or other display medium...

 images using lossless data compression
Lossless data compression
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be reconstructed, in exchange...

. Other file formats, however, are designed for storage of several different types of data: the Ogg
Ogg
Ogg is a free, open container format maintained by the Xiph.Org Foundation. The creators of the Ogg format state that it is unrestricted by software patents and is designed to provide for efficient streaming and manipulation of high quality digital multimedia.The Ogg container format can multiplex...

 format can act as a container
Container format
A container or wrapper format is a meta-file format whose specification describes how different data elements and metadata coexist in a computer file....

 for many different types of multimedia
Multimedia
Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which use only rudimentary computer display such as text-only, or...

, including any combination of audio
Sound
Sound is a mechanical wave that is an oscillation of pressure transmitted through a solid, liquid, or gas, composed of frequencies within the range of hearing and of a level sufficiently strong to be heard, or the sensation stimulated in organs of hearing by such vibrations.-Propagation of...

 and/or video
Video
Video is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion.- History :...

, with or without text (such as subtitles), and metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

. A text file
Text file
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a computer file system...

 can contain any stream of characters, encoded for example as ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 or Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, including possible control character
Control character
In computing and telecommunication, a control character or non-printing character is a code point in a character set, that does not in itself represent a written symbol.It is in-band signaling in the context of character encoding....

s. Some file formats, such as HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

, Scalable Vector Graphics
Scalable Vector Graphics
Scalable Vector Graphics is a family of specifications of an XML-based file format for describing two-dimensional vector graphics, both static and dynamic . The SVG specification is an open standard that has been under development by the World Wide Web Consortium since 1999.SVG images and their...

 and the source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...

 of computer software
Computer software
Computer software, or just software, is a collection of computer programs and related data that provide the instructions for telling a computer what to do and how to do it....

, are also text files with defined syntax
Syntax
In linguistics, syntax is the study of the principles and rules for constructing phrases and sentences in natural languages....

es that allow them to be used for specific purposes.

Specifications

Many file formats, including some of the most well-known file formats, have a published specification document (often with a reference implementation
Reference implementation
In the software development process, a reference implementation is the standard from which all other implementations, with their attendant customizations, are measured, and to which all improvements are added...

) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...

 treats a particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format developers view their specification documents as trade secret
Trade secret
A trade secret is a formula, practice, process, design, instrument, pattern, or compilation of information which is not generally known or reasonably ascertainable, by which a business can obtain an economic advantage over competitors or customers...

s, and therefore do not release them to the public. Second, some file format developers never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.

Using file formats without a publicly available specification can be costly. Learning how the format works will require either reverse engineering
Reverse engineering
Reverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...

 it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there is a specification document, and typically requires the signing of a non-disclosure agreement
Non-disclosure agreement
A non-disclosure agreement , also known as a confidentiality agreement , confidential disclosure agreement , proprietary information agreement , or secrecy agreement, is a legal contract between at least two parties that outlines confidential material, knowledge, or information that the parties...

. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.

Patent
Patent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....

 law, rather than copyright
Copyright
Copyright is a legal concept, enacted by most governments, giving the creator of an original work exclusive rights to it, usually for a limited time...

, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented algorithms. For example, using compression with the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of GIF
GIF
The Graphics Interchange Format is a bitmap image format that was introduced by CompuServe in 1987 and has since come into widespread usage on the World Wide Web due to its wide support and portability....

s, and is partly responsible for the development of the alternative PNG format. However, the patent expired in the US in mid-2003, and worldwide in mid-2004. Algorithms are usually held not to be patentable under current European law, which also includes a provision that members "shall ensure that, wherever the use of a patented technique is needed for a significant purpose such as ensuring conversion of the conventions used in two different computer systems or networks so as to allow communication and exchange of data content between them, such use is not considered to be a patent infringement", which would apparently allow implementation of a patented file system where necessary to allow two different computers to interoperate.

Identifying the type of a file

A method is required to determine the format of a particular file within the filesystem—an example of metadata. Different operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

s have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.

Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.

Filename extension

One popular method in use by several operating systems, including Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

, Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

, CP/M
CP/M
CP/M was a mass-market operating system created for Intel 8080/85 based microcomputers by Gary Kildall of Digital Research, Inc...

, DOS
DOS
DOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...

, VMS
VMS
- Communication and transportation :* Voice Mail System, automated telephone messaging* Video Messaging Service , video messaging for 3G handsets* VMS MobiFone, one of the largest mobile phone operators in Vietnam...

, and VM/CMS, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is known as the filename extension
Filename extension
A filename extension is a suffix to the name of a computer file applied to indicate the encoding of its contents or usage....

. For example, HTML documents are identified by names that end with .htm (or .html), and GIF images by .gif. In the original FAT
File Allocation Table
File Allocation Table is a computer file system architecture now widely used on many computer systems and most memory cards, such as those used with digital cameras. FAT file systems are commonly found on floppy disks, flash memory cards, digital cameras, and many other portable devices because of...

 filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as 8.3 filename. Many formats thus still use three-character extensions, even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse the operating system and consequently users.

One artifact of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly.

This led more recent operating system shells, such as Windows 95
Windows 95
Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Windows products...

 and Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.

A downside of hiding the extension is that it then becomes possible to have what appear to be two or more identical filenames in the same folder. This is especially true when image files are needed in more than one format for different applications. For example, a company logo may be needed both in .tif format (for publishing) and .gif format (for web sites). With the extensions visible, these would appear as the unique filenames "CompanyLogo.tif" and "CompanyLogo.gif". With the extensions hidden, these would both appear to have the identical filename "CompanyLogo", making it more difficult to determine which to select for a particular application.

A further downside is that hiding such information can become a security risk. This is because on a filename extensions reliant system all usable files will have such an extension (for example all JPEG images will have ".jpg" or ".jpeg" at the end of their name), so seeing file extensions would be a common occurrence and users may depend on them when looking for a file's format. By having file extensions hidden a malicious user can create an executable program
Trojan horse (computing)
A Trojan horse, or Trojan, is software that appears to perform a desirable function for the user prior to run or install, but steals information or harms the system. The term is derived from the Trojan Horse story in Greek mythology.-Malware:A destructive program that masquerades as a benign...

 with an innocent name such as "Holiday photo.jpg.exe". In this case the ".exe" will be hidden and a user will see this file as "Holiday photo.jpg", which appears to be a JPEG image, unable to harm the machine save for bugs in the application used to view it. However, the operating system will still see the ".exe" extension and thus will run the program, which is then able to cause harm and presents a security issue. To further trick users, it is possible to store an icon inside the program, as done on Microsoft Windows, in which case the operating system's icon assignment can be overridden with an icon commonly used to represent JPEG images, making such a program look like and appear to be called an image, until it is opened that is. This issue requires users with extensions hidden to be vigilant, and never open files which seem to have a known extension displayed despite the hidden option being enabled (since it must therefore have 2 extensions, the real one being unknown until hiding is disabled). This presents a practical problem for Windows systems where extension hiding is turned on by default.

Internal metadata

A second way to identify a file format is to store information regarding the format inside the file itself. Usually, such information is written in one (or more) binary string(s), tagged or raw texts placed in fixed, specific locations within the file. Since the easiest place to locate them is at the beginning of it, such area is usually called a file header when it is greater than a few bytes, or a magic number if it is just a few bytes long.

File header

The metadata contained in a file header are not necessarily stored only at the beginning but might be present in other areas too, often including the end of the file; it depends on the file format or the type of data it contains. Character-based (text) files have character-based human-readable headers, whereas binary formats usually feature binary headers, although that is not a rule: a human-readable file header may require more bytes, but is easily discernible with simple text or hexadecimal editors.
File headers may not only contain the information required by algorithms to identify the file format alone, but also real metadata about the file and its contents. For example most image file formats
Image file formats
Image file formats are standardized means of organizing and storing digital images. Image files are composed of either pixels, vector data, or a combination of the two. Whatever the format, the files are rasterized to pixels when displayed on most graphic displays...

 store information about image size, resolution, color space
Color space
A color model is an abstract mathematical model describing the way colors can be represented as tuples of numbers, typically as three or four values or color components...

/format and optionally other authoring
Authoring
Authoring may refer to:* Writing, as by an author* Authoring systems, computer based systems that allow the creation of content for intelligent tutoring systems...

 information like who, when and where it was made, what camera model and shooting parameters was it taken with (if any, cfr. Exif), and so on. Such metadata may be used by a program reading or interpreting the file both during the loading process and after that, but can also be used by the operating system to quickly capture information about the file itself without loading it all into memory.

The downsides of file header as a file-format identification method are at least two. First, at least a few (initial) blocks of the file need to be read in order to gain such information; those could be fragmented
Fragmentation (computer)
In computer storage, fragmentation is a phenomenon in which storage space is used inefficiently, reducing storage capacity and in most cases reducing the performance. The term is also used to denote the wasted space itself....

 in different locations of the same storage medium, thus requiring more seek and I/O time, which is particularly bad for the identification of large quantities of files altogether (like a GUI
Gui
Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...

 browsing inside a folder with thousands or more files and discerning file icons or thumbnail
Thumbnail
Thumbnails are reduced-size versions of pictures, used to help in recognizing and organizing them, serving the same role for images as a normal text index does for words...

s for all of them to visualize). Second, if the header is binary hard-coded (i.e. the header itself is subject to a non-trivial interpretation in order to be recognized), especially for metadata content protection's sake, there is some risk that file format is misinterpreted at first sight, or even badly written at the source, often resulting in corrupt metadata (which, in extremely pathological cases, might even render the file unreadable anymore).

A more logically sophisticated example of file header is that used in wrapper (or container) file formats.

Magic number

One way to incorporate such metadata, often associated with Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

 and its derivatives, is just to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

 identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

 that starts with <!DOCTYPE, or, for XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....

, the XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.

The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata-based methods need check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where filetypes don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand a valid magic number does not guarantee that the file is not corrupt or of a wrong type.

So-called shebang
Shebang (Unix)
In computing, a shebang is the character sequence consisting of the characters number sign and exclamation point , when it occurs as the first two characters on the first line of a text file...

 lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.

Another operating system using magic numbers is AmigaOS
AmigaOS
AmigaOS is the default native operating system of the Amiga personal computer. It was developed first by Commodore International, and initially introduced in 1985 with the Amiga 1000...

, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Hunk
Amiga Hunk
Hunk is the executable file format of tools and programs of the Amiga Operating System based on Motorola 68000 CPU and other processors of the same family....

 executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the Amiga standard Datatype recognition system. Another method was the FourCC
FourCC
A FourCC is a sequence of four bytes used to uniquely identify data formats.The concept originated in the OSType scheme used in the Macintosh system software and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives...

 method, originating in OSType
OSType
OSType is the name of a four-byte sequence commonly used as an identifier in Mac OS. While the bytes can have any value, they usually display figures characterized in software programs such as those utilized in ASCII or Mac OS Roman character sets.OSType values are used to identify file data...

 on Macintosh, later adapted by Interchange File Format
Interchange File Format
Interchange File Format , is a generic container file format originally introduced by the Electronic Arts company in 1985 in order to ease transfer of data between software produced by different companies....

 (IFF) and derivatives.

External metadata

A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself.

This approach keeps the metadata separate from both the main data and the name, but is also less portable
Porting
In computer science, porting is the process of adapting software so that an executable program can be created for a computing environment that is different from the one for which it was originally designed...

 than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...

 three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.

Note that zip files or archive files
File archiver
A file archiver is a computer program that combines a number of files together into one archive file, or a series of archive files, for easier transportation or storage...

 solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.

Mac OS type-codes

The Mac OS
Mac OS
Mac OS is a series of graphical user interface-based operating systems developed by Apple Inc. for their Macintosh line of computer systems. The Macintosh user experience is credited with popularizing the graphical user interface...

' Hierarchical File System
Hierarchical File System
Hierarchical File System is a file system developed by Apple Inc. for use in computer systems running Mac OS. Originally designed for use on floppy and hard disks, it can also be found on read-only media such as CD-ROMs...

 stores codes for creator
Creator code
A creator code is a mechanism introduced in pre-Mac OS X versions of the Macintosh operating system to link a data file to the application program which created it, in a manner similar to file extensions in other operating systems. Codes are four-byte OSTypes. For example, the creator code of the...

and type
Type code
A type code is the only mechanism used in pre-Mac OS X versions of the Macintosh operating system to denote a file's format, in a manner similar to file extensions in other operating systems. Codes are four-byte OSTypes...

as part of the directory entry for each file. These codes are referred to as OSType
OSType
OSType is the name of a four-byte sequence commonly used as an identifier in Mac OS. While the bytes can have any value, they usually display figures characterized in software programs such as those utilized in ASCII or Mac OS Roman character sets.OSType values are used to identify file data...

s, and for instance a HyperCard
HyperCard
HyperCard is an application program created by Bill Atkinson for Apple Computer, Inc. that was among the first successful hypermedia systems before the World Wide Web. It combines database capabilities with a graphical, flexible, user-modifiable interface. HyperCard also features HyperTalk, written...

 "stack" file has a creator of WILD (from Hypercard's previous name, "WildCard") and a type of STAK. The type code specifies the format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of TEXT, but which each open in a different program, due to having differing creator codes. RISC OS
RISC OS
RISC OS is a computer operating system originally developed by Acorn Computers Ltd in Cambridge, England for their range of desktop computers, based on their own ARM architecture. First released in 1987, under the name Arthur, the subsequent iteration was renamed as in 1988...

 uses a similar system, consisting of a 12-bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

 number which can be looked up in a table of descriptions — e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a PostScript
PostScript
PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...

 file.

Mac OS X Uniform Type Identifiers (UTIs)

A Uniform Type Identifier (UTI) is a method used in Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

 for uniquely identifying "typed" classes of entity, such as file formats. It was developed by Apple
Apple Computer
Apple Inc. is an American multinational corporation that designs and markets consumer electronics, computer software, and personal computers. The company's best-known hardware products include the Macintosh line of computers, the iPod, the iPhone and the iPad...

 as a replacement for OSType
OSType
OSType is the name of a four-byte sequence commonly used as an identifier in Mac OS. While the bytes can have any value, they usually display figures characterized in software programs such as those utilized in ASCII or Mac OS Roman character sets.OSType values are used to identify file data...

 (type
Type code
A type code is the only mechanism used in pre-Mac OS X versions of the Macintosh operating system to denote a file's format, in a manner similar to file extensions in other operating systems. Codes are four-byte OSTypes...

 & creator code
Creator code
A creator code is a mechanism introduced in pre-Mac OS X versions of the Macintosh operating system to link a data file to the application program which created it, in a manner similar to file extensions in other operating systems. Codes are four-byte OSTypes. For example, the creator code of the...

s).

The UTI is a Core Foundation
Core Foundation
Core Foundation is a C application programming interface in Mac OS X & iOS, and is a mix of low-level routines and wrapper functions...

 string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

, which uses a reverse-DNS
Reverse-DNS
The Reverse-DNS is a naming convention for the components, packages, and types used by a programming language, system or framework. A characteristic of reverse-DNS strings is that they are based on registered domain names, and are only reversed for sorting purposes...

 string. Some common and standard types use a domain called public (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....

). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.

In addition to file formats, UTIs can also be used for other entities which can exist in OS X, including:
  • Pasteboard data
  • Folders
    Directory (file systems)
    In computing, a folder, directory, catalog, or drawer, is a virtual container originally derived from an earlier Object-oriented programming concept by the same name within a digital file system, in which groups of computer files and other folders can be kept and organized.A typical file system may...

     (directories)
  • Translatable types (as handled by the Translation Manager)
  • Bundles
  • Frameworks
  • Streaming data
  • Aliases and symlinks

OS/2 Extended Attributes

The HPFS, FAT12 and FAT16
File Allocation Table
File Allocation Table is a computer file system architecture now widely used on many computer systems and most memory cards, such as those used with digital cameras. FAT file systems are commonly found on floppy disks, flash memory cards, digital cameras, and many other portable devices because of...

 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.

The NTFS
NTFS
NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....

 filesystem also allows to store OS/2 extended attributes, as one of file forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.

POSIX extended attributes

On Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

 and Unix-like
Unix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....

 systems, the ext2
Ext2
The ext2 or second extended filesystem is a file system for the Linux kernel. It was initially designed by Rémy Card as a replacement for the extended file system ....

, ext3
Ext3
The ext3 or third extended filesystem is a journaled file system that is commonly used by the Linux kernel. It is the default file system for many popular Linux distributions, including Debian...

, ReiserFS
ReiserFS
ReiserFS is a general-purpose, journaled computer file system designed and implemented by a team at Namesys led by Hans Reiser. ReiserFS is currently supported on Linux . Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel...

 version 3, XFS
XFS
XFS is a high-performance journaling file system created by Silicon Graphics, Inc. It is the default file system in IRIX releases 5.3 and onwards and later ported to the Linux kernel. XFS is particularly proficient at parallel IO due to its allocation group based design...

, JFS, FFS, and HFS+
HFS Plus
HFS Plus or HFS+ is a file system developed by Apple Inc. to replace their Hierarchical File System as the primary file system used in Macintosh computers . It is also one of the formats used by the iPod digital music player...

 filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique and a value can be accessed through its related name.

PRONOM Unique Identifiers (PUIDs)

The PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK
The National Archives (UK)
The National Archives is a UK government department and an executive agency of the Secretary of State for Justice. It is "the UK government's official archive, containing 1,000 years of history"...

 as part of its PRONOM technical registry
PRONOM technical registry
PRONOM is a web-based technical registry to support digital preservation services, developed by The National Archives of the United Kingdom. PRONOM was the first and remains, to date, the only operational public file format registry in the world,, although the "Magic File" repository of the File...

 service. PUIDs can be expressed as Uniform Resource Identifier
Uniform Resource Identifier
In computing, a uniform resource identifier is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network using specific protocols...

s using the info:pronom/ namespace. Although not yet widely used outside of UK government and some digital preservation
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

 programmes, the PUID scheme does provide greater granularity than most alternative schemes.

MIME types

MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

 types are widely used in many Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by IANA
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

) consisting of a type and a sub-type, separated by a slash
Slash (punctuation)
The slash is a sign used as a punctuation mark and for various other purposes. It is now often called a forward slash , and many other alternative names.-History:...

 — for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

, independent of the source and target operating systems. MIME types identify files on BeOS
BeOS
BeOS is an operating system for personal computers which began development by Be Inc. in 1991. It was first written to run on BeBox hardware. BeOS was optimized for digital media work and was written to take advantage of modern hardware facilities such as symmetric multiprocessing by utilizing...

, AmigaOS 4.0 and MorphOS
MorphOS
MorphOS is an Amiga-compatible computer operating system. It is a mixed proprietary and open source OS produced for the Pegasos PowerPC processor based computer, PowerUP accelerator equipped Amiga computers, and a series of Freescale development boards that use the Genesi firmware, including the...

, as well as store unique application signatures for application launching. In AmigaOS and MorphOS the Mime type system works in parallel with Amiga specific Datatype system.

There are problems with the MIME types though; several organisations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.

File format identifiers (FFIDs)

File format identifiers is another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organisation origin/maintainer (this number represents a value in a company/standards organisation database), the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual file extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organisation.

File content based format identification

Another but least popular way to identify the file format is to look at the file contents for distinguishable patterns among file types. As we know, the file contents are sequence of bytes and a byte has 256 unique patterns (0~255). Thus, counting the occurrence of byte patterns that is often referred as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types

File structure

There are several types of ways to structure data in a file. The most usual ones are described below.

Unstructured formats (raw memory dumps)

Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.

This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal
Pascal (programming language)
Pascal is an influential imperative and procedural programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small and efficient language intended to encourage good programming practices using structured programming and data structuring.A derivative known as Object Pascal...

 string is not recognized as such in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

). On the other hand, developing tools for reading and writing these types of files is very simple.

The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.

Chunk-based formats

Electronic Arts
Electronic Arts
Electronic Arts, Inc. is a major American developer, marketer, publisher and distributor of video games. Founded and incorporated on May 28, 1982 by Trip Hawkins, the company was a pioneer of the early home computer games industry and was notable for promoting the designers and programmers...

 and Commodore
Commodore International
Commodore is the commonly used name for Commodore Business Machines , the U.S.-based home computer manufacturer and electronics manufacturer headquartered in West Chester, Pennsylvania, which also housed Commodore's corporate parent company, Commodore International Limited...

-Amiga
Amiga
The Amiga is a family of personal computers that was sold by Commodore in the 1980s and 1990s. The first model was launched in 1985 as a high-end home computer and became popular for its graphical, audio and multi-tasking abilities...

 pioneered this file format in 1985, with their IFF (Interchange File Format
Interchange File Format
Interchange File Format , is a generic container file format originally introduced by the Electronic Arts company in 1985 in order to ease transfer of data between software produced by different companies....

) file format. In this kind of file structure, each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data (for binary encoded files). This type of container is called a "chunk". The signature is usually called a chunk id, chunk identifier, or tag identifier.

With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand.

This concept has been taken again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG
JPEG
In computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....

 storage, DER (Distinguished Encoding Rules
Distinguished Encoding Rules
Distinguished Encoding Rules , is a message transfer syntax specified by the ITU in X.690. The Distinguished Encoding Rules of ASN.1 is an International Standard drawn from the constraints placed on basic encoding rules encodings by X.509. DER encodings are valid BER encodings...

) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF)
SDXF
SDXF is a data serialization format defined by RFC 3072.It allows arbitrary structured data of different types to be assembled together for exchanging between computers of different architectures....

. Even XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 can be considered a kind of chunk-based format, since each data element is surrounded by tags which are akin to chunk identifiers.

Directory-based formats

This is another extensible format, that closely resembles a file system (OLE
Object Linking and Embedding
Object Linking and Embedding is a technology developed by Microsoft that allows embedding and linking to documents and other objects. For developers, it brought OLE Control eXtension , a way to develop and use custom user interface elements...

 Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk image
Disk image
A disk image is a single file or storage device containing the complete contents and structure representing a data storage medium or device, such as a hard drive, tape drive, floppy disk, CD/DVD/BD, or USB flash drive, although an image of an optical disc may be referred to as an optical disc image...

s, OLE
Object Linking and Embedding
Object Linking and Embedding is a technology developed by Microsoft that allows embedding and linking to documents and other objects. For developers, it brought OLE Control eXtension , a way to develop and use custom user interface elements...

 documents and TIFF images.

See also

  • Audio file format
    Audio file format
    An audio file format is a file format for storing digital audio data on a computer system. This data can be stored uncompressed, or compressed to reduce the file size. It can be a raw bitstream, but it is usually a container format or an audio data format with defined storage layer.-Types of...

  • Chemical file format
    Chemical file format
    This article discusses some common molecular file formats, including usage and converting between them.-Distinguishing formats:Chemical information is usually provided as files or streams and many formats have been created, with varying degrees of documentation. The format can be found by three...

  • Container format (digital)
  • Document file format
    Document file format
    A document file format is a text or binary file format for storing documents on a storage media, especially for use by computers.There currently exist a multitude of incompatible document file formats....

  • DROID file format identification utility
  • File (command), a file type identification utility
  • File Formats, Transformation, and Migration (related wikiversity article)
  • FormatFactory
    FormatFactory
    - External links :* and on CNet on Stern's web site in netzwelt in Computerra...

    , a free omni file format converter.
  • Future proofing
  • Graphics file format summary
  • List of archive formats
  • Image file formats
    Image file formats
    Image file formats are standardized means of organizing and storing digital images. Image files are composed of either pixels, vector data, or a combination of the two. Whatever the format, the files are rasterized to pixels when displayed on most graphic displays...

  • List of file formats
  • List of free file formats
  • List of motion and gesture file formats
  • Magic number (programming)
    Magic number (programming)
    In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:* A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures...

  • List of file signatures, or "magic numbers"
  • Object file
    Object file
    An object file is a file containing relocatable format machine code that is usually not directly executable. Object files are produced by an assembler, compiler, or other language translator, and used as input to the linker....

  • Open format
    Open format
    An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...

  • Windows file types

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK