PRONOM technical registry
Encyclopedia
PRONOM is a web
-based technical registry to support digital preservation
services, developed by The National Archives of the United Kingdom
. PRONOM was the first and remains, to date, the only operational public file format
registry in the world,, although the "Magic File" repository of the File Command has served this role in a less formal capacity for two decades. Other projects to develop technical registries, including the UK Digital Curation Centre
's Representation Information Registry, and the Global Digital Format Registry project at Harvard University
, are now in progress.
PRONOM's origins lie in a requirement to have access to reliable technical information about the electronic records held by The National Archives. By definition, electronic records are not inherently human-readable - file formats encode information into a form which can only be processed and rendered comprehensible by very specific technological environments. The accessibility of that information is therefore highly vulnerable to technological obsolescence
. Technical information about the structure of those file formats, and the software and hardware
environments required to support them, is therefore a prerequisite for any digital preservation regime. PRONOM was developed to provide this function, initially as an internal resource for National Archives staff, and subsequently as public, web-based resource.
PRONOM 4, released in October 2005, includes a significant reworking of the underlying data model to allow the capture of detailed technical information on file formats and support future interoperability with other planned registry systems, and the release of the DROID software for automatic file format identification.
The latest version PRONOM 5 was a relatively minor update to support improvements to DROID and was released in 2006. A much more substantial update is planned for 2007, which will include the exposure of core PRONOM functions through web service
s interfaces
. This work forms part of the Seamless Flow programme to position The National Archives to receive and manage future government records in electronic formats.
In future, PRONOM may participate as a node in the planned Global Digital Format Registry project.
The National Archives won the 2007 Digital Preservation Award
sponsored by the Digital Preservation Coalition
, for its work on PRONOM and DROID.
The PRONOM registry provides a searchable web database of technical information about file formats, the software tools required to access them, and the technical environments required to access them. Users can search for formats and software using a variety of criteria, such as format or software name and file extension. PRONOM also holds information about support periods for software products, and can also be queried on this basis. In addition to on-screen viewing, registry information can be exported in XML
, CSV
and printer-friendly formats. The PRONOM website allows users to submit new information for inclusion in PRONOM.
s for records in the PRONOM registry. Such identifiers are fundamental to the exchange and management of digital objects, by allowing human or automated user agents to unambiguously identify, and share that identification of, the representation information required to support access to an object. This is a virtue both of the inherent uniqueness of the identifier, and of its binding to a definitive description of the representation information in a registry such as PRONOM.
At present, the PUID scheme is limited to one particular class of representation information: the format
in which a digital object is encoded. Formats were considered a particular priority for such a scheme, as no existing, universally applicable system provides for this. Unix
magic numbers
and Macintosh data forks
do provide some of this functionality, but the same is not true within DOS
or Microsoft Windows
environments. The three-character file extension is neither standardised nor unique, and is interpreted differently by different environments. Equally, the IANA
MIME
-type scheme does not provide sufficient granularity or coverage to satisfy the requirements for unique identifiers. The PUID scheme has been developed for the single purpose of providing such identifiers.
The scheme has been adopted as the recommended encoding scheme for describing file formats in the latest version of the UK e-Government Metadata Standard. The scheme is designed to be extensible, and may be expanded in future to include other classes of representation information in PRONOM, such as compression methods
, character encoding schemes
, and operating system
s.
PUIDs can be expressed as Uniform Resource Identifier
s using the info:pronom/ namespace, details of which are available from the info URI registry. Neither the PUID scheme, nor its expression as an info URI, supports any inherent dereferencing mechanism, i.e. a PUID does not resolve to a Uniform Resource Locator
. However, The National Archives is planning to develop a range of services to expose PRONOM registry content, including a resolution service for PUIDs.
s.
DROID allows files and folders to be selected from a file system for identification. After the identification process had been run, the results can be output in XML
, CSV
or printer-friendly formats.
DROID is a platform-independent Java
tool. It includes a documented, public API, and can be invoked from both GUI
and command line interfaces.
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
-based technical registry to support digital preservation
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...
services, developed by The National Archives of the United Kingdom
The National Archives (UK)
The National Archives is a UK government department and an executive agency of the Secretary of State for Justice. It is "the UK government's official archive, containing 1,000 years of history"...
. PRONOM was the first and remains, to date, the only operational public file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
registry in the world,, although the "Magic File" repository of the File Command has served this role in a less formal capacity for two decades. Other projects to develop technical registries, including the UK Digital Curation Centre
Digital Curation Centre
- Purpose :The Digital Curation Centre was established to help solve the extensive challenges of digital preservation and digital curation and to lead research, development, advice, and support services for higher education institutions in the United Kingdom...
's Representation Information Registry, and the Global Digital Format Registry project at Harvard University
Harvard University
Harvard University is a private Ivy League university located in Cambridge, Massachusetts, United States, established in 1636 by the Massachusetts legislature. Harvard is the oldest institution of higher learning in the United States and the first corporation chartered in the country...
, are now in progress.
PRONOM's origins lie in a requirement to have access to reliable technical information about the electronic records held by The National Archives. By definition, electronic records are not inherently human-readable - file formats encode information into a form which can only be processed and rendered comprehensible by very specific technological environments. The accessibility of that information is therefore highly vulnerable to technological obsolescence
Digital obsolescence
Digital obsolescence is a situation where a digital resource is no longer readable because the physical media, the reader required to read the media, the hardware, or the software that runs on it, is no longer available. A prime example of this is the BBC Domesday Project...
. Technical information about the structure of those file formats, and the software and hardware
Computer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...
environments required to support them, is therefore a prerequisite for any digital preservation regime. PRONOM was developed to provide this function, initially as an internal resource for National Archives staff, and subsequently as public, web-based resource.
Development
The first version of PRONOM was developed by The National Archives digital preservation department in March 2002. PRONOM 2 was released in December 2002, and provided support for the development of multi-lingual versions of the registry. The web-enabling of PRONOM (PRONOM 3) in February 2004 represented the starting point for the development of PRONOM as a major online resource for the international digital preservation community.PRONOM 4, released in October 2005, includes a significant reworking of the underlying data model to allow the capture of detailed technical information on file formats and support future interoperability with other planned registry systems, and the release of the DROID software for automatic file format identification.
The latest version PRONOM 5 was a relatively minor update to support improvements to DROID and was released in 2006. A much more substantial update is planned for 2007, which will include the exposure of core PRONOM functions through web service
Web service
A Web service is a method of communication between two electronic devices over the web.The W3C defines a "Web service" as "a software system designed to support interoperable machine-to-machine interaction over a network". It has an interface described in a machine-processable format...
s interfaces
Interface (computer science)
In the field of computer science, an interface is a tool and concept that refers to a point of interaction between components, and is applicable at the level of both hardware and software...
. This work forms part of the Seamless Flow programme to position The National Archives to receive and manage future government records in electronic formats.
In future, PRONOM may participate as a node in the planned Global Digital Format Registry project.
The National Archives won the 2007 Digital Preservation Award
Digital Preservation Award
The Digital Preservation Award is an international award sponsored by the Digital Preservation Coalition. The award 'recognises the many new initiatives being undertaken in the challenging field of digital preservation'. It was inaugurated in 2004 and is presented as part of the Conservation...
sponsored by the Digital Preservation Coalition
Digital Preservation Coalition
The Digital Preservation Coalition is a UK-based non-profit limited company which seeks to secure the preservation of digital resources in the UK and internationally to secure the global digital memory and knowledge base.-History:...
, for its work on PRONOM and DROID.
Services
The core technical registry supports a number of specific services:The PRONOM registry provides a searchable web database of technical information about file formats, the software tools required to access them, and the technical environments required to access them. Users can search for formats and software using a variety of criteria, such as format or software name and file extension. PRONOM also holds information about support periods for software products, and can also be queried on this basis. In addition to on-screen viewing, registry information can be exported in XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
, CSV
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....
and printer-friendly formats. The PRONOM website allows users to submit new information for inclusion in PRONOM.
The PRONOM Persistent Unique Identifier (PUID) scheme
The PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique and unambiguous identifierIdentifier
An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may be an idea, physical [countable] object , or physical [noncountable] substance...
s for records in the PRONOM registry. Such identifiers are fundamental to the exchange and management of digital objects, by allowing human or automated user agents to unambiguously identify, and share that identification of, the representation information required to support access to an object. This is a virtue both of the inherent uniqueness of the identifier, and of its binding to a definitive description of the representation information in a registry such as PRONOM.
At present, the PUID scheme is limited to one particular class of representation information: the format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
in which a digital object is encoded. Formats were considered a particular priority for such a scheme, as no existing, universally applicable system provides for this. Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
magic numbers
Magic number (programming)
In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:* A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures...
and Macintosh data forks
Resource fork
The resource fork is a construct of the Mac OS operating system used to store structured data in a file, alongside unstructured data stored within the data fork. A resource fork stores information in a specific form, such as icons, the shapes of windows, definitions of menus and their contents, and...
do provide some of this functionality, but the same is not true within DOS
DOS
DOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...
or Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
environments. The three-character file extension is neither standardised nor unique, and is interpreted differently by different environments. Equally, the IANA
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...
MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...
-type scheme does not provide sufficient granularity or coverage to satisfy the requirements for unique identifiers. The PUID scheme has been developed for the single purpose of providing such identifiers.
The scheme has been adopted as the recommended encoding scheme for describing file formats in the latest version of the UK e-Government Metadata Standard. The scheme is designed to be extensible, and may be expanded in future to include other classes of representation information in PRONOM, such as compression methods
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
, character encoding schemes
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
, and operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s.
PUIDs can be expressed as Uniform Resource Identifier
Uniform Resource Identifier
In computing, a uniform resource identifier is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network using specific protocols...
s using the info:pronom/ namespace, details of which are available from the info URI registry. Neither the PUID scheme, nor its expression as an info URI, supports any inherent dereferencing mechanism, i.e. a PUID does not resolve to a Uniform Resource Locator
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....
. However, The National Archives is planning to develop a range of services to expose PRONOM registry content, including a resolution service for PUIDs.
DROID
DROID (Digital Record Object Identification) is a software tool developed by The National Archives to perform automated batch identification of file formats. It is one of a planned series of tools utilising PRONOM to provide specific digital preservation services. DROID uses internal (byte sequence) and external (file extension) signatures to identify and report the specific file format versions of digital files. These signatures are stored in an XML signature file, generated from information recorded in the PRONOM technical registry. New and updated signatures are regularly added to PRONOM, and DROID can be configured to automatically download updated signature files from the PRONOM website via web serviceWeb service
A Web service is a method of communication between two electronic devices over the web.The W3C defines a "Web service" as "a software system designed to support interoperable machine-to-machine interaction over a network". It has an interface described in a machine-processable format...
s.
DROID allows files and folders to be selected from a file system for identification. After the identification process had been run, the results can be output in XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
, CSV
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....
or printer-friendly formats.
DROID is a platform-independent Java
Java (Sun)
Java refers to several computer software products and specifications from Sun Microsystems, a subsidiary of Oracle Corporation, that together provide a system for developing application software and deploying it in a cross-platform environment...
tool. It includes a documented, public API, and can be invoked from both GUI
Gui
Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...
and command line interfaces.
Future services
Proposed future services include format risk assessments and preservation planning, and the automated generation of migration pathways for converting between formats.See also
- Digital curationDigital curationDigital curation is the selection, preservation, maintenance, collection and archiving of digital assets.Digital curation is generally referred to the process of establishing and developing long term repositories of digital assets for current and future reference by researchers, scientists,...
- Digital preservationDigital preservationDigital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...
- File formatFile formatA file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
- File (command)