Open Archives Initiative Protocol for Metadata Harvesting
Encyclopedia
OAI-PMH is a protocol developed by the Open Archives Initiative
Open Archives Initiative
The Open Archives Initiative is an attempt to build a "low-barrier interoperability framework" for archives containing digital content . It allows people to harvest metadata...

. It is used to harvest (or collect) the metadata descriptions of the records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core
Dublin Core
The Dublin Core metadata terms are a set of vocabulary terms which can be used to describe resources for the purposes of discovery. The terms can be used to describe a full range of web resources: video, images, web pages etc and physical resources such as books and objects like artworks...

, but may also support additional representations.

The protocol is usually just referred to as the OAI Protocol.

OAI-PMH uses XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 over HTTP. The current version is 2.0, updated in 2008.

History

In the late 1990s, Herbert Van de Sompel
Herbert van de Sompel
Herbert Van de Sompel is a Belgian librarian and computer scientist, most known for his role in the development of the Open Archives Initiative and standards such as OpenURL, Object Reuse and Exchange, and the OAI Protocol for Metadata Harvesting....

 (Ghent University
Ghent University
Ghent University is a Dutch-speaking public university located in Ghent, Belgium. It is one of the larger Flemish universities, consisting of 32,000 students and 7,100 staff members. The current rector is Paul Van Cauwenberge.It was established in 1817 by King William I of the Netherlands...

) was working with researchers and librarians at Los Alamos National Laboratory
Los Alamos National Laboratory
Los Alamos National Laboratory is a United States Department of Energy national laboratory, managed and operated by Los Alamos National Security , located in Los Alamos, New Mexico...

 (US) and called a meeting to address difficulties related to interoperability issues of e-print servers and digital repositories
Digital library
A digital library is a library in which collections are stored in digital formats and accessible by computers. The digital content may be stored locally, or accessed remotely via computer networks...

. The meeting was held in Santa Fe, New Mexico
Santa Fe, New Mexico
Santa Fe is the capital of the U.S. state of New Mexico. It is the fourth-largest city in the state and is the seat of . Santa Fe had a population of 67,947 in the 2010 census...

, in October 1999. A key development from the meeting was the definition of an interface that permitted e-print servers to expose metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

 for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention".

Several workshops were held in 2000 at the ACM Digital Libraries conference and elsewhere to share the ideas from the Santa Fe Convention. It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the Coalition for Networked Information and the Digital Library Federation provided funding to establish an Open Archives Initiative
Open Archives Initiative
The Open Archives Initiative is an attempt to build a "low-barrier interoperability framework" for archives containing digital content . It allows people to harvest metadata...

 (OAI
OAI
-Personal name:* Oai was the common nickname of Saigō-no-Tsubone , a figure in the history of feudal Japan.-Organizations:OAI, as an initialism, may refer to:*Ohio Aerospace Institute, a space grant college...

) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at Cornell University
Cornell University
Cornell University is an Ivy League university located in Ithaca, New York, United States. It is a private land-grant university, receiving annual funding from the State of New York for certain educational missions...

 (Ithaca, New York
Ithaca, New York
The city of Ithaca, is a city in upstate New York and the county seat of Tompkins County, as well as the largest community in the Ithaca-Tompkins County metropolitan area...

) in September 2000 to improve the interface developed at the Santa Fe Convention. The specifications were refined over e-mail.

OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in Washington D.C., and another in February in Berlin, Germany. Subsequent modifications to the XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 standard by the W3C required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible.

OAI registries

The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives. Although registration is not mandatory, it is encouraged.

There are several large registries of OAI-compliant repositories:
  1. The Open Archives list of registered OAI repositories
  2. The OAI registry at University of Illinois at Urbana-Champaign
  3. The Celestial OAI registry
  4. Eprint’s Institutional Archives Registry
  5. Openarchives.eu The European Guide to OAI-PMH compliant repositories in the world
  6. ScientificCommons.org A worldwide service and registry

Uses

Commercial search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

s have started using OAI-PMH to acquire more resources. Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 is using OAI-PMH to harvest information from the National Library of Australia Digital Object Repository. In 2004, Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

 acquired content from OAIster
OAIster
OAIster was a project of the Digital Library Production Service of the University of Michigan University Library. Its goal is to create a collection of freely available, previously difficult-to-access, academically-oriented digital resources that are easily searchable by anyone...

 (University of Michigan
University of Michigan
The University of Michigan is a public research university located in Ann Arbor, Michigan in the United States. It is the state's oldest university and the flagship campus of the University of Michigan...

) that was obtained through metadata harvesting with OAI-PMH. Google did accept OAI-PMH as part of their Sitemap Protocol
Google Sitemaps
The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes,...

, though decided to stop doing so in 2008. Wikimedia uses an OAI-PMH repository to provide feeds of Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

 and related site updates for search engines and other bulk analysis/republishing endeavors. Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing an incremental harvesting. NASA's Mercury: Metadata Search System
Mercury: Metadata Search System
Mercury is a Distributed Metadata Management, Data Discovery and Access System . It is a scientific data search system to capture and manage biogeochemical and ecological data in support of the National Aeronautics and Space Administration's Earth science programs. Mercury was originally developed...

 uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day.

The mod oai
Mod oai
mod_oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community...

 project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers
Apache HTTP Server
The Apache HTTP Server, commonly referred to as Apache , is web server software notable for playing a key role in the initial growth of the World Wide Web. In 2009 it became the first web server software to surpass the 100 million website milestone...

.

Software

OAI-PMH is based on a client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 metadata in Dublin Core
Dublin Core
The Dublin Core metadata terms are a set of vocabulary terms which can be used to describe resources for the purposes of discovery. The terms can be used to describe a full range of web resources: video, images, web pages etc and physical resources such as books and objects like artworks...

 format, and may also provide it in other XML formats.

A number of software systems support the OAI-PMH, including Fedora
Fedora (software)
Fedora is a modular architecture built on the principle that interoperability and extensibility is best achieved by the integration of data, interfaces, and mechanisms as clearly defined modules...

, GNU EPrints
EPrints
EPrints is a free and open source software package for building open access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It shares many of the features commonly seen in Document Management systems, but is primarily used for institutional...

 from the University of Southampton
University of Southampton
The University of Southampton is a British public university located in the city of Southampton, England, a member of the Russell Group. The origins of the university can be dated back to the founding of the Hartley Institution in 1862 by Henry Robertson Hartley. In 1902, the Institution developed...

, Open Journal Systems
Open Journal Systems
Open Journal Systems is an open-source software for the management of peer-reviewer academic journals, created by the Public Knowledge Project, released under the GNU General Public License.-Design:...

 from the Public Knowledge Project
Public Knowledge Project
The Public Knowledge Project is a non-profit research initiative of the Faculty of Education at the University of British Columbia, the Canadian Centre for Studies in Publishing at Simon Fraser University, the Simon Fraser University Library , and Stanford University...

, Desire2Learn
Desire2Learn
Desire2Learn Incorporated is a provider of enterprise eLearning solutions and develops online Learning Management Systems used at more than 450 institutions around the world...

, DSpace
DSpace
DSpace is an open source software package that provides the tools for management of digital assets, and is commonly used as the basis for an institutional repository. It supports a wide variety of data, including books, theses, 3D digital scans of objects, photographs, film, video, research data...

 from MIT, HyperJournal from the University of Pisa
University of Pisa
The University of Pisa , located in Pisa, Tuscany, is one of the oldest universities in Italy. It was formally founded on September 3, 1343 by an edict of Pope Clement VI, although there had been lectures on law in Pisa since the 11th century...

, Primo, DigiTool, Rosetta and MetaLib from Ex Libris, DOOR from the eLab in Lugano, Switzerland, panFMP from the PANGAEA (data library)
PANGAEA (data library)
PANGAEA - Data Publisher for Earth & Environmental Science is a digital data library and a data publisher for earth system science. Data can be georeferenced in time and space ....

, SimpleDL
SimpleDL
SimpleDL is digital collection management software that allows for the upload, description, management and access of digital collections and is UTF-8 compatible. SimpleDL is not limited by format and is capable of handling documents, PDFs, images, videos, audio files, and data only objects...

 from Roaring Development, and jOAI.

Archives

A number of large archives support the protocol including arXiv
ArXiv
The arXiv |Chi]], χ) is an archive for electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance which can be accessed online. In many fields of mathematics and physics, almost all...

 and the CERN
CERN
The European Organization for Nuclear Research , known as CERN , is an international organization whose purpose is to operate the world's largest particle physics laboratory, which is situated in the northwest suburbs of Geneva on the Franco–Swiss border...

 Document Server.

See also

  • Data format management
    Data Format Management
    Data format management is the application of a systematic approach to the selection and use of the data formats used to encode information for storage on a computer....

  • Digital curation
    Digital curation
    Digital curation is the selection, preservation, maintenance, collection and archiving of digital assets.Digital curation is generally referred to the process of establishing and developing long term repositories of digital assets for current and future reference by researchers, scientists,...

  • Digital preservation
    Digital preservation
    Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

  • File format
    File format
    A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...

  • Library of Congress Digital Library project
    Library of Congress Digital Library project
    The Library of Congress National Digital Library Program is assembling a digital library of reproductions of primary source materials to support the study of the history and culture of the United States...

  • Dublin Core
    Dublin Core
    The Dublin Core metadata terms are a set of vocabulary terms which can be used to describe resources for the purposes of discovery. The terms can be used to describe a full range of web resources: video, images, web pages etc and physical resources such as books and objects like artworks...

    , an ISO metadata standard
  • National Digital Information Infrastructure and Preservation Program
    National Digital Information Infrastructure and Preservation Program
    The National Digital Information Infrastructure and Preservation Program is an archival program led by the Library of Congress to archive and provide access to digital resources. The U.S. Congress established the program in 2000...

  • Metadata Encoding and Transmission Standard
    METS
    The Metadata Encoding and Transmission Standard is a metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium...

     maintained by the Library of Congress
  • Preservation Metadata: Implementation Strategies (PREMIS)
    Preservation Metadata: Implementation Strategies (PREMIS)
    PREMIS is an international working group concerned with developing metadata for use in digital preservation....

  • LOCKSS
    LOCKSS
    The LOCKSS project, under the auspices of Stanford University, develops and supports an open source system allowing libraries to collect, preserve and provide their readers with access to material published on the Web. The system attempts to replicate the way libraries do this for material...

  • Web archiving
    Web archiving
    Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK