Web archiving
Encyclopedia
Web archiving is the process of collecting portions of the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

 and ensuring the collection is preserved
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

 in an archive
Archive
An archive is a collection of historical records, or the physical place they are located. Archives contain primary source documents that have accumulated over the course of an individual or organization's lifetime, and are kept to show the function of an organization...

, such as an archive site
Archive site
In web archiving, an archive site is a website that stores information on, or the actual, webpages from the past for anyone to view.-Common techniques:Two common techniques are #1 using a web crawler or #2 user submissions....

, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

s for automated collection. The largest web archiving organization based on a crawling approach is the Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

 which strives to maintain an archive of the entire Web. National libraries
National library
A national library is a library specifically established by the government of a country to serve as the preeminent repository of information for that country. Unlike public libraries, these rarely allow citizens to borrow books...

, national archives and various consortia of organizations are also involved in archiving culturally important Web content. Commercial web archiving software and services are also available to organizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.

Collecting the web

Web archivists generally archive all types of web content including HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 web pages, style sheets, JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

, images
Digital image
A digital image is a numeric representation of a two-dimensional image. Depending on whether or not the image resolution is fixed, it may be of vector or raster type...

, and video
Digital video
Digital video is a type of digital recording system that works by using a digital rather than an analog video signal.The terms camera, video camera, and camcorder are used interchangeably in this article.- History :...

. They also archive metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

 about the collected resources such as access time, MIME type, and content length. This metadata is useful in establishing authenticity
Authentication
Authentication is the act of confirming the truth of an attribute of a datum or entity...

 and provenance
Provenance
Provenance, from the French provenir, "to come from", refers to the chronology of the ownership or location of an historical object. The term was originally mostly used for works of art, but is now used in similar senses in a wide range of fields, including science and computing...

 of the archived collection.

Remote harvesting

The most common web archiving technique uses web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

s to automate the process of collecting web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

s. Web crawlers typically view web pages in the same manner that users with a browser see the Web, and therefore provide a comparatively simple method of remotely harvesting web content. Examples of web crawlers used for web archiving include:
  • Automated Internet Sessions in biterScripting
  • Heritrix
    Heritrix
    Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

  • HTTrack
    HTTrack
    HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

  • Wget
    Wget
    GNU Wget is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get...


On-demand

There are numerous services that may be used to archive web resources "on-demand", using web crawling techniques.
  • Aleph Archives, offers web archiving services for regulatory compliance and eDiscovery aimed to corporate (Global 500 market), legal and government industries.
  • Archive-It, a subscription service which allows institutions to build, manage and search their own web archive.
  • Archivethe.net, a shared web-archiving platform operated by the Internet Memory Foundation (formerly European Archive Foundation).
  • BackupURL.com, allows creation of "a copy of any website that you can share and view any time knowing it will last forever." This service is on the Wikipedia blacklist.
  • Compliance WatchDog by SiteQuest Technologies, a subscription service that archives websites and allows users to browse the site as it appeared in the past. It also monitors sites for changes and alerts compliance personnel if a change is detected.
  • freezePAGE snapshots, a free/subscription service. To preserve snapshots, requires login every 30 days for unregistered users, 60 days for registered users.
  • Hanzo Archives, provides web archiving, cloud archiving, and social media archiving software and services for e-discovery, information management, Financial Industry Regulatory Authority
    Financial Industry Regulatory Authority
    In the United States, the Financial Industry Regulatory Authority, Inc., or FINRA, is a private corporation that acts as a self-regulatory organization . FINRA is the successor to the National Association of Securities Dealers, Inc. ...

    , United States Securities and Exchange Commission
    United States Securities and Exchange Commission
    The U.S. Securities and Exchange Commission is a federal agency which holds primary responsibility for enforcing the federal securities laws and regulating the securities industry, the nation's stock and options exchanges, and other electronic securities markets in the United States...

    , and Food and Drug Administration
    Food and Drug Administration
    The Food and Drug Administration is an agency of the United States Department of Health and Human Services, one of the United States federal executive departments...

     compliance, as well as corporate heritage. Hanzo is used by leading organisations in many industries, and in national governmental institutions. Web archive access is on-demand in native format, and includes full-text search, annotations, redaction, archive policy and temporal browsing. Hanzo is integrated with leading electronic discovery
    Electronic Discovery
    Electronic discovery refers to discovery in civil litigation which deals with the exchange of information in electronic format . Usually a digital forensics analysis is performed to recover evidence...

     applications and services.
  • Iterasi, Provides enterprise web archiving for compliance, litigation protection, e-discovery and brand heritage. For enterprise companies, financial organizations, government agencies and more.
  • Nextpoint, offers an automated cloud-based, SaaS for marketing, compliance and litigation related needs including electronic discovery
  • PageFreezer, a subscription SaaS
    Software as a Service
    Software as a service , sometimes referred to as "on-demand software," is a software delivery model in which software and its associated data are hosted centrally and are typically accessed by users using a thin client, normally using a web browser over the Internet.SaaS has become a common...

     service to archive, replay and search websites, blogs, web 2.0, Flash & social media for marketing, eDiscovery and regulatory compliance with U.S. Food and Drug Administration
    Food and Drug Administration
    The Food and Drug Administration is an agency of the United States Department of Health and Human Services, one of the United States federal executive departments...

     (FDA), Financial Industry Regulatory Authority
    Financial Industry Regulatory Authority
    In the United States, the Financial Industry Regulatory Authority, Inc., or FINRA, is a private corporation that acts as a self-regulatory organization . FINRA is the successor to the National Association of Securities Dealers, Inc. ...

    , U.S. Securities and Exchange Commission, Sarbanes–Oxley Act Federal Rules of Evidence
    Federal Rules of Evidence
    The is a code of evidence law governing the admission of facts by which parties in the United States federal court system may prove their cases, both civil and criminal. The Rules were enacted in 1975, with subsequent amendments....

     and records management laws. Archives can be used as legal evidence.
  • Perpetually, creates forensically sound archives of any web page for compliance, regulated public companies, competitive intelligence and institutional memory.
  • Reed Technology Web Archiving Services powered by Iterasi, offers litigation protection, regulatory compliance & eDiscovery in the corporate, legal and government industries.
  • The Web Archiving Service is a subscription service optimized for the academic environment guided by input from librarians, archivists and researchers. WAS provides topical browsing, change comparison and site-by-site control of capture settings and frequency. Developed and hosted by the University of California Curation Center at the California Digital Library.
  • WebCite
    WebCite
    WebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...

    , a free service specifically for scholarly authors, journal editors and publishers to permanently archive and retrieve cited Internet references.
  • Website-Archive.com, a subscription service. Captures screen-shots of pages, transactions and user journeys using "actual browsers". Screen-shots can be viewed online or downloaded in a monthly archive. Uses Cloud Testing technology.


Database archiving

Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

 content into a standard schema
Logical schema
A Logical Schema is a data model of a specific problem domain expressed in terms of a particular data management technology. Without being specific to a particular database management product, it is in terms of either relational tables and columns, object-oriented classes, or XML tags...

, often using XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

. Once stored in that standard format, the archived content of multiple databases can then be made available using a single access system. This approach is exemplified by the DeepArc and Xinq tools developed by the Bibliothèque nationale de France
Bibliothèque nationale de France
The is the National Library of France, located in Paris. It is intended to be the repository of all that is published in France. The current president of the library is Bruno Racine.-History:...

 and the National Library of Australia
National Library of Australia
The National Library of Australia is the largest reference library of Australia, responsible under the terms of the National Library Act for "maintaining and developing a national collection of library material, including a comprehensive collection of library material relating to Australia and the...

 respectively. DeepArc enables the structure of a relational database
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...

 to be mapped to an XML schema
XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself...

, and the content exported into an XML document. Xinq then allows that content to be delivered online. Although the original layout and behavior of the website cannot be preserved exactly, Xinq does allow the basic querying and retrieval functionality to be replicated.

Transactional archiving

Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a web server
Web server
Web server can refer to either the hardware or the software that helps to deliver content that can be accessed through the Internet....

 and a web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

. It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

, on a given date. This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information.

A transactional archiving system typically operates by intercepting every HTTP request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bitstreams. A transactional archiving system requires the installation of software on the web server, and cannot therefore be used to collect content from a remote website.

Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:
  • The robots exclusion protocol may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway.
  • Large portions of a web site may be hidden in the deep Web
    Deep web
    The Deep Web refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines....

    . For example, the results page behind a web form lies in the deep Web because most crawlers cannot follow a link to the results page.
  • Crawler traps (e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl.


However, it is important to note that a native format web archive, i.e., a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology.

The Web is so large that crawling a significant portion of it takes a large amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.

General limitations

  • Some web servers are configured to return different pages to web archiver requests than they would in response to regular browser requests. This is typically done to fool search engines into directing more user traffic to a website, and is often done to avoid accountability, or to provide enhanced content only to those browsers that can display it.


Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman states that "although the Web is popularly regarded as a public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...

 resource, it is copyright
Copyright
Copyright is a legal concept, enacted by most governments, giving the creator of an original work exclusive rights to it, usually for a limited time...

ed; thus, archivists have no legal right to copy the Web". However national libraries
National library
A national library is a library specifically established by the government of a country to serve as the preeminent repository of information for that country. Unlike public libraries, these rarely allow citizens to borrow books...

 in many countries do have a legal right to copy portions of the web under an extension of a legal deposit
Legal deposit
Legal deposit is a legal requirement that a person or group submit copies of their publications to a repository, usually a library. The requirement is mostly limited to books and periodicals. The number of copies varies and can range from one to 19 . Typically, the national library is one of the...

.

Some private non-profit web archives that are made publicly accessible like WebCite
WebCite
WebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...

 or the Internet Archive
Internet Archive
The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

 allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite cites a recent lawsuit against Google's caching, which Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 won.

Aspects of web curation

Web curation, like any digital curation, entails:
  • Certification of the trustworthiness and integrity of the collection content
  • Collecting verifiable Web assets
  • Providing Web asset search and retrieval
  • Semantic and ontological continuity and comparability of the collection content


Thus, besides the discussion on methods of collecting the Web, those of providing access, certification, and organizing must be included. There are a set of popular tools that addresses these curation steps:

A suite of tools for Web Curation by International Internet Preservation Consortium
International Internet Preservation Consortium
-Projects:IIPC sponsored a project on "cross-archival search strategies" which included the creation of an archive focused on the 2010 Winter Olympics....

:

Other open source tools for manipulating web archives:
  • WARC Tools - for creating, reading, parsing and manipulating, web archives programmatically
  • Search Tools - for indexing and searching full-text and metadata within web archives

See also

  • Archive
    Archive
    An archive is a collection of historical records, or the physical place they are located. Archives contain primary source documents that have accumulated over the course of an individual or organization's lifetime, and are kept to show the function of an organization...

  • Archive site
    Archive site
    In web archiving, an archive site is a website that stores information on, or the actual, webpages from the past for anyone to view.-Common techniques:Two common techniques are #1 using a web crawler or #2 user submissions....

  • Digital preservation
    Digital preservation
    Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

  • Heritrix
    Heritrix
    Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open-source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.Heritrix was developed...

  • International Internet Preservation Consortium
    International Internet Preservation Consortium
    -Projects:IIPC sponsored a project on "cross-archival search strategies" which included the creation of an archive focused on the 2010 Winter Olympics....

  • Internet Archive
    Internet Archive
    The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge". It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and nearly 3 million public domain books. The Internet Archive...

     Wayback Machine
    Wayback Machine
    The Wayback Machine is a digital time capsule created by the Internet Archive non-profit organization, based in San Francisco, California. It is maintained with content from Alexa Internet. The service enables users to see archived versions of web pages across time, which the Archive calls a "three...

  • Library of Congress Digital Library project
    Library of Congress Digital Library project
    The Library of Congress National Digital Library Program is assembling a digital library of reproductions of primary source materials to support the study of the history and culture of the United States...

  • List of Web archiving initiatives
  • National Digital Information Infrastructure and Preservation Program
    National Digital Information Infrastructure and Preservation Program
    The National Digital Information Infrastructure and Preservation Program is an archival program led by the Library of Congress to archive and provide access to digital resources. The U.S. Congress established the program in 2000...

  • Pandora Archive
    Pandora Archive
    PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting...

  • Portuguese Web Archive
    Portuguese Web Archive
    The Portuguese Web Archive is the national Web archive of Portugal. Its mission is to periodically archive contents of national interest available on the Web, storing and preserving for future generations information of historical relevance. It is a project of the Foundation for National...

  • Project MINERVA
  • UK Web Archiving Consortium
    UK Web Archiving Consortium
    The UK Web Archiving Consortium is a consortium of six leading UK institutions working collaboratively on a pilot operation archiving selected UK websites...

  • Virtual artifact
    Virtual artifact
    A virtual artifact is an immaterial object that exists in the human mind or in a digital environment, for example the Internet, intranet, virtual reality, cyberspace, etc.-Background:...

  • WebCite
    WebCite
    WebCite is a service that archives web pages on demand. Authors can subsequently cite the archived web pages through WebCite, in addition to citing the original URL of the web page. Readers are able to retrieve the archived web pages indefinitely, without regard to whether the original web page is...

  • Web crawling


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK