Content-addressable storage
Encyclopedia
Content-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. It is typically used for high-speed storage and retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 of fixed content, such as documents stored for compliance with government regulations. Roughly speaking, content-addressable storage is the permanent-storage analogue to content-addressable memory
Content-addressable memory
Content-addressable memory is a special type of computer memory used in certain very high speed searching applications. It is also known as associative memory, associative storage, or associative array, although the last term is more often used for a programming data structure...

.

CAS and FCS

Content Addressable Storage (CAS) and Fixed Content Storage (FCS) are different acronyms for the same type of technology. The CAS / FCS technology is intended to store data that does not change (fixed) in time. The difference is that typically CAS exposes a digest generated by a cryptographic hash function
Cryptographic hash function
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...

 (such as MD5
MD5
The MD5 Message-Digest Algorithm is a widely used cryptographic hash function that produces a 128-bit hash value. Specified in RFC 1321, MD5 has been employed in a wide variety of security applications, and is also commonly used to check data integrity...

 or SHA-1) from the document it refers to. If the hash function is weak, this method could be subject to collisions in an adversarial environment (different documents returning the same hash). The main advantages of CAS / FCS technology is that the location of the actual data and the number of copies is unknown to the user. The metaphor of a CAS / FCS is not that of memory and memory locations. The proper metaphor is that of a coat check. The difference is that, with a coat check, once the item has been retrieved it cannot be retrieved again. With CAS / FCS technology a client is able to retrieve the same data using the same claim check over and over.

Content-addressed vs. Location-addressed

When being contrasted with content-addressed storage, a typical local or networked storage device
Storage device
Storage device may refer to:*Box, or any of a variety of containers or receptacles*Data storage device, a device for recording information, which could range from handwriting to video or acoustic recording, or to electromagnetic energy modulating magnetic tape and optical discs* Object storage...

 is referred to as location-addressed. In a location-addressed storage device, each element of data is stored onto the physical medium, and its location recorded for later use. The storage device often keeps a list, or directory, of these locations. When a future request is made for a particular item, the request includes only the location (for example, path and file names) of the data. The storage device can then use this information to locate the data on the physical medium, and retrieve it. When new information is written into a location-addressed device, it is simply stored in some available free space, without regard to its content. The information at a given location can usually be altered or completely overwritten without any special action on the part of the storage device.

Within the scope of this discussion, a good way to think of the above is as container-addressed storage.

The Content Addressable File Store
Content Addressable File Store
The Content Addressable File Store was a hardware device developed by International Computers Limited that provided a disk storage with built-in search capability...

 (CAFS) was a hardware device developed and sold by International Computers Limited (ICL) in the 1970s and 1980s that provided location-addressed disk storage with built-in search capability. The search logic was incorporated into the disk controller. A query expressed in a high-level query language could be compiled into a search specification that was then sent to the disk controller for execution. Files could also be accessed via the conventional location-addressing mechanism, permitting CAFS to support an IDMS
IDMS
IDMS is primarily a network database management system for mainframes. It was first developed at B.F. Goodrich and later marketed by Cullinane Database Systems...

 CODASYL database and also support content addressing of the same records.

In contrast, when information is stored into a CAS system, the system will record a content address, which is an identifier
Identifier
An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may be an idea, physical [countable] object , or physical [noncountable] substance...

 uniquely and permanently linked to the information content itself. A request to retrieve information from a CAS system must provide the content identifier, from which the system can determine the physical location of the data and retrieve it. Because the identifiers are based on content, any change to a data element will necessarily change its content address. In nearly all cases, a CAS device will not permit editing information once it has been stored. Whether it can be deleted is often controlled by a policy.

While the idea of content-addressed storage is not new, production-quality systems were not readily available until roughly 2003. In mid-2004, the industry group SNIA
SNIA
SNIA can refer to an industrial association or an Italian manufacturer:*Storage Networking Industry Association**SNIA Dictionary*SNIA S.p.A. – Italian firm headquartered in Milano....

 began working with a number of CAS providers to create standard behavior and interopability guidelines for CAS systems.

Pros and cons

CAS storage works most efficiently on data that does not change often. It is of particular interest to large organizations that must comply with document-retention laws, such as Sarbanes-Oxley. In these corporations a large volume of documents will be stored for as much as a decade, with no changes and infrequent access. CAS is designed to make the searching for a given document content very quick, and provides an assurance that the retrieved document is identical to the one originally stored. (If the documents were different, their content addresses would differ.) In addition, since data is stored into a CAS system by what it contains, there is never a situation where more than one copy of an identical document exists in storage. By definition, two identical documents have the same content address, and so point to the same storage location.

For data that changes frequently, CAS is not as efficient as location-based addressing. In these cases, the CAS device would need to continually recompute the address of data as it was changed, and the client systems would be forced to continually update information regarding where a given document exists. For random access systems, a CAS would also need to handle the possibility of two initially identical documents diverging, requiring a copy of one document to be created on demand.

Typical implementation

Paul Carpentier and Jan van Riel coined the term CAS while working at a company called FilePool in the late 1990s. FilePool was acquired in 2001 and became the underpinnings of the first commercially available CAS system, which was introduced as EMC's
EMC Corporation
EMC Corporation , a Financial Times Global 500, Fortune 500 and S&P 500 company, develops, delivers and supports information infrastructure and virtual infrastructure hardware, software, and services. EMC is headquartered in Hopkinton, Massachusetts, USA.Former Intel executive Richard Egan and his...

 Centera platform. Paul and Jan are now working together again at Caringo which has introduced advancements in CAS technology with the CAStor content storage software. The Centera CAS system consists of a series of networked nodes (1-U servers running Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

), divided between storage nodes and access nodes. The access nodes maintain a synchronized directory of content addresses, and the corresponding storage node where each address can be found. When a new data element, or blob (Binary large object
Binary large object
A blob is a collection of binary data stored as a single entity in a database management system. Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob...

), is added, the device calculates a hash of the content and returns this hash as the blob's content address. As mentioned above, the hash is searched for to verify that identical content is not already present. If the content already exists, the device does not need to perform any additional steps; the content address already points to the proper content. Otherwise, the data is passed off to a storage node and written to the physical media.

When a content address is provided to the device, it first queries the directory for the physical location of the specified content address. The information is then retrieved from a storage node, and the actual hash of the data recomputed and verified. Once this is complete, the device can supply the requested data to the client. Within the Centera system, each content address actually represents a number of distinct data blobs, as well as optional metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

. Whenever a client adds an additional blob to an existing content block, the system recomputes the content address.

To provide additional data security, the Centera access nodes, when no read or write operation is in progress, constantly communicate with the storage nodes, checking the presence of at least two copies of each blob as well as their integrity. Additionally, they can be configured to exchange data with a different, e.g. off-site, Centera system, thereby strengthening the precautions against accidental data loss.

IBM has another flavor of CAS which can be software based, Tivoli Storage manager 5.3, or hardware based, the IBM DR550. The architecture is different in that it is based on a hierarchical storage management
Hierarchical storage management
Hierarchical storage management is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive than slower devices, such as optical discs and magnetic...

 (HSM) design which provides some additional flexibility such as being able to support not only WORM
Write Once Read Many
A Write Once Read Many or WORM drive is a data storage device where information, once written, cannot be modified. On ordinary data storage devices, the number of times data can be modified is not limited, except by the rated lifespan of the device, as modification involves physical changes that...

 disk but WORM tape and the migration of data from WORM disk to WORM tape and vice versa. This provides for additional flexibility in disaster recovery situations as well as the ability to reduce storage costs by moving data off disk to tape.

Another typical implementation is from iTernity. The concept of iTernity bases of container, each container is addressed by its hash value. A container is a multiple number of fixed content documents, so one container is not changeable and the hash value is fixed after the write process.

Open-source implementations

One of the very first content-addressed storage servers, Venti
Venti
Venti is a network storage system that permanently stores data blocks. A 160-bit SHA-1 hash of the data acts as the address of the data...

, was originally developed for Plan 9 from Bell Labs
Plan 9 from Bell Labs
Plan 9 from Bell Labs is a distributed operating system. It was developed primarily for research purposes as the successor to Unix by the Computing Sciences Research Center at Bell Labs between the mid-1980s and 2002...

 and is now also available for Unix-like systems as part of Plan 9 from User Space
Plan 9 from User Space
Plan 9 from User Space is a port of many Plan 9 from Bell Labs libraries and applications to Unix-like operating systems...

.

A first step towards an open source CAS+ implementation is Twisted Storage. Active development continues on Twisted Storage with a new release being worked on.

While it is generally used as a source code control system, Linus Torvalds' Git program is a userspace CAS filesystem.

Project Honeycomb
Project Honeycomb
The Sun StorageTek 5800 System is an object-based storage system from Sun Microsystems that uses a symmetric clustered design with both processing and storage functions within a system cell...

 is an open source API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

 for CAS systems. The XAM
Xam
XAM, or the eXtensible Access Method, is a storage standard developed and maintained by the Storage Networking Industry Association . It is in the process of being ratified as an ANSI standard. XAM is an API for fixed content aware storage devices. XAM replaces the various proprietary interfaces...

 interface being developed under the auspices of SNIA.ORG is an attempt to create a standard interface for archiving on CAS (and CAS like) products and projects.

Bitcache is an open source distributed implementation of CAS written in Ruby. Bitcache server has an implementation for Drupal as well.

Camlistore is a recent project to bring the advantages of content-addressable storage "to the masses". It is intended to be used for a wide variety of use cases, including distributed backup; a snapshotted-by-default, version-controlled filesystem; and decentralised, permission-controlled filesharing.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK