Data deduplication
Encyclopedia
In computing
, data deduplication is a specialized data compression
technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the chunk size), the amount of data that must be stored or transferred can be greatly reduced.
This type of deduplication is different than that which is performed by standard file compression tools such as LZ77 and LZ78
. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example a typical email system might contain 100 instances of the same one megabyte
(MB) file attachment. Each time the email
platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.
Post-process and in-line deduplication methods are often heavily debated.
Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces that link with the referenced data chunk. The de-duplication process is intended to be transparent to end users and applications.
To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are two-fold. First, data deduplication requires overhead to discover and remove the duplicate data. In primary storage systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data, is that secondary data tends to have more duplicate data. Backup application in particular commonly generate significant portions of duplicate data over time.
Data deduplication has been deployed successfully with primary storage in some cases where the system design does not require significant overhead, or impact performance.
One method for deduplicating data relies on the use of cryptographic hash function
s to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used, and although the probabilities are small, they are always non zero.
Thus, the concern arises that data corruption
can occur if a hash collision
occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity.
The hash functions used include standards such as SHA-1, SHA-256 and others. These provide a far lower probability of data loss than the risk of an undetected/uncorrected hardware error in most cases and can be in the order of per petabyte (1,000 terabyte) of data.
Some cite the computational resource intensity of the process as a drawback of data deduplication. However, this is rarely an issue for stand-alone devices or appliances, as the computation is completely offloaded from other systems. This can be an issue when the deduplication is embedded within devices providing other services.
To improve performance, many systems utilize weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance.
Another area of concern with deduplication is the related effect on snapshots
, backup
, and archival, especially where deduplication is applied against primary storage (for example inside a NAS
filer). Reading files out of a storage device causes full rehydration of the files, so any secondary copy of the data set is likely to be larger than the primary copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the post-deduplication snapshot will preserve the entire original file. This means that although storage capacity for primary file copies will shrink, capacity required for snapshots may expand dramatically.
Another concern is the effect of compression and encryption. Although deduplication is a version of compression, it works in tension with traditional compression. Deduplication achieves better efficiency against smaller data chunks, whereas compression achieves better efficiency against larger chunks. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data will have 0% gain from deduplication, even though the underlying data may be redundant.
Scaling has also been a challenge for dedupe systems because the hash table or dedupe namespace needs to be shared across storage devices. If there are multiple disk backup devices in an infrastructure with discrete dedupe namespaces, then space efficiency is adversely affected. A namespace shared across devices - called Global Dedupe - preserves space efficiency, but is technically challenging from a reliability and performance perspective.
Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying reliability of the system. (Compare this, for example, to the LOCKSS
storage architecture that achieves reliability through multiple copies of data.)
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, data deduplication is a specialized data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the chunk size), the amount of data that must be stored or transferred can be greatly reduced.
This type of deduplication is different than that which is performed by standard file compression tools such as LZ77 and LZ78
LZ77 and LZ78
LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and...
. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example a typical email system might contain 100 instances of the same one megabyte
Megabyte
The megabyte is a multiple of the unit byte for digital information storage or transmission with two different values depending on context: bytes generally for computer memory; and one million bytes generally for computer storage. The IEEE Standards Board has decided that "Mega will mean 1 000...
(MB) file attachment. Each time the email
Email
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.
Benefits
- Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk—a surprisingly common scenario. In the case of data backups, which routinely are performed to protect against data loss, most of data in a given backup isn't changed from the previous backup. Common backup systems try to exploit this by omitting (or hard linkHard linkIn computing, a hard link is a directory entry that associates a name with a file on a file system. . The term is used in file systems which allow multiple hard links to be created for the same file. This has the effect of creating multiple names for the same file, causing an aliasing effect: e.g...
ing) files that haven't changed or storing differencesData differencingIn computer science and information theory, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target...
between files. Neither approach captures all redundancies, however. Hard linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents).
- Network data deduplication is used to reduce the absolute number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimizationWAN OptimizationWAN optimization is a collection of techniques for increasing data-transfer efficiencies across wide-area networks. In 2008, the WAN optimization market was estimated to be $1 billion , and it will grow to $4.4 billion according to Gartner, a technology research firm.The most common measures of...
for more information.
- Virtual servers benefit from deduplication because it allows nominally separate system files for each virtual server to be coalesced into a single storage space. At the same time, if a given server customizes a file, deduplication will not change the files on the other servers—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved.
Deduplication overview
Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written.Post-process deduplication
With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that you may unnecessarily store duplicate data for a short time which is an issue if the storage system is near full capacity.In-line deduplication
This is the process where the deduplication hash calculations are created on the target device as the data enters the device in real time. If the device spots a block that it already stored on the system it does not store the new block, just references to the existing block. The benefit of in-line deduplication over post-process deduplication is that it requires less storage as data is not duplicated. On the negative side, it is frequently argued that because hash calculations and lookups takes so long, it can mean that the data ingestion can be slower thereby reducing the backup throughput of the device. However, certain vendors with in-line deduplication have demonstrated equipment with similar performance to their post-process deduplication counterparts.Post-process and in-line deduplication methods are often heavily debated.
Source versus target deduplication
Another way to think about data deduplication is by where it occurs. When the deduplication occurs close to where data is created, which is often referred to as "source deduplication," whereas when it occurs near where the data is stored, it is commonly called "target deduplication."- Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called Copy-on-writeCopy-on-writeCopy-on-write is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, they can all be given pointers to the same resource...
a copy of that file or changed block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data.
- Target deduplication is the process of removing duplicates of data in the secondary store. Generally this will be a backup store such as a data repository or a virtual tape libraryVirtual Tape LibraryA virtual tape library is a data storage virtualization technology used typically for backup and recovery purposes. A VTL presents a storage component as tape libraries or tape drives for use with existing backup software.Virtualizing the disk storage as tape allows integration of VTLs with...
. There are three different ways of performing the deduplication process.
Deduplication Methods
One of the most common forms of data deduplication implementations work by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that data with the same identification is identical. If the software either assumes that a given identification already exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the implementation, then it will replace that duplicate chunk with a link.Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces that link with the referenced data chunk. The de-duplication process is intended to be transparent to end users and applications.
- Chunking. Between commercial deduplication implementations, technology varies primarily in chunking method and in architecture. In some systems, chunks are defined by physical layer constraints (e.g. 4KB block size in WAFLWrite Anywhere File LayoutThe Write Anywhere File Layout is a file layout that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure , and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances...
). In some systems only complete files are compared, which is called Single Instance Storage or SIS. The most intelligent (but CPU intensive) method to chunking is generally considered to be sliding-block. In sliding block, a window is passed along the file stream to seek out more naturally occurring internal file boundaries.
- Client backup deduplication. This is the process where the deduplication hash calculations are initially created on the source (client) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data. The benefit of this is that it avoids data being unnecessarily sent across the network thereby reducing traffic load.
- Primary storage and secondary storage. By definition, primary storage systems are designed for optimal performance, rather than lowest possible cost. The design criteria for these systems is to increase performance, at the expense of other considerations. Moreover, primary storage systems are much less tolerant of any operation that can negatively impact performance. Also by definition, secondary storage systems contain primarily duplicate, or secondary copies of data. These copies of data are typically not used for actual production operations and as a result are more tolerant of some performance degradation, in exchange for increased efficiency.
To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are two-fold. First, data deduplication requires overhead to discover and remove the duplicate data. In primary storage systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data, is that secondary data tends to have more duplicate data. Backup application in particular commonly generate significant portions of duplicate data over time.
Data deduplication has been deployed successfully with primary storage in some cases where the system design does not require significant overhead, or impact performance.
Drawbacks and concerns
Whenever data is transformed, concerns arise about potential loss of data. By definition, data deduplication systems store data differently from how it was written. As a result, users are concerned with the integrity of their data. The various methods of deduplicating data all employ slightly different techniques. However, the integrity of the data will ultimately depend upon the design of the deduplicating system, and the quality used to implement the algorithms. As the technology has matured over the past decade, the integrity of most of the major products has been well proven.One method for deduplicating data relies on the use of cryptographic hash function
Cryptographic hash function
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...
s to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used, and although the probabilities are small, they are always non zero.
Thus, the concern arises that data corruption
Data corruption
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data...
can occur if a hash collision
Hash collision
Not to be confused with wireless packet collision.In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value, checksum, fingerprint, or cryptographic digest....
occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity.
The hash functions used include standards such as SHA-1, SHA-256 and others. These provide a far lower probability of data loss than the risk of an undetected/uncorrected hardware error in most cases and can be in the order of per petabyte (1,000 terabyte) of data.
Some cite the computational resource intensity of the process as a drawback of data deduplication. However, this is rarely an issue for stand-alone devices or appliances, as the computation is completely offloaded from other systems. This can be an issue when the deduplication is embedded within devices providing other services.
To improve performance, many systems utilize weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance.
Another area of concern with deduplication is the related effect on snapshots
Snapshot (computer storage)
In computer systems, a snapshot is the state of a system at a particular point in time. The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems....
, backup
Backup
In information technology, a backup or the process of backing up is making copies of data which may be used to restore the original after a data loss event. The verb form is back up in two words, whereas the noun is backup....
, and archival, especially where deduplication is applied against primary storage (for example inside a NAS
Network-attached storage
Network-attached storage is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. NAS not only operates as a file server, but is specialized for this task either by its hardware, software, or configuration of those elements...
filer). Reading files out of a storage device causes full rehydration of the files, so any secondary copy of the data set is likely to be larger than the primary copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the post-deduplication snapshot will preserve the entire original file. This means that although storage capacity for primary file copies will shrink, capacity required for snapshots may expand dramatically.
Another concern is the effect of compression and encryption. Although deduplication is a version of compression, it works in tension with traditional compression. Deduplication achieves better efficiency against smaller data chunks, whereas compression achieves better efficiency against larger chunks. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data will have 0% gain from deduplication, even though the underlying data may be redundant.
Scaling has also been a challenge for dedupe systems because the hash table or dedupe namespace needs to be shared across storage devices. If there are multiple disk backup devices in an infrastructure with discrete dedupe namespaces, then space efficiency is adversely affected. A namespace shared across devices - called Global Dedupe - preserves space efficiency, but is technically challenging from a reliability and performance perspective.
Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying reliability of the system. (Compare this, for example, to the LOCKSS
LOCKSS
The LOCKSS project, under the auspices of Stanford University, develops and supports an open source system allowing libraries to collect, preserve and provide their readers with access to material published on the Web. The system attempts to replicate the way libraries do this for material...
storage architecture that achieves reliability through multiple copies of data.)
See also
- Capacity optimizationCapacity optimizationCapacity optimization is a general term for technologies used to improve storage utilization by shrinking stored data. The primary technologies used for capacity optimization are deduplication and data compression. These solutions are delivered as software or hardware solution, integrated with...
- Single-instance storageSingle-instance storageSingle-instance storage is a system's ability to keep one copy of content that multiple users or computers share. It is a means to eliminate data duplication and to increase efficiency...
- Content-addressable storageContent-addressable storageContent-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. It is typically used for high-speed storage and retrieval of fixed content, such as documents stored...
- Delta encodingDelta encodingDelta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...
- Linked DataLinked DataIn computing, linked data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a...
- Record linkageRecord linkageRecord linkage refers to the task of finding records in a data set that refer to the same entity across different data sources...
- Identity resolutionIdentity resolutionIdentity resolution is an operational intelligence process, typically powered by an identity resolution engine or middleware stack, whereby organizations can connect disparate data sources with a view to understanding possible identity matches and non-obvious relationships across multiple data silos...
External links
- Biggar, Heidi(2007.12.11). WebCast: The Data Deduplication Effect
- Fellows, Russ(Evaluator Group, Inc.) Data Deduplication, why when where and how?
- Using Latent Semantic Indexing for Data Deduplication.
- A Better Way to Store Data.
- What Is the Difference Between Data Deduplication, File Deduplication, and Data Compression? - Database from eWeek
- SNIA DDSR SIG Understanding Data Deduplication Ratios