Correlation database
Encyclopedia
A Correlation database is a database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...

 (DBMS) that is data model independent and designed to efficiently handle unplanned, ad hoc
Ad hoc
Ad hoc is a Latin phrase meaning "for this". It generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes. Compare A priori....

 queries in an analytical system environment. It was developed in 2005 by database architect Joseph Foley, whose background includes more than 30 years in data warehousing and business intelligence
Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

 research and development work across a variety of industries.

Unlike relational database management systems
Relational database management system
A relational database management system is a database management system that is based on the relational model as introduced by E. F. Codd. Most popular databases currently in use are based on the relational database model....

, which use a records-based storage approach, or column-oriented databases
Column-oriented DBMS
A column-oriented DBMS is a database management system that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items....

 which use a column-based storage method, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values.

Structure of the Correlation DBMS

Because a correlation DBMS stores each unique data value only once, the physical database size is significantly smaller than relational or column-oriented databases, without the use of data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

 techniques. Above approximately 30GB, a correlation DBMS may become smaller than the raw data set.

The VBS model used by a CDBMS consists of three primary physical sets of objects that are stored and managed:
  • a data dictionary
    Data dictionary
    A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format." The term may have one of several closely related meanings pertaining to...

     (metadata)
    Metadata
    The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

    ;

  • an indexing and linking data set (additional metadata); and

  • the actual data values that comprise the stored information.


In the VBS model, each unique value in the raw data is stored only once; therefore, the data is always normalized
Database normalization
In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...

 at the level of unique values. This eliminates the need to normalize data sets in the logical schema.

Data values are stored together in ordered sets based on data types: all integers in one set, characters in another, etc. This optimizes the data handling processes that access the values.

In addition to typical data values, the data value store contains a special type of data for storing relationships between tables. This functions similarly to foreign keys
Foreign Keys
Foreign Keys is the second album released in 1985 by musician Jandek, and his eleventh overall. It finds him returning to the band sound with a vengeance, and is the first all-band Jandek album, with no acoustic numbers whatsoever...

 in RDBMS structures, but with a CDBMS, the relationship is known by the dictionary and stored as a data value, making navigation between tables completely automatic.

The data dictionary contains typical metadata plus additional statistical data about the tables, columns and occurrences of values in the logical schema
Logical schema
A Logical Schema is a data model of a specific problem domain expressed in terms of a particular data management technology. Without being specific to a particular database management product, it is in terms of either relational tables and columns, object-oriented classes, or XML tags...

. It also maintains information about the relationships between the logical tables. The index and linking storage includes all of the data used to locate the contents of a record from the ordered values in the data store.

While not a RAM
Random-access memory
Random access memory is a form of computer data storage. Today, it takes the form of integrated circuits that allow stored data to be accessed in any order with a worst case performance of constant time. Strictly speaking, modern types of DRAM are therefore not random access, as data is read in...

-based storage system, a CBMDS is designed to use as much RAM as the operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

 can provide. For large databases, additional RAM improves performance. Generally, 4GB of RAM will provide optimized access times up to about 100 million records. 8GB of RAM is adequate for databases up to 10 times that size. Because the incremental RAM consumed decreases as the database grows, 16GB of RAM will generally support databases containing up to approximately 20 billion records.

Comparison of DBMS Storage Structures

The sample records shown below illustrate the physical differences in the storage structures used in relational, column-oriented and correlation databases.
Cust ID Name City State
12222 ABC Corp Minneapolis MN
19434 A1 Mfg Duluth MN
20523 J&J Inc St. Paul MN

Storage in RDBMS

The record-based structure used in an RDBMS stores data in with elements of the row most near each other. Variations like clustered indexing
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

 may change the sequence of the rows, but all rows, columns and values will be stored as in the table. The above table might be stored as:

12222,ABC Corp,Minneapolis,MN;19434,A1 Mfg,Duluth,MN;20523,J&J Inc,St. Paul,MN

Storage in column-oriented databases

In the column-based structure, elements of the same column are stored adjacent to each other. Consecutive duplicates within a single column may be automatically removed or compressed efficiently.

12222,19434,20523;ABC Corp,A1 Mfg,J&J Inc;Minneapolis,Duluth,St.Paul;MN,MN,MN

Storage in CDBMS

In the VBS structure used in a CDBMS, each unique value is stored once and given an abstract (numeric) identifier, regardless of the number of occurrences or locations in the original data set. The original dataset is then constructed by referencing those logical identifiers. The correlation index may resemble the storage below. Note that the value "MN" which occurs multiple times in the data above is only included once. As the amount of repeat data grows, this benefit multiplies.

1:12222,2:19434,3:20523,4:ABC Corp,5:A1 Mfg,6:J&J Inc,7:Minneapolis,8:Duluth,9:St.Paul,10:MN

The records in our example table above can then be expressed as:

11:[1,4,7,10],12:[2,5,8,10],13:[3,6,9,10]

It's worth noting that this correlation process is a form of Database Normalization
Database normalization
In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...

. Just as one can achieve some benefits of column-oriented storage within an RDBMS, so too can one achieve some benefits of the correlation database through database normalization. However, in a traditional RDBMS this normalization process requires work in the form of table configuration, stored procedures, and SQL statements. We say that a database is a correlation database when it naturally expresses a fully normalized schema without this extra configuration. As a result, a correlation database may have more focused optimizations for this fully normalized structure.

It's also worth noting that this correlation process is similar to what occurs in a text-search oriented Inverted index
Inverted index
In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents...

.

Advantages and Disadvantages of the CDBMS

For analytical data warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

 applications, a CDBMS has several advantages over alternative database structures. First, because the database engine itself indexes all data and auto-generates its own schema on the fly while loading, it can be implemented quickly and is easy to update. There is no need for physical pre-design and no need to ever restructure the database. Second, a CDBMS enables creation and execution of complex queries such as associative queries ("show everything that is related to x") that are difficult if not impossible to model in SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....

. The primary advantage of the CDBMS is that it is optimized for executing ad hoc queries - queries not anticipated during the data warehouse design phase.

A CDBMS has two drawbacks in comparison to database alternatives. Unlike relational databases, which can be used in a wide variety of applications, a correlation database is designed specifically for analytical applications and does not provide transaction management features; it cannot be used for transactional processing
Online transaction processing
Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing...

. Second, because it indexes all data during the load process, the physical load speed of a CDBMS is slower than relational or column-oriented structures. However, because it eliminates the need for logical or physical pre-design, the overall "time to use" of a CDBMS is generally similar to or somewhat faster than alternative structures.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK