Column-oriented DBMS
Encyclopedia
A column-oriented DBMS is a database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...

 (DBMS) that stores its content by column rather than by row. This has advantages for data warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

s and library catalogues where aggregates are computed over large numbers of similar data items.

It is possible to achieve some benefits of column-oriented and row-oriented organization with any database. By denoting one as column-oriented we are referring to both the ease of expression of a column-oriented structure and the focus on optimizations for column-oriented workloads. This approach is in contrast to row-oriented or row store databases and with correlation database
Correlation database
A Correlation database is a database management system that is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment...

s, which use a value-based storage structure.

Description

A database program must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.
EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000


This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary).

This table exists in the computer's memory (RAM) and storage (hard drive). Although RAM and hard drives differ mechanically, the computer's operating system abstracts them. Still, the database must coax its two-dimensional table into a one-dimensional series of bytes, for the operating system to write to either the RAM, or hard drive, or both.

A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.

1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.

1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;

This is a simplification. Partitioning
Partition (database)
A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons....

, indexing
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

, caching, views, OLAP cube
OLAP cube
An OLAP cube is a data structure that allows fast analysis of data. It can also be defined as the capability of manipulating and analyzing data from multiple perspectives...

s, and transactional systems such as write ahead logging
Write ahead logging
In computer science, write-ahead logging is a family of techniques for providing atomicity and durability in database systems....

 or multiversion concurrency control
Multiversion concurrency control
Multiversion concurrency control , in the database field of computer science, is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.For instance, a database will...

 all dramatically affect the physical organization. That said, online transaction processing
Online transaction processing
Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing...

 (OLTP)-focused RDBMS systems are more row-oriented, while online analytical processing (OLAP)-focused systems are a balance of row-oriented and column-oriented.

Benefits

Comparisons between row-oriented and column-oriented systems are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other delays in computers. Sometimes, reading a megabyte of sequentially stored data takes no more time than one random access. Further, because seek time is improving at a slow rate relative to CPU power (see Moore's Law
Moore's Law
Moore's law describes a long-term trend in the history of computing hardware: the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years....

), this focus will likely continue on systems reliant on hard-disks for storage. Following is a set of over-simplified observations which attempt to paint a picture of the trade-offs between column- and row-oriented organizations. Unless, of course, the application can be reasonably assured to fit most/all data into memory, in which case huge optimizations are available from in-memory database systems.
  1. Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
  2. Column-oriented systems are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
  3. Row-oriented systems are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
  4. Row-oriented systems are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.


In practice, row-oriented architectures are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column stores are well-suited for OLAP
OLAP
In computing, online analytical processing, or OLAP , is an approach to swiftly answer multi-dimensional analytical queries. OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining...

-like workloads (e.g., data warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

s) which typically involve a smaller number of highly complex queries over all data (possibly terabyte
Terabyte
The terabyte is a multiple of the unit byte for digital information. The prefix tera means 1012 in the International System of Units , and therefore 1 terabyte is , or 1 trillion bytes, or 1000 gigabytes. 1 terabyte in binary prefixes is 0.9095 tebibytes, or 931.32 gibibytes...

s). However, there are a number of proven row-based OLAP RDBMS that handles terabytes, or even petabyte
Petabyte
A petabyte is a unit of information equal to one quadrillion bytes, or 1000 terabytes. The unit symbol for the petabyte is PB...

s of data, such as Teradata
Teradata
Teradata Corporation is a vendor specializing in data warehousing and analytic applications. Its products are commonly used by companies to manage data warehouses for analytics and business intelligence purposes. Teradata was formerly a division of NCR Corporation, with the spinoff from NCR on...

.

Compression

Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data. For example, many popular modern compression schemes, such as LZW or run-length encoding
Run-length encoding
Run-length encoding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run...

, make use of the similarity of adjacent data to compress. While the same techniques may be used on row-oriented data, a typical implementation will achieve less effective results.

To improve compression, several implementations (such as Vertica
Vertica
Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker, and Andrew Palmer; its President and CEO is Christopher P. Lynch. HP announced it would acquire the company in February 2011. On March 22, 2011, HP completed...

) sort the rows. For example, using bitmap indexes, sorting can improve compression by an order of magnitude. To maximize the compression benefits of the lexicographical order with respect to run-length encoding
Run-length encoding
Run-length encoding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run...

, it is best to use low-cardinality columns as the first sort keys. For example, given a table with columns sex, age, name, it would be best to sort first on the value sex (cardinality of two), then age (cardinality of <150), then name.

Columnar compression achieves a reduction in disk space at the expense of efficiency of retrieval. Retrieving all data from a single row is more efficient when that data is located in a single location, such as in a row-oriented architecture. Further, the greater adjacent compression achieved, the more difficult random-access may become, as data might need to be uncompressed to be read. Therefore, column-oriented architectures are sometimes enriched by additional mechanisms aimed at minimizing the need for access to compressed data.

Implementations

Column stores or transposed files have been implemented from the early days of DBMS development, beginning in the 1970s. For example, Statistics Canada implemented the RAPID system in 1976 and used it for processing and retrieval of the Canadian Census of Population and Housing as well as several other statistical applications. RAPID was shared with other statistical organizations throughout the world and used widely in the 1980s. It continued to be used by Statistics Canada until the 1990s.

For many years, only the Sybase IQ
Sybase IQ
Sybase IQ is a relational database software system used for business intelligence and data warehousing, produced by Sybase.-Features:As a column-oriented DBMS, Sybase IQ stores data tables as sections of columns of data rather than as rows of data...

 product was commercially available in the column-oriented DBMS class. However, that has changed rapidly in the last few years with many open source and commercial implementations.

Current examples of column-oriented DBMSs include:

Commercial:
  • 1010data
    1010data
    1010data is a privately-held company that provides a cloud-based software platform and associated services for complex business analytics on and database publishing of very large data sets...

    's Tenbase database
  • Alterian's Engine
  • Aster Data Systems
    Aster Data Systems
    Aster Data Systems is a data management and analysis software company headquartered in San Carlos, California. It was founded in 2005 and acquired in 2011.-Products:...

  • EXASOL
    EXASOL
    EXASOL is an analytic database management software company. Its product is called EXASolution, a RDBMS. EXASOL regularly publishes results of the TPC-H benchmark....

  • FAME
    FAME (database)
    FAME is a time series database from SunGard which comes along with a suite of APIs and domain-specific programming language.- History :...

  • FluidDB
  • Hive Intelligence Ltd Hex Engine
  • Infobright
    Infobright
    Infobright is a commercial provider of column-oriented relational database software with a focus in machine-generated data. The company's head office is located in Toronto, Canada. Most of its research and development is based in Warsaw, Poland.-History:...

     Enterprise Edition
  • KDB
  • Kickfire
    Kickfire
    Kickfire, Inc. was an analytic database appliance manufacturer. It was acquired by Teradata in August 2010. The Kickfire appliance utilizes FPGA hardware acceleration techniques for SQL databases.Features:...

  • Oracle Retail Predictive Application Server (RPAS)
  • Paraccel
    Paraccel
    ParAccel, Inc. is a vendor in the data warehouse appliance market category. Vendors in this category provide a purpose-built database management system used for data warehousing, business intelligence and analytic processing and, according to Hackathorn and White, there are varying degrees of...

     Analytic Database
  • SAND CDBMS
    SAND CDBMS
    SAND CDBMS is a column-oriented DBMS software system optimized for business intelligence applications, delivering the data warehousing component, developed by SAND Technology Inc.-Company History:SAND Technology was founded in 1983....

  • SAP HANA
  • SenSage
    SenSage
    SenSage Inc. is a privately held data warehouse software provider headquartered in San Francisco, California. SenSage serves enterprises who use the software to capture and store event data so that it can be consolidated, searched and analyzed to generate reports that detect fraud, analyze...

  • Sybase IQ
    Sybase IQ
    Sybase IQ is a relational database software system used for business intelligence and data warehousing, produced by Sybase.-Features:As a column-oriented DBMS, Sybase IQ stores data tables as sections of columns of data rather than as rows of data...

  • Vectorwise
  • Vertica
    Vertica
    Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker, and Andrew Palmer; its President and CEO is Christopher P. Lynch. HP announced it would acquire the company in February 2011. On March 22, 2011, HP completed...



Free and open source software
Free and open source software
Free and open-source software or free/libre/open-source software is software that is liberally licensed to grant users the right to use, study, change, and improve its design through the availability of its source code...

:
  • C-Store
    C-Store
    C-Store was a database management system based on a column-oriented DBMS developed by a team at Brown University, Brandeis University and the Massachusetts Institute of Technology, including Michael Stonebraker, Stanley Zdonik, and Samuel Madden ....

  • InfiniDB Community Edition
  • Infobright
    Infobright
    Infobright is a commercial provider of column-oriented relational database software with a focus in machine-generated data. The company's head office is located in Toronto, Canada. Most of its research and development is based in Warsaw, Poland.-History:...

     Community Edition
  • LucidDB
    LucidDB
    LucidDB is an Open Source database purpose-built to power Data Warehouses, OLAP servers and Business Intelligence systems.LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence...

  • Metakit
    Metakit
    Metakit is an embedded database library with a small footprint. It fills the gap between flat-file, relational, object-oriented, and tree-structured databases, supporting relational joins, serialization, nested structures, and instant schema evolution...

  • MonetDB
    MonetDB
    MonetDB is an open source column-oriented database management system developed at the Centrum Wiskunde & Informatica in the Netherlands.It was designed to provide high performance on complex queries against large databases,e.g...

  • S programming language
    S programming language
    S is a statistical programming language developed primarily by John Chambers and Rick Becker and Allan Wilks of Bell Laboratories...

     and GNU R incorporate column-oriented data structures for statistical analyses
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK