RCFile
Encyclopedia
Big data
refers to fast growing and huge data sets that cannot be easily handled by traditional databases
, including parallel databases
. Big data sets are stored, managed and analyzed in large and scalable distributed systems, where data processing model is based on the MapReduce
framework. One important issue to address is how to store (or place) increasingly large data sets in the distributed systems to prepare for their accesses and analytics in a fast and effective way. Since the data movement can be very expensive, the initial placement of data sets is crucial to the entire process of big data analytic.
RCfile (Record Columnar File) is a data placement structure that is designed and implemented for big data analytics under the MapReduce
programming environment. RCFile aims to meet the following four basic requirements of big data analytics in large scale distributed systems: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.
RCFile is implemented as a part of Hive
, a data warehouse infrastructure, on top of Hadoop
. It is also adopted in HCatalog project (formerly known as Howl), a table and storage management service for Hadoop. RCFile has been widely used in big data analytics community where Hadoop is supported as the basic system environment. The largest user of RCFile is the Facebook
, which has one of the busiest data processing system in the world.
RCFile is also a result of basic research from Facebook data Infrastructure team, ICT/CAS
and The Ohio State University. A research paper entitled “RCFile: a Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse systems” was published and presented in ICDE’ 11.
To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store) or only partitioning the table vertically like the column-oriented DBMS
(column-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups.
Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:
Row Group 1 Row Group 2
11, 21, 31; 41, 51;
12, 22, 32; 42, 52;
13, 23, 33; 43, 53;
14, 24, 34; 44, 54;
Big data
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing...
refers to fast growing and huge data sets that cannot be easily handled by traditional databases
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
, including parallel databases
Parallel database
A parallel database system seeks to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion, the distribution is governed solely by performance considerations...
. Big data sets are stored, managed and analyzed in large and scalable distributed systems, where data processing model is based on the MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
framework. One important issue to address is how to store (or place) increasingly large data sets in the distributed systems to prepare for their accesses and analytics in a fast and effective way. Since the data movement can be very expensive, the initial placement of data sets is crucial to the entire process of big data analytic.
RCfile (Record Columnar File) is a data placement structure that is designed and implemented for big data analytics under the MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
programming environment. RCFile aims to meet the following four basic requirements of big data analytics in large scale distributed systems: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.
RCFile is implemented as a part of Hive
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix...
, a data warehouse infrastructure, on top of Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
. It is also adopted in HCatalog project (formerly known as Howl), a table and storage management service for Hadoop. RCFile has been widely used in big data analytics community where Hadoop is supported as the basic system environment. The largest user of RCFile is the Facebook
Facebook
Facebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...
, which has one of the busiest data processing system in the world.
RCFile is also a result of basic research from Facebook data Infrastructure team, ICT/CAS
Chinese Academy of Sciences
The Chinese Academy of Sciences , formerly known as Academia Sinica, is the national academy for the natural sciences of the People's Republic of China. It is an institution of the State Council of China. It is headquartered in Beijing, with institutes all over the People's Republic of China...
and The Ohio State University. A research paper entitled “RCFile: a Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse systems” was published and presented in ICDE’ 11.
Description of basic structure
In a relational database, data is organized as two-dimensional tables. For example, a table in a database consists of 4 columns (c1 to c4):c1 | c2 | c3 | c4 |
---|---|---|---|
11 | 12 | 13 | 14 |
21 | 22 | 23 | 24 |
31 | 32 | 33 | 34 |
41 | 42 | 43 | 44 |
51 | 52 | 53 | 54 |
To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store) or only partitioning the table vertically like the column-oriented DBMS
Column-oriented DBMS
A column-oriented DBMS is a database management system that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items....
(column-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups.
c1 | c2 | c3 | c4 |
---|---|---|---|
11 | 12 | 13 | 14 |
21 | 22 | 23 | 24 |
31 | 32 | 33 | 34 |
c1 | c2 | c3 | c4 |
---|---|---|---|
41 | 42 | 43 | 44 |
51 | 52 | 53 | 54 |
Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:
Row Group 1 Row Group 2
11, 21, 31; 41, 51;
12, 22, 32; 42, 52;
13, 23, 33; 43, 53;
14, 24, 34; 44, 54;