Apache Hive
Encyclopedia
Apache Hive is a data warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

 infrastructure built on top of Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

 for providing data summarization, query, and analysis. While initially developed by Facebook
Facebook
Facebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...

, Apache Hive is now used and developed by other companies such as Netflix
Netflix
Netflix, Inc., is an American provider of on-demand internet streaming media in the United States, Canada, and Latin America and flat rate DVD-by-mail in the United States. The company was established in 1997 and is headquartered in Los Gatos, California...

. Hive is also included in Amazon Elastic MapReduce on Amazon Web Services
Amazon Web Services
Amazon Web Services is a collection of remote computing services that together make up a cloud computing platform, offered over the Internet by Amazon.com...

.

Features

Apache Hive supports analysis of large datasets stored in Hadoop compatible file systems such as Amazon S3
Amazon S3
Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces...

 filesystem. It provides an SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....

-like language called HiveQL while maintaining full support for map/reduce. To accelerate queries, it provides indexes, including bitmap index
Bitmap Index
A bitmap index is a special kind of database index that uses bitmaps.Bitmap indexes have traditionally been considered to work well for data such as gender, which has a small number of distinct values, for example male and female, but many occurrences of those values. This would happen if, for...

es.

By default, Hive stores metadata in an embedded Apache Derby
Apache Derby
Apache Derby is a relational database management system developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. It has a 2 MB disk-space footprint.Apache Derby is developed as an open source project under the Apache 2.0 license...

 database, and other client/server databases like MySQL can optionally be used.

Currently, there are three file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE and RCFILE
RCFile
Big data refers to fast growing and huge data sets that cannot be easily handled by traditional databases, including parallel databases. Big data sets are stored, managed and analyzed in large and scalable distributed systems, where data processing model is based on the MapReduce framework...

.

HiveQL

While based on SQL, HiveQL does not strictly follow the full SQL-92
SQL-92
SQL-92 was the third revision of the SQL database query language. Unlike SQL-89, it was a major revision of the standard. For all but a few minor incompatibilities, the SQL-89 standard is forwards-compatible with SQL-92....

 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

. Also, HiveQL lacks support for transactions
Database transaction
A transaction comprises a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions...

 and materialized view
Materialized view
A materialized view is a database object that contains the results of a query. They are local copies of data located remotely, or are used to create summary tables based on aggregations of a table's data. Materialized views, which store data based on remote tables, are also known as snapshots...

s, and only limited subquery support.

Internally, a compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

 translates HiveQL statement into a directed acyclic graph
Directed acyclic graph
In mathematics and computer science, a directed acyclic graph , is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of...

 of MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....

 jobs, which are submitted to Hadoop for execution.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK