Pig (programming language)
Encyclopedia
Pig

is a high-level platform for creating MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....

 programs used with Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....

 for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java and then call directly from the language.

Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.

Below is an example of a "Word Count" program in Pig Latin


A = load '/tmp/my-copy-of-all-pages-on-internet';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/number-of-words-on-internet';

The above program will generate parallel executable tasks which can be distributed across 1,000s of machines in a Hadoop cluster to count the number of words in a dataset such as "all the webpages on the internet".
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK