Hadoop
Encyclopedia
Apache Hadoop is a software framework
that supports data-intensive distributed applications
under a free license
. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google
's MapReduce
and Google File System
(GFS) papers.
Hadoop is a top-level Apache
project being built and used by a global community of contributors, written in the Java
programming language. Yahoo!
has been the largest contributor to the project, and uses Hadoop extensively across its businesses.
Hadoop was created by Doug Cutting
, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch
search engine project.
files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community.
For effective scheduling of work, every Hadoop-compatible filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, so reducing backbone traffic. The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable.
A small Hadoop cluster will include a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes, and compute-only worker nodes; these are normally only used in non-standard applications.
Hadoop requires JRE 1.6 or higher. The standard startup and shutdown scripts require ssh
to be set up between nodes in the cluster.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the filesystem index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus preventing filesystem corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate filesystem, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the filesystem-specific equivalent.
to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB
), across multiple machines. It achieves reliability by replicating
the data across multiple hosts, and hence does not require RAID
storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX
compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files.
The HDFS does not provide high availability
, because an HDFS filesystem instance requires one unique server, the name node. This is a single point of failure
for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster. The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the Primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the Primary Namenode and builds snapshots of the Primary Namenode's directory information, which is then saved to local/remote directories. These checkpointed images can be used to restart a failed Primary Namenode without having to replay the entire journal of filesystem actions, the edit log to create an up-to-date directory structure.
An advantage of using the HDFS is data awareness between the jobtracker and tasktracker. The jobtracker schedules map/reduce jobs to tasktrackers with an awareness of the data location. An example of this would be if node A contained data (x,y,z) and node B contained data (a,b,c). The jobtracker will schedule node B to perform map/reduce tasks on (a,b,c) and node A would be scheduled to perform map/reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other filesystems this advantage is not always available. This can have a significant impact on the performance of job completion times, which has been demonstrated when running data intensive jobs.
Another limitation of HDFS is that it cannot be directly mounted
by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace
(FUSE) virtual file system
has been developed to address this problem, at least for Linux
and some other Unix
systems.
File access can be achieved through the native Java API, the Thrift
API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface
, or browsed through the HDFS-UI webapp
over HTTP.
Hadoop can work directly with any distributed file system
that can be mounted by the underlying operating system simply by using a file:// URL; however, this comes at a price: the loss of locality. To reduce network traffic, Hadoop needs to know which servers are closest to the data; this is information which Hadoop-specific filesystem bridges can provide.
Out-of-the-box, this includes Amazon S3, and the CloudStore
filestore, through s3:// and kfs:// URLs directly.
A number of third party filesystem bridges have also been written, none of which are currently in Hadoop distributions. These may offer superior availability or scalability, and possibly a more general-purpose filesystem than HDFS, which is biased towards large files and only offers a subset of the expected semantics of a Posix Filesystem: no locking, or writing to anywhere other than the tail of a file.
engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine
process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty
and can be viewed from a web browser.
If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off. In earlier versions of Hadoop, all active work was lost when a JobTracker restarted.
Known limitations of this approach are:
, and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).
The fair scheduler was developed by Facebook
. The goal of the fair scheduler is to provide fast response times for small jobs and QoS
for production jobs. The fair scheduler has three basic concepts.
By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.
There is no preemption
once a job is running.
database, the Apache Mahout
machine learning
system, and the Apache Hive
Data Warehouse
system. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, that is very data-intensive, and able to work on pieces of the data in parallel. As of October 2009, commercial applications of Hadoop included:
cluster
and produces data that is now used in every Yahoo! Web search query.
There are multiple Hadoop clusters at Yahoo!, and no HDFS filesystems or MapReduce jobs are split across multiple datacenters. Every hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.
On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production. Yahoo! contributes back all work it does on Hadoop to the open-source community, the company's developers also fix bugs and provide stability improvements internally, and release this patched source code so that other users may benefit from their effort.
Facebook
In 2010 Facebook
claimed that they have the largest Hadoop cluster in the world with 21 PB
of storage. On July 27, 2011 they announced the data has grown to 30 PB.
(EC2) and Amazon Simple Storage Service (S3). As an example The New York Times
used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).
There is support for the S3 filesystem in Hadoop distributions, and the Hadoop team generates EC2 machine images after every release. From a pure performance perspective, Hadoop on S3/EC2 is inefficient, as the S3 filesystem is remote and delays returning from every write operation until the data are guaranteed not to be lost. This removes the locality advantages of Hadoop, which schedules work near data to save on network load.
in April 2009. Provisioning of the Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce. Apache Hive
, which is built on top of Hadoop for providing data warehouse services, is also offered in Elastic MapReduce.
Support for using Spot Instances was later added in August 2011. Elastic MapReduce is fault tolerant for slave failures, and it is recommended to only run the Task Instance Group on spot instances to take advantage of the lower cost while maintaining availability.
and Google
announced an initiative in 2007 to use Hadoop to support university courses in distributed computer programming.
In 2008 this collaboration, the Academic Cloud Computing Initiative (ACCI), partnered with the National Science Foundation
to provide grant funding to academic researchers interested in exploring large-data applications. This resulted in the creation of the Cluster Exploratory (CLuE) program.
environments. Instead of setting up a dedicated Hadoop cluster, an existing compute farm can be used if the resource manager of the cluster is aware of the Hadoop jobs, and thus Hadoop jobs can be scheduled like other jobs in the cluster.
was released in 2008, and running Hadoop on Sun Grid
(Sun's on-demand utility computing
service) was possible. In the initial implementation of the integration, the CPU-time scheduler has no knowledge of the locality of the data. Unfortunately, this means that the processing is not always done on the same rack as the data; this was a key feature of the Hadoop Runtime. An improved integration with data-locality was announced during the Sun HPC Software Workshop '09.
In 2008-2009 Sun
released the Hadoop Live CD OpenSolaris
project, which allows running a fully functional Hadoop cluster using a live CD
. This distribution includes Hadoop 0.19 -as of April 2010 there has not been an updated release.
Software framework
In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by user code, thus providing application specific software...
that supports data-intensive distributed applications
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...
under a free license
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...
. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
's MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
and Google File System
Google File System
Google File System is a proprietary distributed file system developed by Google Inc. for its own use. It is designed to provide efficient, reliable access to data using large clusters of commodity hardware...
(GFS) papers.
Hadoop is a top-level Apache
Apache Software Foundation
The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...
project being built and used by a global community of contributors, written in the Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
programming language. Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...
has been the largest contributor to the project, and uses Hadoop extensively across its businesses.
Hadoop was created by Doug Cutting
Doug Cutting
Douglass Read Cutting is an advocate and creator of open-source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation. He holds a bachelor's degree from Stanford University....
, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch
Nutch
Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.- Features :Nutch is coded entirely in the Java programming language, but data is written in language-independent formats...
search engine project.
Architecture
Hadoop consists of the Hadoop Common, which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JARJAR (file format)
In software, JAR is an archive file format typically used to aggregate many Java class files and associated metadata and resources into one file to distribute application software or libraries on the Java platform.JAR files are built on the ZIP file format and have the .jar file extension...
files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community.
For effective scheduling of work, every Hadoop-compatible filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, so reducing backbone traffic. The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable.
A small Hadoop cluster will include a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes, and compute-only worker nodes; these are normally only used in non-standard applications.
Hadoop requires JRE 1.6 or higher. The standard startup and shutdown scripts require ssh
Secure Shell
Secure Shell is a network protocol for secure data communication, remote shell services or command execution and other secure network services between two networked computers that it connects via a secure channel over an insecure network: a server and a client...
to be set up between nodes in the cluster.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the filesystem index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus preventing filesystem corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate filesystem, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the filesystem-specific equivalent.
Hadoop Distributed File System
The HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPCRemote procedure call
In computer science, a remote procedure call is an inter-process communication that allows a computer program to cause a subroutine or procedure to execute in another address space without the programmer explicitly coding the details for this remote interaction...
to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB
Megabyte
The megabyte is a multiple of the unit byte for digital information storage or transmission with two different values depending on context: bytes generally for computer memory; and one million bytes generally for computer storage. The IEEE Standards Board has decided that "Mega will mean 1 000...
), across multiple machines. It achieves reliability by replicating
Replication (computer science)
Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. It could be data replication if the same data is stored on multiple storage devices, or...
the data across multiple hosts, and hence does not require RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files.
The HDFS does not provide high availability
High availability
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period....
, because an HDFS filesystem instance requires one unique server, the name node. This is a single point of failure
Single point of failure
A single point of failure is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.-Overview:Systems can be made...
for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster. The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the Primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the Primary Namenode and builds snapshots of the Primary Namenode's directory information, which is then saved to local/remote directories. These checkpointed images can be used to restart a failed Primary Namenode without having to replay the entire journal of filesystem actions, the edit log to create an up-to-date directory structure.
An advantage of using the HDFS is data awareness between the jobtracker and tasktracker. The jobtracker schedules map/reduce jobs to tasktrackers with an awareness of the data location. An example of this would be if node A contained data (x,y,z) and node B contained data (a,b,c). The jobtracker will schedule node B to perform map/reduce tasks on (a,b,c) and node A would be scheduled to perform map/reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other filesystems this advantage is not always available. This can have a significant impact on the performance of job completion times, which has been demonstrated when running data intensive jobs.
Another limitation of HDFS is that it cannot be directly mounted
Mount (computing)
Mounting takes place before a computer can use any kind of storage device . The user or their operating system must make it accessible through the computer's file system. A user can access only files on mounted media.- Mount point :A mount point is a physical location in the partition used as a...
by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace
Filesystem in Userspace
Filesystem in Userspace is a loadable kernel module for Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code...
(FUSE) virtual file system
Virtual file system
A virtual file system or virtual filesystem switch is an abstraction layer on top of a more concrete file system. The purpose of a VFS is to allow client applications to access different types of concrete file systems in a uniform way...
has been developed to address this problem, at least for Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
and some other Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems.
File access can be achieved through the native Java API, the Thrift
Thrift (protocol)
Thrift is an interface definition language that is used to define and create services for numerous languages. It is used as a remote procedure call framework and was developed at Facebook for "scalable cross-language services development"...
API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface
Command-line interface
A command-line interface is a mechanism for interacting with a computer operating system or software by typing commands to perform specific tasks...
, or browsed through the HDFS-UI webapp
Web application
A web application is an application that is accessed over a network such as the Internet or an intranet. The term may also mean a computer software application that is coded in a browser-supported language and reliant on a common web browser to render the application executable.Web applications are...
over HTTP.
Other Filesystems
By May 2011, the list of supported filesystems included:- HDFS: Hadoop's own rack-aware filesystem. This is designed to scale to tens of petabytes of storage and runs on top of the filesystems of the underlying operating systemOperating systemAn operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s. - Amazon S3 filesystem. This is targeted at clusters hosted on the Amazon Elastic Compute CloudAmazon Elastic Compute CloudAmazon Elastic Compute Cloud is a central part of Amazon.com's cloud computing platform, Amazon Web Services . EC2 allows users to rent virtual computers on which to run their own computer applications...
server-on-demand infrastructure. There is no rack-awareness in this filesystem, as it is all remote. - CloudStoreCloudStoreCloudStore is Kosmix's C++ implementation of Google File System. It parallels the Hadoop project, which is implemented in Java. CloudStore supports incremental scalability, replication, checksumming for data integrity, client side fail-over and access from C++, Java and Python...
(previously Kosmos Distributed File System), which is rack-aware. - FTP Filesystem: this stores all its data on remotely accessible FTP servers.
- Read-only HTTP and HTTPSHttpsHypertext Transfer Protocol Secure is a combination of the Hypertext Transfer Protocol with SSL/TLS protocol to provide encrypted communication and secure identification of a network web server...
file systems.
Hadoop can work directly with any distributed file system
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...
that can be mounted by the underlying operating system simply by using a file:// URL; however, this comes at a price: the loss of locality. To reduce network traffic, Hadoop needs to know which servers are closest to the data; this is information which Hadoop-specific filesystem bridges can provide.
Out-of-the-box, this includes Amazon S3, and the CloudStore
CloudStore
CloudStore is Kosmix's C++ implementation of Google File System. It parallels the Hadoop project, which is implemented in Java. CloudStore supports incremental scalability, replication, checksumming for data integrity, client side fail-over and access from C++, Java and Python...
filestore, through s3:// and kfs:// URLs directly.
A number of third party filesystem bridges have also been written, none of which are currently in Hadoop distributions. These may offer superior availability or scalability, and possibly a more general-purpose filesystem than HDFS, which is biased towards large files and only offers a subset of the expected semantics of a Posix Filesystem: no locking, or writing to anywhere other than the tail of a file.
- In 2009 IBMIBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
discussed running Hadoop over the IBM General Parallel File System. The source code was published in October 2009. - In April 2010, Parascale published the source code to run Hadoop against the Parascale filesystem.
- In April 2010, Appistry released a Hadoop filesystem driver for use with its own CloudIQ Storage product.
- In June 2010, HP discussed a location-aware IBRIX FusionIBRIX FusionIBRIX Fusion is a scalable parallel file system combined with integrated logical volume manager, availability features and a management interface. The software was produced, sold, and supported by IBRIX Incorporated of Billerica, Massachusetts. HP announced on July 17, 2009 that it had reached a...
filesystem driver. - In May 2011, MapRMapRMapR is a San Jose, California-based enterprise software company that develops and sells Apache Hadoop-derived software. The company contributes to Apache Hadoop projects like HBase, Pig , Apache Hive, and Apache ZooKeeper...
Technologies, Inc. announced the availability of an alternate filesystem for Hadoop, which replaced the HDFS file system with a full random-access read/write file system, with advanced features like snaphots and mirrors, and get rid of the single point of failureSingle point of failureA single point of failure is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.-Overview:Systems can be made...
issue of the default HDFS NameNode.
JobTracker and TaskTracker: the MapReduce engine
Above the file systems comes the MapReduceMapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine
Java Virtual Machine
A Java virtual machine is a virtual machine capable of executing Java bytecode. It is the code execution component of the Java software platform. Sun Microsystems stated that there are over 4.5 billion JVM-enabled devices.-Overview:...
process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty
Jetty (web server)
Jetty is a pure Java-based HTTP client/server, WebSocket client/server and servlet container developed as a free and open source project as part of the Eclipse Foundation...
and can be viewed from a web browser.
If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off. In earlier versions of Hadoop, all active work was lost when a JobTracker restarted.
Known limitations of this approach are:
- The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system loadLoad (computing)In UNIX computing, the system load is a measure of the amount of work that a computer system performs. The load average represents the average system load over a period of time...
of the allocated machine, and hence its actual availability. - If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.
Scheduling
By default Hadoop uses FIFOFIFO
FIFO is an acronym for First In, First Out, an abstraction related to ways of organizing and manipulation of data relative to time and prioritization...
, and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).
Fair scheduler
The fair scheduler was developed by Facebook
Facebook
Facebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...
. The goal of the fair scheduler is to provide fast response times for small jobs and QoS
Quality of service
The quality of service refers to several related aspects of telephony and computer networks that allow the transport of traffic with special requirements...
for production jobs. The fair scheduler has three basic concepts.
- Jobs are grouped into PoolsPool (Computer science)A pool in computer science is a set of initialised resources that are kept ready to use, rather than allocated and destroyed on demand. A client of the pool will request an object from the pool and perform operations on the returned object...
. - Each pool is assigned a guaranteed minimum share.
- Excess capacity is split between jobs.
By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.
Capacity scheduler
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.
- Jobs are submitted into queues.
- Queues are allocated a fraction of the total resource capacity.
- Free resources are allocated to queues beyond their total capacity.
- Within a queue a job with a high level of priority will have access to the queue's resources.
There is no preemption
Preemption (computing)
In computing, preemption is the act of temporarily interrupting a task being carried out by a computer system, without requiring its cooperation, and with the intention of resuming the task at a later time. Such a change is known as a context switch...
once a job is running.
Other applications
The HDFS filesystem is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. The list includes the HBaseHBase
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS , providing BigTable-like capabilities for Hadoop...
database, the Apache Mahout
Apache Mahout
Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform...
machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
system, and the Apache Hive
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix...
Data Warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...
system. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, that is very data-intensive, and able to work on pieces of the data in parallel. As of October 2009, commercial applications of Hadoop included:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
Yahoo!
On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core LinuxLinux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
cluster
Cluster (computing)
A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks...
and produces data that is now used in every Yahoo! Web search query.
There are multiple Hadoop clusters at Yahoo!, and no HDFS filesystems or MapReduce jobs are split across multiple datacenters. Every hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.
On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production. Yahoo! contributes back all work it does on Hadoop to the open-source community, the company's developers also fix bugs and provide stability improvements internally, and release this patched source code so that other users may benefit from their effort.
Facebook
Facebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...
claimed that they have the largest Hadoop cluster in the world with 21 PB
Petabyte
A petabyte is a unit of information equal to one quadrillion bytes, or 1000 terabytes. The unit symbol for the petabyte is PB...
of storage. On July 27, 2011 they announced the data has grown to 30 PB.
Other users
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations. Some of the notable users include:Hadoop on Amazon EC2/S3 services
It is possible to run Hadoop on Amazon Elastic Compute CloudAmazon Elastic Compute Cloud
Amazon Elastic Compute Cloud is a central part of Amazon.com's cloud computing platform, Amazon Web Services . EC2 allows users to rent virtual computers on which to run their own computer applications...
(EC2) and Amazon Simple Storage Service (S3). As an example The New York Times
The New York Times
The New York Times is an American daily newspaper founded and continuously published in New York City since 1851. The New York Times has won 106 Pulitzer Prizes, the most of any news organization...
used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).
There is support for the S3 filesystem in Hadoop distributions, and the Hadoop team generates EC2 machine images after every release. From a pure performance perspective, Hadoop on S3/EC2 is inefficient, as the S3 filesystem is remote and delays returning from every write operation until the data are guaranteed not to be lost. This removes the locality advantages of Hadoop, which schedules work near data to save on network load.
Amazon Elastic MapReduce
Elastic MapReduce was introduced by AmazonAmazon.com
Amazon.com, Inc. is a multinational electronic commerce company headquartered in Seattle, Washington, United States. It is the world's largest online retailer. Amazon has separate websites for the following countries: United States, Canada, United Kingdom, Germany, France, Italy, Spain, Japan, and...
in April 2009. Provisioning of the Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce. Apache Hive
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix...
, which is built on top of Hadoop for providing data warehouse services, is also offered in Elastic MapReduce.
Support for using Spot Instances was later added in August 2011. Elastic MapReduce is fault tolerant for slave failures, and it is recommended to only run the Task Instance Group on spot instances to take advantage of the lower cost while maintaining availability.
Hadoop at Google and IBM
IBMIBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
and Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
announced an initiative in 2007 to use Hadoop to support university courses in distributed computer programming.
In 2008 this collaboration, the Academic Cloud Computing Initiative (ACCI), partnered with the National Science Foundation
National Science Foundation
The National Science Foundation is a United States government agency that supports fundamental research and education in all the non-medical fields of science and engineering. Its medical counterpart is the National Institutes of Health...
to provide grant funding to academic researchers interested in exploring large-data applications. This resulted in the creation of the Cluster Exploratory (CLuE) program.
Running Hadoop in compute farm environments
Hadoop can also be used in compute farms and high-performance computingHigh-performance computing
High-performance computing uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers.-Overview:...
environments. Instead of setting up a dedicated Hadoop cluster, an existing compute farm can be used if the resource manager of the cluster is aware of the Hadoop jobs, and thus Hadoop jobs can be scheduled like other jobs in the cluster.
Grid Engine Integration
Integration with Sun Grid EngineSun Grid Engine
Oracle Grid Engine, previously known as Sun Grid Engine , previously known as CODINE or GRD , is an open source batch-queuing system, developed and supported by Sun Microsystems...
was released in 2008, and running Hadoop on Sun Grid
Sun Grid
Sun Cloud is an on-demand Cloud computing service operated by Sun Microsystems, a subsidiary of Oracle Corporation. The Sun Cloud Compute Utility provides access to a substantial computing resource over the Internet for US$1 per CPU-hour...
(Sun's on-demand utility computing
Utility computing
Utility computing is the packaging of computing resources, such as computation, storage and services, as a metered service similar to a traditional public utility...
service) was possible. In the initial implementation of the integration, the CPU-time scheduler has no knowledge of the locality of the data. Unfortunately, this means that the processing is not always done on the same rack as the data; this was a key feature of the Hadoop Runtime. An improved integration with data-locality was announced during the Sun HPC Software Workshop '09.
In 2008-2009 Sun
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...
released the Hadoop Live CD OpenSolaris
OpenSolaris
OpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software...
project, which allows running a fully functional Hadoop cluster using a live CD
Live CD
A live CD, live DVD, or live disc is a CD or DVD containing a bootable computer operating system. Live CDs are unique in that they have the ability to run a complete, modern operating system on a computer lacking mutable secondary storage, such as a hard disk drive...
. This distribution includes Hadoop 0.19 -as of April 2010 there has not been an updated release.
Condor Integration
The Condor High-Throughput Computing System integration was presented at the Condor Week conference in 2010.Commercially supported Hadoop-related products
There are a number of companies offering commercial implementations and/or providing support for Hadoop.- ClouderaClouderaCloudera Inc. is a Palo Alto-based enterprise software company which provides Apache Hadoop-based software and services. It contributes to Hadoop and related Apache projects and provides a distribution for Hadoop for the enterprise. Cloudera has two products: and...
offers CDH (Cloudera's Distribution including Apache Hadoop) and Cloudera Enterprise. - IBMIBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition. - In March 2011, Platform ComputingPlatform ComputingPlatform Computing is a privately held software company that is primarily known for its job scheduling product, Load Sharing Facility . It was founded in 1992 in Toronto, Ontario, Canada and is currently headquartered in Markham, Ontario with 11 branch offices across the United States, Europe and...
announced support for the Hadoop MapReduce API in its SymphonySymphony (software)Platform Symphony is a High-performance computing software system developed by Platform Computing, the company that developed Load Sharing Facility . Focusing on the Financial Services Industry , Symphony is designed to deliver scalability and enhances performance for compute-intensive risk and...
software. - In May 2011, MapRMapRMapR is a San Jose, California-based enterprise software company that develops and sells Apache Hadoop-derived software. The company contributes to Apache Hadoop projects like HBase, Pig , Apache Hive, and Apache ZooKeeper...
Technologies, Inc. announced the availability of their distributed filesystem and MapReduce engine, the MapR Distribution for Apache Hadoop. The MapR product includes most Hadoop eco-system components and adds capabilities such as snapshots, mirrors, NFS access and full read-write file semantics. - Silicon Graphics InternationalSilicon Graphics InternationalSilicon Graphics International Corp. , is an American manufacturer of computer hardware and software, including high-performance computing solutions, x86-based servers for datacenter deployment, and visualization products...
offers Hadoop optimized solutions based on the SGI Rackable and CloudRack server lines with implementation services. - EMCEMC CorporationEMC Corporation , a Financial Times Global 500, Fortune 500 and S&P 500 company, develops, delivers and supports information infrastructure and virtual infrastructure hardware, software, and services. EMC is headquartered in Hopkinton, Massachusetts, USA.Former Intel executive Richard Egan and his...
released EMC Greenplum Community Edition and EMC Greenplum HD Enterprise Edition in May 2011. The community edition, with optional for-fee technical support, consists of Hadoop, HDFS, HBaseHBaseHBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS , providing BigTable-like capabilities for Hadoop...
, HiveApache HiveApache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix...
, and the ZooKeeperApache ZooKeeperApache ZooKeeper is a software project of the Apache Software Foundation, providing an open source centralized configuration service and naming registry for large distributed systems. ZooKeeper is a sub project of Hadoop....
configuration service. The enterprise edition is an offering based on the MapRMapRMapR is a San Jose, California-based enterprise software company that develops and sells Apache Hadoop-derived software. The company contributes to Apache Hadoop projects like HBase, Pig , Apache Hive, and Apache ZooKeeper...
product, and offers proprietary features such as snapshots and wide area replication. - In June 2011, Yahoo! and Benchmark Capital formed Hortonworks Inc., whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users.
- Google added AppEngine-MapReduce to support running Hadoop 0.20 programs on Google App EngineGoogle App EngineGoogle App Engine is a platform as a service cloud computing platform for developing and hosting web applications in Google-managed data centers. It virtualizes applications across multiple servers,...
. - In Oct 2011, OracleOracle CorporationOracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
announced the Big Data Appliance, which integrates Hadoop, Oracle Enterprise LinuxOracle Enterprise LinuxOracle Linux, formerly known as Oracle Enterprise Linux, is a Red Hat Enterprise Linux-compatible distribution, repackaged and sold by Oracle, available under the GNU General Public License since late 2006....
, the R programming language, and a NoSQLNosqlIn computing, NoSQL is a broad class of database management systems that differ from the classic model of the relational database management system in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally...
database with the Exadata hardware. - Dovestech has released Ocean Sync Hadoop Management Software Freeware Edition. The software allows users to control and monitor all aspects of an Hadoop cluster. The website for this tool is OceanSync.com.
ASF's view on the use of "Hadoop" in product names
The Apache Software Foundation has stated that only software officially released by the Apache Hadoop Project can be called Apache Hadoop or Distributions of Apache Hadoop. The naming of products and derivative works from other vendors and the term "compatible" are somewhat controversial within the Hadoop developer community.Papers
Papers that influence birth and growth of Hadoop and big data processing.- 2004 MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat from Google Lab. THis paper inspired Doug Cutting to start open-source implementation of Map-Reduce framework. He named it Hadoop, after his son's toy elephant.
- 2005 From Databases to Dataspaces: A New Abstraction for Information Management, the authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system’s understanding of the data.
- 2006 Bigtable: A Distributed Storage System for Structured Data from Google Lab.
- 2008 H-store: a high-performance, distributed main memory transaction processing system
- 2009 MAD Skills: New Analysis Practices for Big Data
- 2011 Apache Hadoop Goes Realtime at Facebook
See also
- NutchNutchNutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.- Features :Nutch is coded entirely in the Java programming language, but data is written in language-independent formats...
- an effort to build an open source search engine based on Lucene and Hadoop. Also created by Doug Cutting. - DatameerDatameerDatameer, Inc., headquartered in San Mateo, California, is a software development companyDatameer specializes in analysis of large volumes of data for business users of Apache Hadoop...
Analytics Solution (DAS) – data source integration, storage, analytics engine and visualization - HBaseHBaseHBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS , providing BigTable-like capabilities for Hadoop...
- BigTableBigTableBigTable is a compressed, high performance, and proprietary database system built on Google File System , Chubby Lock Service, SSTable and a few other Google technologies; it is currently not distributed nor is it used outside of Google, although Google offers access to it as part of their Google...
-model database. - HypertableHypertableHypertable is an open source database inspired by publications on the design of Google's BigTable. The project is based on experience of engineers who were solving large-scale data-intensive tasks for many years....
- HBase alternative - MapReduceMapReduceMapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
- Hadoop's fundamental data filtering algorithm - Apache MahoutApache MahoutApache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform...
- Machine Learning algorithms implemented on Hadoop - Apache Cassandra - A column-oriented database that supports access from Hadoop
- HPCCHPCCHPCC , also known as DAS , is a Data Intensive Computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing Big...
- LexisNexisLexisNexisLexisNexis Group is a company providing computer-assisted legal research services. In 2006 it had the world's largest electronic database for legal and public-records related information...
Risk Solutions High Performance Computing Cluster - Sector/SphereSector/SphereSector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS/MapReduce stack. Sector is a distributed file system targeting data storage over a large number of commodity computers...
- Open source distributed storage and processing - Cloud computingCloud computingCloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility over a network ....
- Big dataBig dataBig data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing...
- Data Intensive ComputingData Intensive ComputingData-Intensive Computing is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data...
External links
- Introducing Apache Hadoop: The Modern Data Operating System — lecture given at Stanford UniversityStanford UniversityThe Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...
by Co-Founder and CTO of Cloudera, Amr Awadallah (video archive). - michael-noll.com's tutorials/