HPCC
Encyclopedia
HPCC also known as DAS (Data Analytics Supercomputer), is a Data Intensive Computing
Data Intensive Computing
Data-Intensive Computing is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data...

 system platform developed by LexisNexis
LexisNexis
LexisNexis Group is a company providing computer-assisted legal research services. In 2006 it had the world's largest electronic database for legal and public-records related information...

 Risk Solutions. The HPCC platform incorporates a software architecture
Software architecture
The software architecture of a system is the set of structures needed to reason about the system, which comprise software elements, relations among them, and properties of both...

 implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing Big Data
Big data
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing...

. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL
ECL, data-centric programming language for Big Data
ECL is a declarative, data centric programming language designed in 2000 to allow a team of programmers to process Big Data across a high performance computing cluster without the programmer being involved in many of the lower level, imperative decisions....

.

Introduction

Many organizations have large amounts of data which has been collected and stored in massive datasets which needs be processed and analyzed to provide business intelligence, improve products and services for customers, or to meet other internal data processing requirements. For example, Internet companies need to process data collected by Web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

s as well as logs, click data, and other information generated by Web service
Web service
A Web service is a method of communication between two electronic devices over the web.The W3C defines a "Web service" as "a software system designed to support interoperable machine-to-machine interaction over a network". It has an interface described in a machine-processable format...

s. Parallel relational database technology
Relational model
The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar F...

 has not proven to be cost-effective or provide the high-performance needed to analyze massive amounts of data in a timely manner. As a result several organizations developed technology to utilize large clusters of commodity servers to provide high-performance computing capabilities for processing and analysis of massive datasets. Clusters can consist of hundreds or even thousands of commodity machines connected using high-bandwidth networks. Examples of this type of cluster technology include Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

’s MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....

, Apache Hadoop, Aster Data Systems
Aster Data Systems
Aster Data Systems is a data management and analysis software company headquartered in San Carlos, California. It was founded in 2005 and acquired in 2011.-Products:...

, Sector/Sphere
Sector/Sphere
Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS/MapReduce stack. Sector is a distributed file system targeting data storage over a large number of commodity computers...

, and LexisNexis HPCC platform.

High Performance Computing

High-Performance Computing
High-performance computing
High-performance computing uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers.-Overview:...

 (HPC) is used to describe computing environments which utilize supercomputers and computer clusters to address complex computational requirements, support applications with significant processing time requirements, or require processing of significant amounts of data. Supercomputers have generally been associated with scientific research and compute-intensive types of problems, but more and more supercomputer technology is appropriate for both compute-intensive and data-intensive applications. A new trend in supercomputer design for high-performance computing is using clusters of independent processors connected in parallel. Many computing problems are suitable for parallelization
Parallel computing
Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

, often problems can be divided in a manner so that each independent processing node can work on a portion of the problem in parallel by simply dividing the data to be processed, and then combining the final processing results for each portion. This type of parallelism is often referred to as data-parallelism
Data parallelism
Data parallelism is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes...

, and data-parallel applications are a potential solution to petabyte scale data processing requirements. Data-parallelism
Data parallelism
Data parallelism is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes...

 can be defined as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance in high-performance computing, and may result in several orders of magnitude performance improvement.

Commodity Computing Clusters

The resulting economies of scale in using multiple independent processing nodes for supercomputer design to address high-performance computing requirements led directly to the implementation of commodity computing
Commodity computing
Commodity computing is to use large numbers of already available computing components for parallel computing to get the greatest amount of useful computation at low cost. It is computing done in commodity computers as opposed to high-cost supermicrocomputers or boutique computers...

 clusters. A computer cluster is a group of shared individual computers, linked by high-speed communications in a local area network
Local area network
A local area network is a computer network that interconnects computers in a limited area such as a home, school, computer laboratory, or office building...

 topology using technology such as gigabit network switches
Ethernet
Ethernet is a family of computer networking technologies for local area networks commercially introduced in 1980. Standardized in IEEE 802.3, Ethernet has largely replaced competing wired LAN technologies....

 or InfiniBand
InfiniBand
InfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable...

, and incorporating system software which provides an integrated parallel processing
Parallel processing
Parallel processing is the ability to carry out multiple operations or tasks simultaneously. The term is used in the contexts of both human cognition, particularly in the ability of the brain to simultaneously process incoming stimuli, and in parallel computing by machines.-Parallel processing by...

 environment for applications with the capability to divide processing among the nodes in the cluster. Cluster configurations can not only improve the performance of applications which use a single computer, but provide higher availability and reliability, and are typically much more cost-effective than single supercomputer systems with equivalent performance. The key to the capability, performance, and throughput of a computing cluster is the system software and tools used to provide the parallel job execution environment. Programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

s with implicit parallel
Implicit parallelism
In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language's constructs...

 processing features and a high-degree of optimization are also needed to ensure high-performance results as well as high programmer productivity. Clusters allow the data used by an application to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data.


Commodity computing clusters are configured using commercial off-the-shelf (COTS) PC components. Rack-mounted servers or blade servers each with local memory and disk storage are often used as processing nodes to allow high-density small footprint configurations which facilitate the use of very high-speed communications equipment to connect the nodes (Figure 1). Linux is widely used as the operating system for computer clusters.

HPCC System Architecture

The HPCC system architecture includes two distinct cluster processing environments, each of which can be optimized independently for its parallel data processing purpose. The first of these platforms is called a Data Refinery whose overall purpose is the general processing of massive volumes of raw data of any type for any purpose but typically used for data cleansing and hygiene, ETL
Extract, transform, load
Extract, transform and load is a process in database usage and especially in data warehousing that involves:* Extracting data from outside sources* Transforming it to fit operational needs...

 processing of the raw data, record linking and entity resolution, large-scale ad-hoc complex analytics, and creation of keyed data and indexes to support high-performance structured queries and data warehouse applications. The Data Refinery is also referred to as Thor
Thor
In Norse mythology, Thor is a hammer-wielding god associated with thunder, lightning, storms, oak trees, strength, the protection of mankind, and also hallowing, healing, and fertility...

, a reference to the mythical Norse god of thunder with the large hammer symbolic of crushing large amounts of raw data into useful information. A Thor cluster is similar in its function, execution environment, filesystem, and capabilities to the Google and Hadoop MapReduce platforms.

Figure 2 shows a representation of a physical Thor processing cluster which functions as a batch job execution engine for scalable data-intensive computing applications. In addition to the Thor master and slave nodes, additional auxiliary and common components are needed to implement a complete HPCC processing environment.
The second of the parallel data processing platforms is called Roxie and functions as a rapid data delivery engine. This platform is designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times. Roxie utilizes a distributed indexed filesystem
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...

 to provide parallel processing of queries using an optimized execution environment and filesystem for high-performance online processing. A Roxie cluster is similar in its function and capabilities to Hadoop with HBase and Hive capabilities added, and provides for near real time predictable query latencies. Both Thor and Roxie clusters utilize the ECL programming language for implementing applications, increasing continuity and programmer productivity.

Figure 3 shows a representation of a physical Roxie processing cluster which functions as a online query execution engine for high-performance query and data warehousing applications. A Roxie cluster includes multiple nodes with server and worker processes for processing queries; an additional auxiliary component called an ESP server which provides interfaces for external client access to the cluster; and additional common components which are shared with a Thor cluster in an HPCC environment. Although a Thor processing cluster can be implemented and used without a Roxie cluster, an HPCC environment which includes a Roxie cluster should also include a Thor cluster. The Thor cluster is used to build the distributed index files used by the Roxie cluster and to develop online queries which will be deployed with the index files to the Roxie cluster.

HPCC Software Architecture

The HPCC software architecture incorporates the Thor and Roxie clusters as well as common Middleware
Middleware
Middleware is computer software that connects software components or people and their applications. The software consists of a set of services that allows multiple processes running on one or more machines to interact...

 components, an external communications layer, client interfaces which provide both end-user services and system management tools, and auxiliary components to support monitoring and to facilitate loading and storing of filesystem data from external sources. An HPCC environment can include only Thor clusters, or both Thor and Roxie clusters. The overall HPCC software architecture is shown in Figure 4.

See also

  • High-Performance Computing
    High-performance computing
    High-performance computing uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers.-Overview:...

  • Supercomputer
    Supercomputer
    A supercomputer is a computer at the frontline of current processing capacity, particularly speed of calculation.Supercomputers are used for highly calculation-intensive tasks such as problems including quantum physics, weather forecasting, climate research, molecular modeling A supercomputer is a...

  • Computer cluster
  • COTS
    COTS
    COTS may refer to:* Commercial off-the-shelf, ready-made products* Commercial Orbital Transportation Services, a program for delivery to space* Crest of the Stars, a manga/anime space opera.* Crown-of-thorns starfish, a reef-coral predator...

  • List of important publications in concurrent, parallel, and distributed computing
  • Parallel Computing
    Parallel computing
    Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

  • Distributed Computing
    Distributed computing
    Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

  • Parallel programming model
    Parallel programming model
    A parallel programming model is a concept that enables the expression of parallel programs which can be compiled and executed. The value of a programming model is usually judged on its generality: how well a range of different problems can be expressed and how well they execute on a range of...

  • Data parallelism
    Data parallelism
    Data parallelism is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes...

  • Big Data
    Big data
    Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing...

  • Implicit parallelism
    Implicit parallelism
    In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language's constructs...

  • Declarative programming
    Declarative programming
    In computer science, declarative programming is a programming paradigm that expresses the logic of a computation without describing its control flow. Many languages applying this style attempt to minimize or eliminate side effects by describing what the program should accomplish, rather than...

  • Data Intensive Computing
    Data Intensive Computing
    Data-Intensive Computing is a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size and typically referred to as Big Data...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK