Kepler scientific workflow system
Encyclopedia
Kepler is a free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 system for designing, executing, reusing, evolving, archiving, and sharing scientific workflow
Workflow
A workflow consists of a sequence of connected steps. It is a depiction of a sequence of operations, declared as work of a person, a group of persons, an organization of staff, or one or more simple or complex mechanisms. Workflow may be seen as any abstraction of real work...

s.
Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement solutions. Workflows in general, and scientific workflows in particular, are directed graph
Directed graph
A directed graph or digraph is a pair G= of:* a set V, whose elements are called vertices or nodes,...

s where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components.
In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid
Grid computing
Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...

. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

Scientific workflow

A scientific workflow is the process of combining data and processes into a configurable, structured set of steps that implement semi-automated computational solutions of a scientific problem. Scientific workflow systems
Scientific workflow system
A Scientific Workflow Systems is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a scientific application...

 often provide graphical user interfaces to combine different technologies along with efficient methods for using them, and thus increase the efficiency of the scientists.

Access to scientific data

Kepler provides direct access to scientific data that has been archived in many of the commonly used data archives. For example, Kepler provides access to data stored in the Knowledge Network for Biocomplexity (KNB) Metacat server and described using Ecological Metadata Language
Ecological Metadata Language
Ecological Metadata Language is a metadata standard developed by and for the ecology discipline. It is based on prior work done by the Ecological Society of America and others, including the Knowledge Network for Biocomplexity. EML is a set of XML schema documents that allow for the structural...

. Additional data sources that are supported include data accessible using the DiGIR protocol, the OPeNDAP
OPeNDAP
OPeNDAP, an acronym for "Open-source Project for a Network Data Access Protocol", is a data transport architecture and protocol widely used by earth scientists. The protocol is based on HTTP and the current specification is . OPeNDAP includes standards for encapsulating structured data, annotating...

 protocol, GridFTP, JDBC, SRB
Storage Resource Broker
Storage Resource Broker is a Data Grid Management System operating in many U.S. and international computational science research projects...

, and others.

Models of Computation

Kepler differs from many of the other bioinformatics workflow management systems
Bioinformatics workflow management systems
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....

 in that it separates the structure of the workflow model from its model of computation, such that different models for the computation of the workflow can be bound to a given workflow graph. Kepler inherits several common models of computation from the Ptolemy system
Ptolemy Project (computing)
The Ptolemy Project is an ongoing project aimed at modeling, simulating, and designing concurrent, real-time, embedded systems. The focus of the Ptolemy Project is on assembling concurrent components. The principal product of the project is the Ptolemy II model based design and simulation tool...

, including Synchronous Data Flow (SDF), Continuous Time (CT), Process Network (PN), and Dynamic Data Flow (DDF), among others.

Hierarchical workflows

Kepler supports hierarchy in workflows, which allows complex tasks to be composed of simpler components. This feature allows workflow authors to build re-usable, modular components that can be saved for use across many different workflows.

Workflow semantics

Kepler provides a model for the semantic annotation of workflow components using terms drawn from an ontology
Ontology
Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories of being and their relations...

. These annotations support many advanced features, including improved search capabilities, automated workflow validation, and improved workflow editing.

Sharing workflows

Kepler components can be shared by exporting the workflow or component into a Kepler Archive (KAR) file, which is an extension of the JAR file format from Java. Once a KAR file is created, it can be emailed to colleagues, shared on web sites, or uploaded to the Kepler Component Repository. The Component Repository is centralized system for sharing Kepler workflows that is accessible via both a web portal and a web service
Web service
A Web service is a method of communication between two electronic devices over the web.The W3C defines a "Web service" as "a software system designed to support interoperable machine-to-machine interaction over a network". It has an interface described in a machine-processable format...

 interface. Users can directly search for and utilize components from the repository from within the Kepler workflow composition GUI.

Provenance

Provenance is a critical concept in scientific workflows, since it allows scientists to understand the origin of their results, to repeat their experiments, and to validate the processes that were used to derive data products. In order for a workflow to be reproduced, provenance information must be recorded that indicates where the data originated, how it was altered, and which components and what parameter settings were used. This will allow other scientists to re-conduct the experiment, confirming the results.
Little support exists in current systems to allow end-users to query provenance information in scientifically meaningful ways, in particular when advanced workflow execution models go beyond simple DAGs (as in process networks).

Kepler history

The Kepler Project was created in 2002 by members of the Science Environment for Ecological Knowledge (SEEK) project and the Scientific Data Management (SDM) project. The project was founded by researchers at the National Center for Ecological Analysis and Synthesis
National Center for Ecological Analysis and Synthesis
The National Center for Ecological Analysis and Synthesis is a research center at the University of California, Santa Barbara, in Santa Barbara, California. Better known by its acronym, NCEAS opened in May 1995, and is funded by the U.S...

 (NCEAS) at the University of California, Santa Barbara
University of California, Santa Barbara
The University of California, Santa Barbara, commonly known as UCSB or UC Santa Barbara, is a public research university and one of the 10 general campuses of the University of California system. The main campus is located on a site in Goleta, California, from Santa Barbara and northwest of Los...

 and the San Diego Supercomputer Center
San Diego Supercomputer Center
The San Diego Supercomputer Center is an organized research unit of the University of California, San Diego . Physically, SDSC is located on the east end of Eleanor Roosevelt College on the campus of UCSD....

 at the University of California, San Diego
University of California, San Diego
The University of California, San Diego, commonly known as UCSD or UC San Diego, is a public research university located in the La Jolla neighborhood of San Diego, California, United States...

. Kepler extends Ptolemy II, which is a software system for modeling, simulation, and design of concurrent, real-time, embedded systems developed at UC Berkeley. Collaboration on Kepler quickly grew as members of various scientific disciplines realized the benefits of scientific workflows for analysis and modeling and began contributing to the system. As of 2008, Kepler collaborators come from many science disciplines, including ecology, molecular biology, genetics, physics, chemistry, conservation science, oceanography, hydrology, library science, computer science, and others.

See also

  • Taverna workbench
    Taverna workbench
    Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...

     (Kepler's European counterpart)
  • Discovery Net
    Discovery Net
    Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services standards....

  • VisTrails
    VisTrails
    VisTrails is a scientific workflow management system developed at the Scientific Computing and Imaging Institute at the University of Utah that provides support for data exploration and visualization. It is written in Python and employs Qt via PyQt bindings. The system is open source, released...

  • LONI Pipeline
    LONI Pipeline
    The LONI Pipeline is a distributed system for constructing, validating, executing and disseminating scientific workflows on grid computing architectures. A major difference between this and other workflow processing environments is that the LONI Pipeline does not require new tools and services to...

  • Bioinformatics workflow management systems
    Bioinformatics workflow management systems
    A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....

  • DataONE
    DataONE
    Data Observation Network for Earth is a project supported by the National Science Foundation under the DataNet program. DataONE will provide scientific data archiving for ecological and environmental data produced by scientists worldwide. DataONE's stated goal is to preserve and provide access to...

     Investigator Toolkit

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK