Scientific workflow system
Encyclopedia
A Scientific Workflow Systems is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a scientific application. A specialized form of scientific workflow systems are bioinformatics workflow management systems
Bioinformatics workflow management systems
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....

 which focus on a specific domain of science, bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

.

The rising interest in scientific workflow systems has coincided with rising interest in e-Science
E-Science
E-Science is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid...

 technologies and applications, and in grid computing
Grid computing
Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...

. The vision of e-Science is that of distributed scientists being able to collaborate on conducting large scale scientific experiments and knowledge discovery
Knowledge discovery
Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data . It is often described as deriving knowledge from the input data...

 applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision.

There are many motives for differentiating scientific workflows from traditional business process workflows. These include:
  • providing an easy-to-use environment for individual application scientists themselves to create their own workflows
  • providing interactive tools for the scientists enabling them to execute their workflows and view their results in real-time
  • simplifying the process of sharing and reusing workflows between the scientists.
  • enabling scientists to track the provenance
    Provenance
    Provenance, from the French provenir, "to come from", refers to the chronology of the ownership or location of an historical object. The term was originally mostly used for works of art, but is now used in similar senses in a wide range of fields, including science and computing...

     of the workflow execution results and the workflow creation steps.


By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow scheduling activities, typically considered by grid computing
Grid computing
Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...

 environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible.

Scientific workflows

The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

 or MATLAB
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...

, or using a scripting language such as Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

 or Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

 with a command-line interface
Command-line interface
A command-line interface is a mechanism for interacting with a computer operating system or software by typing commands to perform specific tasks...

.

Scientific workflows are now recognized as a crucial element of the cyberinfrastructure
Cyberinfrastructure
United States federal research funders use the term cyberinfrastructure to describe research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services distributed over...

, facilitating
e-Science. Typically sitting on top of a middleware
Middleware
Middleware is computer software that connects software components or people and their applications. The software consists of a set of services that allows multiple processes running on one or more machines to interact...

 layer, scientific workflows are a means
by which scientists can model, design, execute, debug, re-configure and re-run their analysis and
visualization pipelines. Part of the established scientific method is to create a record of the origins
of a result, how it was obtained, experimental methods used, machine calibrations and parameters,
etc. It is the same in e-Science, except provenance data are a record of the workflow activities
invoked, services and databases accessed, data sets used, and so forth. Such information is useful
for a scientist to interpret their workflow results and for other scientists to establish trust in the
experimental result.

Examples

There are many examples of scientific workflow systems:
  • Discovery Net
    Discovery Net
    Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services standards....

    : one of the earliest examples of a scientific workflow system
  • Galaxy
    Galaxy (computational biology)
    Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming experience...

    : initially targeted at genomics
    Genomics
    Genomics is a discipline in genetics concerning the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis,...

  • Kepler scientific workflow system
    Kepler scientific workflow system
    Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows.Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement solutions...

  • PipeLine Pilot
  • Taverna workbench
    Taverna workbench
    Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...

    : widely used in bioinformatics
  • Triana
    Triana
    -Spain:*Triana, Seville, a large neighborhood of Seville, on the west bank of the Guadalquivir river, that is famous for flamenco music and traditional tilemaking*Triana , a parish in the municipality of Girona...

  • KNIME
    KNIME
    KNIME, the Konstanz Information Miner, is a user friendly, coherent open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept...



In addition to the workflow systems themselves, communities such as the social networking site myExperiment
MyExperiment
myExperiment is a social web site for researchers sharing Research Objects such as Scientific Workflows. The Website was launched in November 2007 and contains a significant collection of scientific workflows for a variety of workflow systems, most notably Taverna, but also other tools such as...

 have developed to facilitate sharing and collaborative development of scientific workflows. Galaxy
Galaxy (computational biology)
Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming experience...

 provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.

See also

  • Bioinformatics workflow management systems
    Bioinformatics workflow management systems
    A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....

  • e-Science
    E-Science
    E-Science is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid...

  • Grid computing
    Grid computing
    Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...

  • Knowledge discovery
    Knowledge discovery
    Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data . It is often described as deriving knowledge from the input data...

  • Workflow
    Workflow
    A workflow consists of a sequence of connected steps. It is a depiction of a sequence of operations, declared as work of a person, a group of persons, an organization of staff, or one or more simple or complex mechanisms. Workflow may be seen as any abstraction of real work...


External links

  • A taxonomy of scientific workflow systems for grid computing from the ACM SIGMOD
    SIGMOD
    SIGMOD is the Association for Computing Machinery's Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases....

     Record
  • Scientific workflow systems - can one size fit all? paper in CIBEC'08 comparing the features of multiple scientific workflow systems.
  • List of software tools related to scientific workflows on the DataONE
    DataONE
    Data Observation Network for Earth is a project supported by the National Science Foundation under the DataNet program. DataONE will provide scientific data archiving for ecological and environmental data produced by scientists worldwide. DataONE's stated goal is to preserve and provide access to...

    website
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK