Bioinformatics workflow management systems
Encyclopedia
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

.

There are currently many different workflow systems. Some have been developed more generally as scientific workflow systems for use by scientists from many different disciplines like astronomy
Astronomy
Astronomy is a natural science that deals with the study of celestial objects and phenomena that originate outside the atmosphere of Earth...

 and earth science
Earth science
Earth science is an all-embracing term for the sciences related to the planet Earth. It is arguably a special case in planetary science, the Earth being the only known life-bearing planet. There are both reductionist and holistic approaches to Earth sciences...

.

Examples

  • Anduril
    Anduril (workflow engine)
    Anduril is an open source component-based workflow framework for scientific data analysis developed at the Computational Systems Biology Laboratory, University of Helsinki....

     is an open source
    Open source
    The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

     component-based workflow framework for scientific data analysis developed at the University of Helsinki
    University of Helsinki
    The University of Helsinki is a university located in Helsinki, Finland since 1829, but was founded in the city of Turku in 1640 as The Royal Academy of Turku, at that time part of the Swedish Empire. It is the oldest and largest university in Finland with the widest range of disciplines available...

    . Anduril provides an execution engine written in Java, a large number of components for bioinformatics analysis, and the AndurilScript language to create and manage workflows.
  • BioBike is a biocomputing platform based upon the KnowOS (Knowledge Operating System) e-science technology. Written entirely in Lisp, KnowOS's main distinguishing feature is "through-the-browser" programmability.
  • BioExtract
    BioExtract
    The BioExtract Server is a web-based system for querying biomolecular sequence data, executing analytic tools on the resulting extracts, and constructing workflows composed of such queries and tools....

     harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools, create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports.
  • BioManager is a bioinformatic data management and analysis workflow developed by the University of Sydney
    University of Sydney
    The University of Sydney is a public university located in Sydney, New South Wales. The main campus spreads across the suburbs of Camperdown and Darlington on the southwestern outskirts of the Sydney CBD. Founded in 1850, it is the oldest university in Australia and Oceania...

    .
  • CellProfiler
    CellProfiler
    CellProfiler is free, open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure phenotypes from thousands of images automatically...

     is an open source
    Open source
    The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

     modular image analysis software developed at the Broad Institute
    Broad Institute
    The Broad Institute is a genomic medicine research center located in Cambridge, Massachusetts, United States. Although it is independently governed and supported as a 501 nonprofit research organization, the institute is formally affiliated with the Massachusetts Institute of Technology, Harvard...

    . Capable of handling hundreds of thousands of images, it contains advanced algorithms for image analysis of cell-based assays and is optimized for high-throughput work. The software allows the user to construct a pipeline of individual modules; each module performs a image processing step, such as image loading, object identification, and feature extraction.
  • Discovery Net
    Discovery Net
    Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services standards....

     (circa 2000) is one of the earliest examples of scientific workflow systems. It was the winner of the “Most Innovative Data Intensive Application Award” at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. The Discovery Net system originated from a £2m EPSRC-funded project with the same name investigating the development of an e-Science platform for scientific discovery from the data generated by a wide variety of high throughput devices at Imperial College London
    Imperial College London
    Imperial College London is a public research university located in London, United Kingdom, specialising in science, engineering, business and medicine...

    . Many of the features of the system (architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems.
  • Ergatis is a web-based system used to create, run, and monitor reusable bioinformatics analysis pipelines. It contains pre-built components for common bioinformatics analysis tasks, such as blast searches or storing data in a Chado
    Generic Model Organism Database
    The Generic Model Organism Database Project began as an effort to create reusable software tools for developing Model Organism Databases . MODs describe genome and other information about important experimental organisms in the life sciences...

     database. These components can be arranged graphically to create highly-configurable pipelines.
  • Galaxy
    Galaxy (computational biology)
    Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming experience...

     is an open source
    Open source
    The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

     workflow system developed at Penn State
    Pennsylvania State University
    The Pennsylvania State University, commonly referred to as Penn State or PSU, is a public research university with campuses and facilities throughout the state of Pennsylvania, United States. Founded in 1855, the university has a threefold mission of teaching, research, and public service...

     and Emory University
    Emory University
    Emory University is a private research university in metropolitan Atlanta, located in the Druid Hills section of unincorporated DeKalb County, Georgia, United States. The university was founded as Emory College in 1836 in Oxford, Georgia by a small group of Methodists and was named in honor of...

    . Galaxy is available as a free public web server and as downloadable software. Galaxy stresses ease of use and sharing and persisting analyses.
  • GenePattern
    GenePattern
    is a freely available software package developed at the Broad Institute of MIT and Harvard for the analysis of genomic data. Designed to enable researchers to develop, capture, and reproduce genomic analysis methodologies, GenePattern was first released in 2004...

     is a genomic analysis platform developed at the Broad Institute of MIT & Harvard
    Broad Institute
    The Broad Institute is a genomic medicine research center located in Cambridge, Massachusetts, United States. Although it is independently governed and supported as a 501 nonprofit research organization, the institute is formally affiliated with the Massachusetts Institute of Technology, Harvard...

     that provides access to more than 150 tools for gene expression analysis, proteomics, SNP analysis, RNA-seq, flow cytometry, and common data processing tasks. A web-based interface provides access to these tools and allows the creation of multi-step analysis pipelines that enable reproducible in silico research.
  • Geodise (Grid Enabled Optimisation and Design Search for Engineering) was developed at the University of Southampton
    University of Southampton
    The University of Southampton is a British public university located in the city of Southampton, England, a member of the Russell Group. The origins of the university can be dated back to the founding of the Hartley Institution in 1862 by Henry Robertson Hartley. In 1902, the Institution developed...

    .
  • Kepler
    Kepler scientific workflow system
    Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows.Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement solutions...

     enables scientists in a variety of disciplines like biology, ecology and astronomy to compose and execute workflows. Kepler is based on the Ptolemy II system for heterogeneous, concurrent modeling and design. Ptolemy II was developed by the members of the Ptolemy project at University of California Berkeley. Although not originally intended for scientific workflows, it provides a mature platform for building and executing workflows, and supports multiple models of computation.
  • LONI Pipeline
    LONI Pipeline
    The LONI Pipeline is a distributed system for constructing, validating, executing and disseminating scientific workflows on grid computing architectures. A major difference between this and other workflow processing environments is that the LONI Pipeline does not require new tools and services to...

     is a Java-based distributed graphical data-analysis environment for constructing, validating, executing and disseminating scientific workflows. As the LONI Pipeline
    LONI Pipeline
    The LONI Pipeline is a distributed system for constructing, validating, executing and disseminating scientific workflows on grid computing architectures. A major difference between this and other workflow processing environments is that the LONI Pipeline does not require new tools and services to...

     references all data, services and tools as external objects, it directly allows resource interoperability without the need for rebuilding the software.
  • Medicel Integrator Workflow is a cluster-enabled bioinformatics workflow design and execution application. It can be used stand-alone or integrated with a biology data warehouse.
  • Pegasus is a flexible framework that enables the mapping of complex scientific workflows onto the grid
    Grid computing
    Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...

     developed at the Information Sciences Institute
    Information Sciences Institute
    The Information Sciences Institute is a research and development unit of the University of Southern California's Viterbi School of Engineering which focuses on computer and communications technology and information processing...

     at the University of Southern California
    University of Southern California
    The University of Southern California is a private, not-for-profit, nonsectarian, research university located in Los Angeles, California, United States. USC was founded in 1880, making it California's oldest private research university...

    .
  • Pegasys is a software for executing and integrating analyses of biological sequences, developed by the University of British Columbia
    University of British Columbia
    The University of British Columbia is a public research university. UBC’s two main campuses are situated in Vancouver and in Kelowna in the Okanagan Valley...

    .
  • Pipeline Pilot is Accelrys
    Accelrys
    Accelrys is a software company headquartered in the US, with representation in Europe and Japan. It provides software for chemical, materials and bioscience research for the pharmaceutical, biotechnology, consumer packaged goods, aerospace, energy and chemical industries.Accelrys started in 2001...

    ’ scientific informatics platform that streamlines the data integration and analysis by using a Visual Programming Language (similar to LabVIEW
    LabVIEW
    LabVIEW is a system design platform and development environment for a visual programming language from National Instruments. LabVIEW provides engineers and scientists with the tools needed to create and deploy measurement and control systems.The graphical language is named "G"...

    ) to build a pipeline to transform any number of inputs (raw data) into any number of outputs.
  • Taverna workbench
    Taverna workbench
    Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...

     is an open source
    Open source
    The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

     workflow system that enables scientists (typically, though not exclusively, in bioinformatics) to compose and execute scientific workflows. It has been developed as part of a £5.5m EPSRC project called myGrid
    MyGrid
    The myGrid consortium is a multi-institutional, multi-disciplinary internationally leading research group focussing on the challenges of eScience...

     based at the University of Manchester
    University of Manchester
    The University of Manchester is a public research university located in Manchester, United Kingdom. It is a "red brick" university and a member of the Russell Group of research-intensive British universities and the N8 Group...

    . Independently, other researchers have created Programming by example
    Programming by example
    In computer science, programming by example , also known as programming by demonstration or more generally as demonstrational programming, is an End-user development technique for teaching a computer new behavior by demonstrating actions on concrete examples...

     workflow development tools that are interoperable with Taverna.
  • Triana is an open source problem solving environment developed at Cardiff University
    Cardiff University
    Cardiff University is a leading research university located in the Cathays Park area of Cardiff, Wales, United Kingdom. It received its Royal charter in 1883 and is a member of the Russell Group of Universities. The university is consistently recognised as providing high quality research-based...

     that combines an intuitive visual interface with powerful data analysis tools.
  • Wildfire is a distributed, Grid-enabled workflow construction and execution environment. It has a graphical user interface for constructing and running workflows. Wildfire borrows user interface features from Jemboss and adds a drag-and-drop interface allowing the user to compose EMBOSS
    EMBOSS
    EMBOSS is an acronym for European Molecular Biology Open Software Suite. EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology and bioinformatics user community...

     (and other) programs into workflows. For execution, Wildfire uses GEL, the underlying workflow execution engine, which can exploit available parallelism on multiple CPU machines including Beowulf-class clusters and Grids.
  • Sight is a web agent – oriented workflow platform that historically has extensive means to integrate websites with ordinary web forms and HTML responses (there is also support for WSDL as well). The system has a GUI-based workflow composer that supports modules with multiple ports and allows to access data from the modules that stand earlier in workflow. Sight was developed in Ulm university
    University of Ulm
    The University of Ulm is a public university in the city of Ulm, in the South German state of Baden-Württemberg. The university was founded in 1967 and focuses on natural sciences, medicine, engineering sciences, mathematics, economics and computer science...

     using java and it currently released under GPL.
  • RetroGuide
    RetroGuide
    RetroGuide is a name of a research project in medical informatics focusing on using workflow technology in healthcare. In 2009, RetroGuide became a component in a larger project/system called HealthFlow...

     is a query framework for querying retrospective bioinformatics data.
  • UGENE Workflow Designer
    UGENE
    UGENE is free open-source cross-platform bioinformatics software.It integrates dozens of well-known biological tools and algorithms, providing both graphical user and command line interfaces...

     is an open source
    Open source
    The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

     visual environment designed for building and executing bioinformatics workflows. The main purpose of the system is providing user-friendly GUI
    Gui
    Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...

     for creating computational workflows that can be executed as well as on commodity hardware as on high-performance clusters and supercomputers.
  • HCDC is an open source workflow system developed at ETH Zurich
    ETH Zurich
    The Swiss Federal Institute of Technology Zurich or ETH Zürich is an engineering, science, technology, mathematics and management university in the City of Zurich, Switzerland....

     that is focus on large scale image based biological experiments. Include large collection of components for multiwell plate handling (96, 384, ...).
  • Mobyle is a framework and web portal specifically aimed at the integration of bioinformatics software and databanks. Mobyle is the successor of Pise and the RPBS server, previous systems that provided web environments to define and execute bioinformatics analyses.
  • Remora is a web server implemented according to the BioMoby web-service specifications, providing life science researchers with an easy-to-use workflow generator and launcher, a repository of predefined workflows and a survey system.

External links

This paper reviews some of the above workflow systems from the ACM SIGMOD
SIGMOD
SIGMOD is the Association for Computing Machinery's Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases....

Record
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK