Sequence assembly
Encyclopedia
In bioinformatics
, sequence assembly refers to aligning
and merging fragments of a much longer DNA
sequence in order to reconstruct the original sequence. This is needed as DNA sequencing
technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing
genomic
DNA, or gene transcript
(ESTs
).
The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. Besides the confusion introduced by shredding the book, the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria
and finally eukaryotes), the assembly programs used in these genome projects
needed to increasingly employ more and more sophisticated strategies to handle:
Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit fly Drosophila melanogaster
, in 2000 and the human genome just a year later, scientists developed assemblers like Celera Assembler and Arachne able to handle genomes of 100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the open source
framework.
Furthermore, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing
, trans-splicing
, single-nucleotide polymorphism, recoding, and post-transcriptional modification
.
In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that has a complexity of O(n2) but can be reduced to O(n log(n)). Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, or a novel, or a catalogue etc.
In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.
In 1975, the Dideoxy termination method (also known as Sanger sequencing
) was invented and until shortly after 2000, the technology was improved up to a point were fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing
projects where the reads
With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger ones like the human genome with approximately 35 million reads needed already large computing farms and distributed computing.
By 2004 / 2005, pyrosequencing
had been brought to commercial viability by 454 Life Sciences
. This new sequencing methods generated reads much shorter than from Sanger sequencing: initially about 100 bases, now 400-500 bases. However, due to the much higher throughput and lower cost than Sanger sequencing, the adoption of this technology by genome centers pushed development of sequence assemblers to deal with this new type of sequences. The sheer amount of data coupled with technology specific error patterns in the reads delayed development of assemblers, at the beginning in 2004 only the Newbler
assembler from 454 was available. Presented in mid 2007, the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler who could assemble 454 reads and mixtures of 454 reads and Sanger reads; using sequences from different sequencing technologies was subsequently coined hybrid assembly.
Since 2006, the Illumina
(previously Solexa) technology is available and able to generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina initially was limited to a length of only 36 bases, making it less suitable for de novo assembly (such as de novo transcriptome assembly
), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Presented by the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads, quickly followed by a number of others.
Later, new technologies like SOLiD
from Applied Biosystems
are released, and new technologies (e.g. IonTorrent, PacBio) continue to emerge at a rapid rate.
.
The result is a suboptimal solution to the problem.
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, sequence assembly refers to aligning
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
and merging fragments of a much longer DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
sequence in order to reconstruct the original sequence. This is needed as DNA sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....
technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing
Shotgun sequencing
In genetics, shotgun sequencing, also known as shotgun cloning, is a method used for sequencing long DNA strands. It is named by analogy with the rapidly-expanding, quasi-random firing pattern of a shotgun....
genomic
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....
DNA, or gene transcript
Transcription (genetics)
Transcription is the process of creating a complementary RNA copy of a sequence of DNA. Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as a complementary language that can be converted back and forth from DNA to RNA by the action of the correct enzymes...
(ESTs
Expressed sequence tag
An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now available in...
).
The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. Besides the confusion introduced by shredding the book, the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
Genome assemblers
The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignmentSequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria
Bacteria
Bacteria are a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria have a wide range of shapes, ranging from spheres to rods and spirals...
and finally eukaryotes), the assembly programs used in these genome projects
Genome project
Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features...
needed to increasingly employ more and more sophisticated strategies to handle:
- terabytes of sequencing data which need processing on computing clustersCluster ComputingCluster Computing: the Journal of Networks, Software Tools and Applications is a journal for parallel processing, distributed computing systems, and computer communication networks....
; - identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially;
- and errors in the fragments from the sequencing instruments, which can confound assembly.
Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit fly Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster is a species of Diptera, or the order of flies, in the family Drosophilidae. The species is known generally as the common fruit fly or vinegar fly. Starting from Charles W...
, in 2000 and the human genome just a year later, scientists developed assemblers like Celera Assembler and Arachne able to handle genomes of 100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
framework.
EST assemblers
EST assembly differs from genome assembly in several ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent only a subset of the whole genome. At a first glance, underlying algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, mainly in the inter-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, cells tend to have a certain number of genes that are constantly expressed in very high amounts (housekeeping genes), which again leads to the problem of similar sequences present in high amounts in the data set to be assembled.Furthermore, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing
Alternative splicing
Alternative splicing is a process by which the exons of the RNA produced by transcription of a gene are reconnected in multiple ways during RNA splicing...
, trans-splicing
Trans-splicing
Trans-splicing is a special form of RNA processing in eukaryotes where exons from two different primary RNA transcripts are joined end to end and ligated....
, single-nucleotide polymorphism, recoding, and post-transcriptional modification
Post-transcriptional modification
Post-transcriptional modification is a process in cell biology by which, in eukaryotic cells, primary transcript RNA is converted into mature RNA. A notable example is the conversion of precursor messenger RNA into mature messenger RNA , which includes splicing and occurs prior to protein synthesis...
.
De-novo vs. mapping assembly
In sequence assembly, two different types can be distinguished:- de-novoDe novoIn general usage, de novo is a Latin expression meaning "from the beginning," "afresh," "anew," "beginning again." It is used in:* De novo transcriptome assembly, the method of creating a transcriptome without a reference genome...
: assembling short reads to create full-length (sometimes novel) sequences (see de novo transcriptome assemblyDe novo transcriptome assemblyDe novo transcriptome assembly is the method of creating a transcriptome without the aid of a reference genome.- Introduction :Before de novo transcriptome assembly, transcriptome information was only readily available for a handful of model organisms utilized by the international scientific...
) - mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence
In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm need to compare every read with every other read (an operation that has a complexity of O(n2) but can be reduced to O(n log(n)). Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, or a novel, or a catalogue etc.
Influence of technological changes
The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.
In 1975, the Dideoxy termination method (also known as Sanger sequencing
Microfluidic Sanger Sequencing
The completion of the Human Genome Project has been a cornerstone in the advancement of biological studies. The outcomes of obtaining a complete reference map of the human genome have ushered in the post-genome era of studies...
) was invented and until shortly after 2000, the technology was improved up to a point were fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing
Shotgun sequencing
In genetics, shotgun sequencing, also known as shotgun cloning, is a method used for sequencing long DNA strands. It is named by analogy with the rapidly-expanding, quasi-random firing pattern of a shotgun....
projects where the reads
- are about 800–900 bases long
- contain sequencing artifacts like sequencing and cloning vectors
- have error rates between 0.5 and 10%
With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger ones like the human genome with approximately 35 million reads needed already large computing farms and distributed computing.
By 2004 / 2005, pyrosequencing
Pyrosequencing
Pyrosequencing is a method of DNA sequencing based on the "sequencing by synthesis" principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides...
had been brought to commercial viability by 454 Life Sciences
454 Life Sciences
454 Life Sciences, is a biotechnology company based in Branford, Connecticut. It is a subsidiary of Roche, and specializes in high-throughput DNA sequencing.-History and Major Achievements:...
. This new sequencing methods generated reads much shorter than from Sanger sequencing: initially about 100 bases, now 400-500 bases. However, due to the much higher throughput and lower cost than Sanger sequencing, the adoption of this technology by genome centers pushed development of sequence assemblers to deal with this new type of sequences. The sheer amount of data coupled with technology specific error patterns in the reads delayed development of assemblers, at the beginning in 2004 only the Newbler
Newbler
Newbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company.-Usage:...
assembler from 454 was available. Presented in mid 2007, the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler who could assemble 454 reads and mixtures of 454 reads and Sanger reads; using sequences from different sequencing technologies was subsequently coined hybrid assembly.
Since 2006, the Illumina
Illumina
Illumina is the second album by Alisha's Attic to be released worldwide in 1998. Three tracks were released from it, these being "The Incidentals", "Wish I Were You" and "Barbarella". It peaked at #15 on the UK album chart...
(previously Solexa) technology is available and able to generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina initially was limited to a length of only 36 bases, making it less suitable for de novo assembly (such as de novo transcriptome assembly
De novo transcriptome assembly
De novo transcriptome assembly is the method of creating a transcriptome without the aid of a reference genome.- Introduction :Before de novo transcriptome assembly, transcriptome information was only readily available for a handful of model organisms utilized by the international scientific...
), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Presented by the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads, quickly followed by a number of others.
Later, new technologies like SOLiD
Solid
Solid is one of the three classical states of matter . It is characterized by structural rigidity and resistance to changes of shape or volume. Unlike a liquid, a solid object does not flow to take on the shape of its container, nor does it expand to fill the entire volume available to it like a...
from Applied Biosystems
Applied Biosystems
Applied Biosystems, Inc. started as GeneCo , was the name of a pioneer biotechnology company founded in 1981 in Foster City, California, in the San Francisco Bay Area...
are released, and new technologies (e.g. IonTorrent, PacBio) continue to emerge at a rapid rate.
Greedy algorithm
Given a set of sequence fragments the object is to find the Shortest common supersequenceShortest common supersequence
This shortest common supersequence problem is closely related to the longest common subsequence problem. Given two sequences X = and Y = , a sequence U = is a common supersequence of X and Y if U is a supersequence of both X and Y...
.
- calculate pairwise alignments of all fragments
- choose two fragments with the largest overlap
- merge chosen fragments
- repeat step 2. and 3. until only one fragment is left
The result is a suboptimal solution to the problem.
Available assemblers
The following table lists assemblers that have a de-novo assembly capability on at least one of the supported technologies.Name | Type | Technologies | Author | Presented / Last updated |
Licence* | Homepage |
---|---|---|---|---|---|---|
ABySS | (large) genomes | Solexa, SOLiD | Simpson, J. et al. | 2008 / 2011 | OS | link |
ALLPATHS-LG | (large) genomes | Solexa, SOLiD | Gnerre, S. et al. | 2011 | OS | link |
AMOS | genomes | Sanger, 454 | Salzberg, S. et al. | 2002? / 2008? | OS | link |
Celera WGA Assembler / CABOG | (large) genomes | Sanger, 454, Solexa | Myers, G. et al.; Miller G. et al. | 2004 / 2010 | OS | link |
CLC Genomics Workbench | genomes | Sanger, 454, Solexa, SOLiD | CLC bio CLC bio CLC bio is a bioinformatics solution provider based in Aarhus, Denmark. CLC bio's software workbenches have more than 75,000 users in more than 100 countries around the globe.-Software:... |
2008 / 2010 | C | link |
Cortex | genomes | Solexa, SOLiD | Iqbal, Z. et al. | 2011 | OS | link |
DNA Dragon | genomes | Illumina, SOLiD, Complete Genomics, 454, Sanger | SequentiX | 2011 | C | link |
DNAnexus | genomes | Illumina, SOLiD, Complete Genomics | DNAnexus | 2011 | C | link |
Edena | genomes | Solexa | D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel. | 2008 | C | link |
Euler | genomes | Sanger, 454 | Pevzner, P. et al. | 2001 / 2006? | (C / NC-A?) | link |
Euler-sr | genomes | 454, Solexa | Chaisson, MJ. et al. | 2008 | NC-A | link |
Forge | (large) genomes, EST, metagenomes | 454, Solexa , SOLID, Sanger | Platt, DM, Evers, D. | 2010 | OS | link |
Geneious Geneious Geneious is suite of cross-platform bioinformatics software applications developed by Biomatters Ltd.- Features :Geneious comes in a Basic version that is free for academic use, and a commercial Pro version with added features. Geneious bundles various bioinformatics tools under one hood with an... |
genomes | Sanger, 454, Solexa | Biomatters Ltd Biomatters Biomatters Limited is the company behind the Geneious Pro software, a bioinformatics software development company headquartered in Auckland, New Zealand. Biomatters develops sequence analysis and bioinformatics software capable of scaling from a single user to the enterprise level of... |
2009 / 2010 | C | link |
Graph Constructor | (large) genomes | Sanger, 454, Solexa, SOLiD | Convey Computer Corporation Convey Computer Convey Computer Corporation is a privately owned company, established in December 2006 and is based in Richardson, Texas. Convey has developed a specific form of heterogeneous computing they call hybrid-core computing... |
2011 | C | link |
IDBA (Iterative De Bruijn graph short read Assembler) | (large) genomes | Sanger,454,Solexa | Yu Peng, Henry C. M. Leung, Siu-Ming Yiu, Francis Y. L. Chin | 2010 | (C / NC-A?) | link |
MIRA (Mimicking Intelligent Read Assembly) | genomes, ESTs | Sanger, 454, Solexa | Chevreux, B. | 1998 / 2011 | OS | link |
NextGENe | (small genomes?) | 454, Solexa, SOLiD | Softgenetics | 2008 | C | link |
Newbler Newbler Newbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company.-Usage:... |
genomes, ESTs | 454, Sanger | 454/Roche | 2009 | C | link |
PASHA | (large) genomes | Illumina | Liu, Schmidt, Maskell | 2011 | OS | link |
Phrap Phrap Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package.- History :Phrap was originally developed by Prof. Phil Green for the assembly of cosmids in large-scale cosmid shotgun sequencing within the Human Genome Project... |
genomes | Sanger, 454 | Green, P. | 2002 / 2003 / 2008 | C / NC-A | link |
TIGR The Institute for Genomic Research The Institute for Genomic Research was a non-profit genomics research institute founded in 1992 by Craig Venter in Rockville, Maryland, United States. It is now a part of the J. Craig Venter Institute.-History:... Assembler |
genomic | Sanger | - | 1995 / 2003 | OS | [ftp://ftp.jcvi.org/pub/software/assembler/ link] |
Ray | genomes | Illumina, mix of Illumina and 454, paired or not | Sébastien Boisvert, François Laviolette & Jacques Corbeil. | 2010 | OS [GNU General Public License] | link |
Sequencher | genomes | traditional and next generation sequence data | Gene Codes Corporation Gene Codes Corporation Gene Codes Corporation is a privately-owned international firm based in Ann Arbor, Michigan which specializes in bioinformatics software for genetic sequence analysis. Its flagship software product, Sequencher, is widely used by researchers at academic and government labs as well as biotechnology... |
1991 / 2009 / 2011 | C | link |
SeqMan NGen | (large) genomes, exomes, transcriptomes, metagenomes, ESTs | Illumina, ABI SOLiD, Roche 454, Ion Torrent, Solexa, Sanger | DNASTAR DNASTAR DNASTAR is a bioinformatics company headquartered in Madison, WI that provides software for DNA sequence analysis and assembly, gene expression analysis, and genomic visualization.-History:DNASTAR was incorporated in 1984... |
2007 / 2011 | C | link |
SHARCGS | (small) genomes | Solexa | Dohm et al. | 2007 / 2007 | OS | link |
SOPRA | genomes | Illumina, SOLiD, Sanger, 454 | Dayarian, A. et al. | 2010 / 2011 | OS | link |
SSAKE | (small) genomes | Solexa (SOLiD? Helicos?) | Warren, R. et al. | 2007 / 2007 | OS | link |
SOAPdenovo | genomes | Solexa | Li, R. et al. | 2009 / 2009 | Closed | link |
Staden gap4 package | BACs | Sanger | Staden et al. | 1991 / 2008 | OS | link |
Taipan | (small) genomes | Illumina | Schmidt, B. et al. | 2009 | OS | link |
VCAKE | (small) genomes | Solexa (SOLiD?, Helicos?) | Jeck, W. et al. | 2007 / 2007 | OS | link |
Phusion assembler | (large) genomes | Sanger | Mullikin JC, et al. | 2003 | OS | link |
Quality Value Guided SRA (QSRA) | genomes | Sanger, Solexa | Bryant DW, et al. | 2009 | OS | link |
Velvet | (small) genomes | Sanger, 454, Solexa, SOLiD | Zerbino, D. et al. | 2007 / 2009 | OS | link |
See also
- Sequence alignmentSequence alignmentIn bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
- Genome assembly
- GenomeABC : A server for Benchmarking of Genome Assemblers.