Phrap
Encyclopedia
Phrap is a widely used program for DNA sequence assembly
. It is part of the Phred
-Phrap-Consed
package.
s in large-scale cosmid shotgun sequencing within the Human Genome Project
. Phrap has been widely used for many different sequence assembly projects, including bacterial genome assemblies and EST assemblies.
Phrap was written as a command line program for easy integration into automated data workflows in genome sequencing centers. For users who want to use Phrap from a graphical interface, the commercial programs MacVector
(for Mac OS X
only) and CodonCode Aligner
(for Mac OS X
and Microsoft Windows
) are available.
s. Phrap used quality scores to mitigate a problem that other assembly programs had struggled with at the beginning of the Human Genome Project
: correctly assembling frequent imperfect repeats, in particular Alu sequence
s. Phrap uses quality scores to tell if any observed differences in repeated regions are likely to be due to random ambiguities in the sequencing process, or more likely to be due to the sequences being from different copies of the Alu repeat. Typically, Phrap had no problems differentiating between the different Alu copies in a cosmid, and to correctly assemble the cosmids (or, later, BACs
). The logic is simple: a base call with a high probability of being correct should never be aligned with another high quality but different base. However, Phrap does not rule out such alignments entirely, and the cross_match alignment gap and alignment penalties used while looking for local alignments are not always optimal for typical sequencing errors and a search for overlapping (contiguous) sequences. (Affine gaps are helpful for homology searches but not usually for sequencing error alignment). Phrap attempts to classify chimeras, vector sequences and low quality end regions all in a single alignment and will sometimes make mistakes. Furthermore, Phrap has more than one round of assembly building internally and later rounds are less stringent - Greedy algorithm.
These design choices were helpful in the 1990s when the program was originally written (at Washington University in Saint Louis, USA) but are less so now. Phrap appears error prone in comparison with newer assemblers like Euler and cannot use mate-pair information directly to guide assembly and assemble past perfect repeats. Phrap is not free software so it has not been extended and enhanced like less restricted open-source software Sequence assembly
.
s by Phrap that contributed to the program's success was the determination of consensus sequences using sequence qualities. In effect, Phrap automated a step that was a major bottleneck in the early phases of the Human Genome Project
: to determine the correct consensus sequence at all positions where the assembled sequences had discrepant bases. This approach had been suggested by Bonfield and Staden in 1995, and was implemented and further optimized in Phrap. Basically, at any consensus position with discrepant bases, Phrap examines the quality scores of the aligned sequences to find the highest quality sequence. In the process, Phrap takes confirmation of local sequence by other reads into account, after considering direction and sequencing chemistry.
The mathematics of this approach were rather simple, since Phred quality score
s are logarithmically linked to error probabilities. This means that the quality scores of confirming reads can simply be added, as long as the error distributions are sufficiently independent. To satisfy this independence criterion, reads must typically be in different direction, since peak patterns that cause base calling errors are often identical when a region is sequenced several times in the same direction.
If a consensus base is covered by both high-quality sequence and (discrepant) low-quality sequence, Phrap's selection of the higher quality sequence will in most cases be correct. Phrap then assigns the confirmed base quality to the consensus sequence base. This makes it easy to (a) find consensus regions that are not covered by high quality sequence (which will also have low quality), and (b) to quickly calculate a reasonably accurate estimate of the error rate of the consensus sequence. This information can then be used to direct finishing efforts, for example re-sequencing of problem regions.
The combination of accurate, base-specific quality scores
and a quality-based consensus sequence was a critical element in the success of the Human Genome Project
. Phred and Phrap, and similar programs who picked up on the ideas pioneered by these two programs, enabled the assembly of large parts of the human genome (and many other genomes) at an accuracy that was substantially higher (less than 1 error in 10,000 bases) than the typical accuracy of carefully hand-edited sequences that had been submitted to the GenBank database before.
Sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 1000 bases,...
. It is part of the Phred
Phred
Phred may refer to:*Phred base calling, a program used in molecular biology*Phred quality score, a term used in molecular biology*Phred , a character from the comic strip Doonesbury*The Phred on Your Head Show, a children's television show...
-Phrap-Consed
Consed
Consed is a program for viewing, editing, and finishing DNA sequence assemblies. Originally developed for sequence assemblies created with phrap, recent versions also support other sequence assembly programs like Newbler.- History :...
package.
History
Phrap was originally developed by Prof. Phil Green for the assembly of cosmidCosmid
A cosmid, first described by Collins and Hohn in 1978, is a type of hybrid plasmid that contains cos sequences, DNA sequences originally from the Lambda phage. Cosmids can be used to build genomic libraries....
s in large-scale cosmid shotgun sequencing within the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...
. Phrap has been widely used for many different sequence assembly projects, including bacterial genome assemblies and EST assemblies.
Phrap was written as a command line program for easy integration into automated data workflows in genome sequencing centers. For users who want to use Phrap from a graphical interface, the commercial programs MacVector
MacVector
MacVector is a commercial sequence analysis application for Apple Macintosh computers running Mac OS X. It is intended to be used by Molecular Biologists to help analyze, design, research and document their experiments in the laboratory.- Features :...
(for Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
only) and CodonCode Aligner
CodonCode Aligner
CodonCode Aligner is a commercial application for DNA sequence assembly, sequence alignment, and editing on Mac OS X and Windows.- Features :* Chromatogram editing, end clipping, and vector trimming.* Sequence assembly and contig editing...
(for Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
and Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
) are available.
Methods
A detailed (albeit partially outdated) description of the Phrap algorithms can be found in the Phrap documentation. A recurring thread within the Phrap algorithms is the use of Phred quality scorePhred quality score
Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each base call in automated sequencer traces...
s. Phrap used quality scores to mitigate a problem that other assembly programs had struggled with at the beginning of the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...
: correctly assembling frequent imperfect repeats, in particular Alu sequence
Alu sequence
An Alu element is a short stretch of DNA originally characterized by the action of the Alu restriction endonuclease. Alu elements of different kinds occur in large numbers in primate genomes. In fact, Alu elements are the most abundant mobile elements in the human genome. They are derived from the...
s. Phrap uses quality scores to tell if any observed differences in repeated regions are likely to be due to random ambiguities in the sequencing process, or more likely to be due to the sequences being from different copies of the Alu repeat. Typically, Phrap had no problems differentiating between the different Alu copies in a cosmid, and to correctly assemble the cosmids (or, later, BACs
Bacterial artificial chromosome
A bacterial artificial chromosome is a DNA construct, based on a functional fertility plasmid , used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell...
). The logic is simple: a base call with a high probability of being correct should never be aligned with another high quality but different base. However, Phrap does not rule out such alignments entirely, and the cross_match alignment gap and alignment penalties used while looking for local alignments are not always optimal for typical sequencing errors and a search for overlapping (contiguous) sequences. (Affine gaps are helpful for homology searches but not usually for sequencing error alignment). Phrap attempts to classify chimeras, vector sequences and low quality end regions all in a single alignment and will sometimes make mistakes. Furthermore, Phrap has more than one round of assembly building internally and later rounds are less stringent - Greedy algorithm.
These design choices were helpful in the 1990s when the program was originally written (at Washington University in Saint Louis, USA) but are less so now. Phrap appears error prone in comparison with newer assemblers like Euler and cannot use mate-pair information directly to guide assembly and assemble past perfect repeats. Phrap is not free software so it has not been extended and enhanced like less restricted open-source software Sequence assembly
Sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 1000 bases,...
.
Quality based consensus sequences
Another use of Phred quality scorePhred quality score
Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each base call in automated sequencer traces...
s by Phrap that contributed to the program's success was the determination of consensus sequences using sequence qualities. In effect, Phrap automated a step that was a major bottleneck in the early phases of the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...
: to determine the correct consensus sequence at all positions where the assembled sequences had discrepant bases. This approach had been suggested by Bonfield and Staden in 1995, and was implemented and further optimized in Phrap. Basically, at any consensus position with discrepant bases, Phrap examines the quality scores of the aligned sequences to find the highest quality sequence. In the process, Phrap takes confirmation of local sequence by other reads into account, after considering direction and sequencing chemistry.
The mathematics of this approach were rather simple, since Phred quality score
Phred quality score
Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each base call in automated sequencer traces...
s are logarithmically linked to error probabilities. This means that the quality scores of confirming reads can simply be added, as long as the error distributions are sufficiently independent. To satisfy this independence criterion, reads must typically be in different direction, since peak patterns that cause base calling errors are often identical when a region is sequenced several times in the same direction.
If a consensus base is covered by both high-quality sequence and (discrepant) low-quality sequence, Phrap's selection of the higher quality sequence will in most cases be correct. Phrap then assigns the confirmed base quality to the consensus sequence base. This makes it easy to (a) find consensus regions that are not covered by high quality sequence (which will also have low quality), and (b) to quickly calculate a reasonably accurate estimate of the error rate of the consensus sequence. This information can then be used to direct finishing efforts, for example re-sequencing of problem regions.
The combination of accurate, base-specific quality scores
Phred quality score
Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each base call in automated sequencer traces...
and a quality-based consensus sequence was a critical element in the success of the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...
. Phred and Phrap, and similar programs who picked up on the ideas pioneered by these two programs, enabled the assembly of large parts of the human genome (and many other genomes) at an accuracy that was substantially higher (less than 1 error in 10,000 bases) than the typical accuracy of carefully hand-edited sequences that had been submitted to the GenBank database before.