RNA-seq, also called "Whole Transcriptome Shotgun Sequencing" ("WTSS") and dubbed "a revolutionary tool for transcriptomics", refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA
Ribonucleic acid , or RNA, is one of the three major macromolecules that are essential for all known forms of life....

 content, a technique that is quickly becoming invaluable in the study of diseases like cancer
Cancer , known medically as a malignant neoplasm, is a large group of different diseases, all involving unregulated cell growth. In cancer, cells divide and grow uncontrollably, forming malignant tumors, and invade nearby parts of the body. The cancer may also spread to more distant parts of the...

. Thanks to the deep coverage and base level resolution provided by next-generation sequencing instruments, RNA-seq provides researchers with efficient ways to measure transcriptome data experimentally, allowing them to get information such as how different alleles of a gene are expressed, detect post-transcriptional mutations or identify gene fusions.


The introduction of next-generation sequencing or high-throughput sequencing technologies opened new doors into the field of DNA sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

, however as understanding of these technologies becomes more widespread and new tools are developed, new innovative ways of applying these technologies are being created.

Given high-throughput sequencing technologies' low requirements of nucleotide sequence product, together with its deep coverage and base-scale resolution, its use has expanded to the field of transcriptomics. Transcriptomics is an area of research characterizing the RNA transcribed from a particular genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....

 under investigation. Although transcriptomes are more dynamic than genomic DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

, these molecules provide direct access to gene regulation and protein information. Sequencing transcriptomes is not a new idea. Various methods have been developed previously to directly determine cDNA sequences based mostly around traditional (and more expensive) Sanger sequencing, while others include methodologies such as Serial analysis of gene expression
Serial Analysis of Gene Expression
Serial analysis of gene expression is a technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. The original technique was developed by Dr. Victor Velculescu...

 (SAGE), cap analysis gene expression
Cap analysis gene expression
Cap analysis gene expression is a technique used in molecular biology to produce a snapshot of the 5' end of the messenger RNA population in a biological sample. The small fragments from the very beginnings of mRNAs are extracted, reverse-transcribed to DNA, PCR amplified and sequenced...

 (CAGE) and massively parallel signature sequencing
Massively parallel signature sequencing
Massive parallel signature sequencing is a sequenced based approach that can be used to identify and quantify mRNA transcripts present in a sample similar to serial analysis of gene expression but the biochemical manipulation and sequencing approach differ substantially.MPSS allows mRNA...


Transcriptome Sequencing (RNA-seq) can be done with a variety of platforms to test a high amount of ideas and hypotheses. For example, using the Illumina (company)
Illumina (company)
Illumina, Inc. is a company incorporated in April 1998 that develops, manufactures and markets integrated systems for the analysis of genetic variation and biological function. Using its technologies, the company provides a line of products and services that serve the sequencing, genotyping and...

 Genome Analyzer platform, recent applications include sequencing mammalian transcriptomes, ABI Solid Sequencing
ABI Solid Sequencing
SOLiD is a next-generation sequencing technology developed by Life Technologies and has been commercially available since 2008. These next generation technologies generate hundreds of millions to billions of small sequence reads at one time...

 to profile stem cell transcriptomes or Life Science's 454 Sequencing to discover SNPs in maize. Even though each platform has its technical differences, the information gathered from each is of the same nature.

RNA Poly(A) Library

Creation of a sequence library can change from platform to platform in high throughput sequencing, where each has several kits designed to build different types of libraries and adapting the resulting sequences to the specific requirements of their instruments.

However, due to the nature of the template being analyzed, there are commonalities within each technology. Frequently, in mRNA analysis the 3' polyadenylated (poly(A)) tail is targeted in order to ensure that coding RNA is separated from noncoding RNA. This can be accomplished simply with poly (T) oligos covalently attached to a given substrate. Presently many studies utilize magnetic beads for this step . The Protocol Online website provides a list of several protocols relating to mRNA isolation.

Studies including portions of the transcriptome outside poly(A) RNAs have shown that when using poly(T) magnetic beads, the flow-through RNA (non-poly(A) RNA) can yield important noncoding RNA gene discovery which would have otherwise gone unnoticed.

Also, since ribosomal RNA represents over 90% of the RNA within a given cell, studies have shown that its removal via probe hybridization increases the capacity to retrieve data from the remaining portion of the transcriptome.

The next step is reverse transcription. Due to the 5' bias of randomly primed-reverse transcription as well as secondary structures influencing primer binding sites, hydrolysis of RNA into 200-300 nucleotides prior to reverse transcription reduces both problems simultaneously. However, there are trade-offs with this method where although the overall body of the transcripts are efficiently converted to DNA, the 5' and 3' ends are less so. Depending on the aim of the study, researchers may choose to apply or ignore this step.

Once the cDNA is synthesized it can be further fragmented to reach the desired fragment length as specified in table 1. The template is now ready to be prepared for the desired sequencing method.

Next-generation sequencing

High-throughput sequencing technologies generate millions of short reads from a library of nucleotide sequences, whether they come from DNA, RNA, or a mixture, the sequencing mechanism of each platform does not vary. The most used technologies and some of their characteristics are shown in the following table
454 Sequencing Illumina SOLiD
Sequencing Chemistry Pyrosequencing
Pyrosequencing is a method of DNA sequencing based on the "sequencing by synthesis" principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides...

Polymerase-based sequence-by-synthesis Ligation-based sequencing
Sequencing by ligation
Sequencing by ligation is a DNA sequencing method that uses the enzyme DNA ligase to identify the nucleotide present at a given position in a DNA sequence. Unlike most currently popular DNA sequencing methods, this method does not use a DNA polymerase to create a second strand...

Amplification approach Emulsion PCR Bridge amplification Emulsion PCR
Paired end separation 3 kb 200 bp 3 kb
Mb per run 100 Mb 300 Gb 3000 Mb
Time per paired end run 7 hours 7-14 days 5 days
Read length (update) 250 bp (400 bp) 100 bp (50-100 bp) 35 bp (35-50 bp)
Cost per run $ 8,438 USD $ 11,750 USD $ 17,447 USD
Cost per Mb $ 84.39 USD $ 1.00 USD $ 5.81 USD

Table 1. Comparing metrics and performance of next-generation DNA sequencers

Direct RNA Sequencing (DRS™) from Helicos' Company

Methods for in-depth characterization of transcriptomes and quantification of transcript levels have emerged as valuable tools for understanding cellular physiology and human disease biology, and have begun to be utilized in various clinical diagnostic applications. Current methods, however, typically require RNA to be converted to complementary DNA (cDNA) via reverse transcription prior to measurements. This step has been shown to introduce biases and artifacts that may interfere with both the proper characterization and quantification of transcripts. To enable virtually unbiased view and quantification of transcriptomes, Helicos has developed and commercialized the single molecule Direct RNA Sequencing (DRSTM) technology. DRSTM is the first and currently the only technology that can sequence RNA molecules directly in a massively-parallel manner without RNA conversion to cDNA or other biasing sample manipulations such as ligation and amplification.

Transcriptome alignment

Due to the small size of the short reads (for Illumina Genome Analyzer this can be around 36 bases) de novo assembly may be difficult (though some software does exist: Velvet (algorithm)
Velvet (algorithm)
Velvet is a set of algorithms manipulating de Bruijn graphs for genomic and de novo transcriptomic Sequence assembly. It was designed for short read sequencing technologies, such as Solexa or 454 Sequencing and was developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute...

), as there cannot be large overlaps between each read needed to easily reconstruct the original sequences, and the deep coverage makes the computing power to track all the possible alignments prohibitive. This can be somewhat overcome by having larger sequences obtained from the same sample using other techniques as Sanger sequencing, and using larger reads as a "skeleton" or a "template" to help assemble reads in difficult regions (e.g. regions with repetitive sequences).

The recommended approach is that of aligning the millions of reads to a "reference genome
Reference genome
A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' genetic code. As they are often assembled from the sequencing of DNA from a number of donors, reference genomes do not accurately represent the genetic code of any...

". There are many tools available for aligning genomic reads to a reference genome (sequence alignment tools), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions.

As discussed above, the sequence libraries are created extracting mRNA using its poly(A) tail, which is added to the mRNA molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences and thus, when trying to align these short reads to a reference genome, only short reads aligning entirely inside exonic regions will be matched while short reads from exon-exon junction regions will not.

A possible method to work around this is to try to align the unaligned short reads using a proxy genome generated with known exonic sequences. This need not cover whole exons, only enough so that the short reads can match on both sides of the exon-exon junction with minimum overlap. The use of paired-end sequencing has been mentioned as a good solution to alignment problems, as besides giving longer length reads, it allows obtaining information in respect to the strand.

Several software packages exist for short read alignment, and recently specialized algorithms for transcriptome alignment have been developed, e.g. TopHat and Cufflinks.

Gene expression

The characterization of gene expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...

 in cells via measurement of mRNA levels has long been of interest to researchers. Even though it has been shown that due to other post transcriptional gene regulation events (such as RNA interference
RNA interference
RNA interference is a process within living cells that moderates the activity of their genes. Historically, it was known by other names, including co-suppression, post transcriptional gene silencing , and quelling. Only after these apparently unrelated processes were fully understood did it become...

) there is not a strong correlation between the abundance of mRNA and the related proteins, measuring mRNA concentration levels is still a useful tool in determining how the transcriptional machinery of the cell is affected in the presence of external signals (e.g. drug treatment), or how cells differ between a healthy state and a diseased state.

Microarray approach

Prior to RNA-seq, DNA microarrays were unchallenged as the experiment of choice for transcriptome analysis. Although many experiments are still using microarrays to generate exciting results where the amount of time to retrieve results for a given sample is shorter, intrinsic experimental limitations of microarrays seem to make RNA-seq the method of choice. One important limitation is a prerequisite for sequence information in order to detect and ultimately evaluate transcripts. As research in the field of RNA-seq is growing steadily with promising and consistent results, one must now consider, "Is this the beginning of the end for microarrays?". Still, for many applications, microarrays are the method of choice. Not only are they 10-100 times cheaper when compared at the same resolution of accuracy (less than $100 per array for high-throughput applications), but RNA-seq protocols still suffer from unknown biases such as those implied by the required ligation steps. Another limitation for quantitative expression profiling is that, by design, high abundance transcripts (such as from housekeeping genes) are responsible for the majority of the sequencing data. In a typical sample, 5% of the genes give rise to 75% of the reads sequenced. As a consequence, it is hard to measure the abundances of the remaining genes reliably, and the majority of transcript measurements is very noisy.

Coverage as measure of expression

Expression can be deduced via RNA-seq to the extent at which a sequence is retrieved. Transcriptome studies in yeast show that in this experimental setting, a fourfold coverage is required for amplicons
thumb|75px|PCR ThermocyclerAn amplicon is a piece of DNA formed as the product of natural or artificial amplification events. For example, it can be formed via polymerase chain reactions or ligase chain reactions , as well as by natural gene duplication.Artificial amplification can be used to...

 to be classified and characterized as an expressed gene. When the transcriptome is fragmented prior to cDNA synthesis, the number of reads corresponding to the particular exon normalized by its length in vivo yields gene expression levels which correlate with those obtained through qPCR.

Single nucleotide variation discovery

Transcriptome single nucleotide variation has been analyzed in maize on the Roche 454 sequencing platform. Directly from the transcriptome analysis, around 7000 single nucleotide polymorphisms (SNPs) were recognized. Following Sanger sequence validation, the researchers were able to conservatively obtain almost 5000 valid SNPs covering more than 2400 maize genes. This impressive transcriptome analysis is currently being applied to cancer research and microbiology which could quite possibly lead to new forms of medicine.


Coverage/depth can affect the mutations seen and given that everything is expression-centric, an allele might not be detected either because it is not in the genome, or because it is not being expressed. At the same time, RNA-seq can yield additional information rather than just the existence of a heterozygous gene as it can also help in estimating the expression of each allele. In association studies, genotypes are associated to disease and expression levels can also be associated with disease. Using RNA-seq, we can measure the relationship between these two associated variables, that is, in what relation are each of the alleles being expressed.

The depth of sequencing required for specific applications can be extrapolated from a pilot experiment.

Germline vs expressed alleles

The only way to be absolutely sure of the individual's mutations is to compare the transcriptome sequences to the germline DNA sequence. This enables the distinction of homozygous genes versus skewed expression of one of the alleles and it can also provide information about genes that were not expressed in the transcriptomic experiment.

Post-transcriptional SNVs

Having the matching genomic and transcriptomic sequences of an individual can also help in detecting post-transcriptional edits, where, if the individual is homozygous for a gene, but the gene's transcript has a different allele, then a post-transcriptional modification event is determined.

mRNA centric single nucleotide variants (SNVs) are generally not considered as a representative source of functional variation in cells, mainly due to the fact that these mutations disappear with the mRNA molecule, however the fact that efficient DNA correction mechanisms do not apply to RNA molecules can cause them to appear more often. This has been proposed as the source of certain prion diseases, also known as TSE or transmissible spongiform encephalopathies.

Fusion gene detection

Caused by different structural modifications in the genome, fusion genes have gained attention because of their relationship with cancer. The ability of RNA-seq to analyze a sample's whole transcriptome in an unbiased fashion makes it an attractive tool to find these kinds of common events in cancer.

The idea follows from the process of aligning the short transcriptomic reads to a reference genome. Most of the short reads will fall within one complete exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. This would be evidence of a possible fusion event, however, because of the length of the reads, this could prove to be very noisy. An alternative approach is to use pair-end reads, when a potentially large number of paired reads would map each end to a different exon, giving better coverage of these events (see figure). Nonetheless, the end result consists of multiple and potentially novel combinations of genes providing an ideal starting point for further validation.

Some considerations

The information gathered when sequencing a sample's transcriptome in this way has many of the same limitations as other RNA expression analysis pipelines. Mainly, the information gathered is:

a) Tissue specific: Gene expression is not uniform throughout an organism's cells, it is strongly dependent on the tissue type being measured;

b) Time dependent: During a cell's lifetime and context, its gene expression levels change.

Because of this, care must be taken when drawing conclusions from the sequencing experiment, as some of the information gathered might not be representative of the individual itself.

An example of this would be during SNV discovery as the mutations discovered are more precisely the mutations being expressed, this is: observing a homozygote location to a non-reference allele in an organism does not necessarily mean that this is the individual's genotype, it could just mean that the gene copy with the reference allele is not being expressed in that tissue and/or at the time snapshot the sample was acquired.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.