definição e significado de RNA-Seq

RNA-Seq

From Wikipedia, the free encyclopedia

RNA-Seq, also called "Whole Transcriptome Shotgun Sequencing" ^[1] ("WTSS") and dubbed "a revolutionary tool for transcriptomics" ^[2], refers to the use of High-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content, a technique that is quickly becoming invaluable in the study of diseases like cancer ^[3]. Thanks to the deep coverage and base level resolution provided by next-generation sequencing instruments, RNA-Seq provides researchers with efficient ways to measure transcriptome data experimentally, allowing them to get information such as how different alleles of a gene are expressed, detect post-transcriptional mutations or identifying gene fusions ^[3].

1 Introduction
2 Methods
3 Analysis
4 References

Introduction

The introduction of Next-generation sequencing, or High-throughput sequencing, technologies opened new doors into the field of DNA sequencing, however as understanding of these technologies becomes more widespread and new tools are being developed, so are new innovative ways of applying these technologies being created ^[4]

Given High-throughput sequencing technologies' low requirements of nucleotide sequence product, together with its deep coverage and base-scale resolution, its use has expanded to the field of transcriptomics ^[2]. Transcriptomics is an area of research characterizing the RNA transcribed from a particular genome under investigation. Although transcriptomes are more dynamic relative to genomic DNA, these molecules provide direct access to gene regulation and protein information. Sequencing transcriptomes is not a new idea. Various methods have been developed previously to directly determine cDNA sequences based mostly around traditional (and more expensive) Sanger sequencing, while others include methodologies such as Serial analysis of gene expression (SAGE), cap analysis gene expression (CAGE) and massively parallel signature sequencing (MPSS).

Transcriptome Sequencing (RNA-seq) can be done with a variety of platforms to test a smorgasbord of ideas and hypotheses. For example, using the Illumina (company) Genome Analyzer platform, recent applications include sequencing mammalian transcriptomes ^[5], ABI Solid Sequencing to profile stem cell transcriptomes ^[6] or Life Science's 454 Sequencing to discover SNPs in maize ^[7]. Even though each platform has its technical individualities, the information gathered from each is of the same nature.

Methods

RNA Poly(A) Library

Next-generation sequencing

High-throughput sequencing technologies generate millions of short reads from a library of nucleotide sequences, whether they come from DNA, RNA, or a mixture, the sequencing mechanism of each platform does not vary. The most used technologies and some of their characteristics are shown in the following table^[2]^[9]

454 Sequencing ^[10]	Illumina ^[11]	SOLiD ^[12]
Sequencing Chemistry	Pyrosequencing	Polymerase-based sequence-by-synthesis	Ligation-based sequencing
Amplification approach	Emulsion PCR	Bridge amplification	Emulsion PCR
Paired end separation	3 kb	up to 10kb	3 kb
Mb per run	100 Mb	1300 Mb	3000 Mb
Time per paired end run	7 hours	4 days	5 days
Read length (update)	250 bp (400)	35, 75 and 100 bp	35 and 50 bp
Cost per run	$ 8,438 USD	$ 8,950 USD	$ 17,447 USD
Cost per Mb	$ 84.39 USD	$ 5.97 USD	$ 5.81 USD

Table 1. Comparing metrics and performance of next-generation DNA sequencers ^[9]

RNA-Seq mapping of short reads in exon-exon junctions.

Transcriptome alignment

Due to the small size of the short reads (for Illumina Genome Analyzer this can be around 42 bases) de novo assembly may be difficult (though some software does exist: Velvet (algorithm)), as there cannot be large overlaps between each read needed to easily reconstruct the original sequences, and the deep coverage makes the computing power to track all the possible alignments prohibitive ^[13]. This can be somewhat overcome by having larger sequences obtained from the same sample using other techniques as Sanger Sequencing, and using larger reads as a "skeleton" or a "template" to help assemble reads in difficult regions (e.g. regions with repetitive sequences).

The recommended approach is that of aligning the millions of reads to a "reference genome". There are many tools available for aligning genomic reads to a reference genome (sequence alignment tools), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions.

As discussed above, the sequence libraries are created extracting mRNA using its poly(A) tail, which is added to the mRNA molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences and thus, when trying to align these short reads to a reference genome, only short reads aligning entirely inside exonic regions will be matched while short reads from exon-exon junction regions will not.

A possible method to work around this is to try to align the unaligned short reads using a proxy genome generated with known exonic sequences. This need not cover whole exons, only enough so that the short reads can match on both sides of the exon-exon junction with minimum overlap. The use of paired-end sequencing has been mentioned as a good solution to alignment problems, as besides giving longer length reads, it allows obtaining information in respect to the strand ^[5].

Several software packages exist for short read alignment, and recently specialized algorithms for transcriptome alignment have been developed, e.g. ^[14]

Analysis

Gene expression

The characterization of gene expression in cells via measurement of mRNA levels has long been of interest to researchers. Even though it has been shown that due to other post transcriptional gene regulation events (such as RNA interference) there is not a strong correlation between the abundance of mRNA and the related proteins ^[15], measuring mRNA concentration levels is still a useful tool in determining how the transcriptional machinery of the cell is affected in the presence of external signals (e.g. drug treatment), or how cells differ between a healthy state and a diseased state.

Microarray approach

Prior to RNA-Seq, DNA microarrays were unchallenged as the experiment of choice for transcriptome analysis. Although many experiments are still using microarrays to generate exciting results, where the amount of time to retrieve results for a given sample is shorter in time, intrinsic experimental limitations of microarrays seem to make RNA-Seq the method of choice. One important limitation, amongst others, is a pre-requisite for sequence information in order to detect and ultimately evaluate transcripts ^[16]. As research in the field of RNA-Seq is growing steadily with promising and consistent results, one must now consider "Is this the beginning of the end for microarrays?" ^[17].

Coverage as measure of expression

Expression can be deduced via RNA-Seq to the extent at which a sequence is retrieved. Transcriptome studies in Yeast ^[18] show that in this experimental setting, a fourfold coverage is required for amplicons to be classified and characterized as an expressed gene. When the transcriptome is fragmented prior to cDNA synthesis, the number of reads corresponding to the particular exon normalized by its length in vivo yields gene expression levels which correlate with those obtained through qPCR.^[19]

Single nucleotide variation discovery

Transcriptome single nucleotide variation has been analyzed in maize on the Roche 454 sequencing platform ^[7]. Directly from the transcriptome anaysis, around 7000 single nucleotide polymorphisms (SNPs) were recognized. Following Sanger sequence validation, the researchers were able to conservatively obtain almost 5000 valid SNPs covering more than 2400 maize genes. This impressive transcriptome analysis is currently being applied to cancer research and microbiology which could quite possibly lead to new forms of medicine.

Coverage/depth

Coverage/depth can affect the mutations seen and given that everything is expression-centric, an allele might not be detected either because it is not in the genome, or because it is not being expressed. At the same time, RNA-seq can yield additional information rather than just the existence of a heterozygous gene as it can also help in estimating the expression of each allele. In association studies, genotypes are associated to disease and expression levels can also be associated with disease. Using RNA-seq, we can measure the relationship between these two associated variables, that is, in what relation are each of the alleles being expressed.

The depth of sequencing required for specific applications can be extrapolated from a pilot experiment. ^[19]

Germline vs expressed alleles

The only way to be absolutely sure of the individual's mutations is to compare the transcriptome sequences to the germline DNA sequence. This enables the distinction of homozygous genes versus skewed expression of one of the alleles and it can also provide information about genes that were not expressed in the transcriptomic experiment.

Post-transcriptional SNVs

Having the matching genomic and transcriptomic sequences of an individual can also help in detecting post-transcriptional edits ^[2], where, if the individual is homozygous for a gene, but the gene's transcript has a different allele, then a post-transcriptional modification event is determined.

mRNA centric single nucleotide variants (SNVs) are generally not considered as a representative source of functional variation in cells, mainly due to the fact that these mutations disappear with the mRNA molecule, however the fact that efficient DNA correction mechanisms do not apply to RNA molecules can cause them to appear more often. This has been proposed as the source of certain prion diseases ^[20], also known as TES or transmissible spongiform encephalopathies.

RNA-Seq mapping of short reads over exon-exon junctions, depending on where each end maps to, it could be defined a Trans or a Cis event.

Fusion gene detection

Some considerations

The information gathered when sequencing a sample's transcriptome in this way has many of the same limitations as other RNA expression analysis pipelines. Mainly, the information gathered is:

a) Tissue specific: Gene expression is not uniform throughout an organism's cells, it is strongly dependent on the tissue type being measured;

b) Time dependent: During a cell's lifetime and context, its gene expression levels change.

Because of this, care must be taken when drawing conclusions from the sequencing experiment, as some of the information gathered might not be representative of the individual itself.

An example of this would be during SNV discovery as the mutations discovered are more precisely the mutations being expressed, this is: observing a homozygote location to a non-reference allele in an organism does not necessarily mean that this is the individual's genotype, it could just mean that the gene copy with the reference allele is not being expressed in that tissue and/or at the time snapshot the sample was acquired.

References

^ ^a ^b ^c Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzywinski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, and Marco A. Marra. (2008). "Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing". BioTechniques 45 (1): 81–94. PMID 18611170. http://www.bcgsc.ca/about/pubann/biotechniques-publication-2008-44-8.
^ ^a ^b ^c ^d ^e Wang Z, Gerstein M, Snyder M. (January 2009). "RNA-Seq: a revolutionary tool for transcriptomics". Nature Reviews Genetics 10 (1): 57–63. doi:10.1038/nrg2484. PMID 19015660. http://www.nature.com/nrg/journal/v10/n1/abs/nrg2484.html.
^ ^a ^b ^c Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (January 2009). "Transcriptome sequencing to detect gene fusions in cancer". Nature. doi:10.1038/nature07638. PMID 19136943. http://www.nature.com/nature/journal/vaop/ncurrent/abs/nature07638.html.
^ Samuel Marguerat, Brian T. Wilhelm and Jürg Bähler (2008). "Next-generation sequencing: applications beyond genomes". Biochemical Society Transactions 36: 1091–1096. doi:10.1042/BST0361091. PMID 18793195. http://www.biochemsoctrans.org/bst/036/1091/bst0361091.htm.
^ ^a ^b ^c ^d Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. (2008). "Mapping and quantifying mammalian transcriptomes by RNA-Seq". Nature Methods 5 (7): 621–628. doi:10.1038/nmeth.1226. PMID 18516045. http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1226.html.
^ Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM. (2008). "Stem cell transcriptome profiling via massive-scale mRNA sequencing". Nature Methods 5 (7): 613–619. doi:10.1038/nmeth.1223. PMID 18516046. http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1223.html.
^ ^a ^b Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS (2007). "SNP discovery via 454 transcriptome sequencing". The Plant Journal 51 (5): 910–918. doi:10.1111/j.1365-313X.2007.03193.x. PMID 17662031. http://www3.interscience.wiley.com/journal/118488674/abstract.
^ http://www.protocol-online.org/prot/Molecular_Biology/RNA/RNA_Extraction/mRNA_Isolation/index.html
^ ^a ^b Mardis, ER (2008). "The impact of next-generation sequencing technology on genetics". Trends in Genetics 24 (3): 142–149. doi:10.1016/j.tig.2007.12.007. PMID 18262675. http://dx.doi.org/10.1016/j.tig.2007.12.007.
^ http://www.454.com/applications/transcriptome-sequencing.asp
^ http://www.illumina.com/pages.ilmn?ID=204
^ http://solid.appliedbiosystems.com/
^ Zerbino DR, Birney E (2008). "Velvet: Algorithms for de novo short read assemblyusing de Bruijn graphs". Genome Research 18 (5): 821–829. doi:10.1101/gr.074492.107. PMID 18349386. http://genome.cshlp.org/content/18/5/821.full.
^ Cole Trapnell, Lior Pachter and Steven Sazlberg (2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics 25: 1105–1111. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/9/1105?etoc.
^ Greenbaum D, Colangelo C, Williams K, Gerstein M. (2003). "Comparing protein abundance and mRNA expression levels on a genomic scale". Genome Biology 4 (9): 117. doi:10.1186/gb-2003-4-9-117. PMID 12952525. http://genomebiology.com/2003/4/9/117.
^ Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. (2008). "RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays". Genome Research 18 (9): 1509–1517. doi:10.1101/gr.079558.108. PMID 18550803. http://genome.cshlp.org/content/18/9/1509.
^ Shendure, Jay (2008). "The beginning of the end for microarrays?". Nature Methods 5 (7): 585–587. doi:10.1038/nmeth0708-585. PMID 18587314. http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth0708-585.html.
^ Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008). "The transcriptional landscape of the yeast genome defined by RNA sequencing". Science 320 (5881): 1344–1349. doi:10.1126/science.1158441. PMID 18451266. http://www.sciencemag.org/cgi/content/abstract/320/5881/1344.
^ ^a ^b Li H, Lovci MT, Kwon YS, Rosenfeld MG, Fu XD, Yeo GW (2008). "Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model.". Proc Natl Acad Sci U S A 105 (51): 20179–84. PMID 19088194. http://www.pnas.org/content/105/51/20179.long.
^ Garcion E, Wallace B, Pelletier L, Wion D. (2004). "RNA mutagenesis and sporadic prion diseases". Journal of Theoretical Biology 230 (2): 271–274. doi:10.1016/j.jtbi.2004.05.014. PMID 15302558. http://dx.doi.org/10.1016/j.jtbi.2004.05.014.
^ Teixeira MR (2006). "Recurrent fusion oncogenes in carcinomas". Ciritical Reviews in Oncogenesis 12 (3-4): 257–271. PMID 17425505. http://www.begellhouse.com/journals/439f422d0783386a,1371844864dc630c,7499e7bf0e7ad511.html.

Definição e significado de RNA-Seq

Definição