RNA-seq Analysis Exercise

Background

RNA-Seq is a recently developed approach for transcriptome profiling that uses deep-sequencing technologies.

In general, a population of RNA (total or fractionated, such as polyA+) is converted to a library of cDNA fragments with adaptors attached to one or both ends. Each molecule, with or without amplification, is then sequenced in a high-throughput manner to obtain short sequences from one end (single-end sequencing) or both ends (pair-end sequencing).

RNA-seq_workflow.jpg

Exercise

Dataset

Here we will use a RNA-seq dataset to identify differentially expressed genes between brain and adrenal samples.

Tissue

RNA-Seq reads forward

RNA-Seq reads reverse

Brain

brain1

brain2

Adrenal

adrenal1

adrenal2

  • Brain1: Forward RNA-seq reads from brain tissue

  • Brain2: Reverse RNA-seq reads from brain tissue

  • Adrenal1: Forward RNA-seq reads from adrenal tissue

  • Adrenal2 : Reverse RNA-seq reads from adrenal tissue

The datasets are paired-end 50bp reads from adrenal and brain tissues. The sampled reads map mostly to a 500Kb region of chromosome 19, positions 3-3.5 million (chr19:3000000-3500000).

The figure below illustrates all the steps of the RNA-Seq analysis workflow. A larger version is available in pdf.

RNA-seq_diagram.png

Run Galaxy and upload data

  1. Connect to http://orione.crs4.it/

  2. On the upper panel click on Shared Data -> Data Libraries, then on Hands-on workshop, and then on RNA-Seq.

  3. Select the files:
    • brain1: brain - forward reads
    • brain2: brain - reverse reads
    • adrenal1: adrenal - forward reads
    • adrenal2: adrenal - reverse reads
    • iGenomes_UCSC_hg19_chr19_gene_annotation.gtf: human genome annotation (chr19 only)
    • hg_chr19.fa: human reference genome (chr19 only)
  4. Import them to a new history (e.g. Workshop: RNA-Seq).

Now, on the upper panel click on Analyze data and select the above-mentioned history amongst the Saved Histories. Then, in the right panel you will find the imported datasets, that will be used in the next steps of the analysis.

Quality control

As a first step of the analysis you should always perform a quality control check to ensure that the raw data looks good and there are no problems or biases in your data which may affect your downstream analysis and conclusions.
To check the quality of your FASTQ files:

  1. From the left panel select the NGS: quality control -> FastQC:Read QC reports using FastQC tool.

  2. From the drop-down list select one the four FASTQ files (e.g. brain1).

  3. Type an appropriate name for your output file (i.e. if the input is brain1, the title could be brain1 QC).

  4. Click "Execute"
  5. Looking at QC results, do you think is it necessary to trim possible bad reads?
  6. If needed (it should be...), trim brain1:

    1. Open NGS: manipulation -> FASTQ Trimmer by column (pay attention, there are many trimmers in Galaxy!)

    2. Type 10 as offset from 5' end
    3. Click "Execute"
    4. When the tool finishes, rename the output file (e.g. brain1 trimmed).

Map processed reads

The next step is the mapping of the processed reads to the reference genome. The major challenge when mapping RNA-seq reads is that the reads, because they come from RNA, often cross splice junction boundaries; splice junctions are not present in a genome's sequence, and hence typical NGS mappers such as Bowtie and BWA are not ideal without modifying the genome sequence. Instead, it is better to use a mapper such as TopHat that is designed to map RNA-seq reads.

RNA-Seq-alignment.png

  1. Use the RNA-seq -> TopHat for Illumina tool to map RNA-seq reads to the hg19 genome. To reduce the time of execution, we will only align reads against chr19 of the human genome. Since the reads are paired, you will need to set the value of mean inner distance between pairs: this is the average distance in basepairs between reads, not the total insert/fragment size.

    1. Select brain1 trimmed as the first RNA-seq FastQ file (forward-reads).

    2. Use a genome from history.

    3. Select hg19_chr19.fa as the reference genome.

    4. Select paired-end as the library.

    5. Select brain2 file as the second RNA-seq FASTQ file (reverse-reads).

    6. Use a mean inner distance of 110.

    7. Leave the default TopHat settings.

  2. Repeat the same steps for the adrenal files, using adrenal1 (forward reads) and adrenal2 (reverse reads).

  3. Look at the tool documentation (just below the Execute button) to understand the datasets that TopHat produces.

Q: Which is the content of the TopHat output files?

Q: How many splice junctions did TopHat find?

Q: Are most splice junctions supported by more or less than 10 reads?
Hint: the score column (5th column) in the splice junctions dataset is useful for answering this question.

Location of splice junctions on the UCSC Genome Browser

The location of splice junctions produced by TopHat may be visualized on the UCSC Genome Browser selecting the display at UCSC main in the history/dataset panel.

Go to region 'chr19:3,000,000-3,500,000'

You should be able to see:

  • The splice junctions produced by TopHat.

  • How TopHat's junctions correspond to the annotated genes.

Summary questions

Q: Are you able to find a gene locus where there are mapped reads, and then find an example of a splice junction between 2 known exons? Hint: to display the mapped reads on UCSC, select display at UCSC main of your TopHat BAM output file in your history/panel.

The figure below shows an example of a splice junction with the reads spanning the two exons. From top to bottom: junction predicted by TopHat; aligned reads; known junctions from UCSC.

ucsc-junction.png

Q: Are you able to find an example where a known splice junction in RefSeq has not been found by TopHat?

Assemble and analyze transcripts

After mapping the reads, the next step is to assemble the reads into complete transcripts that can be analyzed for differential expression and phenomena such as splicing events and transcriptional start sites.

  1. Run the RNA-seq -> Cufflinks tool on the TopHat accepted hits on adrenal BAM dataset. This will assemble the reads into transcripts. See the tool documentation to figure out the output datasets produced by Cufflinks.

  2. Run the RNA-seq -> Cufflinks tool on the TopHat accepted hits on brain BAM dataset.

  3. Run the RNA-seq -> Cuffmerge tool on the two datasets of assembled transcripts produced by Cufflinks from Step 1:

    1. Select Cufflinks on brain: assembled transcripts as the GTF input file;

    2. Click on Add new additional GTF Input Files.

    3. Select Cufflinks on adrenal: assembled transcripts as the additional GTF input file.

    4. Use the UCSC genes annotation iGenomes_UCSC_hg19_chr19_gene_annotation.gtf as the reference annotation.

    Cuffmerge produces a merged transcripts dataset that includes all transcript in both datasets. The merged set of transcripts are needed in subsequent steps.
  4. Run the RNA-seq -> Cuffdiff tool with the following parameters:

    • the merged transcripts by Cuffmerge;

    • in Condition 1:
      • Name: brain

      • Replicate 1: Tophat accepted hits dataset for brain;
    • in Condition 2:
      • Name: adrenal

      • Replicate 1: Tophat accepted hits dataset for adrenal.
    This will perform differential expression analyses on the two samples.

    {i} Be prepared!: 15 output files will appear in your history!

    Here follows a short description of the output files. In bold the output files we'll use in this practical:

Output

Description

transcript FPKM tracking

FPKM (Fragments Per Kilobase of transcript per Million mapped reads, i.e. normalized expression) at transcript level

transcript differential expression testing

Differential expression testing for transcripts: FPKM of one group vs FPKM of the other.

gene FPKM tracking

FPKM at gene level. Tracks the summed FPKM of transcripts sharing the same gene_id.

gene differential expression testing

Differential expression testing for genes. Tests differences in the summed FPKM of transcripts the same gene_id.

TSS groups FPKM tracking

FPKM at TSS (transcription start site) level. Tracks the summed FPKM of transcripts sharing the same tss_id.

TSS groups differential expression testing

Differential expression testing for primary transcripts. Tests differences in the summed FPKM of transcripts sharing the sane tss_id.

CDS FPKM tracking

FPKM at CDS level. Tracks the summed FPKM of transcripts sharing the same coding sequence, independent of tss_id.

CDS FPKM differential expression testing

Differential expression testing for coding sequence (CDS). Tests differences in the summed FPKM of transcripts sharing the same coding sequence, independent of tss_id.

CDS overloading diffential expression testing

Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.

promoters differential expression testing

Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.

splicing differential expression testing

Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.

TSS groups read group tracking

Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing the same tss_id in each replicate.

CDS read group tracking

Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing the same coding sequence, independent of tss_id in each replicate.

genes read group tracking

Gene read group tracking. Tracks the summed expression and counts of transcripts sharing the same gene_id in each replicate.

isoforms read group tracking

Transcript read group tracking.

Browse the Cuffdiff documentation for more details.

Look at the transcript FPKM tracking dataset at the top of your history. The second column contains the class code of each transcript.
Q: Find an entry for a novel isoform (class code 'j') and an entry for an isoform that matches a reference isoform (class code '=').

Q: What is the nearest gene and transcription start site for each entry and what is each transcript's FPKM (Fragments Per Kilobase of exon per Million fragments mapped) value?

Look at transcript differential expression testing dataset (second from the top of your history).
Q: What are the top 10 hits with best p-values?
Hint: To answer this question run the Filter and Sort -> Sort tool by selecting:

  1. the transcript differential expression testing file,

  2. the column c12 (corresponding to p-values),

  3. select General numeric sort as sorting flavor (which can also handle numbers in scientific notation),

  4. and finally Ascending order.

The output of TopHat in BAM format containing the alignment of the reads to the genome can be viewed with the UCSC Genome Browser as you did before.

References

  • Trapnell, C., et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7(3), 562-578 Paper

  • Trapnell, C., et al. (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31(1), 46-53 Paper

  • TopHat: a spliced read mapper for RNA-Seq Website

  • Cufflinks: transcript assembly, differential expression, and differential regulation for RNA-Seq Website

  • CummeRbund: visualization of RNA-Seq differential analysis in R Website

BioWiki: RNA-Seq_Tutorial (last edited 2014-09-25 14:20:15 by NicolaSoranzo)