In the early days of microarray analysis, people were happy if they got a handful of differentially-expressed genes that they could validate or follow-up. However, with later technologies (and depending on the experimental setup) we might have thousands of statistically-significant results, which no-one has the time to follow-up. Also, we might be interested in pathways / mechanisms that are altered and not just individual genes.
In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.
There are two different approaches one might use, and we will cover the theory behind both
The question we are asking here is;
“Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?”
Where Theme X could be genes belonging to a particular GO (Gene Onotology) term.
Let’s imagine that we have a bag full of balls. Each balls represents a gene in the gene universe. - Paint the balls representing our selected list grey, and paint the rest red.
In this small example, we can define;
Now, lets select a number (say, 12) of the balls at random without seeing into the bag and look at what we get
We have picked, at random, 8 grey balls. Using simulations, we can repeat the process and look at how many grey we get.
The distribution of the data shows what the most-likely values are
We can count how many times each value is observed
trials
4 5 6 7 8 9 10 11 12
1 3 73 403 1333 2638 3168 1922 459
Dividing by the number of trials gives a probability of sorts
trials
4 5 6 7 8 9 10 11 12
0.0001 0.0003 0.0073 0.0403 0.1333 0.2638 0.3168 0.1922 0.0459
The probability of getting at least a certain number can also be computed
4 5 6 7 8 9 10 11 12
0.0001 0.0004 0.0077 0.0480 0.1813 0.4451 0.7619 0.9541 1.0000
4 5 6 7 8 9 10 11 12
0.9999 0.9996 0.9923 0.9520 0.8187 0.5549 0.2381 0.0459 0.0000
Back to our example, the distribution of balls can be expressed as a contingency table, on which we can use a Fisher’s exact test
Total grey balls: 10 Total in subset: 12
The formula for Fishers exact test is;
\[ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \]
or less formally;
P = (ways of choosing grey balls) X (ways of non-grey balls amongst subset) / ways of choosing subset
Be careful of how you define the universe
What genes were candidates for selection as universe?
Just changing the size of the universe alone can have a massive effect on the p-value
In order to get a larger set of differentially-expressed genes across the whole genome, we will use a different dataset where gene counts have already been computed
In the study of Brooks et al. 2011, the Pasilla (PS) gene, Drosophila homologue of the Human splicing regulators Nova-1 and Nova-2 Proteins, was depleted in Drosophila melanogaster by RNAi. The authors wanted to identify exons that are regulated by Pasilla gene using RNA sequencing data. Total RNA was isolated and used for preparing either single-end or paired-end RNA-seq libraries for treated (PS depleted) samples and untreated samples. These libraries were sequenced to obtain a collection of RNA sequencing reads for each sample. The effects of Pasilla gene depletion on splicing events can then be analyzed by comparison of RNA sequencing data of the treated (PS depleted) and the untreated samples. The genome of Drosophila melanogaster is known and assembled. It can be used as reference genome to ease this analysis. In a reference based RNA-seq data analysis, the reads are aligned (or mapped) against a reference genome, Drosophila melanogaster here, to significantly improve the ability to reconstruct transcripts and then identify differences of expression between several conditions.
To save time and memory, pre-counted files are available online (https://zenodo.org/record/1185122#.WqGfbnW0PCI)
c7 < 0.05 and (c3 > 1.0 or c3 < -1.0)
(Optional) Sort by a column:
We can sort the table by values in a particular column (eg. log fold change)
We will be using the online tool GOrilla
This is the list of all the genes you tested in your RNAseq pipeline.
Drosophila melanogaster
Two unranked lists of genes
DE_Genes.txt
as the Target set.Background.txt
as the Background set.Process
Search Enriched GO terms
For these tests, we don’t have to specify a statistical threshold and use the test statistics (log fold-change) from all genes as the input to the test. The popular Gene Set Enrichment Analysis (GSEA) uses this approach. These tests can be used to detect differential expression for a group of genes, even when the effects are too small or there is too little data to detect the genes individually.
gsea
Subramanian et al, PNAS 2005
The Broad institute provides a version of GSEA that can be run via a java application. Another option is GeneTrail which has the option to do a GSEA-style analysis.
c3 != 'NA'
DESeq2 result file...
as inputc1 c3
as the columns to cutgsea-input.txt
)Go to the GeneTrail website
gsea-input.txt