Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law, Stephane Ballereau, Oscar Rueda, Ashley Sawle Based on the course RNAseq analysis in R delivered on May 11/12th 2016 and modified by Cancer Research Uk Cambridge Centre for the Functional Genomics Autumn School 2017
This material has been created using the following resources:
This section also uses code from Stephen Turner’s guide to fgsea https://stephenturner.github.io/deseq-to-fgsea/
In the early days of microarray analysis, people were happy if they got a handful of differentially-expressed genes that they could validate or follow-up. However, with later technologies (and depending on the experimental setup) we might have thousands of statistically-significant results, which no-one has the time to follow-up. Also, we might be interested in pathways / mechanisms that are altered and not just individual genes.
In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.
There are two different approaches one might use, and we will cover the theory behind both
The fgsea package is a free implementation of the Broad’s GSEA software and is described in more detail in the package vignette “fast preranked gene set enrichment analysis (GSEA)”:
The GSEA analysis is performed by:
“After establishing the ES for each gene set across the phenotype, GSEA reiteratively randomizes the sample labels and retests for enrichment across the random classes. By performing repeated class label randomizations, the ES for each gene set across the true classes can be compared to the ES distribution from the random classes. Those gene sets that significantly outperform iterative random class permutations are considered significant.”
The article describing the original software is available here and there is also a commentary on GSEA.
In addition to the GSEA software the Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). Unfortunately, these are collections of human genes. However, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institutes Bioinformatics service and made avaialble for download. These gene sets use Entrez ID as their identifier.
library(fgsea)
library(tidyverse)
results_annotated <- read_csv("tumour_vs_normal_DESeq_annotated.csv")
An appealing feature of the method is that it does not require us to impose arbitrary cut-offs on the dataset to decide what is differentially-expressed or not. The steps in producing the input required for GSEA are i) retrieving the ranked statistics ii) naming each one according to Entrez ID.
gseaInput <- results_annotated %>%
filter(!is.na(ENTREZID), !is.na(stat)) %>%
arrange(stat)
ranks <- pull(gseaInput,stat)
names(ranks) <- gseaInput$ENTREZID
The Walter+Eliza Hall Institutes Bioinformatics service have made mouse versions of the MSigDB datasets available for download. This should already be available in the Robjects
folder.
download.file("http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata", destfile = "Robjects/human_H_v5p2.rdata")
trying URL 'http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata'
Content type 'text/plain; charset=UTF-8' length 22168 bytes (21 KB)
==================================================
downloaded 21 KB
load("Robjects/human_H_v5p2.rdata")
pathways <- Hs.H
Conduct analysis:
fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize = 500, nperm=1000)
dim(fgseaRes)
[1] 50 8
#head(fgseaRes)
The results table gives the names of each pathway that was tested and the stats from doing the test. We can make this into a “tidy” object with the following code.
fgseaResTidy <- fgseaRes %>%
as_tibble() %>%
arrange(desc(NES))
# Show in a nice table:
fgseaResTidy
library(ggplot2)
ggplot(fgseaResTidy, aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill=padj<0.05)) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="Hallmark pathways NES from GSEA")
The enrichment plot will show where the genes belonging to a particular gene set are towards the top or the bottom of the genelist, and how the enrichment score is calculated across the dataset.
Here we show the enrichment plot for the pathway with the most positive enrichment score.
plotEnrichment(pathways[["HALLMARK_MYC_TARGETS_V1"]],
ranks)
names(pathways)
[1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB"
[2] "HALLMARK_HYPOXIA"
[3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS"
[4] "HALLMARK_MITOTIC_SPINDLE"
[5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING"
[6] "HALLMARK_TGF_BETA_SIGNALING"
[7] "HALLMARK_IL6_JAK_STAT3_SIGNALING"
[8] "HALLMARK_DNA_REPAIR"
[9] "HALLMARK_G2M_CHECKPOINT"
[10] "HALLMARK_APOPTOSIS"
[11] "HALLMARK_NOTCH_SIGNALING"
[12] "HALLMARK_ADIPOGENESIS"
[13] "HALLMARK_ESTROGEN_RESPONSE_EARLY"
[14] "HALLMARK_ESTROGEN_RESPONSE_LATE"
[15] "HALLMARK_ANDROGEN_RESPONSE"
[16] "HALLMARK_MYOGENESIS"
[17] "HALLMARK_PROTEIN_SECRETION"
[18] "HALLMARK_INTERFERON_ALPHA_RESPONSE"
[19] "HALLMARK_INTERFERON_GAMMA_RESPONSE"
[20] "HALLMARK_APICAL_JUNCTION"
[21] "HALLMARK_APICAL_SURFACE"
[22] "HALLMARK_HEDGEHOG_SIGNALING"
[23] "HALLMARK_COMPLEMENT"
[24] "HALLMARK_UNFOLDED_PROTEIN_RESPONSE"
[25] "HALLMARK_PI3K_AKT_MTOR_SIGNALING"
[26] "HALLMARK_MTORC1_SIGNALING"
[27] "HALLMARK_E2F_TARGETS"
[28] "HALLMARK_MYC_TARGETS_V1"
[29] "HALLMARK_MYC_TARGETS_V2"
[30] "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"
[31] "HALLMARK_INFLAMMATORY_RESPONSE"
[32] "HALLMARK_XENOBIOTIC_METABOLISM"
[33] "HALLMARK_FATTY_ACID_METABOLISM"
[34] "HALLMARK_OXIDATIVE_PHOSPHORYLATION"
[35] "HALLMARK_GLYCOLYSIS"
[36] "HALLMARK_REACTIVE_OXIGEN_SPECIES_PATHWAY"
[37] "HALLMARK_P53_PATHWAY"
[38] "HALLMARK_UV_RESPONSE_UP"
[39] "HALLMARK_UV_RESPONSE_DN"
[40] "HALLMARK_ANGIOGENESIS"
[41] "HALLMARK_HEME_METABOLISM"
[42] "HALLMARK_COAGULATION"
[43] "HALLMARK_IL2_STAT5_SIGNALING"
[44] "HALLMARK_BILE_ACID_METABOLISM"
[45] "HALLMARK_PEROXISOME"
[46] "HALLMARK_ALLOGRAFT_REJECTION"
[47] "HALLMARK_SPERMATOGENESIS"
[48] "HALLMARK_KRAS_SIGNALING_UP"
[49] "HALLMARK_KRAS_SIGNALING_DN"
[50] "HALLMARK_PANCREAS_BETA_CELLS"
Challenge 1
- What pathway has the most extreme negative enrichment score? How can you identify this pathway from the results table?
- Make the enrichment plot for this pathway
The names of genes involved in particular pathways can be extracted from the pathways
object. We can then use these genes to make a heatmap, or other visualisation, to show how these genes partition the dataset.
dds <- readRDS("Robjects/dds.rds")
Loading required package: DESeq2
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall,
clusterEvalQ, clusterExport, clusterMap, parApply,
parCapply, parLapply, parLapplyLB, parRapply,
parSapply, parSapplyLB
The following objects are masked from ‘package:dplyr’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename,
cbind, colMeans, colnames, colSums, dirname, do.call,
duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, lengths, Map,
mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, Position, rank, rbind, Reduce,
rowMeans, rownames, rowSums, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which,
which.max, which.min
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:dplyr’:
first, rename
The following object is masked from ‘package:tidyr’:
expand
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Attaching package: ‘IRanges’
The following objects are masked from ‘package:dplyr’:
collapse, desc, slice
The following object is masked from ‘package:purrr’:
reduce
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages
'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
anyMissing, rowMedians
The following object is masked from ‘package:dplyr’:
count
Loading required package: BiocParallel
Attaching package: ‘DelayedArray’
The following objects are masked from ‘package:matrixStats’:
colMaxs, colMins, colRanges, rowMaxs, rowMins,
rowRanges
The following object is masked from ‘package:purrr’:
simplify
The following objects are masked from ‘package:base’:
aperm, apply
library(DESeq2)
library(pheatmap)
my_genes <- filter(results_annotated, ENTREZID %in% pathways[["HALLMARK_MYC_TARGETS_V1"]]) %>%
pull(GeneID)
vsd <- vst(dds)
mat <- assay(vsd)[my_genes,]
mat <- mat - rowMeans(mat)
dim(mat)
[1] 197 27
sampleinfo <- data.frame(colData(dds))[,c("Batch","Status","gleason_score")]
colnames(mat) <- rownames(sampleinfo) <- dds$ID
pheatmap(mat,
annotation_col = sampleinfo)
GOseq is a method to conduct Gene Ontology (GO) analysis suitable for RNA-seq data as it accounts for the gene length bias in detection of over-representation (GOseq article)
From the GOseq vignette:
“GO analysis of RNA-seq data requires the use of random sampling in order to generate a suitable null distribution for GO category membership and calculate each category’s significance for over representation amongst DE genes. … In most cases, the Wallenius distribution can be used to approximate the true null distribution, without any significant loss in accuracy. The goseq package implements this approximation as its default option.”
First, create list of DEGs. We don’t have to be too strict about our criteria for a gene to be differentially-expressed, as too few genes will not give us many enriched pathways.
Mapping between gene identifiers and pathways is done via pre-built Bioconductor packages. In our dataset, each gene that we have tested has a valid gene symbol. Therefore we will tell goseq
to use the gene symbol (the GeneID
) column in the annotated data frame to map to pathways and calculate gene lengths.
genes <- results_annotated$padj < 0.05 & !is.na(results_annotated$padj)
names(genes) <- results_annotated$GeneID
The nullp
will fit a model to account for gene length biases in our data. In order to use the human genome version hg38 we need the TxDb.Hsapiens.UCSC.hg38.knownGene
package to be installed.
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
library(goseq)
pwf <- nullp(genes, "hg38","geneSymbol")
Then we conduct gene set enrichment analysis with the goseq
function.
goseq_res <- goseq(pwf, "hg38","geneSymbol",test.cats="GO:BP")
head(goseq_res)
clusterProfiler
is another Bioconductor package for over-representation analysis. It’s main advantage is that it provides some nice visualisation methods.
Firstly, we can identify over-represented GO terms and visualise these as a network.
library(clusterProfiler)
universe <- results_annotated %>% pull(ENSEMBL)
sigGenes <- results_annotated %>%
filter(padj < 0.05, !is.na(ENSEMBL)) %>% pull(ENSEMBL)
enrich_go <- enrichGO(
gene= sigGenes,
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP",
universe = universe,
qvalueCutoff = 0.05,
readable=TRUE
)
enrich_go_tidy <- enrich_go %>%
slot("result") %>%
tibble::as.tibble()
enrich_go_tidy
A dot plot can show us the most enriched pathways, and the size of each.
dotplot(enrich_go)
emapplot(enrich_go)
There are also many online tools that one could use to perform a gene set or ontology analysis.
The tools generally require your input genes lists to be uploaded as a simple text file. In this final challenge, we will create some files that you might use in one of these tools.
This file has one column which lists all the gene names present in the analysis. Gene Symbols are commonly used, although a tool may accept Ensembl or Refseq names
This file has one column which lists the genes that passed the threshold for statistical significance (e.g. p-value less than 0.05) in your analysis. Gene Symbols are commonly used, although a tool may accept Ensembl or Refseq names
Challenge
Create two text files that can be imported into online tools for further analysis 1. A list of background genes 2. A list of differentially expressed genes 3. Load these files into GOrilla for analysis HINT: the
write.table
function is able to write a data frame to a txt file in R. You will need to set the appropriate arguments to make sure that a text file with only one column is created.