We start by loading our results from the previous week
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
library(dplyr)
Attaching package: ‘dplyr’
The following object is masked from ‘package:Biobase’:
combine
The following object is masked from ‘package:matrixStats’:
count
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff, setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
## Read the annotated results
results_tgf <- readRDS("Robjects/results_TGF_vs_CTR_annotated_BACKUP.rds")
## Read the counts that we produced previously
dds <- readRDS("Robjects/dds_BACKUP.rds")
The standard volcano plot can be produced with the
ggplot2
package
library(ggplot2)
results_tgf %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj))) + geom_point()
library(org.Hs.eg.db)
Loading required package: AnnotationDbi
Attaching package: ‘AnnotationDbi’
The following object is masked from ‘package:dplyr’:
select
ecm_anno <- AnnotationDbi::select(org.Hs.eg.db,
keys = "GO:0030198",
keytype = "GO",
columns=c("ENSEMBL","SYMBOL"))
'select()' returned 1:many mapping between keys and columns
ecm_anno
ecm_ids <- pull(ecm_anno, "ENSEMBL")
results_tgf %>%
mutate(ECM_Gene = ENSEMBL %in% ecm_ids) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj),col=ECM_Gene)) + geom_point()
The default colouring does not distinguish the ECM genes as much as
we might like. We could manually adjust it so that genes belonging to
the pathway are coloured in red, and the rest in black. The transparency
(alpha
) can also be adjusted. Below is a suggestion.
results_tgf %>%
mutate(ECM_Gene = ENSEMBL %in% ecm_ids) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj),col=ECM_Gene,alpha=ECM_Gene)) + geom_point() + scale_color_manual(values=c("black","red")) + scale_alpha_manual(values=c(0.2,1))
The tricky part of making a heatmap is deciding which genes to include. The gene names have to be in ENSEMBL format, as the count matrix we will be plotting has ENSEMBL names in the rows.
In the following example, we chose the genes with most extreme log2 fold-change
# The pheatmap library is needed to make the heatmap
library(pheatmap)
#first, get variance-stablised counts
vsd <- vst(dds)
using 'avgTxLength' from assays(dds), correcting for library size
sampleInfo <- as.data.frame(colData(dds)[,c("condition","Treated")])
## Order results by log2FoldChange and get the most extreme
top_genes_logFC <- results_tgf %>%
arrange(desc(abs(log2FoldChange))) %>%
slice(1:30) %>%
pull(ENSEMBL)
top_genes_logFC_labels <- results_tgf %>%
arrange(desc(abs(log2FoldChange))) %>%
slice(1:30) %>%
pull(SYMBOL)
## get the corresponding geen symbols
pheatmap(assay(vsd)[top_genes_logFC,],
scale = "row",
labels_row = top_genes_logFC_labels)
In order to use the enrichKEGG
function, we will have to
define a set of enriched genes in terms of their ENTREZID
(instead of ENSEMBL
)
sigGenesEntrez <- results_tgf %>%
filter(padj < 0.05, !is.na(ENTREZID)) %>% pull(ENTREZID)
Make sure you check the help for enrichKEGG
, as it uses
different argument names
library(clusterProfiler)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
Please cite:
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang,
W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize
multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z
Attaching package: ‘clusterProfiler’
The following object is masked from ‘package:AnnotationDbi’:
select
The following object is masked from ‘package:IRanges’:
slice
The following object is masked from ‘package:S4Vectors’:
rename
The following object is masked from ‘package:stats’:
filter
enrich_kegg <- enrichKEGG(gene = sigGenesEntrez,
organism = "hsa",)
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
as.data.frame(enrich_kegg)
The same plots can be created from the results. Here is the dot plot.
dotplot(enrich_kegg)
and the upset plot.
enrichplot::upsetplot(enrich_kegg)
For the GSEA analysis using KEGG we can use the same set of ranked statistics, but make sure to name them according to ENTREZID
ranked_genes <- results_tgf %>%
arrange(desc(stat)) %>%
filter(!is.na(stat))
geneList <- pull(ranked_genes, stat)
names(geneList) <- pull(ranked_genes, ENTREZID)
Again, make sure to check the arguments for gseKEGG
.
gse_kegg <- gseKEGG(geneList = geneList,
organism = "hsa")
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
Warning: There are ties in the preranked stats (11.21% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.Warning: There are duplicate gene names, fgsea may produce unexpected results.Warning: For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.leading edge analysis...
done...
as.data.frame(gse_kegg)