Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law, Stephane Ballereau, Oscar Rueda, Ashley Sawle Based on the course RNAseq analysis in R delivered on May 11/12th 2016 and modified by Cancer Research Uk Cambridge Centre for the Functional Genomics Autumn School 2017
This material has been created using the following resources:
DESeq2
Now that we are happy that we have normalised the data and that the quality looks good, we can continue to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESeq2
or edgeR
. We will use DESeq2
for the rest of this practical.
**First make sure we have all the objects and libraries loaded*
library(DESeq2)
The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.
Note that although we spent some time looking at the quality of our data , these steps are not required prior to performing differential expression so are not shown here. Remember, DESeq2
requires raw counts so the vst
transformation is not shown as part of this basic protocol.
raw <- read_tsv("raw_data/selected_prostate_tcga_raw.tsv")
genes <- pull(raw,X)
cts <- as.matrix(raw[,-1])
rownames(cts) <- genes
sampleinfo <- read.delim("meta_data/sampleInfo_corrected.txt")
keep <- rowSums(assay(dds) >= 5) >= 5
dds <- dds[keep,]
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = sampleinfo,
design = ~Status)
dds <- dds[,-which(dds$ID %in% c("N7","N8","N9"))]
dds
We also have the output of the pre-processing section saved as an R object if you didn’t manage to complete these steps.
## Only run if you didn't complete the previous section on pre-processing
dds <- readRDS("Robjects/dds.rds")
We have previously defined the test condition using the design
argument when we created the object. This can be checked using the design
function.
Typically we decide the design for the analysis when we create the DESeq2 objects, but it can be modified prior to the differential expression analysis
colData(dds)
DataFrame with 27 rows and 5 columns
SampleName ID gleason_score Status Batch
<factor> <factor> <factor> <factor> <integer>
TCGA.CH.5753.01A.11R.1580.07 TCGA.CH.5753.01A.11R.1580.07 5753 G9 Tumour 1
TCGA.CH.5761.01A.11R.1580.07 TCGA.CH.5761.01A.11R.1580.07 5761 G9 Tumour 1
TCGA.EJ.5507.01A.01R.1580.07 TCGA.EJ.5507.01A.01R.1580.07 5507 G9 Tumour 1
TCGA.FC.A8O0.01A.41R.A37L.07 TCGA.FC.A8O0.01A.41R.A37L.07 A8O0 G6 Tumour 1
TCGA.G9.6371.01A.11R.1789.07 TCGA.G9.6371.01A.11R.1789.07 6371 G6 Tumour 1
... ... ... ... ... ...
TCGA.EJ.7327.11A.01R.2118.07 TCGA.EJ.7327.11A.01R.2118.07 N3 Normal Normal 1
TCGA.G9.6356.11A.01R.1789.07 TCGA.G9.6356.11A.01R.1789.07 N4 Normal Normal 1
TCGA.G9.6384.11A.01R.1858.07 TCGA.G9.6384.11A.01R.1858.07 N5 Normal Normal 1
TCGA.G9.6499.11A.02R.1965.07 TCGA.G9.6499.11A.02R.1965.07 N6 Normal Normal 2
TCGA.HC.8260.11A.01R.2263.07 TCGA.HC.8260.11A.01R.2263.07 N10 Normal Normal 2
design(dds)
~Status
The function runs a couple of processing steps automatically to adjust for different library size and gene-wise variabiliy, which you can read about in the DESeq2 vignette.
Firstly apply the median ratio normalisation method to compensate for differences in library sizes
dds <- estimateSizeFactors(dds)
estimate the dispersion for each gene
dds <- estimateDispersions(dds)
Apply statistical testing based on the negative binomial distribution.
dds <- nbinomWaldTest(dds)
Fortunately, there is one convenient function that will apply the three steps
de_status <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 947 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
The results of the analysis can be obtained using the results
function and displayed to the screen. Each row is a particular gene measured in the study (i.e. all genes in the organism being studied) and each column reports some aspect of the differential expression analysis for that gene. Note that all genes are reported regardless of whether they are differentially-expressed or not. The results table is sorted according to the gene names rather than the test results.
results(de_status)
The output can be converted into a data frame and manipulated in the usual manner. It is recommended to use dplyr
to manipulate the data frames with the standard set of operations detailed on the dplyr cheatsheet
select
to pick which columns to displayfilter
to restrict the rowsmutate
to add new variables to the data framearrange
to order the data frame according to values of a columnHowever, dpylr
does not like data frame that have rownames. We can use the rownames_to_column
function from the tibble
package to add an extra column that contains the Ensembl gene IDs.
library(dplyr)
library(tibble)
results_status <- as.data.frame(results(de_status)) %>%
rownames_to_column("GeneID")
We can sort the rows by adjusted p-value and then print the first 10 rows.
arrange(results_status, padj) %>%
head(n=10)
GeneID baseMean log2FoldChange lfcSE stat pvalue padj
1 MIR4508 202.93938 6.704474 0.5325059 12.590422 2.383957e-36 4.576959e-32
2 SEMA5B 1258.54783 4.309721 0.3538565 12.179291 4.007503e-34 3.847003e-30
3 HSPA6 3148.75296 -7.457737 0.8235722 -9.055353 1.361266e-19 8.711650e-16
4 NKX2-3 77.94173 5.948540 0.7544276 7.884839 3.149416e-15 1.511641e-11
5 HOXC4 194.33661 4.485504 0.5905854 7.595013 3.077607e-14 1.181740e-10
6 HOXC5 51.46773 4.586805 0.6261784 7.325078 2.387600e-13 7.639921e-10
7 FOXD1 207.75221 4.835241 0.7122362 6.788817 1.130569e-11 3.100827e-08
8 HOXC6 501.89633 3.943275 0.6052434 6.515189 7.259817e-11 1.742265e-07
9 NETO2 471.09916 2.960785 0.4637273 6.384754 1.716737e-10 3.662181e-07
10 CST9 19.63986 5.628379 0.8847921 6.361245 2.001247e-10 3.676494e-07
We can sort the rows and then write the resulting data frame to a file.
arrange(results_status, padj) %>%
write.csv("results/tumour_vs_normal_DESeq_all.csv")
Challenge 1
- Re-run the analysis to find differentially-expressed genes between the Gleason Grades 6 and 9
- Write a csv file that contains just the results for the genes that have a p-value less than 0.05 and a log2 fold change more than 1, or less than -1. HINT: So that we don’t overwrite our results so far, it may be convenient to create a new
DESeqDataSet
object for the new differential expression analysis.
dds_gleason <- dds
design(dds_gleason) <- ......
In this initial analyis DESeq2
has automatically decided which member of our sample groups to use as our baseline (Normal
in this case) so that the log2 fold changes are reported with a positve value meaning higher expression in Tumour
. If we want to change this behaviour we can change the contrast
argument in the results
function
## This should give the same as the table above
results(de_status, contrast=c("Status","Tumour","Normal"))
## Changing the direction of the contrast
results(de_status, contrast=c("Status","Normal","Tumour"))
If we change to performing differential expression analysis on the gleason_score
variable then there are various contrasts that can be made; gleason 6
vs Normal
, Gleason 9
vs Normal
etc. When the results
function is run the table that is displayed is for the contrast Normal vs G6
. The resultsNames
function can tell us which other contrasts we can access.
dds_gleason <- dds
design(dds_gleason) <- ~gleason_score
de_gleason <- DESeq(dds_gleason)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 577 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
resultsNames(de_gleason)
[1] "Intercept" "gleason_score_G9_vs_G6" "gleason_score_Normal_vs_G6"
results_gleason <- data.frame(results(de_gleason))
A venn diagram is a common way of visualising the overlap between two genelists. We need to create a data frame where each column indicates whether each gene is differentially expressed in a particular contrast or not. To create such columns we can do a logical test on the adjusted p-values from our results tables.
venn_data <- data.frame(CellType = results_status$padj<0.05,
Gleason = results_gleason$padj < 0.05)
library(limma)
vennDiagram(venn_data)
DESEq2
allows for more complicated models to be fit to the data. For guidance on how to fit more complicated models you can consult the DESeq2 vignette, the limma user guide or the Bioconductor mailing list.
In particular, DESeq2 allows multi-factor models which can account for other sources of variation in the data such as batches or gender.
Lets suppose that we wanted the different between virgin and lactatin individuals, but controlling for Batch
. The main assumption being that the effect of Status
is the same regardless of Batch
The design for such an analysis would be:-
dds_mf <- dds
design(dds_mf) <- ~Batch+Status
de_mf <- DESeq(dds.mf)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 600 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
results_mf <- results(de_mf,contrast=c("Status","Tumour","Normal"))
results_mf
log2 fold change (MLE): Status Tumour vs Normal
Wald test p-value: Status Tumour vs Normal
DataFrame with 23368 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
1/2-SBSRNA4 53.0443987895977 0.2832355361628 0.226779705469357 1.24894569192864 0.211684937966744
A1BG 221.123479031816 -1.17471447179457 0.508517726511583 -2.31007575655834 0.0208839603703524
A1BG-AS1 29.7317978744033 -1.32549411546739 0.474210774088385 -2.79515816150635 0.00518743297291225
A1CF 7.72356776697792 1.43903927022599 0.730103467956311 1.97100730702473 0.0487230400302148
A2LD1 400.322513053529 0.474744895157193 0.199951414788572 2.37430125542841 0.017582198742755
... ... ... ... ... ...
ZYG11B 2579.01965251234 -0.299743258592344 0.169578593358214 -1.76757721983914 0.0771316091131724
ZYX 8071.18437115632 -0.387691129714366 0.240282043951644 -1.61348356847001 0.106639490283516
ZZEF1 3233.23162325016 -0.251943528887863 0.186782910984024 -1.3488574921579 0.17738274398432
ZZZ3 2204.88768617892 0.149833112009978 0.151518264585969 0.988878221509495 0.322722727159437
tAKR 1.548947012253 -0.0606553024026499 0.781497440271342 -0.0776142048291161 0.938134942648593
padj
<numeric>
1/2-SBSRNA4 0.404505048256582
A1BG 0.0867887267993926
A1BG-AS1 0.0326402675538388
A1CF 0.154263757037365
A2LD1 0.0765231802955422
... ...
ZYG11B 0.209847384936015
ZYX 0.2601979804504
ZZEF1 0.3615648129121
ZZZ3 0.525498455619574
tAKR 0.968830992496522
The DESeq
workflow applies median of ratios normalization that accounts for differences in sequencing depth between samples. The user does not usually need to run this step. However, if you want a matrix of counts for some application outside of Bioconductor the values can be extracted from the dds
object.
dds <- estimateSizeFactors(dds)
countMatrix <-counts(dds, normalized=TRUE)
write.csv(countMatrix,file="processed_data/normalized_counts.csv")
We can also save the DESeq2 results so that we don’t have to repeat the analysis
saveRDS(de_status, file="Robjects/de_status.rds")
saveRDS(de_gleason, file="Robjects/de_gleason.rds")