Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law, Stephane Ballereau, Oscar Rueda, Ashley Sawle Based on the course RNAseq analysis in R delivered on May 11/12th 2016 and modified by Cancer Research Uk Cambridge Centre for the Functional Genomics Autumn School 2017
This material has been created using the following resources:
DESeq2
Now that we are happy that we have normalised the data and that the quality looks good, we can continue to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESeq2
or edgeR
. We will use DESeq2
for the rest of this practical.
**First make sure we have all the objects and libraries loaded*
library(DESeq2)
library(tximport)
The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.
Note that although we spent some time looking at the quality of our data , these steps are not required prior to performing differential expression so are not shown here. Remember, DESeq2
requires raw counts so the vst
transformation is not shown as part of this basic protocol.
dirs <- list.files(file.path("salmon_quant/"))
quant_files <- paste0("salmon_quant/",dirs,"/quant.sf.gz")
names(quant_files) <- dirs
tx2gene <- read.csv("tx2gene.csv")
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene,ignoreTxVersion = TRUE)
sampleinfo <- read.delim("meta_data/sampleInfo_corrected.txt")
rownames(sampleinfo) <- sampleinfo$run
dds <- DESeqDataSetFromTximport(txi,
colData = sampleinfo,
design <- ~CellType)
keep <- rowSums(assay(dds) >= 5) >= 2
dds <- dds[keep,]
We also have the output of the pre-processing section saved as an R object if you didn’t manage to complete these steps.
## Only run if you didn't complete the previous section on pre-processing
load("Robjects/preprocessing.Rdata")
We have previously defined the test condition using the design
argument when we created the object. This can be checked using the design
function.
Typically we decide the design for the analysis when we create the DESeq2 objects, but it can be modified prior to the differential expression analysis
colData(dds)
DataFrame with 12 rows and 4 columns
run Name CellType Status
<factor> <factor> <factor> <factor>
SRR1552444 SRR1552444 MCL1-LA luminal virgin
SRR1552445 SRR1552445 MCL1-LB luminal virgin
SRR1552446 SRR1552446 MCL1-LC luminal pregnancy
SRR1552447 SRR1552447 MCL1-LD luminal pregnancy
SRR1552448 SRR1552448 MCL1-LE luminal lactation
... ... ... ... ...
SRR1552451 SRR1552451 MCL1-DH basal virgin
SRR1552452 SRR1552452 MCL1-DI basal pregnancy
SRR1552453 SRR1552453 MCL1-DJ basal pregnancy
SRR1552454 SRR1552454 MCL1-DK basal lactation
SRR1552455 SRR1552455 MCL1-DL basal lactation
design(dds) <- ~CellType
The function runs a couple of processing steps automatically to adjust for different library size and gene-wise variabiliy, which you can read about in the DESeq2 vignette.
Firstly apply the median ratio normalisation method to compensate for differences in library sizes
dds <- estimateSizeFactors(dds)
estimate the dispersion for each gene
dds <- estimateDispersions(dds)
Apply statistical testing based on the negative binomial distribution.
dds <- nbinomWaldTest(dds)
Fortunately, there is one convenient function that will apply the three steps
de.cellType <- DESeq(dds)
using pre-existing normalization factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
The results of the analysis can be obtained using the results
function and displayed to the screen. Each row is a particular gene measured in the study (i.e. all genes in the organism being studied) and each column reports some aspect of the differential expression analysis for that gene. Note that all genes are reported. At this stage the gene identifiers are not very informative, something we will fix in the next section.
results(de.cellType)
log2 fold change (MLE): CellType luminal vs basal
Wald test p-value: CellType luminal vs basal
DataFrame with 17973 rows and 6 columns
baseMean log2FoldChange
<numeric> <numeric>
ENSMUSG00000000001 4068.33959643508 0.385871582598132
ENSMUSG00000000028 203.536232806184 0.811809379599301
ENSMUSG00000000037 121.264614183815 -1.79043393209216
ENSMUSG00000000056 437.163852367364 0.0464783556140183
ENSMUSG00000000058 3648.16003486305 0.14655749676048
... ... ...
ENSMUSG00000115744 6.5102254866026 -0.196252877440725
ENSMUSG00000115798 8.61324603359873 0.62806875352405
ENSMUSG00000115808 5.69983980001993 -1.87315498117461
ENSMUSG00000115812 3.48530425952456 0.0613115285646093
ENSMUSG00000115814 9.3358310615054 -0.646037485661208
lfcSE stat
<numeric> <numeric>
ENSMUSG00000000001 0.140534932730412 2.74573428186961
ENSMUSG00000000028 0.427713111143038 1.89802313384733
ENSMUSG00000000037 0.325457628706606 -5.50128119352274
ENSMUSG00000000056 0.336482286230572 0.138130170638966
ENSMUSG00000000058 0.498353539580032 0.294083386834144
... ... ...
ENSMUSG00000115744 0.484587611080003 -0.404989465172943
ENSMUSG00000115798 0.52295422261736 1.20100140004721
ENSMUSG00000115808 0.678199955662751 -2.76195090479492
ENSMUSG00000115812 0.669781817316698 0.0915395535970767
ENSMUSG00000115814 0.606181994081262 -1.06574839234602
pvalue padj
<numeric> <numeric>
ENSMUSG00000000001 0.00603756420772203 0.0133708069701353
ENSMUSG00000000028 0.0576930346502584 0.0969433024686542
ENSMUSG00000000037 3.77041320862499e-08 2.2514645684759e-07
ENSMUSG00000000056 0.890137541600907 0.919109240769746
ENSMUSG00000000058 0.768694185261808 0.826103590875375
... ... ...
ENSMUSG00000115744 0.685485254686406 0.757823742959555
ENSMUSG00000115798 0.22975065824174 0.313957596604923
ENSMUSG00000115808 0.00574571175591682 0.0127875558117554
ENSMUSG00000115812 0.927063878924631 0.947700484484525
ENSMUSG00000115814 0.286537395618647 0.376541677087853
The output can be converted into a data frame and manipulated in the usual manner. It is recommended to use dplyr
to manipulate the data frames with the standard set of operations detailed on the dplyr cheatsheet
select
to pick which columns to displayfilter
to restrict the rowsmutate
to add new variables to the data framearrange
to order the data frame according to values of a columnHowever, dpylr
does not like data frame that have rownames. We can use the rownames_to_column
function from the tibble
package to add an extra column that contains the Ensembl gene IDs.
The %>%
symbol refers to the piping operation in R, which is a way of chaining operations together.
library(dplyr)
library(tibble)
results.cellType <- as.data.frame(results(de.cellType)) %>%
rownames_to_column("GeneID")
results.cellType
We can sort the rows by adjusted p-value and then print the first 10 rows.
arrange(results.cellType, padj) %>%
head(n=10)
Or we can sort the rows and then write the resulting data frame to a file.
arrange(results.cellType, padj) %>%
write.csv("basal_vs_luminal_DESeq_all.csv")
arrange(results.cellType, padj) %>%
filter(padj < 0.05) %>%
write.csv("basal_vs_luminal_DESeq_DE.csv")
Challenge 1
- Re-run the analysis to find differentially-expressed genes between the developmental stages virgin and lactation
- Write a csv file that contains results for the genes that have a p-value less than 0.05 and a log2 fold change more than 1, or less than -1. HINT: So that we don’t overwrite our results so far, it may be convenient to create a new
DESeqDataSet
object for the new differential expression analysis.
dds.status <- dds
design(dds.status) <- ......
In this initial analyis DESeq2
has automatically decided which member of our sample groups to use as our baseline (basal
in this case) so that the log2 fold changes are reported with a positve value meaning higher expression in luminal
. If we want to change this behaviour we can change the contrast
argument in the results
function
## This should give the same as the table above
results(de.cellType, contrast=c("CellType","luminal","basal"))
## Changing the direction of the contrast
results(de.cellType, contrast=c("CellType","basal","luminal"))
If we change to performing differential expression analysis on the Status
variable then there are various contrasts that can be made; pregnant
vs lactation
, lactation
vs virgin
etc. When the results
function is run the table that is displayed is for the contrast virgin vs lactate
. The resultsNames
function can tell us which other contrasts we can access.
dds.status <- dds
design(dds.status) <- ~Status
de.status <- DESeq(dds.status)
using pre-existing normalization factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
resultsNames(de.status)
[1] "Intercept" "Status_pregnancy_vs_lactation"
[3] "Status_virgin_vs_lactation"
results.status <- data.frame(results(de.status))
A venn diagram is a common way of visualising the overlap between two genelists. We need to create a data frame where each column indicates whether each gene is differentially expressed in a particular contrast or not. To create such columns we can do a logical test on the adjusted p-values from our results tables.
venn_data <- data.frame(CellType = results.cellType$padj<0.05,
Status = results.status$padj < 0.05)
library(limma)
vennDiagram(venn_data)
Challenge 2
- Use a venn diagram to visualise the overlap in the genes found to be differentially expressed in the
pregnant vs virgin
andlactation vs virgin
contrasts.- How many genes are in common?
DESEq2
allows for more complicated models to be fit to the data. For guidance on how to fit more complicated models you can consult the DESeq2 vignette, the limma user guide or the Bioconductor mailing list.
In particular, DESeq2 allows multi-factor models which can account for other sources of variation in the data such as batches or gender.
Lets suppose that we wanted the different between virgin and lactatin individuals, but controlling for CellType
. The main assumption being that the effect of Status
is the same regardless of CellType
The design for such an analysis would be:-
dds.mf <- dds
design(dds.mf) <- ~CellType+Status
de.mf <- DESeq(dds.mf)
using pre-existing normalization factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
results.mf <- results(de.mf,contrast=c("Status","lactation","virgin"))
results.mf
log2 fold change (MLE): Status lactation vs virgin
Wald test p-value: Status lactation vs virgin
DataFrame with 17973 rows and 6 columns
baseMean log2FoldChange
<numeric> <numeric>
ENSMUSG00000000001 4068.33959643508 -0.157073961462995
ENSMUSG00000000028 203.536232806184 -0.463768112351007
ENSMUSG00000000037 121.264614183815 -0.125203128964674
ENSMUSG00000000056 437.163852367364 0.322627299307911
ENSMUSG00000000058 3648.16003486305 1.61799767940002
... ... ...
ENSMUSG00000115744 6.5102254866026 -0.68542697373131
ENSMUSG00000115798 8.61324603359873 -1.12506788283054
ENSMUSG00000115808 5.69983980001993 1.46274904838166
ENSMUSG00000115812 3.48530425952456 -0.268958050014585
ENSMUSG00000115814 9.3358310615054 -0.0601959612877811
lfcSE stat
<numeric> <numeric>
ENSMUSG00000000001 0.112030929585792 -1.40205889609002
ENSMUSG00000000028 0.538243641584199 -0.861632310204371
ENSMUSG00000000037 0.408710289357286 -0.306337110234149
ENSMUSG00000000056 0.42532383259027 0.758545076919567
ENSMUSG00000000058 0.353421127866108 4.57810117116992
... ... ...
ENSMUSG00000115744 0.642620944270856 -1.06661163138563
ENSMUSG00000115798 0.600733305718197 -1.87282421687189
ENSMUSG00000115808 0.745592394541443 1.96186154672525
ENSMUSG00000115812 0.891160642964858 -0.301806472422045
ENSMUSG00000115814 0.719709561366585 -0.0836392407702365
pvalue padj
<numeric> <numeric>
ENSMUSG00000000001 0.160897660045662 0.298572057592648
ENSMUSG00000000028 0.388889885208587 0.554834325762911
ENSMUSG00000000037 0.759347986603781 0.852902660201997
ENSMUSG00000000056 0.448124738284556 0.60949480228342
ENSMUSG00000000058 4.69215756155008e-06 5.80255648055399e-05
... ... ...
ENSMUSG00000115744 0.286147245969111 0.450595189259171
ENSMUSG00000115798 0.0610926627298804 0.145637759255059
ENSMUSG00000115808 0.049778605778987 0.124363730070539
ENSMUSG00000115812 0.76279959654206 0.855617282341187
ENSMUSG00000115814 0.933343266604123 0.962308323815519
The DESeq
workflow applies median of ratios normalization that accounts for differences in sequencing depth between samples. The user does not usually need to run this step. However, if you want a matrix of counts for some application outside of Bioconductor the values can be extracted from the dds
object.
dds <- estimateSizeFactors(dds)
countMatrix <-counts(dds, normalized=TRUE)
head(countMatrix)
write.csv(countMatrix,file="normalized_counts.csv")
save(de.cellType,de.status,de.mf, file="Robjects/DE.Rdata")
Lun, Aaron T L, Yunshun Chen, and Gordon K Smyth. 2016. “It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR.” Methods in Molecular Biology (Clifton, N.J.) 1418 (January): 391–416. doi:10.1007/978-1-4939-3578-9\_19.