The sequencing reads were quantified using the salmon tool (out of the scope of this course). The output of this tool is kept in the salmon_quant
directory of the course data folder.
We have to use the list.files
function to locate all the files. It can be run in “recursive” mode means that it checks all sub-folders.
## see help for list.files
## recursive checks all sub-folders
## full.names returns the path in addition to the file name
## dirs contain all the sample names
dirs <- list.files("salmon_quant/")
quant_files <- list.files("salmon_quant/",pattern="quant.sf.gz",recursive = TRUE,full.names = TRUE)
## Set nicer names
names(quant_files) <- dirs
quant_files
1_CTR_BC_2
"salmon_quant//1_CTR_BC_2/quant.sf.gz"
2_TGF_BC_4
"salmon_quant//2_TGF_BC_4/quant.sf.gz"
3_IR_BC_5
"salmon_quant//3_IR_BC_5/quant.sf.gz"
4_CTR_BC_6
"salmon_quant//4_CTR_BC_6/quant.sf.gz"
5_TGF_BC_7
"salmon_quant//5_TGF_BC_7/quant.sf.gz"
6_IR_BC_12
"salmon_quant//6_IR_BC_12/quant.sf.gz"
7_CTR_BC_13
"salmon_quant//7_CTR_BC_13/quant.sf.gz"
8_TGF_BC_14
"salmon_quant//8_TGF_BC_14/quant.sf.gz"
9_IR_BC_15
"salmon_quant//9_IR_BC_15/quant.sf.gz"
Two further files are required. 1) the transcript to gene mapping (created previously to save time). 2) the sample information
library(readr)
tx2gene <- read_csv("tx2gene.csv")
-- Column specification -----------------------------------------------
cols(
ENST00000456328 = col_character(),
ENSG00000223972 = col_character()
)
sampleinfo <- read.csv("meta_data/sampleInfo.csv")
The tximport
package is first required to import the quantification files, and summarise them to the gene level.
library(tximport)
# ignoreTxVersion is required to make sure the transcript names match up
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene,ignoreTxVersion = TRUE)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9
transcripts missing from tx2gene: 1
summarizing abundance
summarizing counts
summarizing length
We can now read these data into DESeq2
. It has many possible input formats each of which requires a different function. One of the arguments is a data frame containing sample information. The design specifies which contrast we want to analyse in R’s modeling format (using the ~
symbol). In the simplest case the design argument can correspond to a column in the sample information that we want to compare.
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi,
colData = sampleinfo,
design = ~condition)
some variables in design formula are characters, converting to factors
dds
class: DESeqDataSet
dim: 57914 9
metadata(1): version
assays(2): counts avgTxLength
rownames(57914): ENSG00000000003 ENSG00000000005 ...
ENSG00000284747 ENSG00000284748
rowData names(0):
colnames(9): 1_CTR_BC_2 2_TGF_BC_4 ... 8_TGF_BC_14 9_IR_BC_15
colData names(5): Run condition Name Replicate Treated
It is useful to check the number of reads obtained for each sample. Low read count could indicate technical issues such as poor RNA quality. The number of reads for each sample can be obtained by summing each column in the assay(dds)
data frame. These can be plotted as a bar graph.
library(dplyr)
library(ggplot2)
png("images/lib_size.png")
mutate(sampleinfo, LibSize = colSums(assay(dds))/1e6) %>%
ggplot(aes(x = Name, y = LibSize)) + geom_col(fill="steelblue") + geom_hline(yintercept = 20,col="red",lty=2)
Colouring by condition
reveals some inconsistencies in encoding
library(dplyr)
library(ggplot2)
pca_data <- plotPCA(vsd,intgroup="Treated",returnData = TRUE) %>%
dplyr::rename(Run = name) %>%
left_join(sampleinfo)
Joining, by = c("Treated", "Run")
ggplot(pca_data,aes(x = PC1, y = PC2,col=condition)) + geom_point()
Adding labels to the plot with geom_text
Nicer positioning of the labels is possible with the ggrepel
package.
library(ggrepel)
pca_data <- plotPCA(vsd,intgroup="Treated",returnData = TRUE) %>%
dplyr::rename(Run = name) %>%
left_join(sampleinfo)
Joining, by = c("Treated", "Run")
ggplot(pca_data,aes(x = PC1, y = PC2,col=group,label = Name)) + geom_point() + geom_text_repel()
After identifying a sample swap, we correct the sample sheet and start the data import again. Only the data frame containing the sample information needs to change.
sampleinfo_corrected <- read_tsv("meta_data/sampleInfo_corrected.txt")
-- Column specification -----------------------------------------------
cols(
Run = col_character(),
condition = col_character(),
Name = col_character(),
Replicate = col_double(),
Treated = col_character()
)
dds <- DESeqDataSetFromTximport(txi,
colData = sampleinfo_corrected,
design = ~condition)
some variables in design formula are characters, converting to factorsusing counts and average transcript lengths from tximport
dds
class: DESeqDataSet
dim: 57914 9
metadata(1): version
assays(2): counts avgTxLength
rownames(57914): ENSG00000000003 ENSG00000000005 ...
ENSG00000284747 ENSG00000284748
rowData names(0):
colnames(9): 1_CTR_BC_2 2_TGF_BC_4 ... 8_TGF_BC_14 9_IR_BC_15
colData names(5): Run condition Name Replicate Treated
The PCA plot can be used to verify the new sample groups. A clear separation is seen on the first component between basal and luminal samples (as we would expect).
vsd <- vst(dds)
using 'avgTxLength' from assays(dds), correcting for library size
plotPCA(vsd,intgroup="condition")
sessionInfo()