Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law, Stephane Ballereau, Oscar Rueda, Ashley Sawle Based on the course RNAseq analysis in R delivered on May 11/12th 2016 and modified by Cancer Research Uk Cambridge Centre for the Functional Genomics Autumn School 2017
This material has been created using the following resources:
Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. Traditionally, these are aligned to a reference genome, then the number of reads mapped to each gene can be counted. More modern approaches such as salmon
quantify transcripts directly and do not require genome aligment to have taken place. Either approach results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, this session will be starting from the count data and getting stuck into analysis.
We will be following a workflow that uses the DESeq2
package. An alternative and well-respected workflow is based on the edgeR and limma packages.
The data for this tutorial comes from a Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival [@Fu2015].
This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.
The sequencing reads for this experiment were uploaded to the Sequencing Read Archive (SRA) and details of how to obtain the raw reads and process are given in the session “Preparing fastq files for analysis in R”.
The sampleInfo.txt
in the meta_data
folder contains basic information about the samples that we will need for the analysis today. This includes the ID for the sample from SRA, an ID assigned by the researcher, and the cell type and developmental stage for each sample.
# Read the sample information into R
sampleinfo <- read.delim("meta_data/sampleInfo.txt")
View(sampleinfo)
sampleinfo
run Name CellType Status
1 SRR1552444 MCL1-LA basal virgin
2 SRR1552445 MCL1-LB luminal virgin
3 SRR1552446 MCL1-LC Luminal pregnancy
4 SRR1552447 MCL1-LD Luminal pregnancy
5 SRR1552448 MCL1-LE luminal lactation
6 SRR1552449 MCL1-LF luminal lactation
7 SRR1552450 MCL1-DG basal virgin
8 SRR1552451 MCL1-DH luminal virgin
9 SRR1552452 MCL1-DI basal pregnancy
10 SRR1552453 MCL1-DJ basal pregnancy
11 SRR1552454 MCL1-DK basal lactation
12 SRR1552455 MCL1-DL basal lactation
rownames(sampleinfo) <- sampleinfo$run
We are going to use the tximport
package to import the count data into R and collapse the data to the gene level. This requires us to run a function in the following form:-
txi <- tximport(files=..., type="salmon", tx2gene=...)
So we will need to define the files that we want to import and a transcript mapping data frame. The transcript mapping takes the form
| TXNAME | GENEID
1| ENST00000456328.2 | ENSG00000223972.5
2| ENST00000450305.2 | ENSG00000223972.5
3| ENST00000473358.1 | ENSG00000243485.5
4| ENST00000469289.1 | ENSG00000243485.5
5| ENST00000607096.1 | ENSG00000284332.1
6| ENST00000606857.1 | ENSG00000268020.3
tximport
is able to import counts produced by different software, and different workflows are described for each in the tximport vignette.
The samples from this study have been quantified using salmon
. For details on how this is done, please see the previous session on preparing fastq files for analysis.
The script we used to run salmon (run_salmon.sh
) created a separate folder for each sample (named according to the SRA ID), and inside each of these folders we find the salmon quantification file. Note that the salmon analysis produced many other files (e.g. log files), but we will only need the quant.sf.gz
files for analysis.
The function we are going to use to import the salmon files requires a vector
comprising the paths to the files that are to be imported. To construct such a vector we can use the following code chunk. Furthermore, we can name each item in the vector according to the directory name. These names will be used eventually to name the columns of our count matrices.
dirs <- list.files("salmon_quant/")
quant_files <- list.files("salmon_quant/",pattern="quant.sf.gz",recursive = TRUE,full.names = TRUE)
names(quant_files) <- dirs
quant_files
SRR1552444
"salmon_quant//SRR1552444/quant.sf.gz"
SRR1552445
"salmon_quant//SRR1552445/quant.sf.gz"
SRR1552446
"salmon_quant//SRR1552446/quant.sf.gz"
SRR1552447
"salmon_quant//SRR1552447/quant.sf.gz"
SRR1552448
"salmon_quant//SRR1552448/quant.sf.gz"
SRR1552449
"salmon_quant//SRR1552449/quant.sf.gz"
SRR1552450
"salmon_quant//SRR1552450/quant.sf.gz"
SRR1552451
"salmon_quant//SRR1552451/quant.sf.gz"
SRR1552452
"salmon_quant//SRR1552452/quant.sf.gz"
SRR1552453
"salmon_quant//SRR1552453/quant.sf.gz"
SRR1552454
"salmon_quant//SRR1552454/quant.sf.gz"
SRR1552455
"salmon_quant//SRR1552455/quant.sf.gz"
The quant files are simple tab-delimited files that tabulate the counting results for each transcript in our chosen organism. Although we will use a specialised Bioconductor package (tximport
) to import the counts for entire dataset into R, we can inspect the first of the files using the standard read_tsv
function from the readr
package.
library(readr)
quants <- read_tsv(quant_files[1])
Parsed with column specification:
cols(
Name = [31mcol_character()[39m,
Length = [32mcol_double()[39m,
EffectiveLength = [32mcol_double()[39m,
TPM = [32mcol_double()[39m,
NumReads = [32mcol_double()[39m
)
head(quants)
Various methods have been proposed to account for the two main known biases in RNA-seq; library composition and gene length. A nice summary is presented on this blog. The current favourite is TPM (transcripts per million) which is similar in concept to RPKM, but the different order in which operations are applied makes it easier to compare across samples.
The TPM are giving in the table, but we can demonstrate the calculation in a few steps
rpk <- quants$NumReads / quants$EffectiveLength
scale_factor <- sum(rpk) / 1e6
tpm <- rpk / scale_factor
In order for tximport
to give gene-level counts, we need to supply a data frame that can be used to associate each transcript name with a gene identifier. It is important to use a transcript file that corresponds to the name genome build as the file used to count the transcripts.
We can check if the gtf
file exists in the directory we expect by running the file.exists
function; returning TRUE
or FALSE
gtf_file <- "Mus_musculus.GRCm38.91.chr.gtf.gz"
file.exists(gtf_file)
[1] TRUE
If required, we can download from the Ensembl FTP site.
download.file("ftp://ftp.ensembl.org/pub/release-91/gtf/mus_musculus/Mus_musculus.GRCm38.91.chr.gtf.gz",destfile = gtf_file)
[images/download_gtf.png]
If analysing your own data, you will have to locate the gtf file on the Ensembl FTP site. If you enter ftp://ftp.ensembl.org/pub/release-91/gtf
into a web browser you will be able to navigate the site and find your organism of interest. By right-clicking on the name of the gtf you will be able to copy the URL and then paste into RStudio.
gtf_file <- "ensembl_ref/my_ref.gtf"
download.file(PASTE_LINK_FROM_ENSEMBL_HERE,destfile = gtf_file)
The Bioconducor website provides many pre-built transcript databases for some organisms (Human, Mouse,Rat etc) which provide transcript definitions and allow users to query the locations of particular genes, exons and other genomic features. You may find a pre-built package that already has the transcript locations required to create the transcript mapping file. Check out the annotation section of the Bioconductor website - http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData and look for packages starting TxDb...
However, it is quite easy to build such a database if we have a gtf
file using the GenomicFeatures
infrastructure.
## Could take a few minutes to run the makeTxDbFromGFF command
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(gtf_file)
The "phase" metadata column contains non-NA values for features
of type stop_codon. This information was ignored.
The database has a number of predefined “keys” and “columns” that have to be specified when creating a query
keytypes(txdb)
[1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID"
[7] "TXNAME"
columns(txdb)
[1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
[6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
[11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
[16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
[21] "TXSTRAND" "TXTYPE"
Sometimes we would want to query the positions for a limited set of selected genes (perhaps the results of a differential-expression analysis), but in this case we want the gene names that correspond to every transcript in the database. To get the names of all transcripts we can use the keys
function. We then compose the query using the select
function to return a data frame
k <- keys(txdb, keytype="TXNAME")
tx_map <- select(txdb, keys = k, columns="GENEID", keytype = "TXNAME")
'select()' returned 1:1 mapping between keys and columns
head(tx_map)
TXNAME GENEID
1 ENSMUST00000193812 ENSMUSG00000102693
2 ENSMUST00000082908 ENSMUSG00000064842
3 ENSMUST00000192857 ENSMUSG00000102851
4 ENSMUST00000161581 ENSMUSG00000089699
5 ENSMUST00000192183 ENSMUSG00000103147
6 ENSMUST00000193244 ENSMUSG00000102348
Such a data frame should be sufficient to allow us to use the tximport
package. There is a bit of a problem though..
library(tximport)
tx2gene <- tx_map
write.csv(tx2gene,file="tx2gene.csv",row.names = FALSE,quote=FALSE)
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene)
In this case R is reporting a useful error message; the IDs we have supplied in the tx2gene
data frame do not correspond to the transcript names in our files.
table(tx_map$TXNAME %in% quants$Name)
FALSE
133849
Fortunately the authors of tximport
have recognised this as a common problem and added an argument to the tximport
function that can be used to overcome the error.
library(tximport)
tx2gene <- tx_map
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene,ignoreTxVersion = TRUE)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12
transcripts missing from tx2gene: 147
summarizing abundance
summarizing counts
summarizing length
The resulting object is a simple list
structure in R which contains a number of components that we can access using a $
operator
names(txi)
[1] "abundance" "counts" "length"
[4] "countsFromAbundance"
head(txi$counts)
SRR1552444 SRR1552445 SRR1552446 SRR1552447
ENSMUSG00000000001 5443.000 6085.000 5773.000 5166
ENSMUSG00000000003 0.000 0.000 0.000 0
ENSMUSG00000000028 259.000 329.753 267.000 102
ENSMUSG00000000037 66.000 97.999 65.999 42
ENSMUSG00000000049 2.000 3.000 0.000 0
ENSMUSG00000000056 266.253 326.000 589.000 464
SRR1552448 SRR1552449 SRR1552450 SRR1552451
ENSMUSG00000000001 2191 2189 4354.000 4369
ENSMUSG00000000003 0 0 0.000 0
ENSMUSG00000000028 231 206 304.694 259
ENSMUSG00000000037 24 21 182.000 162
ENSMUSG00000000049 0 0 4.000 6
ENSMUSG00000000056 402 360 770.000 522
SRR1552452 SRR1552453 SRR1552454 SRR1552455
ENSMUSG00000000001 4556.000 4079 3550 2800.000
ENSMUSG00000000003 0.000 0 0 0.000
ENSMUSG00000000028 199.000 152 44 63.000
ENSMUSG00000000037 230.000 198 238 192.000
ENSMUSG00000000049 1.000 1 3 2.000
ENSMUSG00000000056 382.998 230 441 398.001
all(rownames(sampleinfo) == colnames(txi$counts))
[1] TRUE
This was a fairly simple fix, but we could have also used some functionality from the tidyverse
collection of packages to make the names compatible.
None of the transcript names are matching up. If we look at the transcript names in the quant
files they have a .
symbol followed by a number after the transcript name. The tidyr
package contains several useful functions for tidying a data frame. One function we will use in particular is to split a column of value that contain multiple pieces of information separated by a common character. In this case the character is “.” but other common separating characters include -
, _
, :
. The separate
function is able to detect which character is being used.
Here we create two new columns, TXNAME
and Number
. The TXNAME
being the column that should match entries in the tx_map
data frame.
library(tidyr)
Attaching package: ‘tidyr’
The following object is masked from ‘package:S4Vectors’:
expand
quants <- separate(quants, Name, c("TXNAME","Number"),remove = FALSE)
head(quants)
Another useful piece of tidyverse
functionality is to join two tables together based on a common column. We want to create a data frame that has the transcript name as it appears in the quant files and the gene name. This can be achieved by a left_join
to ensure that all the transcript IDs in the quant files have an entry in the combined table.
N.B. those already familiar with R might have encountered the merge
function for achieving a similar result.
library(dplyr)
Attaching package: ‘dplyr’
The following object is masked from ‘package:AnnotationDbi’:
select
The following object is masked from ‘package:Biobase’:
combine
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff, setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
quants <- left_join(quants, tx_map, by="TXNAME")
head(quants)
To create the tx2gene
we can pull out the columns that we require from this merged data frame. There is another dplyr
function for doing this that has simpler syntax than the “base R” method of subsetting a data frame.
tx2gene <- dplyr:::select(quants, Name, GENEID)
head(tx2gene)
We might want to further check that there are no “missing values” in the Gene ID column and remove them from the data frame
any(is.na(tx2gene$GENEID))
[1] TRUE
tx2gene <- filter(tx2gene, !is.na(GENEID))
library(tximport)
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12
transcripts missing from tx2gene: 147
summarizing abundance
summarizing counts
summarizing length
We will be using the DESeq2
library to analyse this dataset. As part of the DESeq2 vignette you will see examples of importing count data from different sources. In all cases, raw count data are expected.
For this workflow we are going to import data from tximport
with the DESeqDataSetFromTximport
function along with the sample information that we created earlier.
A design for the experiment also needs to be specified. This will define how the differential expression analysis is carried out, but can be changed at a later stage so we will use CellType
for now as our factor of interest.
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi,
colData = sampleinfo,
design <- ~CellType)
The object contains all the metadata for the experiment, along with the counts.
colData(dds)
DataFrame with 12 rows and 4 columns
run Name CellType Status
<factor> <factor> <factor> <factor>
SRR1552444 SRR1552444 MCL1-LA basal virgin
SRR1552445 SRR1552445 MCL1-LB luminal virgin
SRR1552446 SRR1552446 MCL1-LC Luminal pregnancy
SRR1552447 SRR1552447 MCL1-LD Luminal pregnancy
SRR1552448 SRR1552448 MCL1-LE luminal lactation
... ... ... ... ...
SRR1552451 SRR1552451 MCL1-DH luminal virgin
SRR1552452 SRR1552452 MCL1-DI basal pregnancy
SRR1552453 SRR1552453 MCL1-DJ basal pregnancy
SRR1552454 SRR1552454 MCL1-DK basal lactation
SRR1552455 SRR1552455 MCL1-DL basal lactation
We will be using these raw counts throughout the workshop and transforming them using methods in the DESeq2
package. If TPM values are desired for some other application, we can extract them from the tximport
object. These are the transcript-level TPM values that we saw earlier in the quant files that have been summarised to the gene-level.
tpm <- txi$abundance
write.csv(tpm, file="tpm_values.csv",quote=FALSE)
At the time of writing, TPM is the recommended way of transforming RNA-seq counts to account for gene length and library composition biases. DESeq2 also provides methods for extracting normalised counts as FPKM (fragments per kilobase per million mapped fragments) and FPM (fragments per million mapped fragments) which might be required for some other analysis or visualisation outside of Bioconductor or DESeq2
.
fpm <- fpm(dds)
write.csv(fpm, file="fpm_values.csv",quote=FALSE)
fpkm <- fpkm(dds)
write.csv(fpkm, file="fpkm_values.csv",quote=FALSE)
We can look at a few different plots to check that the data is good quality, and that the samples are as we would expect. First, we can check how many reads we have for each sample in the DESeqDataSet
. The counts themselves are accessed using the assay
function; giving a matrix of counts. The sum of a particular column is therefore the total number of reads for that sample.
sum(assay(dds)[,1])
[1] 24444863
A convenience function colSums
exists for calculating the sum of each column in a matrix, returning a vector
as a result.
colSums(assay(dds))
SRR1552444 SRR1552445 SRR1552446 SRR1552447 SRR1552448 SRR1552449
24444863 25995520 26800109 26378758 29201722 29106858
SRR1552450 SRR1552451 SRR1552452 SRR1552453 SRR1552454 SRR1552455
27070438 25504036 27990146 26185840 24395121 22722335
Challenge 1
- Produce a bar plot to show the Millions of reads for each sample
- Change the names under the plot so they are the “Name” of the sample rather than Run Name
- Add a horizontal line at 20 million reads
HINT: look at the help for the base graphics functions
barplot
andabline
(or try and useggplot2
if you prefer)
The dataset is usually filtered at this stage to remove any genes that are not expressed. Although not strictly required for the DESeq2 differential expression algorithm, it can reduce the time and memory required to perform some of the analysis. Let’s say that for a gene to be “expressed” in a particular sample we need to see 5 or more counts
is_expressed <- assay(dds) >= 5
head(is_expressed)
SRR1552444 SRR1552445 SRR1552446 SRR1552447
ENSMUSG00000000001 TRUE TRUE TRUE TRUE
ENSMUSG00000000003 FALSE FALSE FALSE FALSE
ENSMUSG00000000028 TRUE TRUE TRUE TRUE
ENSMUSG00000000037 TRUE TRUE TRUE TRUE
ENSMUSG00000000049 FALSE FALSE FALSE FALSE
ENSMUSG00000000056 TRUE TRUE TRUE TRUE
SRR1552448 SRR1552449 SRR1552450 SRR1552451
ENSMUSG00000000001 TRUE TRUE TRUE TRUE
ENSMUSG00000000003 FALSE FALSE FALSE FALSE
ENSMUSG00000000028 TRUE TRUE TRUE TRUE
ENSMUSG00000000037 TRUE TRUE TRUE TRUE
ENSMUSG00000000049 FALSE FALSE FALSE TRUE
ENSMUSG00000000056 TRUE TRUE TRUE TRUE
SRR1552452 SRR1552453 SRR1552454 SRR1552455
ENSMUSG00000000001 TRUE TRUE TRUE TRUE
ENSMUSG00000000003 FALSE FALSE FALSE FALSE
ENSMUSG00000000028 TRUE TRUE TRUE TRUE
ENSMUSG00000000037 TRUE TRUE TRUE TRUE
ENSMUSG00000000049 FALSE FALSE FALSE FALSE
ENSMUSG00000000056 TRUE TRUE TRUE TRUE
R is happy to think of logical values (TRUE
or FALSE
) as the integers 0
or 1
. Therefore if we calculate the sum across a particular row it will give the number of samples that gene is expressed in.
sum(is_expressed[1,])
[1] 12
sum(is_expressed[2,])
[1] 0
In a similar manner to colSums
, rowSums
will give the sum of each row in a matrix and return the result as a vector
.
hist(rowSums(is_expressed),main="Number of samples a gene is expressed in",xlab="Sample Count")
It seems that genes are either expressed in all samples, or not expressed at all. We will decide to keep genes that are expressed in at least 2 samples.
keep <- rowSums(assay(dds) >= 5) >= 2
table(keep)
keep
FALSE TRUE
16479 17973
dds <- dds[keep,]
We typically use a boxplot
to visualise difference the distributions of the columns of a numeric data frame. Applying the boxplot
function to the raw counts from our dataset reveals something about the nature of the data; the distributions are dominated by a few genes with very large counts.
boxplot(assay(dds))
boxplot(log10(assay(dds)))
We can use the vst
or rlog
function from DESeq2
to compensate for the effect of different library sizes and put the data on the log\(_2\) scale. The effect is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. For more details see the DESeq2 vignette
# Get log2 counts
vsd <- vst(dds,blind=TRUE)
using 'avgTxLength' from assays(dds), correcting for library size
# Check distributions of samples using boxplots
boxplot(assay(vsd), xlab="", ylab="Log2 counts per million",las=2,main="Normalised Distributions")
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(assay(vsd)), col="blue")
Another use of the transformed data is sample clustering. Here, we apply the dist
function to the transpose of the transformed count matrix to get sample-to-sample distances.
sampleDists <- dist(t(assay(vsd)))
A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. By re-naming the rows and columns of the distance matrix we can make the plot easier to interpret.
library(RColorBrewer)
library(pheatmap)
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(colData(dds)$CellType, colData(dds)$Status, sep="-")
colnames(sampleDistMatrix) <- colData(dds)$Name
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
col=colors)
Related to the distance matrix heatmap is the (Principal Components Analysis) PCA plot, which shows the samples in the 2D plane spanned by their first two principal components. A principle components analysis is an example of an unsupervised analysis, where we don’t need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers
DESeq2
has a convenient plotPCA
function for making the PCA plot, which makes use of the ggplot2
graphics package.
plotPCA(vsd,intgroup="CellType")
Challenge 2
- Is the
plotPCA
plot based on all genes in the dataset? How can we change how many genes are used for the PCA analysis? Does this significantly change the plot? (HINT: check the documentation for theplotPCA
function.)- Change the
intgroup
parameter so that both CellType and Status are used for grouping. (See the documentation again)- Is there something strange going on with the samples?
- Identify the two samples that don’t appear to be in the right place.
- What other problems can you see with the metadata?
In our unsupervised analysis we should see that the main source of variation is due to biological effects, and not technical variation such as when the libraries were sequenced. If we do observe high technical variation in our data, it is not a complete disaster provided that we have designed our experiment propery. In particular the sva Bioconductor package can correct for batch effects provided that representatives of the groups of interest appear in each batch. Alternatively, the batch or confounding factor may be incorporated into the differential expression analysis.
Hopefully we have spotted a potential sample swap in the dataset. The mislabelled samples are MCL1.DH, which is labelled as luminal but should be basal, and MCL1.LA, which is labelled as basal but should be luminal. Such errors are not uncommon when handling large numbers of samples and sometimes we need to go back to the lab books and verify that a swap has been made. If there is no supporting evidence for a swap then it can be safer to exclude the samples.
Furthermore, the person creating the sample sheet has been inconsistent about the way that values of CellType
and Status
have been entered into the metadata. Such errors can be annoying when labelling plots, but have more serious consequences when attempting to fit statistical models to the data.
library(stringr)
sampleinfo_corrected <- sampleinfo
sampleinfo_corrected <- mutate(sampleinfo_corrected, CellType = str_to_lower(CellType))
sampleinfo_corrected <- mutate(sampleinfo_corrected, Status = str_trim(Status))
sampleinfo_corrected <- mutate(sampleinfo_corrected, CellType = ifelse(Name == "MCL1-DH","basal",CellType))
sampleinfo_corrected <- mutate(sampleinfo_corrected, CellType= ifelse(Name == "MCL1-LA","luminal",CellType))
write.table(sampleinfo_corrected, file="meta_data/sampleInfo_corrected.txt",sep="\t",row.names = FALSE)
Challenge 3
- Re-create the DESeqDataset object to include the corrected sample information
- Re-run the plotPCA function on the new data and verify that the sample groups now look correct
It is common practice when using a series of dplyr
data frame manipulations (such as mutate
above) to create a workflow with the %>%
operation from the magrittr
package. The result is a more-readable chunk of code, although the result is exactly the same.
sampleinfo_corrected <- mutate(sampleinfo, CellType = str_to_lower(CellType)) %>%
mutate(Status = str_trim(Status)) %>%
mutate(CellType = ifelse(Name == "MCL1-DH","basal",CellType)) %>%
mutate(CellType= ifelse(Name == "MCL1-LA","luminal",CellType))
sampleinfo_corrected
run Name CellType Status
1 SRR1552444 MCL1-LA luminal virgin
2 SRR1552445 MCL1-LB luminal virgin
3 SRR1552446 MCL1-LC luminal pregnancy
4 SRR1552447 MCL1-LD luminal pregnancy
5 SRR1552448 MCL1-LE luminal lactation
6 SRR1552449 MCL1-LF luminal lactation
7 SRR1552450 MCL1-DG basal virgin
8 SRR1552451 MCL1-DH basal virgin
9 SRR1552452 MCL1-DI basal pregnancy
10 SRR1552453 MCL1-DJ basal pregnancy
11 SRR1552454 MCL1-DK basal lactation
12 SRR1552455 MCL1-DL basal lactation
The plotPCA
function produces a ggplot2
plot (rather than “base” graphics) and we can also get the function to return the data used to create the plot by changing the returnData
argument.
plot_data <- plotPCA(vsd,intgroup=c("CellType","Status"),returnData=TRUE)
plot_data <- bind_cols(plot_data,sampleinfo_corrected)
With ggplot2
plot can be created by mapping various aesthetics of the plot (colour, shape, x- and y-coordinates) to columns in the data frame.
library(ggplot2)
ggplot(plot_data, aes(x = PC1,y=PC2, col=CellType)) + geom_point()
Having the data in this form allows us to customise the plot in lots of ways
ggplot(plot_data, aes(x = PC1,y=PC2, col=CellType,pch=Status)) + geom_point(size=5)
ggplot(plot_data, aes(x = PC1,y=PC2, col=CellType,pch=Status,label=Name)) + geom_point() + geom_text(alpha=0.4)