Obtaining raw reads
Before we can begin the analysis, we need to make our raw sequencing
data accessible within the same computing environment that we are going
to use for analysis. Assuming that you will be using a computing node
without a GUI, this can be a non-trivial task. For example, we cannot
easily drag-and-drop the files from one location to another or click a
download link. Therefore we will introduce some commands for
transferring files, compressing and de-compressing and checking file
integrity.
Lets create a temporary directory to test these tools:-
cd
mkdir tmp
cd tmp
Use wget
to download from a remote location
If your data are hosted on a website or FTP site, you can use the
wget
command to download to the current working directory.
A small fastq
file is provided on the Sheffield
Bioinformatics Core website as an example. The output updates you on the
progress of the download.
Note that if downloading from an FTP site you might be prompted for a
username and password.
wget http://sbc.shef.ac.uk/data/example_reads.fastq
ls -lh
The contents of the file can be printed to the screen using the
head
command you should have learnt about previously
head -n 4 example_reads.fastq
Details on the FASTQ format
Although it looks complicated (and it is), we can understand the fastq format with
a little decoding. Some rules about the format include…
1 |
Always begins with ‘@’ and then information about the read |
2 |
The actual DNA sequence |
3 |
Always begins with a ‘+’ and sometimes the same info in line 1 |
4 |
Has a string of characters which represent the quality scores; must
have same number of characters as line 2 |
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
Line 4 shows the quality for each nucleotide in the read. Quality is
interpreted as the probability of an incorrect base call (e.g. 1 in 10)
or, equivalently, the base call accuracy (e.g. 90%). To make it possible
to line up each individual nucleotide with its quality score, the
numerical score is converted into a code where each individual character
represents the numerical quality score for an individual nucleotide. For
example, in the line above, the quality score line is:
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
The numerical value assigned to each of these characters depends on
the sequencing platform that generated the reads. The sequencing machine
used to generate our data uses the standard Sanger quality PHRED score
encoding, using Illumina version 1.8 onwards. Each character is assigned
a quality score between 0 and 41 as shown in the chart below.
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
Quality score: 01........11........21........31........41
Each quality score represents the probability that the corresponding
nucleotide call is incorrect. This quality score is logarithmically
based, so a quality score of 10 reflects a base call accuracy of 90%,
but a quality score of 20 reflects a base call accuracy of 99%. These
probability values are the results from the base calling algorithm and
depend on how much signal was captured for the base incorporation.
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
Since this file is so small, we needn’t worry about the file size.
However, in a real life experiment our raw sequencing reads can be many
10s of Gb in size. It is common practice to compress the file so that it
takes less space on disk. The standard way of achieving this in Unix
systems is to use the gzip
command. A side-effect of the
compression is that the resulting file can no longer be printed with
cat
. However, a zcat
command can be used to
the same effect.
##example of compressing the file
gzip example_reads.fastq
## list the directory to see what changed...
ls -lh
## Now have to use zcat to print...
zcat example_reads.fastq | head -n4
Download an archived directory
Another popular technique for compressing and transfering a
collection of files is to organise them into a single file (or
tarball), which can then be compressed. Such a file is also
available on the Sheffield Bioinformatics Core website for demonstration
purposes. The wget
command can be used as before.
wget http://sbc.shef.ac.uk/data/archive.tar.gz
ls
The tar
command with the options zxvf
(look
these up if you want to understand more) can be used to create a
directory structure containing all the files.
tar zxvf archive.tar.gz
ls
Checking the download worked
Having downloaded all your files, you should probably check they have
been downloaded successfully and been corrupted in transit. Partial or
incomplete files can be a source of errors later on in the analysis. If
you are downloading many files, it can be difficult to spot if a file
failed to download. A simple way is to check the file sizes, but this
can be unreliable.
A common way of doing this is to generate an md5sum from the
file. You can think of this as a unique digital barcode of a text file.
The same file will always give the same md5sum. Sequencing vendors will
often calculate these files on their servers and send the results to you
in the form of a text file that you can use to check. Note that for
large fastq files, these can take several minutes to compute.
### show md5sum for a particular file
md5sum fastq/Sample1/Sample1.fastq
## help is available on the function
md5sum --help
To see what would happen if a file didn’t transfer properly, we can
create a copy of Sample1.fastq
containing fewer lines. The
resulting md5sum will be different
## Use > to redirect the output
head fastq/Sample1/Sample1.fastq > fastq/Sample1/Sample1_sub.fastq
md5sum fastq/Sample1/Sample1_sub.fastq
Challenge 1
- Download the pre-computed md5sums available at
http://sbc.shef.ac.uk/data/archive_md5.txt
.
- Print the contents of this file and verify that the sum for
Sample1.fastq is correct
- The md5sum command is able to check the md5sums for a collection of
files against some pre-computed values. Use documentation (and google!)
to discover how to do this for the files we have downloaded.
Obtaining public data and About the example data
A good resource for obtaining public data is the SRA explorer.
In the Search box you can enter the accession number of a study
(usually quoted in the published paper), to obtain download links for
all the fastq files. It will even generate a bash script for you to
run.
The data for this tutorial comes from the paper, Induction of
fibroblast senescence generates a non-fibrogenic myofibroblast phenotype
that differentially impacts on cancer prognosis..
Cancer associated fibroblasts characterized by an myofibroblastic
phenotype play a major role in the tumour microenvironment, being
associated with poor prognosis. We found that this cell population is
made in part by senescent fibroblasts in vivo. As senescent fibroblasts
and myofibroblasts have been shown to share similar tumour promoting
functions in vitro we compared the transcriptosomes of these two
fibroblast types and performed RNA-seq of human foetal foreskin
fibroblasts 2 (HFFF2) treated with 2ng/ml TGF-beta-1 to induce
myofibroblast differentiation or 10Gy gamma irradiation to induce
senescence. We isolated RNA 7 days upon this treatments changing the
medium 3 days before the RNA extraction.
The paper states a particular accession number that can be
used to access the data
This ID can be used to find the data on ArrayExpress, but is also
recognised by SRA explorer (you don’t need the [All
fields] text shown in the screenshot). Gene Expression Omnibus
(GSE..) or Sequencing Read Archive
(SRP…) identifiers are also recognised.
If the identifier is recognised the names of samples belonging to the
dataset will be shown. These can be selected and added to a collection.
From the saved datasets button (top-right) the user can then access
download links.
For this particular dataset we created a download script using the
links provided by SRA explorer. To keep the run-time of the workshop
short (and to keep computing requirements down), each raw file was
subset to a more manageable amount.
The example dataset consists of nine raw (fastq files). The files are
named according to names assigned when the data were uploaded to the
public repository. We can use a sample sheet to associate each of these
database names to something more memorable, and eventually the
biological conditions being studied (beyond the scope of what we will
cover today)
ERR732901_sub |
CTR_1 |
Control |
ERR732901_sub |
TGF_1 |
TGF_Treated |
… |
… |
… |
Assessing Quality using FastQC
In real life, you won’t be assessing the quality of your reads by
visually inspecting your FASTQ files. Rather, you’ll be using a software
program to assess read quality and filter out poor quality reads. We’ll
first use a program called FastQC
to visualize the quality of our reads. Later in our workflow, we’ll use
another program to filter out poor quality reads.
FastQC has a number of features which can give you a quick impression
of any problems your data may have, so you can take these issues into
consideration before moving forward with your analyses. Rather than
looking at quality scores for each individual read, FastQC looks at
quality collectively across all reads within a sample. The image below
shows a FastQC-generated plot that indicates a very high quality
sample:
The x-axis displays the base position in the read, and the y-axis
shows quality scores. In this example, the sample contains reads that
are 40 bp long. For each position, there is a box-and-whisker plot
showing the distribution of quality scores for all reads at that
position. The horizontal red line indicates the median quality score and
the yellow box shows the 2nd to 3rd quartile range. This means that 50%
of reads have a quality score that falls within the range of the yellow
box at that position. The whiskers show the range to the 1st and 4th
quartile.
For each position in this sample, the quality values do not drop much
lower than 32. This is a high quality score. The plot background is also
color-coded to identify good (green), acceptable (yellow), and bad (red)
quality scores.
Now let’s take a look at a quality plot on the other end of the
spectrum.
Here, we see positions within the read in which the boxes span a much
wider range. Also, quality scores drop quite low into the “bad” range,
particularly on the tail end of the reads. The FastQC tool produces
several other diagnostic plots to assess sample quality, in addition to
the one plotted above.
Running FastQC
We will be working with a set of sample data that is located in the
directory (rnaseq_data
). These files have been compressed,
so have the fastq.gz
extension. FastQC is happy with either
compressed on un-compressed files.
Navigate to your FASTQ dataset:
cd /home/dcuser/rnaseq_data
To run the FastQC program, we would normally have to tell our
computer where the program is located. In this particular case, FastQC
has been installed in a location that can be accessed from all
directories.
which fastqc
/usr/bin/fastqc
This is the first program we have seen that is specific to
NGS analysis, so might not be available on all file systems by
default.
FastQC can accept multiple file names as input, so we can use the
*.fastq.gz
wildcard to run FastQC on all of the FASTQ files
in this directory.
fastqc fastq/*.fastq.gz
You will see an automatically updating output message telling you the
progress of the analysis. It will start like this:
Started analysis of ERR732901_sub.fastq.gz
Approx 100% complete for ERR732901_sub.fastq.gz
Analysis complete for ERR732901_sub.fastq.gz
In total, it should take less than 1 minute for FastQC to run on all
of our FASTQ files. When the analysis completes, your prompt will
return. So your screen will look something like this:
Started analysis of ERR732909_sub.fastq.gz
Approx 100% complete for ERR732909_sub.fastq.gz
Analysis complete for ERR732909_sub.fastq.gz
The FastQC program has created several new files within our
/data/fastq
directory.
ls fastq/
ERR732901_sub.fastqc.html ERR732901_sub.fastqc.zip ERR732901_sub.fastq.gz ERR732901_sub.fastqc.html
....
For each input FASTQ file, FastQC has created a .zip
file and a .html
file. The .zip
file extension
indicates that this is actually a compressed set of multiple output
files. We’ll be working with these output files soon. The
.html
file is a stable webpage displaying the summary
report for each of our samples.
We want to keep our data files and our results files separate, so we
will move these output files into a new directory within our
results/
directory. If this directory does not exist, we
will have to create it.
## -p flag stops a message from appearing if the directory already exists
mkdir -p qc
mv fastq/*.html qc/
mv fastq/*.zip qc/
Combining the reports with multiqc
It can be quite tiresome to click through multiple QC reports and
compare the results for different samples. It is useful to have all the
QC plots on the same page so that we can more easily spot trends in the
data.
The multiqc tool has been designed for the tasks of aggregating qc
reports and combining into a single report that is easy to digest.
multiqc
multiqc --help
Challenge 2
Use the multiqc tool to create a single QC report for the dataset.
Look at the help for the tool, and figure out how to run the tool on the
fastqc output we have just generated. Make sure that
multiqc
creates saves the report file in the
qc
folder.
The environment that we are working inside includes a version of the
Firefox web browser. We can open a particular HTML file with firefox
using the command:-
firefox qc/ERR732901_sub.fastqc.html
If this doesn’t work, there is a File Explorer tool available in the
desktop environment that you can use to navigate to, and view the
files.
Challenge 3
Discuss your results with a neighbour. - Which samples most the
highest / lowest number of reads - Which sample(s) looks the best in
terms of per base sequence quality? - Which sample(s) look the
worst?
Click
Here for next part
---
title: "Assessing Read Quality"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
    css: stylesheets/styles.css
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,eval=FALSE)
```

(Adapted from the Data Carpentry Genomics wrangling materials at:- https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html)

# Overview

Covered in this section

- Using the command line to download files
- Compression and de-compression of files
- Verifying that a download was completed
- Quality assessment of raw sequencing data
- Compiling multiple QC reports into a single file


# Bioinformatics workflows

When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass
through a number of  different tools in order to generate your final desired output. The execution of this set of
tools in a specified order is commonly referred to as a *workflow* or a *pipeline*. 

An example of the workflow we will be using for our RNA-seq analysis is provided below with a brief
description of each step. 

![workflow](images/workflow.png)


1. Quality control - Assessing quality using FastQC
2. Align reads to reference genome 
3. Quantification to obtain gene-level counts post-alignment clean-up
4. Differential Expression

These workflows in bioinformatics adopt a plug-and-play approach in that the output of one tool can be easily
used as input to another tool without any extensive configuration. Having standards for data formats is what 
makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon 
within the community. The tools that are used to analyze data at different stages of the workflow are therefore 
built under the assumption that the data will be provided in a specific format.  

**Although we are using RNA-seq as a case study, the initial steps for other NGS technologies are highly-similar**

# Obtaining raw reads

Before we can begin the analysis, we need to make our raw sequencing data accessible within the same computing environment that we are going to use for analysis. Assuming that you will be using a computing node without a GUI, this can be a non-trivial task. For example, we cannot easily drag-and-drop the files from one location to another or click a download link. Therefore we will introduce some commands for transferring files, compressing and de-compressing and checking file integrity.

Lets create a temporary directory to test these tools:-

```{bash eval=FALSE}
cd
mkdir tmp
cd tmp
```


## Use `wget` to download from a remote location

If your data are hosted on a website or FTP site, you can use the `wget` command to download to the current working directory. A small `fastq` file is provided on the Sheffield Bioinformatics Core website as an example. The output updates you on the progress of the download.

Note that if downloading from an FTP site you might be prompted for a username and password.

```{bash eval=FALSE}
wget http://sbc.shef.ac.uk/data/example_reads.fastq
ls -lh
```

The contents of the file can be printed to the screen using the `head` command you should have learnt about previously

```{bash eval=FALSE}
head -n 4 example_reads.fastq
```

## Details on the FASTQ format

Although it looks complicated (and it is), we can understand the
[fastq](https://en.wikipedia.org/wiki/FASTQ_format) format with a little decoding. Some rules about the format
include...

|Line|Description|
|----|-----------|
|1|Always begins with '@' and then information about the read|
|2|The actual DNA sequence|
|3|Always begins with a '+' and sometimes the same info in line 1|
|4|Has a string of characters which represent the quality scores; must have same number of characters as line 2|


```
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
```


Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the
probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call
accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality
score, the numerical score is converted into a code where each individual character
represents the numerical quality score for an individual nucleotide. For example, in the line
above, the quality score line is:

```
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
```

The numerical value assigned to each of these characters depends on the
sequencing platform that generated the reads. The sequencing machine used to generate our data
uses the standard Sanger quality PHRED score encoding, using Illumina version 1.8 onwards.
Each character is assigned a quality score between 0 and 41 as shown in
the chart below.

```
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
                   |         |         |         |         |
Quality score:    01........11........21........31........41
```

Each quality score represents the probability that the corresponding nucleotide call is
incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a
base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%.
These probability values are the results from the base calling algorithm and depend on how
much signal was captured for the base incorporation.


```
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
```



Since this file is so small, we needn't worry about the file size. However, in a real life experiment our raw sequencing reads can be many 10s of Gb in size. It is common practice to compress the file so that it takes less space on disk. The standard way of achieving this in Unix systems is to use the `gzip` command. A side-effect of the compression is that the resulting file can no longer be printed with `cat`. However, a `zcat` command can be used to the same effect.

```{bash eval=FALSE}
##example of compressing the file
gzip example_reads.fastq
## list the directory to see what changed...
ls -lh
## Now have to use zcat to print...
zcat example_reads.fastq | head -n4
```


## Download an archived directory

Another popular technique for compressing and transfering a collection of files is to organise them into a single file (or *tarball*), which can then be compressed. Such a file is also available on the Sheffield Bioinformatics Core website for demonstration purposes. The `wget` command can be used as before.

```{bash eval=FALSE}
wget http://sbc.shef.ac.uk/data/archive.tar.gz
ls
```

The `tar` command with the options `zxvf` (look these up if you want to understand more) can be used to create a directory structure containing all the files.

```{bash eval=FALSE}
tar zxvf archive.tar.gz
ls
```



### Checking the download worked

Having downloaded all your files, you should probably check they have been downloaded successfully and been corrupted in transit. Partial or incomplete files can be a source of errors later on in the analysis. If you are downloading many files, it can be difficult to spot if a file failed to download. A simple way is to check the file sizes, but this can be unreliable.

A common way of doing this is to generate an *md5sum* from the file. You can think of this as a unique digital barcode of a text file. The same file will always give the same md5sum. Sequencing vendors will often calculate these files on their servers and send the results to you in the form of a text file that you can use to check. Note that for large fastq files, these can take several minutes to compute.

```{bash eval=FALSE}
### show md5sum for a particular file
md5sum fastq/Sample1/Sample1.fastq
## help is available on the function
md5sum --help
```

To see what would happen if a file didn't transfer properly, we can create a copy of `Sample1.fastq` containing fewer lines. The resulting md5sum will be different

```{bash}
## Use > to redirect the output
head fastq/Sample1/Sample1.fastq > fastq/Sample1/Sample1_sub.fastq
md5sum fastq/Sample1/Sample1_sub.fastq
```

## Challenge 1 {.challenge}

<div class="exercise">
 1. Download the pre-computed md5sums available at `http://sbc.shef.ac.uk/data/archive_md5.txt`. 
 2. Print the contents of this file and verify that the sum for Sample1.fastq is correct
 3. The md5sum command is able to check the md5sums for a collection of files against some pre-computed values. Use documentation (and google!) to discover how to do this for the files we have downloaded.
</div>




##  Obtaining public data and About the example data

A good resource for obtaining public data is the SRA explorer.

- [https://sra-explorer.info/](https://sra-explorer.info/)

In the Search box you can enter the accession number of a study (usually quoted in the published paper), to obtain download links for all the fastq files. It will even generate a bash script for you to run. 

The data for this tutorial comes from the paper, [*Induction of fibroblast senescence generates a non-fibrogenic myofibroblast phenotype that differentially impacts on cancer prognosis.*](http://europepmc.org/article/MED/27992856). 

> Cancer associated fibroblasts characterized by an myofibroblastic phenotype play a major role in the tumour microenvironment, being associated with poor prognosis. We found that this cell population is made in part by senescent fibroblasts in vivo. As senescent fibroblasts and myofibroblasts have been shown to share similar tumour promoting functions in vitro we compared the transcriptosomes of these two fibroblast types and performed RNA-seq of human foetal foreskin fibroblasts 2 (HFFF2) treated with 2ng/ml TGF-beta-1 to induce myofibroblast differentiation or 10Gy gamma irradiation to induce senescence. We isolated RNA 7 days upon this treatments changing the medium 3 days before the RNA extraction.


The paper states a particular *accession number* that can be used to access the data

- **E-MTAB-3101**

This ID can be used to find the data on ArrayExpress, but is also recognised by SRA explorer (you don't need the **[All fields]** text shown in the screenshot). Gene Expression Omnibus (**GSE..**) or Sequencing Read Archive (**SRP...**) identifiers are also recognised.

![](images/sra_explorer.PNG)

If the identifier is recognised the names of samples belonging to the dataset will be shown. These can be selected and added to a collection. From the saved datasets button (top-right) the user can then access download links.

For this particular dataset we created a download script using the links provided by SRA explorer. To keep the run-time of the workshop short (and to keep computing requirements down), each raw file was subset to a more manageable amount.

<div class="information">
The code used to download the example dataset can be found in `/home/dcuser/rnaseq_data/scripts/download_subset.sh`. You do not need to run this as part of the workshop - it is just for information, or if you wanted to repeat the workshop in your own environment.
</div>

The example dataset consists of nine raw (fastq files). The files are named according to names assigned when the data were uploaded to the public repository. We can use a sample sheet to associate each of these database names to something more memorable, and eventually the biological conditions being studied (beyond the scope of what we will cover today)

| fastq     | Sample | Condition   |
|-----------|--------|-------------|
| ERR732901_sub | CTR_1  | Control     |
| ERR732901_sub | TGF_1  | TGF_Treated |
| ...       | ...    | ...         |


# Assessing Quality using FastQC

In real life, you won't be assessing the quality of your reads by visually inspecting your 
FASTQ files. Rather, you'll be using a software program to assess read quality and 
filter out poor quality reads. We'll first use a program called [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to visualize the quality of our reads. 
Later in our workflow, we'll use another program to filter out poor quality reads. 

FastQC has a number of features which can give you a  quick impression of any problems your
data may have, so you can take these issues into consideration before moving forward with your
analyses. Rather than looking at quality scores for each individual read, FastQC looks at
quality collectively across all reads within a sample. The image below shows a FastQC-generated plot that indicates
a very high quality sample:

![good_quality](images/good_quality1.8.png)

The x-axis displays the base position in the read, and the y-axis shows quality scores. In this
example, the sample contains reads that are 40 bp long. For each position, there is a 
box-and-whisker plot showing the distribution of quality scores for all reads at that position.
The horizontal red line indicates the median quality score and the yellow box shows the 2nd to
3rd quartile range. This means that 50% of reads have a quality score that falls within the
range of the yellow box at that position. The whiskers show the range to the 1st and 4th 
quartile.

For each position in this sample, the quality values do not drop much lower than 32. This 
is a high quality score. The plot background is also color-coded to identify good (green),
acceptable (yellow), and bad (red) quality scores.

Now let's take a look at a quality plot on the other end of the spectrum. 

![bad_quality](images/bad_quality1.8.png)

Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the "bad" range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above. 

## Running FastQC  

We will be working with a set of sample data that is located in the directory (`rnaseq_data`). These files have been compressed, so have the `fastq.gz` extension. FastQC is happy with either compressed on un-compressed files.


Navigate to your FASTQ dataset: 

```{bash eval=FALSE}
cd /home/dcuser/rnaseq_data
```



To run the FastQC program, we would normally have to tell our computer where the program is located. In this particular case, FastQC has been installed in a location that can be accessed from all directories. 

```{bash eval=FALSE}
which fastqc
```

```
/usr/bin/fastqc
```

**This is the first program we have seen that is specific to NGS analysis, so might not be available on all file systems by default.**

FastQC can accept multiple file names as input, so we can use the `*.fastq.gz` wildcard to run FastQC on all of the FASTQ files in this directory.

```{bash eval=FALSE}
fastqc fastq/*.fastq.gz
```


You will see an automatically updating output message telling you the 
progress of the analysis. It will start like this: 


```
Started analysis of ERR732901_sub.fastq.gz
Approx 100% complete for ERR732901_sub.fastq.gz
Analysis complete for ERR732901_sub.fastq.gz

```

In total, it should take less than 1 minute for FastQC to run on all
of our FASTQ files. When the analysis completes, your prompt
will return. So your screen will look something like this:

```
Started analysis of ERR732909_sub.fastq.gz
Approx 100% complete for ERR732909_sub.fastq.gz
Analysis complete for ERR732909_sub.fastq.gz

```

The FastQC program has created several new files within our
`/data/fastq` directory. 

```{bash eval=FALSE}
ls fastq/
```


```
ERR732901_sub.fastqc.html    ERR732901_sub.fastqc.zip     ERR732901_sub.fastq.gz     ERR732901_sub.fastqc.html     
....
```


For each input FASTQ file, FastQC has created a `.zip` file and a
`.html` file. The `.zip` file extension indicates that this is 
actually a compressed set of multiple output files. We'll be working
with these output files soon. The `.html` file is a stable webpage
displaying the summary report for each of our samples.

We want to keep our data files and our results files separate, so we
will move these output files into a new directory within our `results/` directory. If this directory does not exist, we will have to create it.

```{bash eval=FALSE}
## -p flag stops a message from appearing if the directory already exists
mkdir -p qc
mv fastq/*.html qc/
mv fastq/*.zip qc/
```


## Combining the reports with `multiqc`

It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.

The multiqc tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.

```{bash eval=FALSE}
multiqc
multiqc --help
```


## Challenge 2 {.challenge}

<div class="exercise"> 
Use the multiqc tool to create a single QC report for the dataset.
Look at the help for the tool, and figure out how to run the tool on the fastqc output we have just generated. Make sure that `multiqc` creates saves the report file in the `qc` folder.
</div>

The environment that we are working inside includes a version of the Firefox web browser. We can open a particular HTML file with firefox using the command:-

```
firefox qc/ERR732901_sub.fastqc.html
```

If this doesn't work, there is a File Explorer tool available in the desktop environment that you can use to navigate to, and view the files.

## Challenge 3 {.challenge}
<div class="exercise"> 
Discuss your results with a neighbour. 
- Which samples most the highest / lowest number of reads
- Which sample(s) looks the best in terms of per base sequence quality? - Which sample(s) look the worst?
</div>

<a href="part3.nb.html" style="font-size: 50px; text-decoration: none">Click Here for next part</a>