Hands-on NGS Analysis in Galaxy and IGV

Sheffield Bioinformatics Core

web : sbc.shef.ac.uk
twitter: SheffBioinfCore
email: bioinformatics-core@sheffield.ac.uk


Tutorial Overview

This tutorial will cover the basics of NGS analysis using Galaxy; a open-source web-based platform for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical NGS analysis and be comfortable with the outputs generated by a sequencing service.You will also use the Integrative Genomics Viewer (IGV) to view the aligned reads in an interactive manner.

Parts of this tutorial are based on the NGS tutorial from the Galaxy Project and the IGV tutorial from the Griffin lab.

The tutorial covers the following steps in our analysis pipeline


Background

Where do the data in this tutorial come from?

The data for this tutorial are publicly-available exome sequencing data downsampled to the BRCA2 region from a fictitious patient. We will use these data throughout the course to call variants, filter and discuss the clinical impact of any mutations.

Section 1: Preparation

Ignore if you have already created a Galaxy account and uploaded the example data in a previous exercise

1. Register as a new user on one of the public Galaxy servers

Make sure you check your email to activate your account

2. Import the data for the workshop.

We can going to import the fastq files for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores. However, as we will see, the representation of the quality scores has changed over time.

You can import the data by:

  1. In the tool panel located on the left, under Basic Tools select Get Data > Upload File. Click on the Paste/Fetch data button on the bottom section of the pop-up window.
  2. Upload the sequence data by selecting the files JoeBlogsBRCAPanel_R1.fastq and JoeBlogsBRCAPanel_R2.fastq. You don’t need to specify the file type or genome build. Galaxy should be able to make a reasonable guess.

  3. You should now have these 2 files in your history:
    • JoeBlogsBRCAPanel_R1.fastq
    • JoeBlogsBRCAPanel_R2.fastq

Section 2: Fastq file format

You can view the files you just uploaded by clicking the eye icon the history item. The first few lines should read as follows

JoeBlogsBRCAPanel_R1.fastq

@HWI-D00461:188:HVGY2BCXY:1:1101:1363:84148/1
TGTGTCATTTCTATTATCTTTGGAACAACCATGAATTAGTCCCTTGGGGTTTTCAAATGCTGCACACTGACTCACACATTTATTTGGTTCTGTTTTTGCCTTCCCTNN
+
DDDDDIHIIIIIHIIIIIIIIIIHIIIIIIIIIHIIIIIIHHIIIIIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIDHHIHC#

JoeBlogsBRCAPanel_R2.fastq

@HWI-D00461:188:HVGY2BCXY:1:1101:1363:84148/2
TGGAAAGACTTTTGGGGGGGGGAGTATTTTTCTTGTTTCTGGTTTTGGTTTTTTTGATCCGGGAAAGATTTTGTTTTTTGGAGGTTGGACTTTTGGGGAGGGGAAAAN
+
<00<<1111<1<11/0/<///</<0111DF11<<11111<11111<1/1<1D1///<<11<<</<1111<<11<DD1<//<110<11/11<0<0</0<0<<-//<<FE

The first line is the unique identifier for each sequenced read. It can be used to encode information such as the sequencing machine, flow cell and lane that the read was generated from and the physical coordinates on the lane.

The quality scores are ASCII representations of how confident we are that a particular base has been called correctly. Letters that are further along the alphabet indicate higher confidence. This is important when trying to identify types of genome variation such as single base changes, but is also indicative of the overall quality of the sequencing. Different scales have been employed over time (resulting in a different set of characters appearing in the file).

Deriving the Quality Score

First of all, we convert the base-calling probability (p) into a Q score using the formula

Quality Scores to probabilities

  • look-up the ASCII code for each character
  • subtract the offset to get the Q score
  • convert to a probability using the formula:-

\[ p = 10^{-Q/10} \]

Let’s see this calculation for the first few bases of the first read in JoeBlogsBRCAPanel_R1.fastq; DDDDDIHI....

Character Code Minus 33 Offset Probability
D 68 35 0.0003162278
D 68 35 0.0003162278
D 68 35 0.0003162278
D 68 35 0.0003162278
D 68 35 0.0003162278
I 73 40 0.0001000000
H 72 39 0.0001258925
I 73 40 0.0001000000

In practice, we don’t have to convert the values as we have software that will do this automatically


Section 3: Quality assessment with FastQC

FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user’s attention to possible issues.

  • From the left hand tool panel in Galaxy, under GENOMIC FILE MANIPULATION, select FASTQ Quality Control -> FastQC
  • Select one of the FASTQ files as input and Execute the tool.
  • When the tool finishes running, you should have an HTML file in your History. Click on the eye icon to view the various quality metrics.

The most important image is whether the base quality decreases significantly over the length of the read

Good quality data should look something like:-

Look at the generated FastQC metrics for your uploaded fastq files. This data looks pretty good - high per-base quality scores (most above 30).

All is not lost if we observe poor quality bases towards the end of the read. There are a number of trimming options that we can use for NGS data and some of these are available through Galaxy. Check out the Trimmming Reads section of the Galaxy NGS tutorial if you are interested in how we can trim our reads.

It is also worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq) and the organism being sequenced, so some warnings might be expected due to the nature of your experiment. For instance, there are known sequencing composition biases that can occur at the beginning of RNA-seq reads.

Aggregating QC reports with multiqc

For datasets with large numbers of fastq files, it may be useful to aggregate the individual reports into a single combined report.

  • Under GENOMIC FILE MANIPULATION, select FASTQ Quality Control -> MultiQC
  • Make sure Software name is set to FastQC
  • In Results file, select the RawData results files that you have just generated

You should then be able to view the fastqc plots for both the fastq files on the same page.

Section 4: Alignment

We don’t really spend much time look at fastq files, as most of our time is spent with aligned reads. i.e. we have used some software to tell us whereabouts in the genome each read belongs to. This will usually be performed for you as part of a sequencing service, but it is good to get an appreciation of the steps involved.

In this section we map the reads in our FASTQ files to a reference genome.

A plethora of different tools have been written to perform this task, and we will not describe it in detail. Links to some key publications are given below:-

Alignment relies on the reference genome being indexed so that the sequencing reads can be located more efficiently. The genome index is a highly-accessible data structure, and Galaxy includes indices for many popular genomes.

1.Align the example files

  • Find the tool GENOMICS ANALYSIS -> Mapping -> Bowtie2
  • alternatively, type bowtie in the search box
  • In Is this single-end or Paired-end? Select Paired-end
  • Set FastQ file #1 and FastQ file #2 to the two fastq files you uploaded in the previous step
  • Make sure the reference genome is set to Human (Homo sapiens):hg19
  • Press Execute
  • Wait!

The result will be a .bam file that we will describe in the next section. This file is not human-readable, as it is compressed. But we can convert to a readable format for illustration purposes.

2. View the alignments

  1. Click on the eye of the resulting file to view the alignments.

About the bam file format

Unlike most of Bioinformatics, a single standard file format has emerged for aligned reads. Moreover, this file format is consistent regardless of whether you have DNA-seq, RNA-seq, ChIP-seq… data.

The bam file is a compressed, binary, version of a sam file.

The .sam file

  • Sequence Alignment/Map (sam)
  • The output from an aligner such as bwa
  • Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)
  • May contain un-mapped reads
  • Potentially large size on disk; ~100s of Gb
    • Can be manipulated with standard unix tools; e.g. cat, head, grep, more, less….
  • Official specification can be obtained online: -
  • We normally work on a compressed version called a .bam file. See later.
  • Header lines starting with an @ character, followed by tab-delimited lines
    • Header gives information about the alignment and references sequences used

The first part of the header lists the names (SN) of the sequences (chromosomes) used in alignment, their length (LN) and a md5sumdigital fingerprint” of the .fasta file used for alignment (M5).


@HD VN:1.0 SO:coordinate
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr11_gl000202_random LN:40103
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
.....
.....

We also have a section where we can record the processing steps used to derive the file

@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.1 CL:"/jetstream/scratch0/main/conda/envs/mulled-v1-65d5efe4f1b69ab7166d1a5a5616adebe902133ea3e4c189d87d7de2e21ddc17/bin/bowtie2-align-s --wrapper basic-0 -p 10 -x /cvmfs/data.galaxyproject.org/byhand/hg19/hg19full/bowtie2_index/hg19full -1 input_f.fastq -2 input_r.fastq"
....
....

Next is a tab-delimited section that describes the alignment of each sequence in detail.

HWI-D00461:188:HVGY2BCXY:1:1109:3430:66266  163 13  32889826    42  108M    =   32889969    251 TTGGGACGAGCGCGTCTTCCGTAGTCCCAGTCCAGCGTGGCGGGGGAGCGCCTCACGCCCCGGGTCGCTGCCGCGGCTTCTTGCCCTTTTGTCTCTGCCAACCCCCAC    0D@@?GEHCHHCEHIDHH?1CCHCHI@1<CCCFC@GCCCEHIHCHICHC?HH=GHE1DE<CEHDEHHC<CCH/?HHG/<1<D@11D?G?FGHEHH01D00D;00<DH<    AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:21C86 YS:i:-5 YT:Z:CP
Column Official Name Brief
1 QNAME Sequence ID
2 FLAG Sequence quality expressed as a bitwise flag
3 RNAME Chromosome
4 POS Start Position
5 MAPQ Mapping Quality
6 CIGAR Describes positions of matches, insertions, deletions w.r.t reference
7 RNEXT Ref. name of mate / next read
8 PNEXT Position of mate / next read
9 TLEN Observed Template length
10 SEQ Sequence
11 QUAL Base Qualities

There can also be all manner of optional tags as extra columns introduce by an aligner or downstream analysis tool. A common use is the RG tag which refers back to the read groups in the header.

Sorting and indexing

You will notice from the 3rd column that the reads are ordered according to their start position; whereas the reads in the fastq file were arranged in order that they were generated on the flow cell. By default, bowtie2 produces a bam where the reads are in the same order as the fastq. However, this is rather inconvenient for analysis where we require reads from the same location to be next to each other in the file.

An additional couple of steps have been performed after bowtie2; sorting the file according to genome position and producing an index file. The index file does not provide any useful information for us and cannot be viewed in Galaxy. However, we will need it later on when viewing the data in IGV.

Quality Flags

The “flags” in the sam file can represent useful QC information

  • Read is unmapped
  • Read is paired / unpaired
  • Read failed QC
  • Read is a PCR duplicate (see later)

The combination of any of these properties is used to derive a numeric value

For instance, a particular read has a flag of 163

Derivation

There is a set of properties that a read can possess. If a particular property is observed, a corresponding power of 2 is added multiplied by 1. The final value is derived by summing all the powers of 2.

Flag Value Meaning
69 (= 1 + 4 + 64) The read is paired, is the first read in the pair, and is unmapped.
77 (= 1 + 4 + 8 + 64) The read is paired, is the first read in the pair, both are unmapped.
83 (= 1 + 2 + 16 + 64) The read is paired, mapped in a proper pair, is the first read in the pair, and it is mapped to the reverse strand.
99 (= 1 + 2 + 32 + 64) The read is paired, mapped in a proper pair, is the first read in the pair, and its mate is mapped to the reverse strand.
133 (= 1 + 4 + 128) The read is paired, is the second read in the pair, and it is unmapped.
137 (= 1 + 8 + 128) The read is paired, is the second read in the pair, and it is mapped while its mate is not.
141 (= 1 + 4 + 8 + 128) The read is paired, is the second read in the pair, but both are unmapped.
147 (= 1 + 2 + 16 + 128) The read is paired, mapped in a proper pair, is the second read in the pair, and mapped to the reverse strand.
163 (= 1 + 2 + 32 + 128) The read is paired, mapped in a proper pair, is the second read in the pair, and its mate is mapped to the reverse strand.

See also

Have a CIGAR!

The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string is a way of encoding the match between a given sequence and the position it has been assigned in the genome. It is comprised by a series of letters and numbers to indicate how many consecutive bases have that mapping.

Code Description
M alignment match
I insertion
D deletion
N skipped
S soft-clipping
H hard-clipping

e.g.

  • 68M
    • 68 bases matching the reference
  • 1S67M
    • 1 soft-clipped read followed by 67 matches
  • 15M87N70M90N16M
    • 15 matches following by 87 bases skipped followed by 70 matches etc.

3. Check the alignment stats

We will now generate a few basic statistics about the alignment of our data

  1. Select the tool NGS: SAMtools -> Flagstat
  2. In the BAM File to Convert box choose the bam file produced by bowtie2.

The tool will also report how many PCR Duplicates have been found in the data. But as we haven’t yet run any software to identify such reads, the flagstat output will show 0 reads.

  1. Find the tool SAMtools -> Idxstats
  2. In the BAM file dropdown select the bam file produced by bowtie2

The output of this tool will tell you how many reads aligned to each chromosome in your reference genome. For our example dataset we only have reads from a small region, so shouldn’t expect to see alignments to each chromosome.

About Duplicates

The preparation of a sequencing library requires PCR amplification of your starting material. This can lead to some DNA fragments being over-represented in your data. As our DNA fragments are formed in a random process, and relatively small compared to the number of bases to be sequenced from the genome (3Gb in humans), we tend to think the two DNA fragments that have identical starting and ending position are unlikely to have occurred due to chance. Some software, such as Picard will identify such artefacts and mark them for attention by downstream methods. i.e. they are not completely discarded from the analysis.

4. Mark Duplicates with Picard

  1. Use the tool GENOMICS TOOLKITS -> Picard -> MarkDuplicates
  2. In Select SAM/BAM dataset or dataset collection choose the bam file produced by bowtie2.
  3. What do you notice about the flag values for any reads that have the same start as another read?
  4. Interpret the meaning of these flags using the online tool

Warning the assumption about reads having the same start location being PCR duplicates falls down when we do sequencing for a very specific region of the genome. e.g. targeted sequencing from a panel of cancer genes. Running a tool to mark PCR duplicates on such data would recommend a high proportion of reads be ignored from further analysis.

5. (Optional) Re-run the alignment statistics

  1. Select the tool GENOMIC FILE MANIPULATION -> SAM/BAM -> Samtools flagstat
  2. In the BAM File to Convert box choose the bam file produced by the mark duplicates step

6. Download your bam file

For the next step you will need to download the bam file that you produced after marking duplicates. To do this, you can click the floppy disk icon.

Make sure that you click both Download dataset and Download bam_index

Section 5. Visualising the aligned reads with IGV

Whilst Bioinformatics tools are very powerful and allow you to perform statistical analyses and test hypotheses, there is no substitute for looking at and exploring the data. A trained-eye can quite quickly get a sense of the data quality before any computational analyses have been run. Furthermore, as the person requesting the sequencing, you probably know a lot about the biological context of the samples and what to expect.

We will load the aligned reads that we have just created into IGV and start to get a feel for the process of variant calling

  1. Download IGV

You can download IGV for Windows using this link

Extract the zip file that you have downloaded

Double-Click the file igv to launch IGV

If you are unable to download IGV, you should be able to run a web-app from the Broad Institute website with the same functionality.

https://igv.org/app/

  1. Load a reference Genome and some Data Tracks

By default, IGV should load with Human genome version hg19 already loaded. It is essential that you use the same genome version that the reads were aligned to. You can check / change the genome by clicking the drop-down menu in the upper-left

We can also load extra tracks into the browser that can help us understand our variant calls. We can load data from dbSNP which will tell us about common mutations that already been identified. These can be loaded via File -> Load from Server.. and selecting dbSNP 1.4.7 from the Variation and Repeats section

  1. Navigate around IGV

When IGV loads up we start with a very high-level view of the genome where all chromosomes are visible. Such a view might be useful if we were interested in large copy-number variation across a cohort of samples. However, we are mostly interested in changes at the individual base-level which is not possible to view at this resolution. We need to navigate to a particular region of interest.

Alongside the drop-down menu used to change the genome, there is a drop-down menu to select a particular chromosome and a box where we can enter text. Inside the text box we can enter a particular genome interval of interest

e.g. chr1:10,000-11,000

IGV should now be displaying a region on chromosome 1 from base position 10,000 to 11,000.

At this resolution, we can start to see the genome sequence. Each DNA base is represented by a different colour (A = green, C = blue) and it seems that this region is highly-repetitive; comprising mostly C bases. We can zoom further in using the zoom control in the top right of IGV

We can use the zoom control, and also move along the genome by holding down by mouse button and sliding left-and-right, to navigate to whatever genomic location we want. If we known the name of the gene we want to interrogate we can navigate directly there using the text box

The text box should now update to show the coordinates of BRCA1 (chr17:....). At the bottom of the screen we can now see the gene model for the gene BRCA1. Genes are represented as lines and boxes. Lines represent intronic regions and boxes represent exonic regions. The arrows indicate the direction / strand of transcription for the gene.

  1. Load the aligned reads

Choose File > Load from File…, select the bam file that you downloaded from Galaxy, and click OK. Note that the bam and index files must be in the same directory for IGV to load these properly.

The main display of IGV should now update to hold tracks for the aligned reads from this bam file. It may seem like nothing is being displayed. This is because we are zoomed-out too far to be able to see the reads. Use the zoom function and move along the genome until you get to the first exon of BRCA1. After you have zoomed-in far enough you will start to see some grey rectangles. These are the aligned reads. Note that some downsampling may occur meaning that not all reads are displayed to reduce memory requirements.

If you hover over a particular read, how will see columns from the bam file being displayed such as the mapping quality and information about the paired reads.

Coloured-letters within the read indicate bases that are different to the reference genome. The entire read may be coloured differently to grey, which can indicate different things depending on how the display has been configured. For example, it can highlight paired-reads with an insert size different to that expected (see here). The display of aligned reads can be configured through the menus, as described here.

---
title: ""
author: "Mark Dunning"
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    df_print: paged
    toc: yes
editor_options:
  chunk_output_type: inline
---



# Hands-on NGS Analysis in Galaxy and IGV

### Sheffield Bioinformatics Core
<img src="media/logo-sm.png" align=right>

web : [sbc.shef.ac.uk](http://sbc.shef.ac.uk)  
twitter: [SheffBioinfCore](https://twitter.com/SheffBioinfCore)  
email: [bioinformatics-core@sheffield.ac.uk](bioinformatics-core@sheffield.ac.uk)

-----

# Tutorial Overview

This tutorial will cover the basics of NGS analysis using Galaxy; a open-source web-based platform for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical NGS analysis and be comfortable with the outputs generated by a sequencing service.You will also use the Integrative Genomics Viewer (IGV) to *view* the aligned reads in an interactive manner.

Parts of this tutorial are based on the NGS tutorial from the Galaxy Project and the IGV tutorial from the Griffin lab.

- [Galaxy NGS Tutorial](https://galaxyproject.org/tutorials/ngs/)
- [IGV Tutorial](https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial)

The tutorial covers the following steps in our analysis pipeline

![](media/pipeline.png)

-----

## Background

#### Where do the data in this tutorial come from?
The data for this tutorial are publicly-available exome sequencing data *downsampled* to the BRCA2 region from a fictitious patient. We will use these data throughout the course to call variants, filter and discuss the clinical impact of any mutations. 

# Section 1: Preparation 

**Ignore if you have already created a Galaxy account and uploaded the example data in a previous exercise**

#### 1.  Register as a new user on one of the public Galaxy servers

- https://usegalaxy.org/
- https://usegalaxy.org.au
- https://usegalaxy.eu

**Make sure you check your email to activate your account**

#### 2.  Import the data for the workshop.

We can going to import the [*fastq* files](https://en.wikipedia.org/wiki/FASTQ_format) for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores. However, as we will see, the representation of the quality scores has changed over time.

You can import the data by:

1.  In the tool panel located on the left, under Basic Tools select **Get
    Data > Upload File**. Click on the **Paste/Fetch data** button on the
    bottom section of the pop-up window.
2.  Upload the sequence data by selecting the files `JoeBlogsBRCAPanel_R1.fastq` and `JoeBlogsBRCAPanel_R2.fastq`. You don't need to specify the file type or genome build. Galaxy should be able to make a reasonable guess.


3.  You should now have these 2 files in your history:
    - `JoeBlogsBRCAPanel_R1.fastq`
    - `JoeBlogsBRCAPanel_R2.fastq`



# Section 2: Fastq file format

You can view the files you just uploaded by clicking the **eye icon** the history item. The first few lines should read as follows


**JoeBlogsBRCAPanel_R1.fastq**

```
@HWI-D00461:188:HVGY2BCXY:1:1101:1363:84148/1
TGTGTCATTTCTATTATCTTTGGAACAACCATGAATTAGTCCCTTGGGGTTTTCAAATGCTGCACACTGACTCACACATTTATTTGGTTCTGTTTTTGCCTTCCCTNN
+
DDDDDIHIIIIIHIIIIIIIIIIHIIIIIIIIIHIIIIIIHHIIIIIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIDHHIHC#
```

**JoeBlogsBRCAPanel_R2.fastq**

```
@HWI-D00461:188:HVGY2BCXY:1:1101:1363:84148/2
TGGAAAGACTTTTGGGGGGGGGAGTATTTTTCTTGTTTCTGGTTTTGGTTTTTTTGATCCGGGAAAGATTTTGTTTTTTGGAGGTTGGACTTTTGGGGAGGGGAAAAN
+
<00<<1111<1<11/0/<///</<0111DF11<<11111<11111<1/1<1D1///<<11<<</<1111<<11<DD1<//<110<11/11<0<0</0<0<<-//<<FE
```


The first line is the unique identifier for each sequenced read. It can be used to encode information such as the *sequencing machine*, *flow cell* and *lane* that the read was generated from and the physical coordinates on the lane.  

The quality scores are [ASCII](http://ascii-code.com/) representations of how confident we are that a particular base has been called correctly. Letters that are further along the alphabet indicate higher confidence. This is important when trying to identify types of genome variation such as single base changes, but is also indicative of the overall quality of the sequencing. Different scales have been employed over time (resulting in a different set of characters appearing in the file). 


### Deriving the Quality Score

First of all, we convert the base-calling probability (p) into a `Q` score using the formula

- Quality scores $$ Q = -10log_{10}p$$
    + Q = 30, p=0.001
    + Q = 20, p=0.01
    + Q = 10, p=0.1
- These numeric quantities are *encoded* as [**ASCII**](http://ascii-code.com/) code
    + At least 33 to get to meaningful characters
    (https://en.wikipedia.org/wiki/FASTQ_format)
    
![](media/phred.png)      

### Quality Scores to probabilities

- look-up the ASCII code for each character
- subtract the offset to get the Q score
- convert to a probability using the formula:-

$$ p = 10^{-Q/10} $$

Let's see this calculation for the first few bases of the first read in `JoeBlogsBRCAPanel_R1.fastq`; `DDDDDIHI....`

Character  | Code | Minus 33 Offset | Probability
------------- | ------------- | ------------- | -------------
D  | 68 | 35 | 0.0003162278
D  | 68 | 35 | 0.0003162278
D  | 68 | 35 | 0.0003162278
D  | 68 | 35 | 0.0003162278
D  | 68 | 35 | 0.0003162278
I  | 73 | 40 | 0.0001000000
H  | 72 | 39 | 0.0001258925
I  | 73 | 40 | 0.0001000000

In practice, we don't have to convert the values as we have software that will do this automatically

-----

# Section 3: Quality assessment with FastQC

[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a popular tool from [Babraham Institute Bioinformatics Group](https://www.bioinformatics.babraham.ac.uk/index.html) used for *quality assessment* of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The [documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/) for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user's attention to possible issues. 

- From the left hand tool panel in Galaxy, under *GENOMIC FILE MANIPULATION*, select *FASTQ Quality Control -> FastQC*
- Select one of the FASTQ files as input and *Execute* the tool.
- When the tool finishes running, you should have an HTML file in your History. Click on the eye icon to view the various quality metrics.

The most important image is whether the base quality decreases significantly over the length of the read

![](media/fastqc-2a.png)

Good quality data should look something like:-

![](media/fastqc-2b.png)



Look at the generated FastQC metrics for your uploaded fastq files. This data looks pretty good - high per-base quality scores (most above 30).

All is not lost if we observe poor quality bases towards the end of the read. There are a number of *trimming* options that we can use for NGS data and some of these are available through Galaxy. Check out the [Trimmming Reads](https://galaxyproject.org/tutorials/ngs/#trimming-reads) section of the Galaxy NGS tutorial if you are interested in how we can trim our reads.

It is also worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq) and the organism being sequenced, so some warnings might be expected due to the nature of your experiment. For instance, there are known sequencing composition biases that can occur at the beginning of RNA-seq reads. 

## Aggregating QC reports with multiqc

For datasets with large numbers of fastq files, it may be useful to aggregate the individual reports into a single combined report. 

- Under *GENOMIC FILE MANIPULATION*, select *FASTQ Quality Control -> MultiQC*
- Make sure *Software name* is set to `FastQC`
- In *Results file*, select the **RawData** results files that you have just generated

You should then be able to view the fastqc plots for both the fastq files on the same page.

# Section 4: Alignment

We don't really spend much time look at *fastq* files, as most of our time is spent with *aligned* reads. i.e. we have used some software to tell us whereabouts in the genome each read belongs to. This will *usually* be performed for you as part of a sequencing service, but it is good to get an appreciation of the steps involved.

In this section we map the reads in our FASTQ files to a reference genome. 

A plethora of different tools have been written to perform this task, and we will not describe it in detail. Links to some key publications are given below:-

  + 2009 Bowtie 1 - [Langmead et al](http://genomebiology.com/content/10/3/R25)
  + 2012 Bowtie 2 - [Langmead and Salzberg](http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.htm)
  + 2009 BWA - [Li and Durbin](http://bioinformatics.oxfordjournals.org/content/25/14/1754.long)
  + 2010 BWA - [Li and Durbin](http://bioinformatics.oxfordjournals.org/content/26/5/589)
  + 2013 BWA-MEM - [Li](http://arxiv.org/abs/1303.3997)

Alignment relies on the reference genome being *indexed* so that the sequencing reads can be located more efficiently. The genome index is a highly-accessible data structure, and Galaxy includes indices for many popular genomes. 

#### 1.Align the example files  

![](media/bowtie2-tool.png)

- Find the tool *GENOMICS ANALYSIS* -> *Mapping* -> *Bowtie2*
  + alternatively, type `bowtie` in the search box
- In *Is this single-end or Paired-end?* Select **Paired-end**
- Set *FastQ file #1* and *FastQ file #2* to the two fastq files you uploaded in the previous step
- Make sure the reference genome is set to **Human (Homo sapiens):hg19**
- Press *Execute*
- Wait!

The result will be a `.bam` file that we will describe in the next section. This file is not human-readable, as it is compressed. But we can convert to a readable format for illustration purposes.

#### 2. View the alignments

1.  Click on the eye of the resulting file to view the alignments.


![](media/bam-alignments.png)

### About the `bam` file format

Unlike most of Bioinformatics, a *single standard* file format has emerged for aligned reads. Moreover, this file format is consistent regardless of whether you have DNA-seq, RNA-seq, ChIP-seq... data. 

The `bam` file is a compressed, binary, version of a `sam` file.

### The `.sam` file

- **S**equence **A**lignment/**M**ap (sam) 
- The output from an aligner such as `bwa`
- Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)
- May contain un-mapped reads
- Potentially large size on disk; ~100s of Gb
    + Can be manipulated with standard unix tools; e.g. *cat*, *head*, *grep*, *more*, *less*....
- Official specification can be [obtained online](http://samtools.github.io/hts-specs/SAMv1.pdf): -
- We normally work on a compressed version called a `.bam` file. See later.
- *Header* lines starting with an `@` character, followed by tab-delimited lines
    + Header gives information about the alignment and references sequences used


The first part of the header lists the names (`SN`) of the sequences (chromosomes) used in alignment, their length (`LN`) and a *md5sum* "[digital fingerprint](https://en.wikipedia.org/wiki/Md5sum)" of the `.fasta` file used for alignment (`M5`).

```

@HD VN:1.0 SO:coordinate
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr11_gl000202_random LN:40103
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
.....
.....

```


We also have a section where we can record the processing steps used to derive the file
```
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.1 CL:"/jetstream/scratch0/main/conda/envs/mulled-v1-65d5efe4f1b69ab7166d1a5a5616adebe902133ea3e4c189d87d7de2e21ddc17/bin/bowtie2-align-s --wrapper basic-0 -p 10 -x /cvmfs/data.galaxyproject.org/byhand/hg19/hg19full/bowtie2_index/hg19full -1 input_f.fastq -2 input_r.fastq"
....
....

```

Next is a *tab-delimited* section that describes the alignment of each sequence in detail. 

```
HWI-D00461:188:HVGY2BCXY:1:1109:3430:66266	163	13	32889826	42	108M	=	32889969	251	TTGGGACGAGCGCGTCTTCCGTAGTCCCAGTCCAGCGTGGCGGGGGAGCGCCTCACGCCCCGGGTCGCTGCCGCGGCTTCTTGCCCTTTTGTCTCTGCCAACCCCCAC	0D@@?GEHCHHCEHIDHH?1CCHCHI@1<CCCFC@GCCCEHIHCHICHC?HH=GHE1DE<CEHDEHHC<CCH/?HHG/<1<D@11D?G?FGHEHH01D00D;00<DH<	AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:21C86 YS:i:-5 YT:Z:CP

```

![](media/sam-entry-explained.png)

Column | Official Name | Brief
------ | -------------- | -----------
1      | QNAME          | Sequence ID
2      | FLAG           | Sequence quality expressed as a bitwise flag
3      | RNAME          | Chromosome
4      | POS            | Start Position
5      | MAPQ           | Mapping Quality
6      | CIGAR          | Describes positions of matches, insertions, deletions w.r.t reference
7      | RNEXT          | Ref. name of mate / next read
8      | PNEXT          | Position of mate / next read
9      | TLEN           | Observed Template length
10     | SEQ            | Sequence
11     | QUAL           | Base Qualities

There can also be all manner of optional tags as extra columns introduce by an aligner or downstream analysis tool. A common use is the `RG` tag which refers back to the read groups in the header.


### Sorting and indexing

You will notice from the 3rd column that the reads are ordered according to their start position; whereas the reads in the `fastq` file were arranged in order that they were generated on the flow cell. By default, `bowtie2` produces a bam where the reads are in the same order as the `fastq`. However, this is rather inconvenient for analysis where we require reads from the same location to be next to each other in the file.

An additional couple of steps have been performed after bowtie2; sorting the file according to genome position and producing an *index* file. The index file does not provide any useful information for us and cannot be viewed in Galaxy. However, we will need it later on when viewing the data in IGV. 

### Quality Flags

The *"flags"* in the sam file can represent useful QC information

  + Read is unmapped
  + Read is paired / unpaired
  + Read failed QC
  + Read is a PCR duplicate (see later)

The combination of any of these properties is used to derive a numeric value


For instance, a particular read has a flag of 163

![](media/flag-highlight.png)


### Derivation

There is a set of properties that a read can possess. If a particular property is observed, a corresponding power of 2 is added multiplied by 1. The final value is derived by summing all the powers of 2.

![](https://galaxyproject.org/tutorials/ngs/sam_flag.png)

Flag Value | Meaning
---------- | --------------------------------
69 (= 1 + 4 + 64) 	| The read is paired, is the first read in the pair, and is unmapped.
77 (= 1 + 4 + 8 + 64) |	The read is paired, is the first read in the pair, both are unmapped.
83 (= 1 + 2 + 16 + 64) |	The read is paired, mapped in a proper pair, is the first read in the pair, and it is mapped to the reverse strand.
99 (= 1 + 2 + 32 + 64) |	The read is paired, mapped in a proper pair, is the first read in the pair, and its mate is mapped to the reverse strand.
133 (= 1 + 4 + 128) |	The read is paired, is the second read in the pair, and it is unmapped.
137 (= 1 + 8 + 128)  |	The read is paired, is the second read in the pair, and it is mapped while its mate is not.
141 (= 1 + 4 + 8 + 128) |	The read is paired, is the second read in the pair, but both are unmapped.
147 (= 1 + 2 + 16 + 128) |	The read is paired, mapped in a proper pair, is the second read in the pair, and mapped to the reverse strand.
163 (= 1 + 2 + 32 + 128) |	The read is paired, mapped in a proper pair, is the second read in the pair, and its mate is mapped to the reverse strand.



See also

- https://broadinstitute.github.io/picard/explain-flags.html

### Have a CIGAR!


![](media/cigar-highlight.png)

The ***CIGAR*** (**C**ompact **I**diosyncratic **G**apped **Alignment** **R**eport) string is a way of encoding the match between a given sequence and the position it has been assigned in the genome. It is comprised by a series of letters and numbers to indicate how many consecutive bases have that mapping.


 
 Code  | Description
------------- | -------------
M  | alignment match
I  | insertion
D  | deletion
N  | skipped
S  | soft-clipping
H  | hard-clipping


e.g.

- `68M`
    + 68 bases matching the reference
- `1S67M`
    + 1 soft-clipped read followed by 67 matches
- `15M87N70M90N16M`
    + 15 matches following by 87 bases skipped followed by 70 matches etc.


#### 3. Check the alignment stats

We will now generate a few basic statistics about the alignment of our data

1. Select the tool *NGS: SAMtools -> Flagstat* 
2. In the *BAM File to Convert* box choose the bam file produced by bowtie2.


The tool will also report how many ***PCR Duplicates*** have been found in the data. But as we haven't yet run any software to identify such reads, the flagstat output will show 0 reads.

1. Find the tool *SAMtools -> Idxstats*
2. In the *BAM file* dropdown select the bam file produced by bowtie2

The output of this tool will tell you how many reads aligned to each chromosome in your reference genome. For our example dataset we only have reads from a small region, so shouldn't expect to see alignments to each chromosome.



### About Duplicates

The preparation of a sequencing library requires *PCR* amplification of your starting material. This can lead to some DNA fragments being over-represented in your data. As our DNA fragments are formed in a random process, and relatively small compared to the number of bases to be sequenced from the genome (3Gb in humans), we tend to think the two DNA fragments that have identical starting and ending position are unlikely to have occurred due to chance. Some software, such as [Picard](http://broadinstitute.github.io/picard/) will identify such artefacts and *mark* them for attention by downstream methods. i.e. they are not completely discarded from the analysis.

![](media/pcr_dups.png)

#### 4. Mark Duplicates with Picard

1. Use the tool *GENOMICS TOOLKITS -> Picard -> MarkDuplicates*
2. In *Select SAM/BAM dataset or dataset collection* choose the bam file produced by bowtie2.
3. What do you notice about the *flag* values for any reads that have the same *start* as another read? 
4. Interpret the meaning of these flags using the online tool
  + https://broadinstitute.github.io/picard/explain-flags.html

**Warning** the assumption about reads having the same start location being PCR duplicates falls down when we do sequencing for a very specific region of the genome. e.g. targeted sequencing from a panel of cancer genes. Running a tool to mark PCR duplicates on such data would recommend a high proportion of reads be ignored from further analysis.
  
#### 5. (Optional) Re-run the alignment statistics

1. Select the tool *GENOMIC FILE MANIPULATION -> SAM/BAM -> Samtools flagstat* 
2. In the *BAM File to Convert* box choose the bam file produced by the *mark duplicates* step

#### 6. Download your bam file

For the next step you will need to download the `bam` file that you produced after *marking duplicates*. To do this, you can click the floppy disk icon.

![](media/download_bam.png)

**Make sure that you click both Download dataset and Download bam_index**

# Section 5. Visualising the aligned reads with IGV

Whilst Bioinformatics tools are very powerful and allow you to perform statistical analyses and test hypotheses, there is no substitute for **looking at and exploring the data**. A trained-eye can quite quickly get a sense of the data quality before any computational analyses have been run. Furthermore, as the person requesting the sequencing, you probably know a lot about the biological context of the samples and what to expect.

We will load the aligned reads that we have just created into IGV and start to get a feel for the process of variant calling

1. Download IGV

You can download IGV for Windows using this link 

- https://software.broadinstitute.org/software/igv/download

*Extract* the zip file that you have downloaded



![](media/extract_igv.PNG)



Double-Click the file `igv` to launch IGV



![](media/run_igv.PNG)

<font size="5">
If you are unable to download IGV, you should be able to run a web-app from the Broad Institute website with the same functionality.

https://igv.org/app/

</font>


2. Load a reference Genome and some Data Tracks

By default, IGV should load with Human genome version *hg19* already loaded. It is essential that you use the **same genome version** that the reads were aligned to. You can check / change the genome by clicking the drop-down menu in the upper-left

![](media/select_genome.PNG)

We can also load extra *tracks* into the browser that can help us understand our variant calls. We can load data from *dbSNP* which will tell us about common mutations that already been identified. These can be loaded via *File* -> *Load from Server..* and selecting `dbSNP 1.4.7` from the `Variation and Repeats` section

![](media/available_datasets.PNG)


3. Navigate around IGV

When IGV loads up we start with a very high-level view of the genome where *all* chromosomes are visible. Such a view might be useful if we were interested in large copy-number variation across a cohort of samples. However, we are mostly interested in changes at the individual base-level which is not possible to view at this resolution. We need to navigate to a particular region of interest.

Alongside the drop-down menu used to change the genome, there is a drop-down menu to select a particular chromosome and a box where we can enter text. Inside the text box we can enter a particular genome interval of interest

e.g. `chr1:10,000-11,000`


IGV should now be displaying a region on chromosome 1 from base position `10,000` to `11,000`. 

![](media/zoom_region.PNG)


At this resolution, we can start to see the *genome sequence*. Each DNA base is represented by a different colour (A = green, C = blue) and it seems that this region is highly-repetitive; comprising mostly `C` bases. We can zoom further in using the zoom control in the top right of IGV

![](media/how_to_zoom.PNG)

We can use the zoom control, and also move along the genome by holding down by mouse button and sliding left-and-right, to navigate to whatever genomic location we want. If we known the name of the gene we want to interrogate we can navigate directly there using the text box

![](media/select_region.PNG)

The text box should now update to show the coordinates of `BRCA1` (`chr17:....`). At the bottom of the screen we can now see the *gene model* for the gene `BRCA1`.  Genes are represented as lines and boxes. Lines represent intronic regions and boxes represent exonic regions. The arrows indicate the direction / strand of transcription for the gene.

![](media/gene_region.PNG)

4. Load the aligned reads

Choose File > Load from File..., select the bam file that you downloaded from Galaxy, and click OK. Note that the bam and index files must be in the same directory for IGV to load these properly.

![](media/select_bam.PNG)

The main display of IGV should now update to hold tracks for the aligned reads from this bam file. It may seem like nothing is being displayed. This is because we are zoomed-out too far to be able to see the reads. **Use the zoom function and move along the genome until you get to the first exon of BRCA1**. After you have zoomed-in far enough you will start to see some grey rectangles. *These are the aligned reads*. Note that some *downsampling* may occur meaning that not all reads are displayed to reduce memory requirements.

If you hover over a particular read, how will see columns from the bam file being displayed such as the mapping quality and information about the paired reads.

Coloured-letters within the read indicate bases that are different to the reference genome. The entire read may be coloured differently to grey, which can indicate different things depending on how the display has been configured. For example, it can highlight paired-reads with an *insert size* different to that expected ([see here](https://software.broadinstitute.org/software/igv/interpreting_insert_size)). The display of aligned reads can be configured through the menus, as described [here](http://software.broadinstitute.org/software/igv/Preferences#Alignments). 
