Installing Bioinformatics tools can be a major headache and frustration; even for experienced Bioinformaticians. In this brief tutorial we explain how you can run command-line Bioinformatics tools on your own desktop / laptop, or University of Sheffield computing cluster, with minimal setup.
It will be assumed that you have some familiarity with the Unix command-line interface, and know what commands you want to run. You can get a primer from the Software Carpentry organisation for example.
The particular example we will give is for an RNA-seq analysis. If you have a different analysis to perform, please check-out the RNA-seq instructions before seeing what seeing what other software we have made available.
Disclaimer:- this is not a tutorial on how to perform an RNA-seq analysis and which tools to use. You should use this tutorial when you are familiar with the workflow and want to run the tools on your own data.
A Virtual-Machine approach (e.g. using VirtualBox) could be used, but we will consider a solution using “Docker”.
Docker is an open platform for developers to build and ship applications, whether on laptops, servers in a data center, or the cloud. It is a (relatively) painless way for you to install and try out Bioinformatics software. You can think of it as an isolated environment inside your existing operating system where you can install and run software without messing with the main OS.
It is worth bearing in mind that the first approach will use the CPU and RAM from your own machine, so if you do not have adequate resources, some of the analyses may struggle and you might have to consider using a computing cluster.
Choose the appropriate link below to install docker on your machine
Once you have installed Docker using the instructions above, you can open a terminal (Mac) or command prompt (Windows; search for the CMD program) and type the following to check that everything is working
docker run hello-world
hello-world
is a pre-built container that prints a “Hello world” message to the screen. Many popular software (not just Bioinformatics) and pipelines are distributed using docker and in particular the dockerhub website. Our container for RNA-seq analysis is available at sheffieldbioinformatics/rnaseq-training
.
With the default settings, the “container” is isolated from your own machine; we can neither bring files that we create back to our own OS, or analyse our own data.
However, adding an -v
argument allows certain folders on your own OS to be visible within the environment.
Lets assume the files I want to analyse are to be found in the folder /c/work/my_fastq_data
. Here’s what the files look like on my Windows machine
The following command would map that directory to the folder /data
inside the docker container
docker run --rm -it -v /c/work/my_fastq_data:/data sheffieldbioinformatics/rnaseq-training
Notice how the command prompt changes to indicate that I am now the root
user within a different file system
N.B. the other options being used here are –rm to delete the container afterwards, and -it to make it interactive.
We now should be able to see our files with the ls
command on the directory /data/
.
cd /data/
ls
If I now want to run fastqc
to perform a QC check on my files, the fastqc
tool is available to us.
cd /data
fastqc *.fastq.gz
Conveniently the results are appear in the directory on our Windows machine
Other commonly-used tools that are available include:-
## QC
multiqc
## alignment
hisat2
tophat2
bowtie2
STAR
## trimming
cutadapt
## quantification
salmon
kallisto
## countinh
featureCounts
htseq-count
cuffdiff
## general
samtools
The trimmomatic trimming tool can also be executed using:-
java -jar $TRIMMOMATIC
Various tools from the RSEM suite can also be used. To see a full list, type rsem
and press the tab key.
rsem-calculate-expression
rsem-prepare-reference
##etc
The sra-toolkit is available for downloading files directly from the Sequence Read Archive (SRA)
fastq-dump
Additionally, various standard command-line tools are available to help you navigate and manipulate files.
cd
ls
cat
more
cp
wget
unzip
gunzip
tar
When you have finished with analysis, type exit
and close the command prompt.
exit
If you have the need for any RNA-seq software that is not covered by this list, but drop us a line (bioinformatics-core@sheffield.ac.uk).
As described above, docker
can be used to run an RNA-seq analysis on a personal laptop or desktop. However, there are some security implications of docker that prohibits it being installed on a high-performance computing system. An alternative is singularity
, which allows computing environments to be distributed as a single image file. We have created such a singularity image for the analysis of RNA-seq and made it available on sharc (The University of Sheffield’s HPC).
If you don’t already have one, you will need to request an account on sharc.
https://www.sheffield.ac.uk/it-services/research/hpc/register
The UoS website has documentation on connecting to the HPC and transfering data.
https://www.sheffield.ac.uk/it-services/research/hpc/sharc
On my Windows machine I use the free tools putty and WinSCP to connect to the cluster
Firstly, login to sharc using the steps in the documentation and then open an interactive shell. e.g.
ssh USERNAME@sharc.shef.ac.uk
## open an interactive shell with 6Gb of RAM. Change as required
qrshx -l rmem=6G
Standard Unix commands (e.g. ls
, cd
, pwd
, wget
…) are already available when you first login to sharc
. However, the tools specific to NGS analysis will require you to prefix the path to the singularity image before the command to run the tool.
To save typing to path to the singularity image can be saved as a variable
export IMAGE_BASE="singularity exec /usr/local/community/bioinformatics-core/singularity/rnaseq-training.sif"
Lets say that I have some fastq
files in the folder /fastdata/md1mjdx/my_fastqs
then I could run fastqc
as follows.
cd /fastdata/md1mjdx/my_fastqs
$IMAGE_BASE fastqc *.fastq.gz
The tools listed in the previous section are all available, but this time the command must be prefixed with $IMAGE_BASE
. The commands listed below will typically print the help. For details of how to run the command for your analysis consult the corresponding documentation.
$IMAGE_BASE multiqc
$IMAGE_BASE salmon
$IMAGE_BASE kallisto
$IMAGE_BASE hisat2-build
$IMAGE_BASE hisat2
$IMAGE_BASE STAR
$IMAGE_BASE samtools
$IMAGE_BASE featureCounts
$IMAGE_BASE htseq-count
If you are outside the University of Sheffield We are not able to offer a great deal of support for analysis on your institutes’ computing environment. In brief, the singularity
tool will need to be installed, which can then be used to convert the docker container into a corresponding singularity file.
singularity build rnaseq-training.sif docker://sheffieldbioinformatics/rnaseq-training
The command to run the docker container for this is:-
docker run --rm -it -v /c/work/my_fastq_data:/data sheffieldbioinformatics/chipseq-training
Providing the following software
# qc
fastqc
multiqc
# trimming
cutadapt
# alignment
bowtie2
# peak-calling
macs2
# downstream
meme
# general
samtools
bedtools
# SRA toolkit
fastq-dump
The picard suite of tools can also be accessed
java -jar $PICARD
``
### UoS computing cluster
export IMAGE_BASE=“singularity exec /usr/local/community/bioinformatics-core/singularity/chipseq-training.sif” ```
Coming soon….