web : sbc.shef.ac.uk
twitter: [@SheffBioinfCore](https://twitter.com/SheffBioinfCore)
email: bioinformatics-core@sheffield.ac.uk
This tutorial will cover the basics of RNA-seq using Galaxy and Degust; two open-source web-based platforms for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical RNA-seq analysis and be comfortable with the outputs generated by the Bioinformatician.
Sections with this background indicate exercises to be completed during the workshop.
Sections with this background highlight particular shortcuts or other references that might be useful.
Sections with this background give information about potential error messages or might encounter, or problems that might arise in your own data analysis.
The official Galaxy page has many tutorials on using the service, and examples of other types of analysis that can be performed on the platform.
Those eventually wanted to perform their own RNA-seq analysis (for example in R), should look out for other courses
From our experience, the most common application of RNA-seq is still to perform a “differential expression” in order to identify genes, and eventually pathways, that are altered between a set of biological conditions. Therefore we will concentrate on this task in the workshop. We will also be considering bulk RNA-seq only. i.e. where our biological sample may be comprised of a pool of heterogeneous cells. Single-cell approaches are becoming more popular, and although there are some similarities in how these data are processed, a different downstream analysis approach is required.
The Galaxy Training Network provides materials on single-cell analysis, and other applications not covered in this workshop.
Two workflows are possible with RNA-seq data - with the difference being whether one performs an alignment to the reference genome or not.
Recent tools for RNA-seq analysis (e.g. salmon
, kallisto
) do not require the time-consuming step of whole-genome alignment to be performed, and can therefore produce gene-level counts in a much faster time frame. They not require the creation of large bam files, which is useful if constrained by file space on Galaxy.
(image from Harvard Bioinformatics Core)
The data for this tutorial comes from a Journal of Experimental Medicine paper “Itraconazole targets cell cycle heterogeneity in colorectal cancer”. This study examines the expression profiles of two cell lines in response to treatment with itraconazole.
For this tutorial, we will assume that the wet-lab stages of the experiment have been performed and in this tutorial we will demonstrate the steps of Quality assessment, alignment, quantification and differential expression testing.
The fastq data for this experiment were made available on the Sequencing Read Archive (SRA) with accession SRP144496. For the purposes of this workshop we have created a downsampled dataset
The experimental design for the dataset is summarised in the table below.
run | name | cell_line | condition |
---|---|---|---|
SRR7108388 | HT55_CONT_1 | HT55 | DMSO |
SRR7108389 | HT55_CONT_2 | HT55 | DMSO |
SRR7108390 | HT55_CONT_3 | HT55 | DMSO |
SRR7108391 | HT55_CONT_4 | HT55 | DMSO |
SRR7108392 | HT55_ITRA_1 | HT55 | ITRACONAZOLE |
SRR7108393 | HT55_ITRA_2 | HT55 | ITRACONAZOLE |
SRR7108394 | HT55_ITRA_3 | HT55 | ITRACONAZOLE |
SRR7108395 | HT55_ITRA_4 | HT55 | ITRACONAZOLE |
SRR7108396 | SW948_CONT_1 | SW948 | DMSO |
SRR7108397 | SW948_CONT_2 | SW948 | DMSO |
SRR7108398 | SW948_CONT_3 | SW948 | DMSO |
SRR7108399 | SW948_CONT_4 | SW948 | DMSO |
SRR7108400 | SW948_ITRA_1 | SW948 | ITRACONAZOLE |
SRR7108401 | SW948_ITRA_2 | SW948 | ITRACONAZOLE |
SRR7108402 | SW948_ITRA_3 | SW948 | ITRACONAZOLE |
SRR7108403 | SW948_ITRA_4 | SW948 | ITRACONAZOLE |
The Sequencing Read Archive (SRA) is commonly-used to store the raw data from sequencing experiments and can be accessed through the NCBI website. However, the interface is not particularly friendly and the links to download data and not easy to obtain.
An easier alternative exists in the form of SRA Explorer
The SRA accession (usually found in a paper describing the dataset) can be entered into the Search box, and all the samples belonging to that dataset should be found. Samples of interest can be saved, and upon “checkout” the download links (URLs) will be displayed. A command-line tool such as curl
or wget
can then be used to download the files locally.
Make sure you check your email to activate your account
The data for this course have all been shared on a google drive. If you have not done so already, please go to this directory and download the following files
https://drive.google.com/drive/folders/1RSuvl9shAw12Bj77uYSUdWtkZ5ST5EWi?usp=sharing
SRR7108388.fastq.gz
SRR7108389.fastq.gz
SRR7108392.fastq.gz
SRR7108393.fastq.gz
Homo_sapiens.GRCh38.cdna.all.fa.gz
tx2gene.txt
We can going to import the fastq files for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores. To make the practical quicker, we have downsampled the original fastq files to a quarter of a million reads.
You can import the data by:
fastq
directory of the zip file that you downloaded from google drive and select these two files from the HT55-DMSO condition.SRR7108388.fastq.gz
SRR7108389.fastq.gz
and these two files are from the HT55 ITRACONAZOLE condition.
SRR7108392.fastq.gz
SRR7108393.fastq.gz
also upload the files Homo_sapiens.GRCh38.cdna.all.fa.gz
and tx2gene.txt
. These are reference files that we will use later.
SRR7108388.fastq.gz
SRR7108389.fastq.gz
SRR7108392.fastq.gz
SRR7108393.fastq.gz
The annotation files may take a while longer to upload
The .gz
at the end of each file name means that it is compressed (like a zip file).
You can upload the other files for extra practice if you wish
To save time, the raw data for this dataset have been uploaded to Galaxy. To access the history, use the link below
https://usegalaxy.eu/u/markdunning/h/beginners-rna-seq—salmon
https://usegalaxy.org.au/u/markdunning/h/beginnersrnaseqsalmon - galaxy.org.au
There should be a button to import the history in the top-right corner. The raw data for the practical should now be available to you.
FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user’s attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.
Question: Do the data seem to be of reasonable quality?
You can use the documentation to help interpret the plots
If poor quality reads towards the ends of reads are considered to be a problem, or there is considerable adapter contamination, we can employ various tools to trim our data.
However, a recent paper demonstrated that read trimming is no longer required prior to alignment:- https://www.biorxiv.org/content/10.1101/833962v1
If you also suspect contamination by another organism, or rRNA present in your data, you can use the sortMeRNA tool to remove this artefact.
It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.
The multiqc tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.
FASTQ Quality Control -> Multiqc
Under Which tool was used generate logs? Choose fastqc and select the RawData output from the fastqc run on each of your bam files.
Question: Repeat the FastQC analysis for the remaining fastq files and combine the reports with multiQC
. Do the fastq files seem to have consistently high-quality?
Traditionally, workflows for RNA-seq would align reads to a reference genome, and then overlap with know gene coordinates. However, many now prefer to align directly to the transcriptome sequences using a method such as salmon
or kallisto
. We will demonstrate the salmon protocol.
As we are going to align to a set of transcripts rather than the whole genome, we require a file that contains the sequence of each transcript. This file has been provided in the google drive folder and should have been uploaded to your Galaxy history.
However, it is useful to know where this file came from in case you are not working with Human data. The file was obtained from Ensembl by clicking on the cDNA (FASTA) link for the appropriate organism (Human).
On the next screen, Right-click to save the .cdna.all.fa.gz
to your computer
The FASTA file is a large text file that lists all the transcripts for the given organism and their genomic sequence. You could open this in a standard text editor if you wished to see the contents. The contents are similar to that of a FASTQ file. In fact, the FASTQ file is a FASTA file with extra quality scores added.
The identifier line for each sequence (starting with >
) names the transcript and the gene it is associated with. Since we obtained the file from Ensembl, the Transcripts and Genes begin with the ENST
and ENSG
respectively. The numbers after each transcript or gene are the version numbers; the sequence and definition of each transcript / gene can evolve over time.
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000434970.2 cdna chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]
CCTTCCTAC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
By default, salmon will produce counts for each transcript. This might be what we want, but for most standard analyses it is preferable to work at the gene-level. We therefore have to tell salmon how the transcripts in the cDNA file relate to known genes. Such a file can be obtained from biomart.
It is important that the file downloaded from Biomart is edited so that the column headings do not contain any spaces. You can do this in a text editor. The edited file should look like this.
It is important to make sure the version number of your transcript file and the biomaRt dataset are the same, otherwise some of the steps downstream might not work as expected.
If you have problems, this mapping file is also provided in the google drive as tx2gene.txt
. The contents of the first column have to be in the same format as the transcript names in the fasta file. i.e. in this case the version number must be present.
The Ensembl gene IDs are not particularly memorable, so it would be highly beneficial to have other annotations at hand to help us interpret the data downstream. We can use the biomart website again to produce a table to aid downstream interpretation.
This time, select only the Gene Stable ID tickbox in the GENE box. Expand the EXTERNAL panel by clicking the “+” next to EXTERNAL, and select HGNC symbol and NCBI gene (formerly Entrezgene) ID
RNA Analysis -> Salmon quant
tx2gene.txt
file from your historyTwo jobs will now be queued for each sample fastq file. The Quantification output will contain transcript-level data, and the Gene Quantification output will be at the gene-level. We should expect the number of lines in the Gene Quantification file to be substantially less. If not, you will need to check that your transcript mapping file was correct.
The Gene Quantification output from each sample comprises the following columns (taken from the salmon documentation)
Note that we are using a downsampled dataset, so the majority of NumReads will be zero.
Methods for detecting differential expression are likely to want data in the form of a table; where every row is a different gene and each column is a unique biological sample. Before we can proceed we will therefore need to merge our salmon results into a single output. This can be down using the Salmon quantmerge tool
RNA Analysis -> Salmon quantmerge
Use the +Insert Quant file and names button repeatedly to select each of your Gene Quantification outputs. The One-word sample names text box can be used to create a shorter column name for each output.
Once all the Gene Quantification files have been selected the drop-down menu under Columns should be changed from Length to NumReads.
The first step in merging our salmon output is to produce a table for each sample that contains just the gene name and the number of reads for that gene. This can be done with the advanced cut tool
Text Manipulation -> Advanced Cut columns from a table
These outputs can be merged using the Column join tool
Collection Operations -> Column join on multiple datasets
After the tool has finished you should have an output table with one row for each gene and a column containing the counts for that gene.
Exercise: Upload the annotation file from biomart containing Ensembl Gene ID, hgnc and Entrez. Use the column join tool to create a table that allows you to identify the counts for given genes more easily.
If time allows, we will also follow this section
The workflow that people used for many years is summarised in this image from Ting-you Wang’s RNA-seq data analysis page, and may still be preferable if your analysis doesn’t just call for gene-level counts.Mapping -> HISAT2
In the left tool panel menu, under NGS Analysis, select Mapping > HISAT2 and set the parameters as follows:
SRR7108388.fastq.gz
SRR7108389.fastq.gz
SRR7108392.fastq.gz
SRR7108393.fastq.gz
HTSeq-count
creates a count matrix using the number of the reads from each bam file that map to the genomic features. For each feature (a gene for example) a count matrix shows how many reads were mapped to this feature.
Various rules can be used to assign counts to features
To obtain the coordinates of each gene, we can use the UCSC genome browser which is integrated into Galaxy. Unfortunately, the Ensembl FTP site cannot be used as the chromosome naming conventions used in Ensembl are different to the chromosome naming scheme used in the reference genomes supplied by Galaxy (1,2,.. vs chr1, chr2). The alternative would be to download a matching gtf and genome reference sequence from Ensembl and upload both to Galaxy. This would take more time and space on the Galaxy server.
Get Data -> UCSC Main table browser
Selecting the UCSC Main tool from Galaxy will take you to the UCSC table browser. From here we can extract gene coordinates for our genome of interest (hg38
) in gtf
format for processing with galaxy.
Click get output and send query to Galaxy to be returned to Galaxy. A new job will be submitted to retrieve the coordinates from UCSC
When you are returned to Galaxy from UCSC it might look like you have lost all th files in your analysis and are no longer logged in.
To solve this, log back in and choose the View all histories option under the History panel.
There should be two “histories”; one containing all the outputs you generated before accessing UCSC, and one containing the UCSC output. All this point you can switch back to your previous history, and drag the box containing the UCSC ouput to this history
RNA Analysis > htseq-count
The htseq tool is designed to produce a separate table of counts for each sample. This is not particularly useful for other tools such as Degust (see next section) which require the counts to be presented in a data matrix where each row is a gene and each column is a particular sample in the dataset.
Collection Operations -> Column Join on Collections
ht-seq
count files from your history SRR1552444.htseq, SRR1552450, etc… Holding the CTRL key allows multiple files to be selected1
Differential expression is possible using Galaxy using the DESeq2 tool (for example). However, our particular recommendation is to use Degust for a more interactive experience. For this section, we will be using counts generated on the full dataset, rather than the downsampled data analysed in the previous section. These counts are available in the file GSE114013_salmon_counts.csv
from the google drive, or can be downloaded using the link below
The term differential expression was first used to refer to the process of finding statistically significant genes from a microarray gene expression study.
Such methods were developed on the premise that microarray expression values are approximately normally-distributed when appropriately transformed (e.g. by using a log\(_2\) transformation) so that a modified version of the standard t-test can be used. The same test is applied to each gene under investigation yielding a test statistic, fold-change and p-value. Similar methods have been adapted to RNA-seq data to account for the fact that the data are count-based and do not follow a normal distribution.
Degust
is a web tool that can analyse the counts files produced in the step above, to test for differential gene expression. It offers and interactive view of the differential expression results
The input file is a count matrix where each row is a measured gene, and each column is a different biological sample. Within the tool we can configure which samples belong to the different biological groups of interest. It is important that no normalisation has been applied to the counts that are submitted to DEGUST.
Read counts have to be normalised first prior to differential expression testing. This page from Harvard Bioinformatics summarises the main biases in “raw” RNA-seq counts; Sequencing depth, Gene length and RNA composition
Methods such as edgeR
(implemented in Degust) and DESeq2
have their own method of normalising counts. You will probably encounter other methods of normalising RNA-seq reads such as RPKM, CPM, TPM etc. This blog provides a nice explanation of the current thinking. As part of the Degust
output, you have the option of downloading normalised counts in various formats. Some other online visualisation tools require normalised counts as input, so it is good to have these to-hand.
GSE114013_salmon_counts.csv
, and click Open.run | name | cell_line | condition |
---|---|---|---|
SRR7108388 | HT55_CONT_1 | HT55 | DMSO |
SRR7108389 | HT55_CONT_2 | HT55 | DMSO |
SRR7108390 | HT55_CONT_3 | HT55 | DMSO |
SRR7108391 | HT55_CONT_4 | HT55 | DMSO |
SRR7108392 | HT55_ITRA_1 | HT55 | ITRACONAZOLE |
SRR7108393 | HT55_ITRA_2 | HT55 | ITRACONAZOLE |
SRR7108394 | HT55_ITRA_3 | HT55 | ITRACONAZOLE |
SRR7108395 | HT55_ITRA_4 | HT55 | ITRACONAZOLE |
SRR7108396 | SW948_CONT_1 | SW948 | DMSO |
SRR7108397 | SW948_CONT_2 | SW948 | DMSO |
SRR7108398 | SW948_CONT_3 | SW948 | DMSO |
SRR7108399 | SW948_CONT_4 | SW948 | DMSO |
SRR7108400 | SW948_ITRA_1 | SW948 | ITRACONAZOLE |
SRR7108401 | SW948_ITRA_2 | SW948 | ITRACONAZOLE |
SRR7108402 | SW948_ITRA_3 | SW948 | ITRACONAZOLE |
SRR7108403 | SW948_ITRA_4 | SW948 | ITRACONAZOLE |
(Not that the screenshots are for illustration purposes and taken from a different dataset to that being analysed in the tutorial.)
Each dot shows the change in expression in one gene.
Click on the dot to see the gene name. The panel on the right of the MA-plot will also update to show the expression of this gene in each sample in the form of a dot plot.
This is a multidimensional scaling plot which represents the variation between samples. It is a similar concept to a Principal Components Analysis (PCA) plot. The x-axis is the dimension with the highest magnitude. In a standard control/treatment setup, samples should be split along this axis. A desirable plot is shown below:-
Similar to the MA-plot, this shows the significance of each gene (y-axis) and magnitude of change (x-axis).
Initially this will contain the details of all genes present in the dataset. Once the FDR and logFC cut-offs are altered, any genes that do not meet the cut-offs will be removed.
The table can be sorted according to any of the columns (e.g. fold-change or p-value)
Above the genes table is the option to download the results of the current analysis to a csv file. You can also download the R code required to reproduce the analysis by clicking the Show R code box underneath the Options box.
Question: Make a note of how many genes are detected with a FDR < 0.05 and abs logFC > 1 It seems that the differential expression analysis is detecting lots of genes. However, does this tell the whole story about the dataset? What do you think is the main factor separating samples on the x-axis, and thus explaining the most variation in the data?
We will now repeat the analysis, but only for samples from the HT55 cell-line. The correct configuration for this analysis is shown below:-
Exercise: How many genes are differentially-expressed with an FDR < 0.05 and abs logFC > 1. Download this file and rename it to HT55.ITRACONAZOLE_vs_DMSO.csv
.
Exercise: Rest the FDR cut-off and abs LogFC cutoffs back to default (1 and 0 respectively) and download the file. Rename the file to background.csv
. We will use this later.
Exercise: Repeat the analysis for SW948 samples and download the table of differentially-expressed results (same FDR and log fold-change) to SW948.ITRACONAZOLE_vs_DMSO.csv
If you didn’t manage to complete these analyses, you can download the files from here by right-clicking on each link and selecting “Save Link as” (or equivalent). They are also available in the course google drive.
We might sometimes want to compare the lists of genes that we identify using different methods, or genes identified from more than one contrast. In our example dataset we can compare the genes in the contrast of ITRACONAZOLE vs DMSO in HT55 and SW948 cells
The website venny provides a really nice interface for doing this.
The final analysis we will perform is to include all the samples, but correct for the differences in cell-line. This is achieved by telling Degust about the hidden factors in our dataset. The hidden factor in this dataset is whether the sample is from the HT55 or SW948 samples. In other words, this is a technical factor that influences our results but not a factor that we wish to compare. We only need to specify which samples are from HT55 and DEGUST will infer that the other samples belong to a different cell line. Other hidden factors you might need to include could be (depending on the MDS plot) :-
See below for the correct configuration to include the hidden factors.
Exercise: How many genes are identified with an FDR < 0.05 and abs logFC > 1 for this “hidden factor” analysis. How does it compare to the initial comparison of DMSO vs ITRA using all samples?
In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.
There are two different approaches one might use, and we will cover the theory behind both. The distinction is whether you are happy to use a hard (and arbitrary) threshold to identify DE genes.
“Threshold-based” methods require the application of a statistical threshold to define list of genes to test (e.g. FDR < 0.01). Then a hypergeometric test or Fisher’s Exact test generally used. These are typically used in situations where plenty of DE genes have been identified, and people often use quite relaxed criteria for identifying DE genes (e.g. raw rather than adjusted p-values or FDR value)
The question we are asking here is;
“Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?”
or
“If I picked a set of genes at random that is the same as the number of DE genes, how many genes from Theme X would I expect to find”?
We can answer this question by knowing
The formula for Fishers exact test is;
\[ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \]
with:-
is DE | Not DE | Row Total | |
---|---|---|---|
In Gene Set | a | b | a + b |
Not in Gene Set | c | d | c + d |
Column Total | a + c | b + d | a + b + c + d = n |
This formula is printed here for your information. The software we use will perform all the calculations for us.
In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- http://www.geneontology.org/
There are several popular online tools for performing enrichment analysis. We will be using the online tool GOrilla to perform the pathways analysis as it is particularly straightforward. It has two modes; the first of which accepts a list of background and target genes.
Homo Sapiens
Two unranked lists of genes
Process
Search Enriched GO terms
You should be presented with a graph of enriched GO terms showing the relationship between the terms. Each GO term is coloured according to its statistical significance.
Below the figure is the results table. This links to more information about each GO term, and lists each gene in the category that was found in your list. The enrichment column gives 4 numbers that are used to determine enrichment (similar to the Fisher exact test we saw earlier)
Exercise: Repeat the GOrilla to find enriched pathways in the HT55: ITRACONAZOLE vs DMSO analysis. What do you notice?
This type of analysis is popular for datasets where differential expression analysis does not reveal many genes that are differentially-expressed on their own. Instead, it seeks to identify genes that as a group have a tendency to be near the extremes of the log-fold changes. The results are typically presented in the following way.
The “barcode”-like panel represents where genes from a particular pathway (HALLMARK_E2F_TARGETS in this case) are located in a gene list ranked from most up-regulated to most down-regulated. The peak in the green curve is used to indicate where the majority of genes are located. If this is shifted to the left or the right it indicates that genes belonging to this gene set have a tendency to be up- or down-regulated.
As such, it does not rely on having to impose arbitrary cut-offs on the data. Instead, we need to provide a measure of the importance of each gene such as it’s fold-change. These are then used the rank the genes.
The Broad institute has made this analysis method popular and provides a version of GSEA that can be run via a java application. However, the application can be a bit fiddly to run, so we will use the GeneTrail website instead
background.csv
in Excel and delete all columns except the SYMBOL
and ITRA
column. Hopefully it should recognise your input without any errors, and on the next screen the Set-level statistic should be automatically set to GSEA
If your data does not get uploaded, double-check that the column heading ITRA has not been pasted into the text box
To make the analysis run faster, you can de-select the GO pathways (biological processes, molecular function and cellular compartment)
For your own analyses, you should consider analysing these categories. We are only de-selecting them here to make things run quicker
After a short wait, you will be able to view and download the results. The tested pathways are grouped into different sources (Kegg, Reactome or Wikipathways)
Each of the significant pathways can be explored in detail; such as showing which genes in that pathways are up- or downregulated.
The Rank of the gene shown is the position of the gene in the ranked list; with 1 being most up-regulated gene. The score is the score used to rank the genes (fold-change in our example).
Exercise: Explore the GeneTrail results. Does the method identify significant pathways for the HT55: ITRACONAZOLE vs DMSO analysis?