Data Wrangling and Processing for Genomics: Glossary

Key Points

Background and Metadata
  • It’s important to record and understand your experiment’s metadata.

Assessing Read Quality
  • Quality encodings vary across sequencing platforms.

  • FastQC and multiqc can generate quality control reports for sequencing data

  • Keep your project directories tidy

  • Files can be copied from HPC to your own machine for interactive visualisation

Trimming and Filtering
  • The options you set for the command-line tools you use are important!

  • Data cleaning is an essential step in a genomics workflow.

Variant Calling Workflow
  • Bioinformatic command line tools are collections of commands that can be used to carry out bioinformatic analyses.

  • To use most powerful bioinformatic tools, you’ll need to use the command line.

  • There are many different file formats for storing genomics data. It’s important to understand what type of information is contained in each file, and how it was derived.

Automating a Variant Calling Workflow
  • We can combine multiple commands into a shell script to automate a workflow.

  • Use echo statements within your scripts to get an automated progress update.

  • We can give names to our output files and directories using variables

Running Genomics workflows on HPC
  • Job arrays can make our lives easier

  • Some tools are able to use multiple threads

Workshop wrap-up
  • Many reproducible pipelines and workflows are already available

  • No need to re-invent the wheel

Glossary

FIXME