Background and Metadata

Overview

Teaching: 10 min
Exercises: 5 min

Questions

What data are we using?

Why is this experiment important?

Objectives

Why study Human Genomes?

Understand the data set.

Background

We are going to use a sequencing dataset from healthy humans.

What is Genome sequencing
- Genome sequencing (sometimes called next-generation sequencing (NGS) or high throughput sequencing) is the process by which small stretches of an individuals’ DNA are “read” to see which bases (A, T, C or G) they are comprised of. These reads are they compared to a reference genome to see where they originated from and what mutations are present. Mutations, differences in DNA sequence between individuals, are not always harmful and can be responsible in normal variations such as eye colour. However, some mutations have the potential to lead to the progression of disease. The most popular method of sequencing is that employed by Illumina, which is demonstrated in this short video.

What is the 1000 genomes project
- The 1000 genomes project was established in 2008 to study variation in the human genome and provide a solid foundation on which to build an understanding of genetic variation in the human population.
Why is the 1000 genomes project important
- The data generated for the 1000 genomes project can be incorporated into many healthcare studies. For example, when identifying mutations in a diseased individual we can use mutations identified among healthy individuals to narrow-down our search of potential disease-causing mutations.

The Data

We have selected three individuals from 1000 genomes and will be working with a subset of the data for these individuals. This is to make the tools and workflows run in a reasonable amount of time. When analysing your own data the same steps can be applied, although they will take much longer to complete

View the Metadata

The metadata file associated with this lesson can be downloaded directly here (right-click and Save Link as) or viewed in Github. If you would like to know details of how the file was created, you can look at some notes and sources here.

This metadata describes information on the samples sequences as part of the dataset and the columns represent:

Column	Description
Sample name	Sample name
Sex	Sex
Biosample ID
Population code	Short code for the population
Population name	Longer, descriptive name for the population
Superpopulation code	Grouping of populations from a similar geographic area (e.g. continent)
Population elastic ID
Data collections	Which datasets the sample belongs to

Challenge

Based on the metadata, can you answer the following questions using a spreadsheet such as Excel?

How many rows and how many columns are in this data?

How many different super populations are there?

How many different populations exist with European origin?

Solution

3116 rows and 9 columns

Nine different sub-populations

Five populations within Europe

Creating and editing metadata

The metadata for a project is usually entered by-hand using software such as Microsoft Excel. When creating such metadata it would be good to bear in mind some common errors that can be inadvertently introduced that complicate computational analysis. These materials from Data Carpentry can be consulted if you are not sure about this:-https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/index.html

Key Points

It’s important to record and understand your experiment’s metadata.

lesson home

Data Wrangling and Processing for Genomics

next episode

Background and Metadata

Overview

Background

The Data

View the Metadata

Challenge

Solution

Creating and editing metadata

Key Points

lesson home

next episode