Background and Metadata

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • What data are we using?

  • Why is this experiment important?

Objectives
  • Why study Human Genomes?

  • Understand the data set.

Background

We are going to use a sequencing dataset from healthy humans.

The Data

View the Metadata

The metadata file associated with this lesson can be downloaded directly here (right-click and Save Link as) or viewed in Github. If you would like to know details of how the file was created, you can look at some notes and sources here.

This metadata describes information on the samples sequences as part of the dataset and the columns represent:

Column Description
Sample name Sample name
Sex Sex
Biosample ID  
Population code Short code for the population
Population name Longer, descriptive name for the population
Superpopulation code Grouping of populations from a similar geographic area (e.g. continent)
Population elastic ID  
Data collections Which datasets the sample belongs to

Challenge

Based on the metadata, can you answer the following questions using a spreadsheet such as Excel?

  1. How many rows and how many columns are in this data?
  2. How many different super populations are there?
  3. How many different populations exist with European origin?

Solution

  1. 3116 rows and 9 columns
  2. Nine different sub-populations
  3. Five populations within Europe

Creating and editing metadata

The metadata for a project is usually entered by-hand using software such as Microsoft Excel. When creating such metadata it would be good to bear in mind some common errors that can be inadvertently introduced that complicate computational analysis. These materials from Data Carpentry can be consulted if you are not sure about this:-https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/index.html

Key Points

  • It’s important to record and understand your experiment’s metadata.