2018-10-24

Introduction

Overview

  • What is statistics? Why do we need it?
  • Understanding data types
  • Worked examples and exercises of various forms of "t-test"
    • using online tools

The point of statistics

  • Rarely feasible to study the whole population that we are interested in, so we take a sample instead
  • Assume that data collected represents a larger population
  • Use sample data to make conclusions about the overall population

Beginning a study

  • Which samples to include?
    • Randomly selected?
    • Generalisability
  • Always think about the statistical analysis
    • Randomised comparisons, or biased?
    • Any dependency between measurements?
    • Data type?
    • Distribution of data? (Normally distributed? Skewed? Bimodal?…)

Generalisability

  • How samples are selected affects interpretation
    • What is the population that the results apply to?
    • How widely applicable will the study be?
  • Statistical methods assume random samples
  • Do not extrapolate beyond range of the data
    • i.e. don’t assume results apply to anything not represented in the data
  • Examples:
    • Males only, no idea about females
    • Adults only, no idea about children
    • 1 litter of mice, no idea about other litters

Data types

  • Several different categorisations
  • Simplest:
    • Categorical (nominal)
    • Categorical with ordering (ordinal)
    • Discrete
    • Continuous

Nominal

  • Most basic type of data
  • Three requirements:
    • Same value assigned to all the members of level
    • Same number not assigned to different levels
    • Each observation only assigned to one level
  • Boils down to yes/no answer
  • e.g. Surgery type, smoker / non-smoker, eye colour, dead/alive, ethnicity.

  • Mutually exclusive fixed categories
  • Implicit order
  • Can say one category higher than another
    • But not how much higher
  • Example: stress level 1 = low … 7 = high
  • Others: Grade, stage, treatment response, education level, pain level.

Discrete

  • Fixed categories, can only take certain values
  • Like ordinal but with well-defined distances
    • Can be treated as continuous if range is large
  • Anything counted (cardinal) is discrete
    • how many?
  • Examples: number of tumours, shoe size, hospital admissions, number of side effects, medication dose, CD4 count, viral load, reads.

Continuous

  • Final type of data
  • Anything that is measured, can take any value
  • May have finite or infinite range
  • Zero may be meaningful: ratios, differences
    • Care required with interpretation
  • Given any two observations, one fits between
  • Examples: Height, weight, blood pressure, temperature, operation time, blood loss, age.

Data types

  • Several different categorisations
  • Simplest:
    • Categorical (nominal) – yes/no
    • Categorical with ordering (ordinal) – implicit order
    • Discrete – only takes certain values; counts (cardinal)
    • Continuous – measurements; finite/infinite range

Measurements: Dependent / Independent?

  • Measurements of gene expression taken from each of 20 individuals
  • Are any measurements more closely related than others?
    • Siblings/littermates?
    • Same individual measured twice?
    • Batch effects?
  • If no reason, assume independent observations

Continuous Data - Normal Distribution

Continuous Data - Normal Distribution?

Continuous Data - Descriptive Statistics

  • Measures of location and spread

  • Mean and standard deviation

\(\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n}\) \(s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}\)

Continuous Data - Descriptive Statistics

  • Median: middle value
  • Lower quartile: median bottom half of data
  • Upper quartile: median top half of data

Continuous Data - Descriptive Statistics (Example)

  • e.g. No of Facebook friends for 7 colleagues
    • 311, 345, 270, 310, 243, 5300, 11
  • Mean and standard deviation

    \(\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n} = 970\)

    \(s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}=1912.57\)

  • Median and Interquartile range
    • 11, 243, 270, 310, 311, 345, 5300

Continuous Data - Descriptive Statistics (Example)

  • e.g. No of Facebook friends for 7 colleagues
    • 311, 345, 270, 310, 243, 530, 11
  • Mean and standard deviation

    \(\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n} = 289\)

    \(s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}=153.79\)

  • Median and Interquartile range
    • 11, 243, 270, 310, 311, 345, 530

Continuous Data - Descriptive Statistics (Example)

  • e.g. No of Facebook friends for 7 colleagues
    • 311, 345, 270, 310, 243, 530, 11
  • Mean and standard deviation: low breakdown point

    \(\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n} = 289\)

    \(s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}=153.79\)

  • Median and Interquartile range: robust to outliers
    • 11, 243, 270, 310, 311, 345, 5300

Categorical data

  • Summarised by counts and percentages
  • Examples
    • 19/82 (23%) subjects had Grade IV tumour
    • 48/82 (58%) subjects had Diarrhoea as an Adverse Event

Hypothesis tests - basic setup

  • Formulate a null hypothesis, \(H_0\)
    • Example: the difference in gene expression before and after treatment = 0
  • Calculate a test statistic from the data under the null hypothesis
  • Compare the test statistic to the theoretical values
    • is it more extreme than expected?
    • the "p-value"
  • Either reject or do not reject the null hypothesis
    • "Absence of evidence is not evidence of absence" (Bland and Altman, 1995)
  • (Correction for multiple testing)

Hypothesis tests - Example

  • The Lady Tasting Tea - Randomised Experiment by Fisher
  • Randomly ordered 8 cups of tea
    • 4 were prepared by first adding milk
    • 4 were prepared by first adding tea
  • Task: Lady had to select the 4 cups of one particular method

Hypothesis tests - Example

  • \(H_0\): Lady had no such ability
  • Test Statistic: number of successes in selecting the 4 cups
  • Result: Lady got all 4 cups correct
  • Conclusion: Reject the null hypothesis

Hypothesis tests - Errors

  • Many factors that may affect our results
    • significance level, sample size, difference of interest, variability of the observations
  • Be aware of issues of multiple testing

Easier way to remember