2018-10-24

## Overview

• What is statistics? Why do we need it?
• Understanding data types
• Worked examples and exercises of various forms of "t-test"
• using online tools

## The point of statistics

• Rarely feasible to study the whole population that we are interested in, so we take a sample instead
• Assume that data collected represents a larger population
• Use sample data to make conclusions about the overall population

## Beginning a study

• Which samples to include?
• Randomly selected?
• Generalisability
• Always think about the statistical analysis
• Randomised comparisons, or biased?
• Any dependency between measurements?
• Data type?
• Distribution of data? (Normally distributed? Skewed? Bimodal?…)

## Generalisability

• How samples are selected affects interpretation
• What is the population that the results apply to?
• How widely applicable will the study be?
• Statistical methods assume random samples
• Do not extrapolate beyond range of the data
• i.e. don’t assume results apply to anything not represented in the data
• Examples:
• Males only, no idea about females
• 1 litter of mice, no idea about other litters

## Data types

• Several different categorisations
• Simplest:
• Categorical (nominal)
• Categorical with ordering (ordinal)
• Discrete
• Continuous

## Nominal

• Most basic type of data
• Three requirements:
• Same value assigned to all the members of level
• Same number not assigned to different levels
• Each observation only assigned to one level
• Boils down to yes/no answer
• e.g. Surgery type, smoker / non-smoker, eye colour, dead/alive, ethnicity.
• Mutually exclusive fixed categories
• Implicit order
• Can say one category higher than another
• But not how much higher
• Example: stress level 1 = low … 7 = high
• Others: Grade, stage, treatment response, education level, pain level.

## Discrete

• Fixed categories, can only take certain values
• Like ordinal but with well-defined distances
• Can be treated as continuous if range is large
• Anything counted (cardinal) is discrete
• how many?
• Examples: number of tumours, shoe size, hospital admissions, number of side effects, medication dose, CD4 count, viral load, reads.

## Continuous

• Final type of data
• Anything that is measured, can take any value
• May have finite or infinite range
• Zero may be meaningful: ratios, differences
• Care required with interpretation
• Given any two observations, one fits between
• Examples: Height, weight, blood pressure, temperature, operation time, blood loss, age.

## Data types

• Several different categorisations
• Simplest:
• Categorical (nominal) – yes/no
• Categorical with ordering (ordinal) – implicit order
• Discrete – only takes certain values; counts (cardinal)
• Continuous – measurements; finite/infinite range

## Measurements: Dependent / Independent?

• Measurements of gene expression taken from each of 20 individuals
• Are any measurements more closely related than others?
• Siblings/littermates?
• Same individual measured twice?
• Batch effects?
• If no reason, assume independent observations

## Continuous Data - Descriptive Statistics

• Measures of location and spread

• Mean and standard deviation

$$\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n}$$ $$s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}$$

## Continuous Data - Descriptive Statistics

• Median: middle value
• Lower quartile: median bottom half of data
• Upper quartile: median top half of data

## Continuous Data - Descriptive Statistics (Example)

• e.g. No of Facebook friends for 7 colleagues
• 311, 345, 270, 310, 243, 5300, 11
• Mean and standard deviation

$$\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n} = 970$$

$$s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}=1912.57$$

• Median and Interquartile range
• 11, 243, 270, 310, 311, 345, 5300

## Continuous Data - Descriptive Statistics (Example)

• e.g. No of Facebook friends for 7 colleagues
• 311, 345, 270, 310, 243, 530, 11
• Mean and standard deviation

$$\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n} = 289$$

$$s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}=153.79$$

• Median and Interquartile range
• 11, 243, 270, 310, 311, 345, 530

## Continuous Data - Descriptive Statistics (Example)

• e.g. No of Facebook friends for 7 colleagues
• 311, 345, 270, 310, 243, 530, 11
• Mean and standard deviation: low breakdown point

$$\bar{X} = \frac{X_1 + X_2 + \dots X_n}{n} = 289$$

$$s.d = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}=153.79$$

• Median and Interquartile range: robust to outliers
• 11, 243, 270, 310, 311, 345, 5300

## Categorical data

• Summarised by counts and percentages
• Examples

## Hypothesis tests - basic setup

• Formulate a null hypothesis, $$H_0$$
• Example: the difference in gene expression before and after treatment = 0
• Calculate a test statistic from the data under the null hypothesis
• Compare the test statistic to the theoretical values
• is it more extreme than expected?
• the "p-value"
• Either reject or do not reject the null hypothesis
• "Absence of evidence is not evidence of absence" (Bland and Altman, 1995)
• (Correction for multiple testing)

## Hypothesis tests - Example

• The Lady Tasting Tea - Randomised Experiment by Fisher
• Randomly ordered 8 cups of tea
• 4 were prepared by first adding milk
• 4 were prepared by first adding tea

## Hypothesis tests - Example

• $$H_0$$: Lady had no such ability
• Test Statistic: number of successes in selecting the 4 cups
• Result: Lady got all 4 cups correct
• Conclusion: Reject the null hypothesis

## Hypothesis tests - Errors

• Many factors that may affect our results
• significance level, sample size, difference of interest, variability of the observations
• Be aware of issues of multiple testing