Acknowledgement

These materials are adapted from a course developed at Cancer Research Uk Cambridge Institute by Mark Dunning, Matthew Eldridge and Thomas Carroll.

Aims and objectives

Although R is well-regarded as a tool for performing statistical analysis, this workshop will not explicitly teach stats. Instead we give introduce the tools that we allow you to manipulate and interrogate your data into a form with which you can execute statistical tests.

Entering commands in R

The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time). This doesn’t automatically keep track of the steps you did.

We’ll be working in an R Notebook. These file are an R Markdown document type, which allow us to combine R code with markdown, a documentation language, providing a framework for literate programming. In an R Notebook, R code chunks can be executed independently and interactively, with output visible immediately beneath the input.

Let’s try this now!

print("Hello World")
[1] "Hello World"

R can be used as a calculator to compute simple sums

2 + 2
[1] 4
2 - 2
[1] 0
4 * 3
[1] 12
10 / 2
[1] 5

The answer is displayed at the console with a [1] in front of it. The 1 inside the square brackets is a place-holder to signify how many values were in the answer (in this case only one). We will talk about dealing with lists of numbers shortly…

In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.

2 + 2 *3
[1] 8
2 + (2 * 3)
[1] 8
(2 + 2) * 3
[1] 12

R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.

pi
[1] 3.141593
sin (pi/2)
[1] 1
cos(pi)
[1] -1
tan(2)
[1] -2.18504
log(1)
[1] 0

We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.

Variables

A variable is a letter or word which takes (or contains) a value. We use the assignment ‘operator’, <- to create a variable and store some value in it.

x <- 10
x
[1] 10
myNumber <- 25
myNumber
[1] 25

We also can perform arithmetic on variables using functions:

sqrt(myNumber)
[1] 5

We can add variables together:

x + myNumber
[1] 35

We can change the value of an existing variable:

x <- 21
x
[1] 21

We can set one variable to equal the value of another variable:

x <- myNumber
x
[1] 25

When we are feeling lazy we might give our variables short names (x, y, i…etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as . and ‘-’. Naming variables the same as in-built functions in R, such as c, T, mean should also be avoided.

Naming variables is a matter of taste. Some conventions exist such as a separating words with - or using camelCaps. Whatever convention you decided, stick with it!

Functions

Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:

sin(x)
[1] -0.1323518

this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.

Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments. seq is a function for generating a numeric sequence from and to particular numbers. Type ?seq to get the help page for this function.

seq(from = 3, to = 20, by = 4)
[1]  3  7 11 15 19
seq(3, 20, 4)
[1]  3  7 11 15 19

Arguments can have default values, meaning we do not need to specify values for these in order to run the function.

rnorm is a function that will generate a series of values from a normal distribution. In order to use the function, we need to tell R how many values we want

## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
 [1]  0.3184546  0.4258766 -0.7887335  0.9537790  1.0488053 -0.1807940 -0.6858420 -0.7510702
 [9] -0.4441776 -0.8333266

The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page

(N.B sometimes help pages can describe more than one function)

?rnorm

In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the mean and sd arguments. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.

rnorm(n=10, mean=2,sd=3)
 [1] -1.848138745  4.496898840  0.800220927  6.228661609 -0.007691493 -1.158266478  1.296587362
 [8]  2.076839744 -5.156370303  6.217972143
rnorm(10, 2, 3)
 [1]  6.4283034  0.5024763 -0.7756072  2.4975459  1.4379876 -1.4794491  6.8735414  3.3915993
 [9]  5.2656520  1.9844090

In the examples above, seq and rnorm were both outputting a series of numbers, which is called a vector in R and is the most-fundamental data-type.

Just as we can save single numbers as a variable, we can also save a vector. In fact a single number is still a vector.

my_seq <- seq(from = 3, to = 20, by = 4)

The arithmetic operations we have seen can be applied to these vectors; exactly the same as a single number.

my_seq + 2
[1]  5  9 13 17 21
my_seq * 2
[1]  6 14 22 30 38



Exercise

  • What is the value of pi to 3 decimal places?
    • see the help for round ?round
  • How can we a create a sequence from 2 to 20 comprised of 5 equally-spaced numbers?
    • i.e. not specifying the by argument and getting R to work-out the intervals
    • check the help page for seq ?seq
  • Create a variable containing 1000 random numbers with a mean of 2 and a standard deviation of 3
    • what is the maximum and minimum of these numbers?
    • what is the average?
    • HINT: see the help pages for functions min, max and mean
## Type your code to answer the exercises in here



Saving your notebook

If you want to re-visit your code at any point, you will need to save a copy.

File > Save >

Packages in R

So far we have used functions that are available with the base distribution of R; the functions you get with a clean install of R. The open-source nature of R encourages others to write their own functions for their particular data-type or analyses.

Packages are distributed through repositories. The most-common ones are CRAN and Bioconductor. CRAN alone has many thousands of packages.

  • The meta cran website can be used to browse packages available in CRAN
  • Bioconductor packages can be browsed here

CRAN and Bioconductor have some level of curation so should be the first place to look. Researchers sometimes make their packages available on github. However, there is no straightforward way of searching github for a particular package and no guarentee of quality.

The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded.

There are functions for installing packages within R. If your package is part of the main CRAN repository, you can use install.packages.

We will be using a set of tidyverse R packages in this practical. To install them, we would do.

## You should already have installed these as part of the course setup

install.packages("readr")
install.packages("ggplot2")
install.packages("dplyr")
# to install the entire set of tidyverse packages, we can do install.packages("tidyverse"). But this will take some time

A package may have several dependencies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.

So long as you stick with the same version of R, you won’t need to repeat this install process.

Once a package is installed, the library function is used to load a package and make it’s functions / data available in your current R session. You need to do this every time you load a new RStudio session. Let’s go ahead and load the readr so we can import some data.

## readr is a packages to import spreadsheets into R
library(readr)

Dealing with data

The tidyverse is an eco-system of packages that provides a consistent, intuitive system for data manipulation and visualisation in R.