These materials are adapted from a course developed at Cancer Research Uk Cambridge Institute by Mark Dunning, Matthew Eldridge and Thomas Carroll.
Although R is well-regarded as a tool for performing statistical analysis, this workshop will not explicitly teach stats. Instead we give introduce the tools that we allow you to manipulate and interrogate your data into a form with which you can execute statistical tests.
The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time). This doesn’t automatically keep track of the steps you did.
We’ll be working in an R Notebook. These file are an R Markdown document type, which allow us to combine R code with markdown, a documentation language, providing a framework for literate programming. In an R Notebook, R code chunks can be executed independently and interactively, with output visible immediately beneath the input.
Let’s try this now!
print("Hello World")
[1] "Hello World"
R can be used as a calculator to compute simple sums
2 + 2
[1] 4
2 - 2
[1] 0
4 * 3
[1] 12
10 / 2
[1] 5
The answer is displayed at the console with a [1]
in
front of it. The 1
inside the square brackets is a
place-holder to signify how many values were in the answer (in this case
only one). We will talk about dealing with lists of numbers shortly…
In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.
2 + 2 *3
[1] 8
2 + (2 * 3)
[1] 8
(2 + 2) * 3
[1] 12
R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.
pi
[1] 3.141593
sin (pi/2)
[1] 1
cos(pi)
[1] -1
tan(2)
[1] -2.18504
log(1)
[1] 0
We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.
A variable is a letter or word which takes (or contains) a value. We
use the assignment ‘operator’, <-
to create a variable
and store some value in it.
x <- 10
x
[1] 10
myNumber <- 25
myNumber
[1] 25
We also can perform arithmetic on variables using functions:
sqrt(myNumber)
[1] 5
We can add variables together:
x + myNumber
[1] 35
We can change the value of an existing variable:
x <- 21
x
[1] 21
We can set one variable to equal the value of another variable:
x <- myNumber
x
[1] 25
When we are feeling lazy we might give our variables short names
(x
, y
, i
…etc), but a better
practice would be to give them meaningful names. There are some
restrictions on creating variable names. They cannot start with a number
or contain characters such as .
and ‘-’. Naming variables
the same as in-built functions in R, such as c
,
T
, mean
should also be avoided.
Naming variables is a matter of taste. Some conventions exist such as a
separating words with -
or using
camelCaps. Whatever convention you decided, stick with
it!
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
sin(x)
[1] -0.1323518
this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
Arguments can be named or unnamed, but if they are unnamed they must
be ordered (we will see later how to find the right order). The names of
the arguments are determined by the author of the function and can be
found in the help page for the function. When testing code, it is easier
and safer to name the arguments. seq
is a function for
generating a numeric sequence from and to particular
numbers. Type ?seq
to get the help page for this
function.
seq(from = 3, to = 20, by = 4)
[1] 3 7 11 15 19
seq(3, 20, 4)
[1] 3 7 11 15 19
Arguments can have default values, meaning we do not need to specify values for these in order to run the function.
rnorm
is a function that will generate a series of
values from a normal distribution. In order to use the
function, we need to tell R how many values we want
## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
[1] 0.3184546 0.4258766 -0.7887335 0.9537790 1.0488053 -0.1807940 -0.6858420 -0.7510702
[9] -0.4441776 -0.8333266
The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page
(N.B sometimes help pages can describe more than one function)
?rnorm
In this case, we see that the defaults for mean and standard
deviation are 0 and 1. We can change the function to generate values
from a distribution with a different mean and standard deviation using
the mean
and sd
arguments. It is
important that we get the spelling of these arguments exactly right,
otherwise R will an error message, or (worse?) do something
unexpected.
rnorm(n=10, mean=2,sd=3)
[1] -1.848138745 4.496898840 0.800220927 6.228661609 -0.007691493 -1.158266478 1.296587362
[8] 2.076839744 -5.156370303 6.217972143
rnorm(10, 2, 3)
[1] 6.4283034 0.5024763 -0.7756072 2.4975459 1.4379876 -1.4794491 6.8735414 3.3915993
[9] 5.2656520 1.9844090
In the examples above, seq
and rnorm
were
both outputting a series of numbers, which is called a vector
in R and is the most-fundamental data-type.
Just as we can save single numbers as a variable, we can also save a vector. In fact a single number is still a vector.
my_seq <- seq(from = 3, to = 20, by = 4)
The arithmetic operations we have seen can be applied to these vectors; exactly the same as a single number.
my_seq + 2
[1] 5 9 13 17 21
my_seq * 2
[1] 6 14 22 30 38
pi
to 3 decimal places?
round
?round
by
argument and getting R to
work-out the intervals?seq
min
,
max
and mean
## Type your code to answer the exercises in here
If you want to re-visit your code at any point, you will need to save a copy.
File > Save >
So far we have used functions that are available with the base distribution of R; the functions you get with a clean install of R. The open-source nature of R encourages others to write their own functions for their particular data-type or analyses.
Packages are distributed through repositories. The most-common ones are CRAN and Bioconductor. CRAN alone has many thousands of packages.
CRAN and Bioconductor have some level of curation so should be the first place to look. Researchers sometimes make their packages available on github. However, there is no straightforward way of searching github for a particular package and no guarentee of quality.
The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded.
There are functions for installing packages within R. If your package
is part of the main CRAN repository, you can use
install.packages
.
We will be using a set of tidyverse
R packages in this
practical. To install them, we would do.
## You should already have installed these as part of the course setup
install.packages("readr")
install.packages("ggplot2")
install.packages("dplyr")
# to install the entire set of tidyverse packages, we can do install.packages("tidyverse"). But this will take some time
A package may have several dependencies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.
So long as you stick with the same version of R, you won’t need to repeat this install process.
Once a package is installed, the library
function is
used to load a package and make it’s functions / data available in your
current R session. You need to do this every time you load a new
RStudio session. Let’s go ahead and load the readr
so
we can import some data.
## readr is a packages to import spreadsheets into R
library(readr)
The tidyverse is an eco-system of packages that provides a consistent, intuitive system for data manipulation and visualisation in R.