These materials are adapted from a course developed at Cancer Research Uk Cambridge Institute by Mark Dunning, Matthew Eldridge and Thomas Carroll.

Although R is well-regarded as a tool for performing statistical analysis, this workshop will not explicitly teach stats. Instead we give introduce the tools that we allow you to manipulate and interrogate your data into a form with which you can execute statistical tests.

The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left panel when RStudio opens for first time). This doesnâ€™t automatically keep track of the steps you did.

Weâ€™ll be working in an **R
Notebook**. These file are an R Markdown document
type, which allow us to **combine R code with** markdown,
**a documentation language**, providing a framework for literate
programming. In an R Notebook, R code chunks can be executed
independently and interactively, with output visible immediately beneath
the input.

Letâ€™s try this now!

`print("Hello World")`

`[1] "Hello World"`

R can be used as a calculator to compute simple sums

`2 + 2`

`[1] 4`

`2 - 2`

`[1] 0`

`4 * 3`

`[1] 12`

`10 / 2`

`[1] 5`

The answer is displayed at the console with a `[1]`

in
front of it. The `1`

inside the square brackets is a
place-holder to signify how many values were in the answer (in this case
only one). We will talk about dealing with lists of numbers shortlyâ€¦

In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.

`2 + 2 *3`

`[1] 8`

`2 + (2 * 3)`

`[1] 8`

`(2 + 2) * 3`

`[1] 12`

R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.

`pi`

`[1] 3.141593`

`sin (pi/2)`

`[1] 1`

`cos(pi)`

`[1] -1`

`tan(2)`

`[1] -2.18504`

`log(1)`

`[1] 0`

We can only go so far with performing simple calculations like this.
Eventually we will need to store our results for later use. For this, we
need to make use of *variables*.

A variable is a letter or word which takes (or contains) a value. We
use the assignment â€˜operatorâ€™, `<-`

to create a variable
and store some value in it.

```
x <- 10
x
```

`[1] 10`

```
myNumber <- 25
myNumber
```

`[1] 25`

We also can perform arithmetic on variables using functions:

`sqrt(myNumber)`

`[1] 5`

We can add variables together:

`x + myNumber`

`[1] 35`

We can change the value of an existing variable:

```
x <- 21
x
```

`[1] 21`

We can set one variable to equal the value of another variable:

```
x <- myNumber
x
```

`[1] 25`

When we are feeling lazy we might give our variables short names
(`x`

, `y`

, `i`

â€¦etc), but a better
practice would be to give them meaningful names. There are some
restrictions on creating variable names. They cannot start with a number
or contain characters such as `.`

and â€˜-â€™. Naming variables
the same as in-built functions in R, such as `c`

,
`T`

, `mean`

should also be avoided.

Naming variables is a matter of taste. Some conventions exist such as a
separating words with `-`

or using
*c*amel*C*aps. Whatever convention you decided, stick with
it!

**Functions** in R perform operations on
**arguments** (the inputs(s) to the function). We have
already used:

`sin(x)`

`[1] -0.1323518`

this returns the sine of x. In this case the function has one
argument: **x**. Arguments are always contained in
parentheses â€“ curved brackets, **()** â€“ separated by
commas.

Arguments can be named or unnamed, but if they are unnamed they must
be ordered (we will see later how to find the right order). The names of
the arguments are determined by the author of the function and can be
found in the help page for the function. When testing code, it is easier
and safer to name the arguments. `seq`

is a function for
generating a numeric sequence *from* and *to* particular
numbers. Type `?seq`

to get the help page for this
function.

`seq(from = 3, to = 20, by = 4)`

`[1] 3 7 11 15 19`

`seq(3, 20, 4)`

`[1] 3 7 11 15 19`

Arguments can have *default* values, meaning we do not need to
specify values for these in order to run the function.

`rnorm`

is a function that will generate a series of
values from a *normal distribution*. In order to use the
function, we need to tell R how many values we want

```
## this will produce a random set of numbers, so everyone will get a different set of numbers
rnorm(n=10)
```

```
[1] 0.3184546 0.4258766 -0.7887335 0.9537790 1.0488053 -0.1807940 -0.6858420 -0.7510702
[9] -0.4441776 -0.8333266
```

The normal distribution is defined by a *mean* (average) and
*standard deviation* (spread). However, in the above example we
didnâ€™t tell R what mean and standard deviation we wanted. So how does R
know what to do? All arguments to a function and their default values
are listed in the help page

(*N.B sometimes help pages can describe more than one
function*)

`?rnorm`

In this case, we see that the defaults for mean and standard
deviation are 0 and 1. We can change the function to generate values
from a distribution with a different mean and standard deviation using
the `mean`

and `sd`

*arguments*. It is
important that we get the spelling of these arguments exactly right,
otherwise R will an error message, or (worse?) do something
unexpected.

`rnorm(n=10, mean=2,sd=3)`

```
[1] -1.848138745 4.496898840 0.800220927 6.228661609 -0.007691493 -1.158266478 1.296587362
[8] 2.076839744 -5.156370303 6.217972143
```

`rnorm(10, 2, 3)`

```
[1] 6.4283034 0.5024763 -0.7756072 2.4975459 1.4379876 -1.4794491 6.8735414 3.3915993
[9] 5.2656520 1.9844090
```

In the examples above, `seq`

and `rnorm`

were
both outputting a series of numbers, which is called a *vector*
in R and is the most-fundamental data-type.

Just as we can save single numbers as a variable, we can also save a vector. In fact a single number is still a vector.

`my_seq <- seq(from = 3, to = 20, by = 4)`

The arithmetic operations we have seen can be applied to these vectors; exactly the same as a single number.

`my_seq + 2`

`[1] 5 9 13 17 21`

`my_seq * 2`

`[1] 6 14 22 30 38`

- What is the value of
`pi`

to 3 decimal places?- see the help for
`round`

`?round`

- see the help for
- How can we a create a sequence from 2 to 20 comprised of 5
equally-spaced numbers?
- i.e.Â not specifying the
`by`

argument and getting R to work-out the intervals - check the help page for seq
`?seq`

- i.e.Â not specifying the
- Create a
*variable*containing 1000 random numbers with a*mean*of 2 and a*standard deviation*of 3- what is the maximum and minimum of these numbers?
- what is the average?
- HINT: see the help pages for functions
`min`

,`max`

and`mean`

```
## Type your code to answer the exercises in here
```

If you want to re-visit your code at any point, you will need to save a copy.

**File > Save > **

So far we have used functions that are available with the
*base* distribution of R; the functions you get with a clean
install of R. The open-source nature of R encourages others to write
their own functions for their particular data-type or analyses.

Packages are distributed through *repositories*. The
most-common ones are CRAN and Bioconductor. CRAN alone has many
thousands of packages.

CRAN and Bioconductor have some level of curation so should be the first place to look. Researchers sometimes make their packages available on github. However, there is no straightforward way of searching github for a particular package and no guarentee of quality.

The **Packages** tab in the bottom-right panel of
RStudio lists all packages that you currently have installed. Clicking
on a package name will show a list of functions that available once that
package has been loaded.

There are functions for installing packages within R. If your package
is part of the main **CRAN** repository, you can use
`install.packages`

.

We will be using a set of `tidyverse`

R packages in this
practical. To install them, we would do.

```
## You should already have installed these as part of the course setup
install.packages("readr")
install.packages("ggplot2")
install.packages("dplyr")
# to install the entire set of tidyverse packages, we can do install.packages("tidyverse"). But this will take some time
```

A package may have several *dependencies*; other R packages
from which it uses functions or data types (re-using code from other
packages is strongly-encouraged). If this is the case, the other R
packages will be located and installed too.

**So long as you stick with the same version of R, you wonâ€™t
need to repeat this install process.**

Once a package is installed, the `library`

function is
used to load a package and make itâ€™s functions / data available in your
current R session. *You need to do this every time you load a new
RStudio session*. Letâ€™s go ahead and load the `readr`

so
we can import some data.

```
## readr is a packages to import spreadsheets into R
library(readr)
```

The ** tidyverse**
is an eco-system of packages that provides a consistent, intuitive
system for data manipulation and visualisation in R.