1 What/Why?

R is a language and environment for statistical computing and graphics.. In more practical terms:

You’re already used to programming since spreadsheets are code:

=SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,MATCH($AL$7&$AL$6,$data.$A$1:$AMJ$1,0)-29,1,2))/1000+SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,RIGHT(V$53,2)*8-7,1,2))/1000-SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,MATCH($AL$7&$AL$6,$data.$A$1:$AMJ$1,0)-37,1,2))/1000

This is a formula from an actual spreadsheet in use by a company. This code is a nightmare to maintain and it’s very difficult to check if there are any errors in it. Spreadsheets create a soup of data storage, processing and presentation (i.e. numbers, formulas and graphs). This hinders transparency and as spreadsheets get more complex, it becomes very difficult to trace how the data analysis is actually done. With R, it’s possible to use code to read from data files and then automatically generate reports in the form of Word documents, pdfs, or html pages. Separating your data, processing and presentation of results into separate components will help you to stay organized.

2 First Steps:

2.1 Setting up R and RStudio

You need to have both R and RStudio set up. You can use R without RStudio, but if you do, the interface will look like the one on the left image below, and this can be a bit intimidating. RStudio runs on top of R and provides a more visual interface, in addition to greatly assisting you with various aspects of programming.

R

R + RStudio

vs.

2.1.1 University Computer

On the university computers you should be able to find RStudio in the Start Menu by looking for Mathematics & Statistics -> R. If there are several versions, just pick the one with the highest version number. You should not have to install R or RStudio on the university computers.

2.1.2 Personal Computer

On your own computer you will have to install both R and RStudio

2.1.3 Open RStudio

Once you open up RStudio, you’ll see that it has four panels, which are:

Top Left - Code - here you can open and work on different script files

Top Right - Environment/History

  • This shows all the variables and functions that are currently loaded in the workspace (the “Environment”)
  • You can click on the variables to see their values, which can be useful to inspect how the code is operating.
  • The History shows all of the commands that you have run.

Bottom Left - Console

  • R commands can be run here
  • You can also find documentation for commands by typing in ?commandName to get help, i.e. if you type ?sum you’ll see the documentation for the sum function.

Bottom Right - Files/Plots/Packages/Help

  • Files - This shows everything in your current working directory
  • Plots - Once you start plotting, multiple plots will be stored here. There are arrows in this view that allow you to navigate between multiple plots
  • Packages - Shows all the packages installed and currently loaded
  • Help - Shows documentation for various functions.

You can run lines of code by highlighting them, and then clicking on “Run” above. You can run all the code at once by doing Code -> Run Region -> Run All

2.1.4 Install Required Packages

Once R and RStudio is installed, open up RStudio and install the necessary packages for R. Note that with R, the word “package” and “library” are often used interchangeably. Today we’ll need to install the ggplot2 library to do the plotting.

2.1.4.1 Installation Steps

  • In the bottom right quadrant of RStudio, locate the Packages tab, and click on Install:
  • Then type in the name of the package ggplot2. You should see it auto-complete as you type:
  • Click on Install and make sure that Install dependencies is checked:
  • You should then see statements like this in the console on the bottom left quadrant:
  • As you can see in the console, you can also install packages just by typing:
install.packages("ggplot2")

3 How to read this tutorial

Everything shown in the large gray boxes below is code that you can run by copy/pasting into the Console tab in RStudio. For example:

print("hello world")
## [1] "hello world"

You can also collect these statements in a new R file (File -> New File -> R Script) which you can then run.

If you do create a new R file, you can run the code by selecting one of the options under the Run button:

Gray bits of text like this usually refer to individual R commands, package names, or the exact name of things that you will see in the user interface.

3.1 Always always always first specify options(stringsAsFactors = FALSE)

Make sure to always run this command when you start using R:

options(stringsAsFactors = FALSE)

We’ll cover what this is in a later practical, but for now it’s important to specify the stringsAsFactors option whenever running code as you may get confusing results without it. In short, R assumes that your data contains factors or categorical variables, which isn’t necessarily the case.

3.2 Assinging values to variables.

We can assign the value of 1 to the variable a by doing:

a = 1

And then can make sure that the value has been assigned:

a
## [1] 1

Instead of using =, we can also use <- which does the same thing. You often will see people use both forms of these.

a <- 1

Although you assigned a number to the variable a, you can also assign to it different types of data like text:

a = "this is some text"
a
## [1] "this is some text"

3.3 Vectors

a  = c(3, 4, 5, 6, 7)
a
## [1] 3 4 5 6 7

Since this is a sequence from 3 to 7, we can also write it like:

a  = c(3:7)
a
## [1] 3 4 5 6 7

3.4 Matrices

Initialize a matrix of zeros with three columns and two rows. Then set the value at row 1 and column 2 to 3

b = matrix(0, ncol=3, nrow=2)
b[1,2] = 3
b
##      [,1] [,2] [,3]
## [1,]    0    3    0
## [2,]    0    0    0

You can append a matrix to the end of another matrix using rbind which means that you will bind the rows together.

b = matrix(0, ncol=3, nrow=2)
c = matrix(1, ncol=3, nrow=2)
d = rbind(b,c)
d
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    1    1    1
## [4,]    1    1    1

You can also bind the columns together using cbind

cbind(b,c)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    1    1    1
## [2,]    0    0    0    1    1    1

R also has functions that run operations such as mean and sum on matrix rows and columns:

rowMeans(d)
## [1] 0 0 1 1
rowSums(d)
## [1] 0 0 3 3
colMeans(d)
## [1] 0.5 0.5 0.5
colSums(d)
## [1] 2 2 2

3.5 Data Frames

Data frames are conceptually similar to matrices, although you can mix in different types of data per column. In the example below, the first two columns are numbers, while the last column is text.

a = data.frame(x = c(1:3),
               y = c(4:6),
               z = c("a", "b", "c"))
a
##   x y z
## 1 1 4 a
## 2 2 5 b
## 3 3 6 c

Data frames are similar to matrices in that you can access elements based on their row and column indices. For example, to get the element in the 2nd row and third column:

a[2,3]
## [1] "b"

We can also get just the 2nd row:

a[2,]
##   x y z
## 2 2 5 b

Or just the 3rd column:

a[,3]
## [1] "a" "b" "c"

One of the nice things about data frames is that you can use the names of the columns (combined with the $ sign) to directly access the values in that column. So if we want to see the values of only the z column, we can use a$z

a$z
## [1] "a" "b" "c"

You can also add a new column to an existing data frame:

a$t = c(10, 13, 17)
a
##   x y z  t
## 1 1 4 a 10
## 2 2 5 b 13
## 3 3 6 c 17

We’ll look at the mtcars data set that is included with R. If you type ?mtcars in the console, you’ll see more documentation. Looking at the first few lines of the mtcars data frame, we see the following which shows data in several columns: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The head command just shows us the top few rows, and you can also use the tail command to look at the bottom rows.

For specific columns, we can use the mean and sd functions to find the average and standard deviation.

mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948

4 Creating Plots

We first need to load the library that we’ll be using for the rest of this tutorial:

library(ggplot2)

4.1 ggplot2

Many visualizations created with R are often created using the ggplot2 library. What’s interesting about this library is that way in which it allows you to construct visualizations. The gg in ggplot2 stands for the Grammar of Graphics. The idea is that when you create plots, you are basically writing sentences that are of the form:

Here's my data frame + Here are the x and y columns + Apply this kind of plot to that data + These are the axis labels + here are some more additional transformations

The syntax may look strange at first, although it’s a very modular approach, and you can create very complex visualizations just by adding new parts to these sentences.

4.2 Scatter plot

To understand this, we can first do a simple scatter plot. You’ll notice with the syntax that we first start with the mtcars data frame, then we specify which columns are to be associated with the x and y values, and then we specify that we want to plot the data as points by adding + geom_point().

ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()

In the following examples, you may see the code examples split over multiple lines. The two statements below are actually equivalent, but by spreading the commands over multiple lines it can sometimes help to make things more readable by separating the code into its different functional pieces.

ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()

ggplot(mtcars, 
       aes(x=hp, y=mpg)) + 
  geom_point()

We can also use values from other columns to modify particular attributes of the points. For example, we can set the color of the points to indicate the number of cylinders

ggplot(mtcars, aes(x=hp, y=mpg, colour=cyl)) + geom_point()

We can set the size of the points based on the weight of the car

ggplot(mtcars, aes(x=hp, y=mpg, colour=cyl, size=wt)) + geom_point()

4.3 Boxplot

For this example, we need to specify x=factor(cyl) to make sure that there is a boxplot drawn for each unique value of cyl.

ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot()

4.4 Histogram

Histogram of the number of cars with a particular fuel economy value:

ggplot(mtcars, aes(x=mpg)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also change the bin width:

ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth=5)

4.5 Bar charts

Count the number of cars with specific numbers of cylinders

ggplot(mtcars, 
       aes(x=cyl)) + 
  geom_bar()

4.6 Pie chart

Pie charts can be created as well, although they require a few more steps. Part of the reason for this is that many data visualization experts discourage their use since other types of visualizations can communicate the information more effectively.

The general strategy to create a pie chart using ggplot2 is to first create a bar chart and then to use polar coordinates to turn the bars into a circle.

In other words, we start with this:

ggplot(mtcars, 
       aes(x=factor(1), fill=factor(cyl))) + 
  geom_bar(width=1)

To explain what’s going on:

  • x=factor(1) - This places the bars at the same location, which allows them to be stacked
  • fill=factor(cyl) - The fill color for the bars is based on the value of cyl
  • geom_bar(width=1) - This is needed to that there isn’t a hole in the plot when we use the code in the step below.
  • Note: the height of the bars is counting the number of cars (i.e. number of rows in the mtcars data frame) with a specific value for cyl. In other words, the size of the pie slices is not based on actual numeric values in mtcars.

We then turn this into a pie chart by adding + coord_polar(theta="y"):

# this does a count
ggplot(mtcars, 
       aes(x=factor(1), fill=factor(cyl))) + 
  geom_bar(width=1) + coord_polar(theta="y")

If we want to create a pie chart where the size of the slices correspond to actual values in the data and not just to counts of things with the same values, we need to take a slightly different approach.

Here we have a data frame listing types of animals and values associated with them:

animals = data.frame(animal_type = c("chickens", "cows", "pigs"),
                     farm_count = c(20, 10, 5))
animals
##   animal_type farm_count
## 1    chickens         20
## 2        cows         10
## 3        pigs          5

We then add y=farm_count and stat="identity" to make sure that this is plotted correctly. We also use theme_void() to remove the axis labels that we saw in the previous plot.

Using stat="identity" means that the size of the pie slices are based on the values contained in the data, and not on the count of things with the same values.

ggplot(animals, 
       aes(x=factor(1), y=farm_count, fill=factor(animal_type))) + 
  geom_bar(width=1, stat="identity") + 
  coord_polar(theta="y") + 
  theme_void()

5 Reading CSV Files

The following example is based on CO2 emissions data from the UNFCCC, specifically the “CO2 excluding LULUCF” Excel spreadsheet which we’ve transformed into a CSV file.

CSV stands for “comma-separated values” which means that you represent tabular data by using commas to separate values from different columns:

animal,farm_count
chickens,20
cows,10
pigs,5

While you can technically read Excel files into R, reading CSV files is much much faster as it is a very simple data format.

Now we’ll load in an example data file and create several plots with it.

For this, you’ll need the UNFCCC_CO2_Emissions.csv file. To get it, right click on this link: UNFCCC_CO2_Emissions.csv and select “Save Target As” or “Save Link As” to save it to your computer.

Internet Explorer might try to save this as “UNFCCC_CO2_Emissions.txt”, make sure to save this as “UNFCCC_CO2_Emissions.csv” or adjust your code so that it knows to read the correct file.

One thing you need to check is your working directory. This is the directory where R looks for any files. You can set this in RStudio Session -> Set Working Directory -> Choose Directory

Make sure that this is set to the directory where you have placed the UNFCCC_CO2_Emissions.csv file.

df = read.csv(file="UNFCCC_CO2_Emissions.csv")

Create line plots per country of the amount of CO2 emissions:

ggplot(df, aes(x=Year, y=CO2_Mt, colour=Country)) + geom_line()

Create a stacked area chart showing how each country’s CO2 emissions contributes to the total:

ggplot(df, 
       aes(x=Year, y=CO2_Mt, fill=Country)) + 
  geom_area()

Same plot, but using geom_line(aes(ymax=CO2_Mt), position="stack", size=0.1) to add black lines to help better distinguish the individual countries.

ggplot(df, 
       aes(x=Year, y=CO2_Mt, fill=Country)) + 
  geom_area() + 
  geom_line(aes(ymax=CO2_Mt), position="stack", size=0.1)

In the previous plots, it’s a bit difficult to distinguish countries with similar colors. We can also use facet_wrap to create plots for individual countries.

ggplot(df, aes(x=Year, y=CO2_Mt)) + geom_line() + facet_wrap(~Country, scales="free_y")

The plot above shows the variation, but you’ll notice that the minimum value on the y scale is not set to zero. This means that the variation observed may not actually be that big when considering the overall amount of emissions. To fix this, we update our code to use ymin=0 so that we can get a picture of the absolute magnitude of emissions.

ggplot(df, aes(x=Year, y=CO2_Mt, ymin=0)) + geom_line() + facet_wrap(~Country, scales="free_y")

6 Reference Materials

For further information beyond what is covered in this practical, you can refer to the resources below. If you are having trouble understanding the contents of the practical, these can be quite useful.

6.1 Cheat sheets

These sheets are important and summarize much of what you will need to know about R for this course

R cheat sheet ggplot2 cheat sheet dplyr cheat sheet

6.2 Additional Materials

For a more basic step-by-step introduction, you can install the swirl package:

install.packages("swirl")

You can then work through tutorials by doing:

library(swirl)
swirl()

Swirl is interesting since it guides you in learning R, within the R console. You’ll see examples like this:

| To assign the result of 5 + 7 to a new variable called x, you type x <- 5 +
| 7. This can be read as 'x gets 5 plus 7'. Give it a try now.