R is a language and environment for statistical computing and graphics.. In more practical terms:
You’re already used to programming since spreadsheets are code:
=SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,MATCH($AL$7&$AL$6,$data.$A$1:$AMJ$1,0)-29,1,2))/1000+SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,RIGHT(V$53,2)*8-7,1,2))/1000-SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,MATCH($AL$7&$AL$6,$data.$A$1:$AMJ$1,0)-37,1,2))/1000
This is a formula from an actual spreadsheet in use by a company. This code is a nightmare to maintain and it’s very difficult to check if there are any errors in it. Spreadsheets create a soup of data storage, processing and presentation (i.e. numbers, formulas and graphs). This hinders transparency and as spreadsheets get more complex, it becomes very difficult to trace how the data analysis is actually done. With R, it’s possible to use code to read from data files and then automatically generate reports in the form of Word documents, pdfs, or html pages. Separating your data, processing and presentation of results into separate components will help you to stay organized.
You need to have both R and RStudio set up. You can use R without RStudio, but if you do, the interface will look like the one on the left image below, and this can be a bit intimidating. RStudio runs on top of R and provides a more visual interface, in addition to greatly assisting you with various aspects of programming.
R |
R + RStudio | |
---|---|---|
|
vs. |
|
On the university computers you should be able to find RStudio in the Start Menu by looking for Mathematics & Statistics
-> R
. If there are several versions, just pick the one with the highest version number. You should not have to install R or RStudio on the university computers.
On your own computer you will have to install both R and RStudio
Once you open up RStudio, you’ll see that it has four panels, which are:
Top Left - Code - here you can open and work on different script files
Top Right - Environment/History
Bottom Left - Console
?sum
you’ll see the documentation for the sum
function.Bottom Right - Files/Plots/Packages/Help
You can run lines of code by highlighting them, and then clicking on “Run” above. You can run all the code at once by doing Code -> Run Region -> Run All
Once R and RStudio is installed, open up RStudio and install the necessary packages for R. Note that with R, the word “package” and “library” are often used interchangeably. Today we’ll need to install the ggplot2 library to do the plotting.
Packages
tab, and click on Install
:ggplot2
. You should see it auto-complete as you type:Install
and make sure that Install dependencies
is checked:install.packages("ggplot2")
Everything shown in the large gray boxes below is code that you can run by copy/pasting into the Console
tab in RStudio. For example:
print("hello world")
## [1] "hello world"
You can also collect these statements in a new R file (File
-> New File
-> R Script
) which you can then run.
Run
button:
Gray bits of text like this
usually refer to individual R commands, package names, or the exact name of things that you will see in the user interface.
options(stringsAsFactors = FALSE)
Make sure to always run this command when you start using R:
options(stringsAsFactors = FALSE)
We’ll cover what this is in a later practical, but for now it’s important to specify the stringsAsFactors
option whenever running code as you may get confusing results without it. In short, R assumes that your data contains factors or categorical variables, which isn’t necessarily the case.
We can assign the value of 1
to the variable a
by doing:
a = 1
And then can make sure that the value has been assigned:
a
## [1] 1
Instead of using =
, we can also use <-
which does the same thing. You often will see people use both forms of these.
a <- 1
Although you assigned a number to the variable a
, you can also assign to it different types of data like text:
a = "this is some text"
a
## [1] "this is some text"
a = c(3, 4, 5, 6, 7)
a
## [1] 3 4 5 6 7
Since this is a sequence from 3 to 7, we can also write it like:
a = c(3:7)
a
## [1] 3 4 5 6 7
Initialize a matrix of zeros with three columns and two rows. Then set the value at row 1 and column 2 to 3
b = matrix(0, ncol=3, nrow=2)
b[1,2] = 3
b
## [,1] [,2] [,3]
## [1,] 0 3 0
## [2,] 0 0 0
You can append a matrix to the end of another matrix using rbind
which means that you will bind the rows together.
b = matrix(0, ncol=3, nrow=2)
c = matrix(1, ncol=3, nrow=2)
d = rbind(b,c)
d
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 1 1 1
## [4,] 1 1 1
You can also bind the columns together using cbind
cbind(b,c)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 1 1 1
## [2,] 0 0 0 1 1 1
R also has functions that run operations such as mean
and sum
on matrix rows and columns:
rowMeans(d)
## [1] 0 0 1 1
rowSums(d)
## [1] 0 0 3 3
colMeans(d)
## [1] 0.5 0.5 0.5
colSums(d)
## [1] 2 2 2
Data frames are conceptually similar to matrices, although you can mix in different types of data per column. In the example below, the first two columns are numbers, while the last column is text.
a = data.frame(x = c(1:3),
y = c(4:6),
z = c("a", "b", "c"))
a
## x y z
## 1 1 4 a
## 2 2 5 b
## 3 3 6 c
Data frames are similar to matrices in that you can access elements based on their row and column indices. For example, to get the element in the 2nd row and third column:
a[2,3]
## [1] "b"
We can also get just the 2nd row:
a[2,]
## x y z
## 2 2 5 b
Or just the 3rd column:
a[,3]
## [1] "a" "b" "c"
One of the nice things about data frames is that you can use the names of the columns (combined with the $
sign) to directly access the values in that column. So if we want to see the values of only the z
column, we can use a$z
a$z
## [1] "a" "b" "c"
You can also add a new column to an existing data frame:
a$t = c(10, 13, 17)
a
## x y z t
## 1 1 4 a 10
## 2 2 5 b 13
## 3 3 6 c 17
We’ll look at the mtcars
data set that is included with R. If you type ?mtcars
in the console, you’ll see more documentation. Looking at the first few lines of the mtcars
data frame, we see the following which shows data in several columns: mpg
, cyl
, disp
, hp
, drat
, wt
, qsec
, vs
, am
, gear
and carb
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The head
command just shows us the top few rows, and you can also use the tail
command to look at the bottom rows.
For specific columns, we can use the mean
and sd
functions to find the average and standard deviation.
mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948
We first need to load the library that we’ll be using for the rest of this tutorial:
library(ggplot2)
Many visualizations created with R are often created using the ggplot2 library. What’s interesting about this library is that way in which it allows you to construct visualizations. The gg in ggplot2 stands for the Grammar of Graphics. The idea is that when you create plots, you are basically writing sentences that are of the form:
Here's my data frame
+ Here are the x and y columns
+ Apply this kind of plot to that data
+ These are the axis labels
+ here are some more additional transformations
The syntax may look strange at first, although it’s a very modular approach, and you can create very complex visualizations just by adding new parts to these sentences.
To understand this, we can first do a simple scatter plot. You’ll notice with the syntax that we first start with the mtcars
data frame, then we specify which columns are to be associated with the x
and y
values, and then we specify that we want to plot the data as points by adding + geom_point()
.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
In the following examples, you may see the code examples split over multiple lines. The two statements below are actually equivalent, but by spreading the commands over multiple lines it can sometimes help to make things more readable by separating the code into its different functional pieces.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
ggplot(mtcars,
aes(x=hp, y=mpg)) +
geom_point()
We can also use values from other columns to modify particular attributes of the points. For example, we can set the color of the points to indicate the number of cylinders
ggplot(mtcars, aes(x=hp, y=mpg, colour=cyl)) + geom_point()
We can set the size of the points based on the weight of the car
ggplot(mtcars, aes(x=hp, y=mpg, colour=cyl, size=wt)) + geom_point()
For this example, we need to specify x=factor(cyl)
to make sure that there is a boxplot drawn for each unique value of cyl
.
ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot()
Histogram of the number of cars with a particular fuel economy value:
ggplot(mtcars, aes(x=mpg)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also change the bin width:
ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth=5)
Count the number of cars with specific numbers of cylinders
ggplot(mtcars,
aes(x=cyl)) +
geom_bar()
Pie charts can be created as well, although they require a few more steps. Part of the reason for this is that many data visualization experts discourage their use since other types of visualizations can communicate the information more effectively.
The general strategy to create a pie chart using ggplot2 is to first create a bar chart and then to use polar coordinates to turn the bars into a circle.
In other words, we start with this:
ggplot(mtcars,
aes(x=factor(1), fill=factor(cyl))) +
geom_bar(width=1)
To explain what’s going on:
x=factor(1)
- This places the bars at the same location, which allows them to be stackedfill=factor(cyl)
- The fill color for the bars is based on the value of cyl
geom_bar(width=1)
- This is needed to that there isn’t a hole in the plot when we use the code in the step below.mtcars
data frame) with a specific value for cyl
. In other words, the size of the pie slices is not based on actual numeric values in mtcars
.We then turn this into a pie chart by adding + coord_polar(theta="y")
:
# this does a count
ggplot(mtcars,
aes(x=factor(1), fill=factor(cyl))) +
geom_bar(width=1) + coord_polar(theta="y")
If we want to create a pie chart where the size of the slices correspond to actual values in the data and not just to counts of things with the same values, we need to take a slightly different approach.
Here we have a data frame listing types of animals and values associated with them:
animals = data.frame(animal_type = c("chickens", "cows", "pigs"),
farm_count = c(20, 10, 5))
animals
## animal_type farm_count
## 1 chickens 20
## 2 cows 10
## 3 pigs 5
We then add y=farm_count
and stat="identity"
to make sure that this is plotted correctly. We also use theme_void()
to remove the axis labels that we saw in the previous plot.
Using stat="identity"
means that the size of the pie slices are based on the values contained in the data, and not on the count of things with the same values.
ggplot(animals,
aes(x=factor(1), y=farm_count, fill=factor(animal_type))) +
geom_bar(width=1, stat="identity") +
coord_polar(theta="y") +
theme_void()
The following example is based on CO2 emissions data from the UNFCCC, specifically the “CO2 excluding LULUCF” Excel spreadsheet which we’ve transformed into a CSV file.
CSV stands for “comma-separated values” which means that you represent tabular data by using commas to separate values from different columns:
animal,farm_count
chickens,20
cows,10
pigs,5
While you can technically read Excel files into R, reading CSV files is much much faster as it is a very simple data format.
Now we’ll load in an example data file and create several plots with it.
For this, you’ll need the UNFCCC_CO2_Emissions.csv file. To get it, right click on this link: UNFCCC_CO2_Emissions.csv and select “Save Target As” or “Save Link As” to save it to your computer.
Internet Explorer might try to save this as “UNFCCC_CO2_Emissions.txt”, make sure to save this as “UNFCCC_CO2_Emissions.csv” or adjust your code so that it knows to read the correct file.
One thing you need to check is your working directory. This is the directory where R looks for any files. You can set this in RStudio Session
-> Set Working Directory
-> Choose Directory
Make sure that this is set to the directory where you have placed the UNFCCC_CO2_Emissions.csv
file.
df = read.csv(file="UNFCCC_CO2_Emissions.csv")
Create line plots per country of the amount of CO2 emissions:
ggplot(df, aes(x=Year, y=CO2_Mt, colour=Country)) + geom_line()
Create a stacked area chart showing how each country’s CO2 emissions contributes to the total:
ggplot(df,
aes(x=Year, y=CO2_Mt, fill=Country)) +
geom_area()
Same plot, but using geom_line(aes(ymax=CO2_Mt), position="stack", size=0.1)
to add black lines to help better distinguish the individual countries.
ggplot(df,
aes(x=Year, y=CO2_Mt, fill=Country)) +
geom_area() +
geom_line(aes(ymax=CO2_Mt), position="stack", size=0.1)
In the previous plots, it’s a bit difficult to distinguish countries with similar colors. We can also use facet_wrap
to create plots for individual countries.
~Country
- create individual plots per distinct values in the Country
column.scales="free_y"
- each plot will have its y axis scaled individually. This helps to view the trends from countries with less CO2 emissions.ggplot(df, aes(x=Year, y=CO2_Mt)) + geom_line() + facet_wrap(~Country, scales="free_y")
The plot above shows the variation, but you’ll notice that the minimum value on the y scale is not set to zero. This means that the variation observed may not actually be that big when considering the overall amount of emissions. To fix this, we update our code to use ymin=0
so that we can get a picture of the absolute magnitude of emissions.
ggplot(df, aes(x=Year, y=CO2_Mt, ymin=0)) + geom_line() + facet_wrap(~Country, scales="free_y")
For further information beyond what is covered in this practical, you can refer to the resources below. If you are having trouble understanding the contents of the practical, these can be quite useful.
These sheets are important and summarize much of what you will need to know about R for this course
ggplot2
dplyr
for reshaping, combining, grouping, and summarizing data frames. We will cover dplyr
in a later practical.R cheat sheet | ggplot2 cheat sheet | dplyr cheat sheet | ||
|
|
|
For a more basic step-by-step introduction, you can install the swirl package:
install.packages("swirl")
You can then work through tutorials by doing:
library(swirl)
swirl()
Swirl is interesting since it guides you in learning R, within the R console. You’ll see examples like this:
| To assign the result of 5 + 7 to a new variable called x, you type x <- 5 +
| 7. This can be read as 'x gets 5 plus 7'. Give it a try now.