For this practical, we need to open up RStudio and install the necessary packages for R. Note that with R, the terms “package” and “library” are often used interchangeably and they mean the same thing.
Today we’ll need to install the tidyverse package. The Tidyverse package is a sort of “meta-package” that actually installs a collection of other packages which are very useful for working with R. In other words, instead of installing twenty different packages individually, you just install this one package and it grabs everything that you need. The online book R for Data Science goes into much more detail about how these packages can be used, and may be useful to refer to in the future, especially if your thesis project involves a lot of data analysis in R.
For this practical, there are two particular packages contained within Tidyverse that we will be using:
dplyr
, which makes it easy to do very complex operations with data frames.ggplot2
which allows you to plot data using a variety of visualizations.In the bottom right quadrant of RStudio, locate the Packages
tab, and click on Install
:
Then type in the name of the package tidyverse
. You should see it auto-complete as you type:
Click on Install
and make sure that Install dependencies
is checked:
The installation may take a minute or two because of the number of different packages it has to install. You should then see statements like this in the console on the bottom left quadrant:
Once you see package ‘tidyverse’ successfully unpacked and MD5 sums checked
, then things have been installed successfully.
You can also install packages just by typing:
install.packages("tidyverse")
We first need to load the library that we’ll be using for the rest of this tutorial:
library(tidyverse)
After running this command, you may see messages like this:
For this practical, these warnings are normal and won’t cause a problem.
If later on you see an error that says something like could not find function
, then that means that you forgot to load the library and that you need to make sure to run the library(tidyverse)
command.
There are several data types you’ll frequently use when working with R. In the previous practical we discussed the first four of these.
numeric
- integers, real numberscharacter
- “a”, “hello world”matrix
data.frame
logical
- boolean values of TRUE
, FALSE
factor
- typically used to indicate categories/labels for data observationsWithin a single data frame, we can have different data types per column. Note that here we use factors in the type_of_person
column to label people as doesn't like cheese
and likes cheese
. In the likes_cheese
column, we encode the same information but as a series of TRUE
and FALSE
values.
If we have multiple labels such as doesn't like cheese
, likes Gouda cheese
, likes stinky cheese
, then it makes sense to use factors since they can support as many distinct categories as you wish to have.
df = data.frame(person = c("Alice", "Bob", "Carol"), # character
type_of_person = factor(c("doesn't like cheese", "likes cheese", "likes cheese")), # factor
favourite_number = c(4, 8, 9), # numeric
likes_cheese = c(FALSE, TRUE, TRUE)) # logical
When working with R, you may sometimes need to check which data types you are working with. This is typically necessary when you run into some strange error or your calculation doesn’t seem to give you the correct results.
To do this, use class()
to see what type of data type you are working with
# Some logical values
a = TRUE
class(a)
## [1] "logical"
b = c(TRUE, FALSE, TRUE)
class(b)
## [1] "logical"
# numeric
a = 3
class(a)
## [1] "numeric"
b = 3.14:10
class(b)
## [1] "numeric"
# character
a = "3"
class(a)
## [1] "character"
b = c("hello", "this", "is", "a", "vector", "of", "characters")
class(b)
## [1] "character"
You can use class()
also on the columns of a data frame.
class(df$person)
## [1] "character"
class(df$type_of_person)
## [1] "factor"
class(df$favourite_number)
## [1] "numeric"
class(df$likes_cheese)
## [1] "logical"
Occasionally you may find that you are working with data that is not of the correct data type. You can fix this by using one of the following functions:
as.numeric()
as.factor()
as.matrix()
as.data.frame()
We can convert the character 3
into the number "3"
and vice versa:
as.numeric("3")
## [1] 3
as.character(3)
## [1] "3"
Factors are a bit of a strange data type since they’re generally used to represent categories of things. If we create a factor for a vector we see:
as.factor(c(3, 4, 7, 2, 4, 3))
## [1] 3 4 7 2 4 3
## Levels: 2 3 4 7
The part about Levels: 2 3 4 7
shows that its treating these numbers as labels or categories. In other words, these values would be used to say things like Category 2 = “brown hair”, Category 3 = “black hair”, etc.
We can also convert factors into characters, in which case we’ll have a character vector:
as.character(df$type_of_person)
## [1] "doesn't like cheese" "likes cheese" "likes cheese"
You can also convert them to numbers, although this is not necessarily useful.
as.numeric(df$type_of_person)
## [1] 1 2 2
Occasionally when you read in a data file, R may interpret numbers as factors, in which case you have to convert the factors first to characters and then to numbers
a = as.factor(c(3, 4, 7, 2, 4, 3))
class(a) # first check if this is a factor
## [1] "factor"
as.numeric(a) # these don't look like the numbers
## [1] 2 3 4 1 3 2
as.character(a) # these look like the numbers, but they're in character form
## [1] "3" "4" "7" "2" "4" "3"
as.numeric(as.character(a)) # this is what we need
## [1] 3 4 7 2 4 3
a <- as.numeric(as.character(a)) # use this to overwrite the original value
a
## [1] 3 4 7 2 4 3
Relational operators allow us to compare values and see if these values meet some condition or not. In the previous practical we used some of these to help with subsetting sets of elements from vectors, matrices and data frames. Below we show more of the operators and give a more detailed overview of how they work.
Operators you will commonly used are:
!=
==
>
, <
>=
, <=
%in%
Assign a value to a
so we can then try out different relational operators on it.
a <- 5
!=
not equal toa != 6
## [1] TRUE
a != 5
## [1] FALSE
When you use these relational operators on a vector (or a matrix or data frame), it performs an evaluation on each element in the vector, resulting in a vector of TRUE
and FALSE
values
b = c(3, 7, 2, 1, 4)
b != 7
## [1] TRUE FALSE TRUE TRUE TRUE
Again, we can use this to return the subset of elements matching the criteria.
b[b != 7]
## [1] 3 2 1 4
And we can see that the results above are the same as the statement below:
b[c(TRUE, FALSE, TRUE, TRUE, TRUE)]
## [1] 3 2 1 4
==
equal toa == 5
## [1] TRUE
>
, <
greater than, less thana > 5
## [1] FALSE
>=
, <=
greater/less than or equala <= 5
## [1] TRUE
%in%
value is in a lista %in% 1:10
## [1] TRUE
!
negationThis takes the opposite of the value that comes after it
(a == 5)
## [1] TRUE
!(a == 5)
## [1] FALSE
=
- only used for variable assignment=>
, =<
- incorrect form for greater/less than or equalIn the previous practical, we talked about subsetting elements within a data frame, although you’ll often need to do much more complex operations.
One package that helps with this is the dplyr package. The main idea with dplyr
is that you start off with data in a data frame, and then you perform a series of operations, or data transformations, which are connected together via the %>%
symbol which acts as a sort of “pipe” through which data flows. The philosophy behind dplyr
is that it provides you way to specify a set of “verbs” describing what you want to do with your data.
In very general terms, with the dplyr
library we can create a series of statements that look like:
some data frame %>%
operation 1 %>%
operation 2
select()
In the following example, we use select()
to return a data frame consisting of the mpg
and cyl
columns.
mtcars %>% select(mpg, cyl)
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
## Duster 360 14.3 8
## Merc 240D 24.4 4
## Merc 230 22.8 4
## Merc 280 19.2 6
## Merc 280C 17.8 6
## Merc 450SE 16.4 8
## Merc 450SL 17.3 8
## Merc 450SLC 15.2 8
## Cadillac Fleetwood 10.4 8
## Lincoln Continental 10.4 8
## Chrysler Imperial 14.7 8
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Toyota Corona 21.5 4
## Dodge Challenger 15.5 8
## AMC Javelin 15.2 8
## Camaro Z28 13.3 8
## Pontiac Firebird 19.2 8
## Fiat X1-9 27.3 4
## Porsche 914-2 26.0 4
## Lotus Europa 30.4 4
## Ford Pantera L 15.8 8
## Ferrari Dino 19.7 6
## Maserati Bora 15.0 8
## Volvo 142E 21.4 4
Note that you can also write the same command as shown below, where you create a new line after the %>%
symbol. This can be useful to do if you have a large number of operations.
mtcars %>%
select(mpg, cyl)
Note if you try to start a new line with the %>%
symbol, you will get an error. This symbol always has to be at the end of the previous line, like we showed above.
mtcars
%>% select(mpg, cyl)
## Error: <text>:2:3: unexpected SPECIAL
## 1: mtcars
## 2: %>%
## ^
filter()
We can use filter()
to only return rows matching some condition. Below we only keep the rows which have a value of hp
below 90.
mtcars %>%
filter(hp < 90)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 4 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 5 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Within a single filter()
statement you can have multiple comparisons separated by a comma. Here we only keep rows where the value of hp
is between 100 and 200.
mtcars %>%
filter(100 < hp, hp >= 200)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## 2 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## 3 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## 4 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## 5 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## 6 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
## 7 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
As mentioned, we can chain together multiple operations. Below we combine the filter()
and select()
statements from above.
mtcars %>%
filter(100 < hp, hp <= 200) %>%
select(mpg, cyl)
## mpg cyl
## 1 21.0 6
## 2 21.0 6
## 3 21.4 6
## 4 18.7 8
## 5 18.1 6
## 6 19.2 6
## 7 17.8 6
## 8 16.4 8
## 9 17.3 8
## 10 15.2 8
## 11 15.5 8
## 12 15.2 8
## 13 19.2 8
## 14 30.4 4
## 15 19.7 6
## 16 21.4 4
The command above is basically of the form: mtcars
%>%
“keep these rows” %>%
“keep these columns”
In the filter()
statement, rows are only returned which match all of the conditions specified. For example, the command below won’t return any values since a car can’t have a hp
value simultaneously below 100 and above 200.
mtcars %>%
filter(hp < 100, hp >= 200) %>%
select(mpg, cyl)
## [1] mpg cyl
## <0 rows> (or 0-length row.names)
However, we can use filter()
to find all cars which have a hp
below 100 or above 200. For this we use the |
symbol which specifies a logical “or”. R will read this as meaning something like filter(hp < 100
“or” hp >= 200)
.
mtcars %>%
filter(hp < 100 | hp >= 200) %>%
select(mpg, cyl)
## mpg cyl
## 1 22.8 4
## 2 14.3 8
## 3 24.4 4
## 4 22.8 4
## 5 10.4 8
## 6 10.4 8
## 7 14.7 8
## 8 32.4 4
## 9 30.4 4
## 10 33.9 4
## 11 21.5 4
## 12 13.3 8
## 13 27.3 4
## 14 26.0 4
## 15 15.8 8
## 16 15.0 8
By using the &
symbol you can specify a logical “and”. This is basically what the comma in the one of the previous examples was doing, and we could rewrite filter(100 < hp, hp <= 200)
as filter(100 < hp & hp <= 200)
since we’re looking for rows where hp
is above 100 and less than or equal to 200.
mtcars %>%
filter(100 < hp & hp <= 200) %>%
select(mpg, cyl)
## mpg cyl
## 1 21.0 6
## 2 21.0 6
## 3 21.4 6
## 4 18.7 8
## 5 18.1 6
## 6 19.2 6
## 7 17.8 6
## 8 16.4 8
## 9 17.3 8
## 10 15.2 8
## 11 15.5 8
## 12 15.2 8
## 13 19.2 8
## 14 30.4 4
## 15 19.7 6
## 16 21.4 4
Filter can also be used on multiple columns. Note that we create a separate clause (hp >= 250 | hp < 100)
to contain a logical “or” statement. The reason for the parenthesis is that this is a single condition that we want to match. For the other conditions, we want rows where the value of cyl
is 4, and the mpg
is greater than 30.
mtcars %>%
filter((hp >= 250 | hp < 100), # condition 1
cyl == 4, # condition 2
mpg > 30) %>% # condition 3
select(mpg, cyl)
## mpg cyl
## 1 32.4 4
## 2 30.4 4
## 3 33.9 4
The order of the statements will sometimes matter and you should pay attention to this. For example, note that if we put the select()
statement before the filter()
statement, we will get an error since our select(mpg,cyl)
statement only keeps the mpg
and cyl
columns, while the filter()
statement needs information on the hp
column. This is what is meant by Evaluation error: object 'hp' not found
. In other words, you will sometimes run into errors with dplyr
when your statements aren’t in the correct sequence.
mtcars %>%
select(mpg, cyl) %>%
filter((hp >= 250 | hp < 100), # condition 1
cyl == 4, # condition 2
mpg > 30) # condition 3
## Error in filter_impl(.data, quo): Evaluation error: object 'hp' not found.
summarise()
The functions filter()
and select()
allow us to select specific rows and columns, although we usually also want to do some sort of statistical calculations over the values in a data frame.
Below we use summarize()
to find the average and standard deviation of all the horsepower (hp
) values. Only a single row is returned since these calculations are performed over the entire set of rows. The summarise()
function actually creates new columns based on the names you specify, and multiple columns can be created by using a comma.
The commands are of the form
summarize(
new_column_name=
operation)
summarize(
new_column_name=
operation,
another_new_column_name=
another operation)
Find the average hp
mtcars %>% summarize(avg_hp = mean(hp))
## avg_hp
## 1 146.6875
Note that you should always precede these statements with your data frame (i.e. mtcars %>%
). Without this, dplyr
won’t know where to find the values for hp
.
summarize(avg_hp = mean(hp))
## Error in mean(hp): object 'hp' not found
Find the average hp
and standard deviation of hp
:
mtcars %>% summarize(avg_hp = mean(hp), sd_hp = sd(hp))
## avg_hp sd_hp
## 1 146.6875 68.56287
Find the average fuel efficiency (mpg
= “miles per gallon”) for different ranges of horsepower (hp
)
mtcars %>% filter(hp < 100) %>% summarise(avg_mpg = mean(mpg))
## avg_mpg
## 1 26.83333
mtcars %>% filter(100 <= hp, hp < 200) %>% summarise(avg_mpg = mean(mpg))
## avg_mpg
## 1 19.21875
mtcars %>% filter(200 <= hp, hp < 300) %>% summarise(avg_mpg = mean(mpg))
## avg_mpg
## 1 13.15
Many visualizations created with R are often created using the ggplot2 library. What’s interesting about this library is that way in which it allows you to construct visualizations. The gg in ggplot2 stands for the Grammar of Graphics. The idea is that when you create plots, you are basically writing sentences that are of the form:
Here's my data frame
+ Here are the x and y columns
+ Apply this kind of plot to that data
+ These are the axis labels
+ here are some more additional transformations
The syntax may look strange at first, although it’s a very modular approach, and you can create very complex visualizations just by adding new parts to these sentences.
We’ll first start off by writing code that doesn’t work, but it at least shows how we build up these statements. First we have the ggplot()
function. You need to call this whenever you want to make a plot.
ggplot()
Next we need to specify which data frame we want to use, via the data=
argument. In this case we use the mtcars
data frame which we first showed in the previous practical.
ggplot(data=mtcars)
At this point, ggplot knows that we want to plot a data frame, but it’s not sure which columns it should look at.
We now have to use aes()
to map what are called the aesthetics. Here we tell ggplot which columns are to be used for the x
and y
values in the plot.
Note that we use don’t use quotation marks for the column names, i.e. aes(x=mpg, y=cyl)
is the correct syntax, while aes(x="mpg", y="cyl")
will not give you the results that you want.
ggplot(data=mtcars, aes(x=mpg, y=cyl))
Now we see a blank grid, whose ranges correspond to the min/max values of the cyl
and mpg
columns. At this point ggplot knows what we want to plot, but not how to plot it. To do this, we need to add one of the “Geoms” as referred to on the Data Visualization with ggplot2 Cheat Sheet. These are the functions which begin with geom_
and look like geom_point
, geom_line
, geom_boxplot
, etc.
In other words, you will be writing a lot of statements of the form:
ggplot(data=
your data frame , aes(x=
column in your data frame , y=
another column in your data frame )) + geom_
point, line, etc.
To understand this, we can first do a simple scatter plot. You’ll notice with the syntax that we first start with the mtcars
data frame, then we specify which columns are to be associated with the x
and y
values, and then we specify that we want to plot the data as points by adding + geom_point()
.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
In the following examples, you may see the code examples split over multiple lines. The two statements below are actually equivalent, but by spreading the commands over multiple lines it can sometimes help to make things more readable by separating the code into its different functional pieces.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
ggplot(mtcars,
aes(x=hp, y=mpg)) +
geom_point()
Note that the +
can’t be at the beginning of the line, otherwise you will get an error. It always has to appear at the end of the line, just like with the %>%
used for dplyr
.
Incorrect:
ggplot(mtcars,
aes(x=hp, y=mpg))
+ geom_point()
We can also use values from other columns to modify particular attributes of the points. For example, we can set the color of the points to indicate the number of cylinders
ggplot(mtcars, aes(x=hp, y=mpg, colour=cyl)) + geom_point()
We can set the size of the points based on the weight of the car
ggplot(mtcars, aes(x=hp, y=mpg, colour=cyl, size=wt)) + geom_point()
Although it’s not directly useful for the mtcars
data set, you should know that ggplot
makes it easy for you to overlay multiple visualizations. Below we create both a scatter plot and a line plot.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + geom_line()
Because the specify the data frame and aes
in the ggplot()
function, we don’t have to specify them later since ggplot is already aware of them. This saves you from having to write more extensive code like this:
ggplot() +
geom_point(data=mtcars, aes(x=hp, y=mpg)) +
geom_line(data=mtcars, aes(x=hp, y=mpg))
For this example, we need to specify x=factor(cyl)
to make sure that there is a boxplot drawn for each unique value of cyl
.
ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot()
Without this step, only a single boxplot will be drawn:
ggplot(mtcars, aes(x=cyl, y=mpg)) + geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Histogram of the number of cars with a particular fuel economy value:
ggplot(mtcars, aes(x=mpg)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also change the bin width:
ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth=5)
Count the number of cars with specific numbers of cylinders
ggplot(mtcars,
aes(x=cyl)) +
geom_bar()
Pie charts can be created as well, although they require a few more steps. Part of the reason for this is that many data visualization experts discourage their use since other types of visualizations can communicate the information more effectively.
The general strategy to create a pie chart using ggplot2 is to first create a bar chart and then to use polar coordinates to turn the bars into a circle.
In other words, we start with this:
ggplot(mtcars,
aes(x=factor(1), fill=factor(cyl))) +
geom_bar(width=1)
To explain what’s going on:
x=factor(1)
- This places the bars at the same location, which allows them to be stackedfill=factor(cyl)
- The fill color for the bars is based on the value of cyl
geom_bar(width=1)
- This is needed to that there isn’t a hole in the plot when we use the code in the step below.mtcars
data frame) with a specific value for cyl
. In other words, the size of the pie slices is not based on actual numeric values in mtcars
.We then turn this into a pie chart by adding + coord_polar(theta="y")
:
# this does a count
ggplot(mtcars,
aes(x=factor(1), fill=factor(cyl))) +
geom_bar(width=1) + coord_polar(theta="y")
If we want to create a pie chart where the size of the slices correspond to actual values in the data and not just to counts of things with the same values, we need to take a slightly different approach.
Here we have a data frame listing types of animals and values associated with them:
animals = data.frame(animal_type = c("chickens", "cows", "pigs"),
farm_count = c(20, 10, 5))
animals
## animal_type farm_count
## 1 chickens 20
## 2 cows 10
## 3 pigs 5
We then add y=farm_count
and stat="identity"
to make sure that this is plotted correctly. We also use theme_void()
to remove the axis labels that we saw in the previous plot.
Using stat="identity"
means that the size of the pie slices are based on the values contained in the data, and not on the count of things with the same values.
ggplot(animals,
aes(x=factor(1), y=farm_count, fill=factor(animal_type))) +
geom_bar(width=1, stat="identity") +
coord_polar(theta="y") +
theme_void()
If you want to specify a specific color for the points, you need to place that within the geom_
statement. If colour
is inside the ggplot()
function, then the color you get will get will not be the one you specify:
# these points are not blue
ggplot(mtcars, aes(x=hp, y=mpg, colour="blue")) + geom_point()
# these points are not blue either
ggplot(mtcars, aes(x=hp, y=mpg), colour="blue") + geom_point()
To fix this, colour
needs to be specified outside the aes()
function:
# these points are blue
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point(colour="blue")
If you want to make values for one color red and one color blue, then one way to achieve this is to do the following. We use dplyr
to create data frames where cyl
is 4 and 6 respectively, and then assign columns in the geom_point()
functions, where colour
is outside of aes()
. Note that ggplot()
has to be empty since we are plotting two different data frames.
ggplot() + geom_point(data = mtcars %>% filter(cyl == 4), aes(x=hp, y=mpg), colour="red") +
geom_point(data = mtcars %>% filter(cyl == 6), aes(x=hp, y=mpg), colour="blue")
ggplot
also allows you to horizontal, vertical and diagonal lines. These are useful if you wish to indicate some level or threshold in the data.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + geom_hline(yintercept = 10)
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + geom_vline(xintercept = 100)
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + geom_abline(intercept=5, slope=0.1)
You can overwrite the default axis labels by using xlab()
and ylab()
. A title can also be added by using ggtitle()
.
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + xlab("Horsepower") + ylab("Miles Per Gallon") +
ggtitle("Horsepower vs. Miles Per Gallon for Some Sports Cars")
The following example is based on CO2 emissions data from the UNFCCC, specifically the “CO2 excluding LULUCF” Excel spreadsheet which we’ve transformed into a CSV file.
CSV stands for “comma-separated values” which means that you represent tabular data by using commas to separate values from different columns:
animal,farm_count
chickens,20
cows,10
pigs,5
While you can technically read Excel files into R, reading CSV files is much much faster as it is a very simple data format.
Now we’ll load in an example data file. For this, you’ll need the UNFCCC_CO2_Emissions.csv file. To get it, right click on this link: UNFCCC_CO2_Emissions.csv and select “Save Target As” or “Save Link As” to save it to your computer.
Internet Explorer might try to save this as “UNFCCC_CO2_Emissions.txt”, make sure to save this as “UNFCCC_CO2_Emissions.csv” or adjust your code so that it knows to read the correct file.
One thing you need to check is your working directory. This is the directory where R looks for any files. You can set this in RStudio Session
-> Set Working Directory
-> Choose Directory
Note that for this step, you’re selecting the directory that you just placed the csv file in, not the actual csv file itself (R won’t let you select the file since it only wants to know the directory). In other words, you’re just telling R that when you mention a file it should look there. Without this, you would have to specify the entire directory structure leading to that file.
Make sure that this is set to the directory where you have placed the UNFCCC_CO2_Emissions.csv
file.
df <- read.csv(file="UNFCCC_CO2_Emissions.csv")
df
appears in the Environment
tab, and we can see that it has 1012 rows (observations) and 3 variables (columns).
Use summary
to get a quick overview of what is in the data:
summary(df)
## Country Year CO2_Mt
## Length:1012 Min. :1990 Min. : 0.079
## Class :character 1st Qu.:1995 1st Qu.: 27.274
## Mode :character Median :2001 Median : 63.196
## Mean :2001 Mean : 496.211
## 3rd Qu.:2007 3rd Qu.: 362.777
## Max. :2012 Max. :6116.441
From this we see that it contains data on years from 1990 until 2012.
If you have read everything in correctly, you should get the following values for the average and standard deviation of the CO2_Mt
column.
mean(df$CO2_Mt)
## [1] 496.2108
sd(df$CO2_Mt)
## [1] 1132.462
Create line plots per country of the amount of CO2 emissions:
ggplot(df, aes(x=Year, y=CO2_Mt, colour=Country)) + geom_line()
Create a stacked area chart showing how each country’s CO2 emissions contributes to the total:
ggplot(df,
aes(x=Year, y=CO2_Mt, fill=Country)) +
geom_area()
Same plot, but using geom_line(aes(ymax=CO2_Mt), position="stack", size=0.1)
to add black lines to help better distinguish the individual countries.
ggplot(df,
aes(x=Year, y=CO2_Mt, fill=Country)) +
geom_area() +
geom_line(aes(ymax=CO2_Mt), position="stack", size=0.1)
## Warning: Ignoring unknown aesthetics: ymax
In the previous plots, it’s a bit difficult to distinguish countries with similar colors. We can also use facet_wrap
to create plots for individual countries.
~Country
- create individual plots per distinct values in the Country
column.scales="free_y"
- each plot will have its y axis scaled individually. This helps to view the trends from countries with less CO2 emissions.ggplot(df, aes(x=Year, y=CO2_Mt)) + geom_line() + facet_wrap(~Country, scales="free_y")
The plot above shows the variation, but you’ll notice that the minimum value on the y scale is not set to zero. This means that the variation observed may not actually be that big when considering the overall amount of emissions. To fix this, we update our code to use ymin=0
so that we can get a picture of the absolute magnitude of emissions.
ggplot(df, aes(x=Year, y=CO2_Mt, ymin=0)) + geom_line() + facet_wrap(~Country, scales="free_y")