1 Monte Carlo simulation

1.1 Quick review

We have a function \(x = f(u,v)\) where:

  • \(u\) has uncertainty \(s_u\)
  • \(v\) has uncertainty \(s_v\)

We then:

  • Construct a normal distribution \(f_{n,u}\) using the mean found for \(u\) and a standard deviation \(s_u\)
  • Construct a normal distribution \(f_{n,v}\) using the mean found for \(v\) and a standard deviation \(s_v\)
  • Take a random value \(u_1\) from fn,u and \(v_1\) from fn,v and calculate \(x_1 = f(u_1, v_1)\)
  • Repeat n times
  • Take average (or median) of all \(x_i\) for (i = 1, … n)

1.2 Example

Assume that for a research project you are told to calculate the uncertainty for the following formula:

\[f(p,q) = \frac{p \cdot q}{p + q}\]

For both \(p\) and \(q\) we have evidence of the following means and standard deviations:

  • \(p\): \(\mu = 2\) and \(\sigma = 0.5\)
  • \(q\): \(\mu = 5\) and \(\sigma = 1\)

We assume that the uncertainties for \(p\) and \(q\) follow normal distributions

# take n samples
n <- 1000
# take n samples from normal distributions with different means and standard deviations
p <- rnorm(n, mean = 2, sd = 0.5)
q <- rnorm(n, mean = 5, sd = 1)

Calculate values for \(f(p,q) = \frac{p \cdot q}{p + q}\)

f <- p*q/(p+q)

Plot the histogram, using 100 bins (use breaks=100)

hist(f, breaks=100)

We now calculate the mean:

f_bar <- mean(f)
f_bar
## [1] 1.394162

and then calculate the standard deviation:

sig <- sd(f)
sig
## [1] 0.2719069

Below we can see how both the mean and the standard deviation converge on a single value as we increase the number of samples from the distributions we defined for \(p\) and \(q\) (see page 33,34 of Lecture 7 for more explanation). Note the log scale for the x axis.

1.3 Exercise

Now you’ve been given a different formula:

\[f(u,v) = u^2 + u + v\]

…and you’ve been told to perform a monte carlo simulation with 10000 samples this time. We again assume that the variables u and v are normally distributed with the following means and standard deviations:

  • \(u\): \(\mu = 1.2\) and \(\sigma = 3\)
  • \(v\): \(\mu = 5\) and \(\sigma = 1.7\)

Now try to recreate the following results on your own.

Plot the histogram of \(x = f(u,v) = u^2 + u + v\)

Calculate mean(x) and std(x) for \(x = f(u,v)\)

## [1] 16.51136
## [1] 16.35522

2 Working with Real Data Sets

2.1 Example

For this example, you’ll need a csv file. To get it, right click on this link: Guerry.csv and select “Save Target As” or “Save Link As” to save it to your computer.

Internet Explorer might try to save this as “Guerry.txt”, make sure to save this as “Guerry.csv” or adjust your code so that it knows to read the correct file.

One thing you need to check is your working directory. This is the directory where R looks for any files. You can set this in RStudio Session -> Set Working Directory -> Choose Directory

Once you’ve done all this, you can now read in the data file:

data1 <- read.csv(file ="Guerry.csv")

Note that every time you read in a .csv file you will have a data frame, even if there is only a single column present.

As mentioned in the previous practicals, we use summary to get an idea about what is in the data frame:

summary(data1)
##       dept         Region      Department   Crime_pers      Crime_prop   
##  Min.   :  1.00   C   :17   Ain     : 1   Min.   : 2199   Min.   : 1368  
##  1st Qu.: 24.25   E   :17   Aisne   : 1   1st Qu.:14156   1st Qu.: 5933  
##  Median : 45.50   N   :17   Allier  : 1   Median :18748   Median : 7595  
##  Mean   : 46.88   S   :17   Ardeche : 1   Mean   :19754   Mean   : 7843  
##  3rd Qu.: 66.75   W   :17   Ardennes: 1   3rd Qu.:25938   3rd Qu.: 9182  
##  Max.   :200.00   NA's: 1   Ariege  : 1   Max.   :37014   Max.   :20235  
##                             (Other) :80                                  
##     Literacy       Donations        Infants         Suicides     
##  Min.   :12.00   Min.   : 1246   Min.   : 2660   Min.   :  3460  
##  1st Qu.:25.00   1st Qu.: 3447   1st Qu.:14300   1st Qu.: 15463  
##  Median :38.00   Median : 5020   Median :17142   Median : 26744  
##  Mean   :39.26   Mean   : 7076   Mean   :19050   Mean   : 36523  
##  3rd Qu.:51.75   3rd Qu.: 9447   3rd Qu.:22682   3rd Qu.: 44058  
##  Max.   :74.00   Max.   :37015   Max.   :62486   Max.   :163241  
##                                                                  
##   MainCity      Wealth         Commerce         Clergy     
##  1:Sm :10   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  2:Med:66   1st Qu.:22.25   1st Qu.:21.25   1st Qu.:22.25  
##  3:Lg :10   Median :43.50   Median :42.50   Median :43.50  
##             Mean   :43.50   Mean   :42.80   Mean   :43.43  
##             3rd Qu.:64.75   3rd Qu.:63.75   3rd Qu.:64.75  
##             Max.   :86.00   Max.   :86.00   Max.   :86.00  
##                                                            
##  Crime_parents    Infanticide    Donation_clergy    Lottery     
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:22.25   1st Qu.:22.25   1st Qu.:22.25   1st Qu.:22.25  
##  Median :43.50   Median :43.50   Median :43.50   Median :43.50  
##  Mean   :43.50   Mean   :43.51   Mean   :43.50   Mean   :43.50  
##  3rd Qu.:64.75   3rd Qu.:64.75   3rd Qu.:64.75   3rd Qu.:64.75  
##  Max.   :86.00   Max.   :86.00   Max.   :86.00   Max.   :86.00  
##                                                                 
##    Desertion      Instruction     Prostitutes        Distance    
##  Min.   : 1.00   Min.   : 1.00   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:22.25   1st Qu.:23.25   1st Qu.:   6.0   1st Qu.:121.4  
##  Median :43.50   Median :41.50   Median :  33.0   Median :200.6  
##  Mean   :43.50   Mean   :43.13   Mean   : 141.9   Mean   :208.0  
##  3rd Qu.:64.75   3rd Qu.:64.75   3rd Qu.: 113.8   3rd Qu.:289.7  
##  Max.   :86.00   Max.   :86.00   Max.   :4744.0   Max.   :539.2  
##                                                                  
##       Area          Pop1831     
##  Min.   :  762   Min.   :129.1  
##  1st Qu.: 5401   1st Qu.:283.0  
##  Median : 6070   Median :346.2  
##  Mean   : 6147   Mean   :378.6  
##  3rd Qu.: 6816   3rd Qu.:444.4  
##  Max.   :10000   Max.   :989.9  
## 

If we take a closer look at the results, we see that the dept column is numeric since summary displays information about the minimum, median, mean, max and quantiles.

      dept    
 Min.   :  1.00
 1st Qu.: 24.25
 Median : 45.50
 Mean   : 46.88
 3rd Qu.: 66.75
 Max.   :200.00

However, for Region, we don’t see quantiles, etc but we are looking at counts of values, as this column contains factors indicating the particular region (C, E, N, S or W). Also note that NA's: 1 tells us that we have one NA which indicates a missing value (Not Available). We’ll discuss later in more detail how these can be dealt with in R.

 Region
C   :17
E   :17
N   :17
S   :17
W   :17
NA's: 1

We can find the number of variables (columns)

length(data1)
## [1] 23

or by:

ncol(data1)
## [1] 23

Similarly, the number of rows can be found via:

nrow(data1)
## [1] 86

And we can find the dimension of the data frame (i.e. the number of rows and columns) via:

dim(data1)
## [1] 86 23

The data you have just loaded in is from the Guerry R package, so to get an idea about the variables we need to install the package:

install.packages("Guerry")

Load the library:

library(Guerry)

and then bring up the help information that relates to this specific data frame which is included in the package:

help(Guerry)

This data set contains statistics about French cities. As you can see, there are several different types of columns:

  • A few categories: Region, Main_city
  • A few numerical variables: e.g., Crime_pers, Distance
  • A few rank variables: Donation_clergy, etc. These show which city is 1st, 2nd, etc.

2.1.1 Select a numerical variable and plot a histogram

ggplot(data1, aes(x=Infants)) + geom_histogram(bins=50)

2.1.2 Make a box plot of Literacy per city size (MainCity)

ggplot(data1, aes(x=MainCity, y=Literacy)) + geom_boxplot()

2.1.3 Managing outliers

In the plot above, we see that the box for the small cities looks strange as it is quite skewed. We now take a closer look

First we want to know how many variables are in the small, medium and large cities:

a <- table(data1$MainCity)
a
## 
##  1:Sm 2:Med  3:Lg 
##    10    66    10

This shows that we have 10 observations for small cities.

2.1.3.1 Creating subsets of data frames

Next we want to see the actual data points in the small cities only. To do this, we’ll use the dplyr package. The dplyr library takes a very similar approach to what we saw with ggplot. We start off with data in a data frame, and then we perform a series of operations or data transformations which are connected together via the %>% symbol which acts as a sort of “pipe” through which data flows.

In very general terms, with the dplyr library we can create a series of statements that look like:

some data frame %>% operation 1 %>% operation 2

First we need to install and load the dplyr package

install.packages("dplyr")
library(dplyr)

In the following example, we take a subset of the mtcars data set. We first keep a selection of rows by using filter, and then use select to keep certain columns.

mtcars %>% 
  filter(hp >= 250 | hp < 100) %>% 
  select(mpg, cyl)

In other words:

  • filter(hp >= 250 | hp < 100) - keep all rows where hp is greater than or equal to 250 or is less than 100.
    • Note that the | symbol specifies a logical “or”. This just means that you want to find values matching hp >= 250 or hp < 100
    • The & symbol is a logical “and”. You could modify the statement above to use hp >= 250 & hp < 100, although this would return no rows since a value can’t be both greater than 250 and less than 100.
  • select(mpg, cyl) - keep only the mpg and cyl columns.
  • The command above is basically of the form: mtcars %>% “keep these rows” %>% “keep these columns”

We can also filter on multiple columns by adding commas in the filter function.

mtcars %>% 
  filter((hp >= 250 | hp < 100), 
         cyl == 4, 
         mpg > 30) %>% 
  select(mpg, cyl)
##    mpg cyl
## 1 32.4   4
## 2 30.4   4
## 3 33.9   4

A few notes:

  • The == symbol in cyl == 4 means that we check if the value is equal to 4.
  • We use parentheses in (hp >= 250 | hp < 100) to group these two conditions together, as we want to match either of these conditions.
  • In the filter function, each of these statements should be set up so that it returns TRUE or FALSE, as shown in the examples below. Only rows with TRUE for all of the tests specified in filter will be returned.
# set the value for a variable
hp_val = 300
# is this bigger than or equal to 250?
hp_val >= 250
## [1] TRUE
# is this bigger than or equal to 250 or less than 100?
(hp_val >= 250 | hp_val < 100)
## [1] TRUE
# is it equal to 300?
hp_val == 300
## [1] TRUE

Regarding NA values:

  • If you are doing a numeric comparisons (i.e. hp >= 250) using the filter function, then NA values will be filtered out.
  • If you want to do a numeric comparison and also keep NA values, then you need to specify something like hp >= 250 | is.na(hp). The function is.na() will return TRUE for any NA values.

Now we use dplyr to make subsets of data1

small_data <- data1 %>% filter(MainCity ==  "1:Sm") %>% select(Crime_prop)
print(small_data)
##    Crime_prop
## 1        7289
## 2        8174
## 3       10263
## 4        9597
## 5       20235
## 6        6170
## 7        5990
## 8        9539
## 9        7770
## 10       7566

2.1.4 Filter outliers and make a boxplot of crime vs city size

From above, we can see that there’s a large outlier in the small cities, and this would disturb a test like ANOVA. We need to remove this.

We then use the filter function with dplyr to keep all rows where we don’t have a combination of a small city and a Crime_prop greater than 15000.

data2 = data1 %>% filter(!(MainCity == "1:Sm" & Crime_prop > 15000))

You’ll notice that the filter statement is a bit more complex this time. Starting from the middle of the statement:

  • & is again our logical “and” which means that we’re looking for situations where the conditions on either side are both true.
  • MainCity == "1:Sm" and Crime_prop > 15000 show that we are trying to match rows where the values meet these conditions.
  • Most importantly is the ! symbol on the outside of this clause, which says that we only want to keep rows that don’t match the conditions specified in the clause (MainCity == "1:Sm" & Crime_prop > 15000)

Now we can see that the boxplot looks more reasonable.

ggplot(data2, aes(x=MainCity, y=Crime_prop)) + geom_boxplot()

2.1.5 Select any row and column by indices

You can also create subsets of data frames by using numeric indices that indicate the row and column numbers. This technique will work with matrices as well. Note in the examples below that in some cases you’re specifying a sequence of numbers (5:9 = 5,6,7,8,9), while in other cases, you are specifying a set of numbers (c(1,5,9) is just the numbers 1, 5 and 9). You can also mix in the names of the columns with numeric indices as well.

b <- data1[1:3, 5:9]
b
##   Crime_prop Literacy Donations Infants Suicides
## 1      15890       37      5098   33120    35039
## 2       5521       51      8901   14572    12831
## 3       7925       13     10973   17044   114121
c <- data1[c(1,5,9), 7]
c
## [1] 5098 6962 3608
d <- data1[c(1,5,9),c("Region", "Infants")]
d
##   Region Infants
## 1      E   33120
## 5      E   23076
## 9      E   18642

Note that both b and d are data frames, although c is a vector of integers.

2.1.6 Select individual columns

x <- data1$Crime_prop

or

x <- data1$Crime_prop[1:10]  #just select the first 10 rows

2.1.7 Make the whole data frame into a matrix

data1_m <- as.matrix(data1)

We can get the total number of all elements in the matrix:

length(data1_m)
## [1] 1978

Note that we can’t access columns via $ anymore. In other words, we can’t do data1_m$Crime_prop

2.1.8 Make a table of city size vs region

tab = table(data1$MainCity, data1$Region)
tab
##        
##          C  E  N  S  W
##   1:Sm   1  4  0  3  2
##   2:Med 16 11 13 12 13
##   3:Lg   0  2  4  2  2

3 Dates and Times

Useful resources:

3.1 Integers to Dates

Dates, not involving times of the day, are natively stored on your computer as integer numbers with 1 Jan 1970 = 0 and 1 corresponding to 1 day. Due to this, you need to specify origin="1970-01-01" in the example below.

For example, the number 365 corresponds January 1, 1971. To tell R that a number is a date, not just a number you use the command as.Date:

as.Date(365, origin="1970-01-01")
## [1] "1971-01-01"

3.2 Strings to Dates

R can also convert strings to dates

as.Date("1971-01-01")   # always use the order of year, month, day
## [1] "1971-01-01"

3.3 Vectors of Date Ranges

You can make vectors filled with series of dates using seq

t = seq(from = as.Date("2000-11-30"), to = as.Date("2000-12-26"), by="1 day")
t
##  [1] "2000-11-30" "2000-12-01" "2000-12-02" "2000-12-03" "2000-12-04"
##  [6] "2000-12-05" "2000-12-06" "2000-12-07" "2000-12-08" "2000-12-09"
## [11] "2000-12-10" "2000-12-11" "2000-12-12" "2000-12-13" "2000-12-14"
## [16] "2000-12-15" "2000-12-16" "2000-12-17" "2000-12-18" "2000-12-19"
## [21] "2000-12-20" "2000-12-21" "2000-12-22" "2000-12-23" "2000-12-24"
## [26] "2000-12-25" "2000-12-26"

by= can be e.g, 1 month, 2 years, 5 days, etc…

Now try to make a weekly time vector from 5 June 2005 to 20 August 2006. You should see:

##  [1] "2005-06-05" "2005-06-12" "2005-06-19" "2005-06-26" "2005-07-03"
##  [6] "2005-07-10" "2005-07-17" "2005-07-24" "2005-07-31" "2005-08-07"
## [11] "2005-08-14" "2005-08-21" "2005-08-28" "2005-09-04" "2005-09-11"
## [16] "2005-09-18" "2005-09-25" "2005-10-02" "2005-10-09" "2005-10-16"
## [21] "2005-10-23" "2005-10-30" "2005-11-06" "2005-11-13" "2005-11-20"
## [26] "2005-11-27" "2005-12-04" "2005-12-11" "2005-12-18" "2005-12-25"
## [31] "2006-01-01" "2006-01-08" "2006-01-15" "2006-01-22" "2006-01-29"
## [36] "2006-02-05" "2006-02-12" "2006-02-19" "2006-02-26" "2006-03-05"
## [41] "2006-03-12" "2006-03-19" "2006-03-26" "2006-04-02" "2006-04-09"
## [46] "2006-04-16" "2006-04-23" "2006-04-30" "2006-05-07" "2006-05-14"
## [51] "2006-05-21" "2006-05-28" "2006-06-04" "2006-06-11" "2006-06-18"
## [56] "2006-06-25" "2006-07-02" "2006-07-09" "2006-07-16" "2006-07-23"
## [61] "2006-07-30" "2006-08-06" "2006-08-13" "2006-08-20"

In general, you can treat date & time vectors as normal vectors, which means that you can use them in plots, add and subtract them, etc.

4 Dates and Times

POSIXct does more or less the same as Date, but it stores the time variable as numbers of seconds since midnight GMT 1970-01-01. As a result, this allows us to store data which expresses the time on a specific date.

t.str = "2013-12-19 10:17:07"    # default format
t.obj = as.POSIXct(t.str) 
t.obj
## [1] "2013-12-19 10:17:07 CET"

If you have any other format than default format e.g., if you need to read in data that are not in the default format, then you can give R the format the data are in:

t.str1 = "19-12-2013 10:17:07"
t.obj1 = as.POSIXct(t.str1, format="%d-%m-%Y %H:%M:%S")
t.obj1
## [1] "2013-12-19 10:17:07 CET"

"%d-%m-%Y %H:%M:%S" says that we have a time that is specified in the form of day, month and year (separated by -) which is then followed by a space and the hours, minutes and seconds (separated by :).

To get an overview of all the format types type help(strptime). Below is a table which summarizes commonly used codes.

Code Description Example
%a Abbreviated weekday name Mon
%A Full weekday name Monday
%b Abbreviated month name Jan
%B Full month name January
%c Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input.
%d Day of the month as decimal number (01–31) 01
%H Hours as decimal number (00–23) 16
%I Hours as decimal number (01–12) 08
%j Day of year as decimal number (001–366) 234
%m Month as decimal number (01–12) 07
%M Minute as decimal number (00–59) 12
%p AM/PM indicator AM
%S Second as integer (00–59) 35
%U Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%w Weekday as decimal number (0–6, Sunday is 0). 1
%W Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%x Date. Locale-specific on output, “%y/%m/%d” on input.
%X Time. Locale-specific on output, “%H:%M:%S” on input.
%y Year without century (00–99) 91
%Y Year with century 1991
%z Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC
%Z Time zone abbreviation as a character string PST

4.1 Changing format of date

You can change the format of the output date using the codes mentioned above.

Code Description Example
%a Abbreviated weekday name Mon
%d Day of the month as decimal number (01–31) 01
%b Abbreviated month name Jan
%Y Year with century 1991
t.obj1
## [1] "2013-12-19 10:17:07 CET"
format(t.obj1, "%a %d %b. %Y")
## [1] "Thu 19 Dec. 2013"

4.2 Adding and subtracting dates

You can add and subtract POSIXt objects and perform logical operations on them. If you have a vector of POSIXt objects, you can use the min(), max() and range() functions.

t.str2 = "2013-12-19 18:20:07"
t.obj2 = as.POSIXct(t.str2)

Subtract two date times:

t.obj1 - t.obj2
## Time difference of -8.05 hours

4.3 Exercise 1

Download the Pr_20May1.csv to your computer in the same way that you did for the other csv file in this practical.

This data contains a time series of monthly temperature data from Jan. 1946 to Dec. 2014, however the data set you just downloaded only has a single column with the temperatures and does not have a column included which corresponds to these dates.

Fix this by making a vector with the dates corresponding to each data point, and make a plot of the temperatures over time.

5 Managing issues in real data

When dealing with real data sets, you will commonly run into several issues. You will often find yourself:

5.1 Inconvenient column names

You may find that the column names for a data frame are excessively long, or may contain extra spaces or strange characters. You can fix this by using the colnames command.

With a hypothetical data frame called input1, you can change the first and second column names by doing:

colnames(input1)[1] <- "x"  # gives the first column the name "x"
colnames(input1)[2] <- "y"      # gives the second column the name "y"

If there are only two columns in the entire data frame, then you could do:

colnames(input1) = c("x", "y")

If there are more than two columns, then you have to specify the indices, otherwise you will get an error.

colnames(input1)[c(1:2)] = c("x", "y")

5.2 Replacing values with NA

Real data may often have missing values, which are represented in R as NA. In some data sets you may see them represented in other forms like NaN (not a number) or by an unrealistic number (e.g., -999). For R to handle the missing values, you have to rename them as NA, if they are not named like that, then R does not recognize them and will try to process them as an actual number.

Let’s say that for a data frame we have called df, NA values are currently represented as -999. If we type into the console df == -999, we’ll get a matrix of TRUE, FALSE or NA values (if NA is already present).

To replace these -999 values, we use the command below which says that for all locations in the data frame with a value of -999, we should assign to those locations the value NA instead.

df[df == -999] <-NA

5.2.1 Exercise 1

For this exercise, download the Carbon.csv data into R as a data frame called input1.

After loading in the csv file into R, try to replace the -999 values with NA. If you run summary before the replacement you should see:

##           date_start elementary.carbon.from.fossil.fuels     BB_EC        
##  01/03/12 12:35: 1   Min.   :0.1077                      Min.   :0.01385  
##  01/06/11 13:30: 1   1st Qu.:0.2289                      1st Qu.:0.02643  
##  02/05/11 15:15: 1   Median :0.3446                      Median :0.06442  
##  02/12/11 09:15: 1   Mean   :0.5839                      Mean   :0.12673  
##  06/06/11 11:45: 1   3rd Qu.:0.6550                      3rd Qu.:0.14614  
##  07/07/11 13:30: 1   Max.   :2.2389                      Max.   :0.66726  
##  (Other)       :17                                                        
##      FF_OC               BB_OC               BIO_OC      
##  Min.   :-999.0000   Min.   :-999.0000   Min.   :0.2535  
##  1st Qu.:   0.2770   1st Qu.:   0.1179   1st Qu.:0.4915  
##  Median :   0.4028   Median :   0.3220   Median :0.7457  
##  Mean   : -42.8049   Mean   : -42.8112   Mean   :1.1032  
##  3rd Qu.:   0.5454   3rd Qu.:   0.7306   3rd Qu.:1.2399  
##  Max.   :   2.2815   Max.   :   3.3353   Max.   :4.0950  
##                                                          
##     FF_WIOC              C_WIOC           FF_WSOC        
##  Min.   :-999.0000   Min.   :0.04875   Min.   :-0.16657  
##  1st Qu.:   0.1436   1st Qu.:0.12869   1st Qu.: 0.09897  
##  Median :   0.1950   Median :0.26482   Median : 0.17624  
##  Mean   : -86.5462   Mean   :0.42041   Mean   : 0.28935  
##  3rd Qu.:   0.3089   3rd Qu.:0.42538   3rd Qu.: 0.35644  
##  Max.   :   1.4266   Max.   :1.80312   Max.   : 1.05329  
##                                                          
##      C_WSOC         
##  Min.   :-999.0000  
##  1st Qu.:   0.4852  
##  Median :   0.8561  
##  Mean   : -42.1394  
##  3rd Qu.:   1.5079  
##  Max.   :   4.2263  
## 

After replacing them, using summary you should see that several columns now have NA values:

##           date_start elementary.carbon.from.fossil.fuels     BB_EC        
##  01/03/12 12:35: 1   Min.   :0.1077                      Min.   :0.01385  
##  01/06/11 13:30: 1   1st Qu.:0.2289                      1st Qu.:0.02643  
##  02/05/11 15:15: 1   Median :0.3446                      Median :0.06442  
##  02/12/11 09:15: 1   Mean   :0.5839                      Mean   :0.12673  
##  06/06/11 11:45: 1   3rd Qu.:0.6550                      3rd Qu.:0.14614  
##  07/07/11 13:30: 1   Max.   :2.2389                      Max.   :0.66726  
##  (Other)       :17                                                        
##      FF_OC            BB_OC             BIO_OC          FF_WIOC       
##  Min.   :0.1173   Min.   :0.06923   Min.   :0.2535   Min.   :0.08086  
##  1st Qu.:0.2867   1st Qu.:0.12886   1st Qu.:0.4915   1st Qu.:0.17058  
##  Median :0.4328   Median :0.32871   Median :0.7457   Median :0.19604  
##  Mean   :0.6585   Mean   :0.65190   Mean   :1.1032   Mean   :0.35414  
##  3rd Qu.:0.5580   3rd Qu.:0.77068   3rd Qu.:1.2399   3rd Qu.:0.31171  
##  Max.   :2.2815   Max.   :3.33533   Max.   :4.0950   Max.   :1.42661  
##  NA's   :1        NA's   :1                          NA's   :2        
##      C_WIOC           FF_WSOC             C_WSOC      
##  Min.   :0.04875   Min.   :-0.16657   Min.   :0.2886  
##  1st Qu.:0.12869   1st Qu.: 0.09897   1st Qu.:0.5394  
##  Median :0.26482   Median : 0.17624   Median :0.8566  
##  Mean   :0.42041   Mean   : 0.28935   Mean   :1.3542  
##  3rd Qu.:0.42538   3rd Qu.: 0.35644   3rd Qu.:1.5549  
##  Max.   :1.80312   Max.   : 1.05329   Max.   :4.2263  
##                                       NA's   :1

5.2.2 Exercise 2

Using the same Carbon.csv data set:

  • Make time vector of the date (the date_start column). If you type class(input1$date_start) you’ll see that it’s currently represented as a factor.
  • Make a new vector that formats the date, so that it just shows the month (see examples above). You may also have to use as.Date() to convert the date_start values from a factor to a Date object.

If you do this correctly, you should see:

months = format(as.Date(input1$date_start), "%B")
months
##  [1] "February"  "April"     "April"     "April"     "May"      
##  [6] "May"       "June"      "June"      "July"      "July"     
## [11] "August"    "August"    "September" "September" "October"  
## [16] "October"   "November"  "November"  "December"  "January"  
## [21] "February"  "March"     "August"

We can now add this vector directly to the data frame:

input1$months = months

5.3 Creating Category Variables

Sometimes real data sets don’t directly contain all the information you want. For example, you have data covering a year and you are wondering if there is a difference between summer and winter, or between day and night; but you only have a vector with date and time.

We may want to create new category variables which could be useful for ANOVA, box plots, etc.

For this example, assume that input1$y varies between 0 and 5. Given different ranges of y, we can specify categories of "low", "medium" and "high"

x1 <- c()   # start with empty vector
x1[input1$y < 1] <- "low"
x1[(input1$y >=1) & (input1$y < 3)] <- "medium"
x1[input1$y>=3] <- "high"

x1 is now a vector of length input1$y and consists of values "low", "medium", "high"

5.3.1 Exercise 3

Using the Carbon.csv data set, make a vector that correctly assigns the seasons to your data set (use the meteorological definition of season e.g. winter is dec, jan, feb; spring is mar, apr, may, etc.)

Note that you can use the syntax below for matching text values.

x1[input1$y %in% c("big", "large", "enormous")] <- "huge"

5.3.2 Exercise 4

Now add the vector of months and the vector of seasons to your data frame.

6 Resources