Literacy
per city size (MainCity
)We have a function \(x = f(u,v)\) where:
We then:
Assume that for a research project you are told to calculate the uncertainty for the following formula:
\[f(p,q) = \frac{p \cdot q}{p + q}\]
For both \(p\) and \(q\) we have evidence of the following means and standard deviations:
We assume that the uncertainties for \(p\) and \(q\) follow normal distributions
# take n samples
n <- 1000
# take n samples from normal distributions with different means and standard deviations
p <- rnorm(n, mean = 2, sd = 0.5)
q <- rnorm(n, mean = 5, sd = 1)
Calculate values for \(f(p,q) = \frac{p \cdot q}{p + q}\)
f <- p*q/(p+q)
Plot the histogram, using 100 bins (use breaks=100
)
hist(f, breaks=100)
We now calculate the mean:
f_bar <- mean(f)
f_bar
## [1] 1.394162
and then calculate the standard deviation:
sig <- sd(f)
sig
## [1] 0.2719069
Below we can see how both the mean and the standard deviation converge on a single value as we increase the number of samples from the distributions we defined for \(p\) and \(q\) (see page 33,34 of Lecture 7 for more explanation). Note the log scale for the x axis.
Now you’ve been given a different formula:
\[f(u,v) = u^2 + u + v\]
…and you’ve been told to perform a monte carlo simulation with 10000 samples this time. We again assume that the variables u
and v
are normally distributed with the following means and standard deviations:
Now try to recreate the following results on your own.
Plot the histogram of \(x = f(u,v) = u^2 + u + v\)
Calculate mean(x) and std(x) for \(x = f(u,v)\)
## [1] 16.51136
## [1] 16.35522
For this example, you’ll need a csv file. To get it, right click on this link: Guerry.csv and select “Save Target As” or “Save Link As” to save it to your computer.
Internet Explorer might try to save this as “Guerry.txt”, make sure to save this as “Guerry.csv” or adjust your code so that it knows to read the correct file.
One thing you need to check is your working directory. This is the directory where R looks for any files. You can set this in RStudio Session
-> Set Working Directory
-> Choose Directory
Once you’ve done all this, you can now read in the data file:
data1 <- read.csv(file ="Guerry.csv")
Note that every time you read in a .csv file you will have a data frame, even if there is only a single column present.
As mentioned in the previous practicals, we use summary
to get an idea about what is in the data frame:
summary(data1)
## dept Region Department Crime_pers Crime_prop
## Min. : 1.00 C :17 Ain : 1 Min. : 2199 Min. : 1368
## 1st Qu.: 24.25 E :17 Aisne : 1 1st Qu.:14156 1st Qu.: 5933
## Median : 45.50 N :17 Allier : 1 Median :18748 Median : 7595
## Mean : 46.88 S :17 Ardeche : 1 Mean :19754 Mean : 7843
## 3rd Qu.: 66.75 W :17 Ardennes: 1 3rd Qu.:25938 3rd Qu.: 9182
## Max. :200.00 NA's: 1 Ariege : 1 Max. :37014 Max. :20235
## (Other) :80
## Literacy Donations Infants Suicides
## Min. :12.00 Min. : 1246 Min. : 2660 Min. : 3460
## 1st Qu.:25.00 1st Qu.: 3447 1st Qu.:14300 1st Qu.: 15463
## Median :38.00 Median : 5020 Median :17142 Median : 26744
## Mean :39.26 Mean : 7076 Mean :19050 Mean : 36523
## 3rd Qu.:51.75 3rd Qu.: 9447 3rd Qu.:22682 3rd Qu.: 44058
## Max. :74.00 Max. :37015 Max. :62486 Max. :163241
##
## MainCity Wealth Commerce Clergy
## 1:Sm :10 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 2:Med:66 1st Qu.:22.25 1st Qu.:21.25 1st Qu.:22.25
## 3:Lg :10 Median :43.50 Median :42.50 Median :43.50
## Mean :43.50 Mean :42.80 Mean :43.43
## 3rd Qu.:64.75 3rd Qu.:63.75 3rd Qu.:64.75
## Max. :86.00 Max. :86.00 Max. :86.00
##
## Crime_parents Infanticide Donation_clergy Lottery
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.:22.25 1st Qu.:22.25 1st Qu.:22.25 1st Qu.:22.25
## Median :43.50 Median :43.50 Median :43.50 Median :43.50
## Mean :43.50 Mean :43.51 Mean :43.50 Mean :43.50
## 3rd Qu.:64.75 3rd Qu.:64.75 3rd Qu.:64.75 3rd Qu.:64.75
## Max. :86.00 Max. :86.00 Max. :86.00 Max. :86.00
##
## Desertion Instruction Prostitutes Distance
## Min. : 1.00 Min. : 1.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:22.25 1st Qu.:23.25 1st Qu.: 6.0 1st Qu.:121.4
## Median :43.50 Median :41.50 Median : 33.0 Median :200.6
## Mean :43.50 Mean :43.13 Mean : 141.9 Mean :208.0
## 3rd Qu.:64.75 3rd Qu.:64.75 3rd Qu.: 113.8 3rd Qu.:289.7
## Max. :86.00 Max. :86.00 Max. :4744.0 Max. :539.2
##
## Area Pop1831
## Min. : 762 Min. :129.1
## 1st Qu.: 5401 1st Qu.:283.0
## Median : 6070 Median :346.2
## Mean : 6147 Mean :378.6
## 3rd Qu.: 6816 3rd Qu.:444.4
## Max. :10000 Max. :989.9
##
If we take a closer look at the results, we see that the dept
column is numeric since summary
displays information about the minimum, median, mean, max and quantiles.
dept
Min. : 1.00
1st Qu.: 24.25
Median : 45.50
Mean : 46.88
3rd Qu.: 66.75
Max. :200.00
However, for Region
, we don’t see quantiles, etc but we are looking at counts of values, as this column contains factors indicating the particular region (C
, E
, N
, S
or W
). Also note that NA's: 1
tells us that we have one NA
which indicates a missing value (Not Available). We’ll discuss later in more detail how these can be dealt with in R.
Region
C :17
E :17
N :17
S :17
W :17
NA's: 1
We can find the number of variables (columns)
length(data1)
## [1] 23
or by:
ncol(data1)
## [1] 23
Similarly, the number of rows can be found via:
nrow(data1)
## [1] 86
And we can find the dimension of the data frame (i.e. the number of rows and columns) via:
dim(data1)
## [1] 86 23
The data you have just loaded in is from the Guerry R package, so to get an idea about the variables we need to install the package:
install.packages("Guerry")
Load the library:
library(Guerry)
and then bring up the help information that relates to this specific data frame which is included in the package:
help(Guerry)
This data set contains statistics about French cities. As you can see, there are several different types of columns:
Region
, Main_city
Crime_pers
, Distance
Donation_clergy
, etc. These show which city is 1st, 2nd, etc.ggplot(data1, aes(x=Infants)) + geom_histogram(bins=50)
Literacy
per city size (MainCity
)ggplot(data1, aes(x=MainCity, y=Literacy)) + geom_boxplot()
In the plot above, we see that the box for the small cities looks strange as it is quite skewed. We now take a closer look
First we want to know how many variables are in the small, medium and large cities:
a <- table(data1$MainCity)
a
##
## 1:Sm 2:Med 3:Lg
## 10 66 10
This shows that we have 10 observations for small cities.
Next we want to see the actual data points in the small cities only. To do this, we’ll use the dplyr package. The dplyr library takes a very similar approach to what we saw with ggplot
. We start off with data in a data frame, and then we perform a series of operations or data transformations which are connected together via the %>%
symbol which acts as a sort of “pipe” through which data flows.
In very general terms, with the dplyr library we can create a series of statements that look like:
some data frame %>%
operation 1 %>%
operation 2
First we need to install and load the dplyr
package
install.packages("dplyr")
library(dplyr)
In the following example, we take a subset of the mtcars
data set. We first keep a selection of rows by using filter
, and then use select
to keep certain columns.
mtcars %>%
filter(hp >= 250 | hp < 100) %>%
select(mpg, cyl)
In other words:
filter(hp >= 250 | hp < 100)
- keep all rows where hp
is greater than or equal to 250 or is less than 100.
|
symbol specifies a logical “or”. This just means that you want to find values matching hp >= 250
or hp < 100
&
symbol is a logical “and”. You could modify the statement above to use hp >= 250 & hp < 100
, although this would return no rows since a value can’t be both greater than 250 and less than 100.select(mpg, cyl)
- keep only the mpg
and cyl
columns.mtcars
%>%
“keep these rows” %>%
“keep these columns”We can also filter on multiple columns by adding commas in the filter
function.
mtcars %>%
filter((hp >= 250 | hp < 100),
cyl == 4,
mpg > 30) %>%
select(mpg, cyl)
## mpg cyl
## 1 32.4 4
## 2 30.4 4
## 3 33.9 4
A few notes:
==
symbol in cyl == 4
means that we check if the value is equal to 4.(hp >= 250 | hp < 100)
to group these two conditions together, as we want to match either of these conditions.filter
function, each of these statements should be set up so that it returns TRUE
or FALSE
, as shown in the examples below. Only rows with TRUE
for all of the tests specified in filter
will be returned.# set the value for a variable
hp_val = 300
# is this bigger than or equal to 250?
hp_val >= 250
## [1] TRUE
# is this bigger than or equal to 250 or less than 100?
(hp_val >= 250 | hp_val < 100)
## [1] TRUE
# is it equal to 300?
hp_val == 300
## [1] TRUE
Regarding NA
values:
hp >= 250
) using the filter
function, then NA
values will be filtered out.NA
values, then you need to specify something like hp >= 250 | is.na(hp)
. The function is.na()
will return TRUE
for any NA
values.Now we use dplyr to make subsets of data1
small_data <- data1 %>% filter(MainCity == "1:Sm") %>% select(Crime_prop)
print(small_data)
## Crime_prop
## 1 7289
## 2 8174
## 3 10263
## 4 9597
## 5 20235
## 6 6170
## 7 5990
## 8 9539
## 9 7770
## 10 7566
From above, we can see that there’s a large outlier in the small cities, and this would disturb a test like ANOVA. We need to remove this.
We then use the filter
function with dplyr
to keep all rows where we don’t have a combination of a small city and a Crime_prop
greater than 15000.
data2 = data1 %>% filter(!(MainCity == "1:Sm" & Crime_prop > 15000))
You’ll notice that the filter
statement is a bit more complex this time. Starting from the middle of the statement:
&
is again our logical “and” which means that we’re looking for situations where the conditions on either side are both true.MainCity == "1:Sm"
and Crime_prop > 15000
show that we are trying to match rows where the values meet these conditions.!
symbol on the outside of this clause, which says that we only want to keep rows that don’t match the conditions specified in the clause (MainCity == "1:Sm" & Crime_prop > 15000)
Now we can see that the boxplot looks more reasonable.
ggplot(data2, aes(x=MainCity, y=Crime_prop)) + geom_boxplot()
You can also create subsets of data frames by using numeric indices that indicate the row and column numbers. This technique will work with matrices as well. Note in the examples below that in some cases you’re specifying a sequence of numbers (5:9
= 5,6,7,8,9), while in other cases, you are specifying a set of numbers (c(1,5,9)
is just the numbers 1, 5 and 9). You can also mix in the names of the columns with numeric indices as well.
b <- data1[1:3, 5:9]
b
## Crime_prop Literacy Donations Infants Suicides
## 1 15890 37 5098 33120 35039
## 2 5521 51 8901 14572 12831
## 3 7925 13 10973 17044 114121
c <- data1[c(1,5,9), 7]
c
## [1] 5098 6962 3608
d <- data1[c(1,5,9),c("Region", "Infants")]
d
## Region Infants
## 1 E 33120
## 5 E 23076
## 9 E 18642
Note that both b
and d
are data frames, although c
is a vector of integers.
x <- data1$Crime_prop
or
x <- data1$Crime_prop[1:10] #just select the first 10 rows
data1_m <- as.matrix(data1)
We can get the total number of all elements in the matrix:
length(data1_m)
## [1] 1978
Note that we can’t access columns via $
anymore. In other words, we can’t do data1_m$Crime_prop
tab = table(data1$MainCity, data1$Region)
tab
##
## C E N S W
## 1:Sm 1 4 0 3 2
## 2:Med 16 11 13 12 13
## 3:Lg 0 2 4 2 2
Useful resources:
Date
and POSIXct
are relevant for this practical. We don’t use the lubridate
library for this example, although it does similar things.Dates, not involving times of the day, are natively stored on your computer as integer numbers with 1 Jan 1970 = 0 and 1 corresponding to 1 day. Due to this, you need to specify origin="1970-01-01"
in the example below.
For example, the number 365 corresponds January 1, 1971. To tell R that a number is a date, not just a number you use the command as.Date
:
as.Date(365, origin="1970-01-01")
## [1] "1971-01-01"
R can also convert strings to dates
as.Date("1971-01-01") # always use the order of year, month, day
## [1] "1971-01-01"
You can make vectors filled with series of dates using seq
t = seq(from = as.Date("2000-11-30"), to = as.Date("2000-12-26"), by="1 day")
t
## [1] "2000-11-30" "2000-12-01" "2000-12-02" "2000-12-03" "2000-12-04"
## [6] "2000-12-05" "2000-12-06" "2000-12-07" "2000-12-08" "2000-12-09"
## [11] "2000-12-10" "2000-12-11" "2000-12-12" "2000-12-13" "2000-12-14"
## [16] "2000-12-15" "2000-12-16" "2000-12-17" "2000-12-18" "2000-12-19"
## [21] "2000-12-20" "2000-12-21" "2000-12-22" "2000-12-23" "2000-12-24"
## [26] "2000-12-25" "2000-12-26"
by=
can be e.g, 1 month, 2 years, 5 days, etc…
Now try to make a weekly time vector from 5 June 2005 to 20 August 2006. You should see:
## [1] "2005-06-05" "2005-06-12" "2005-06-19" "2005-06-26" "2005-07-03"
## [6] "2005-07-10" "2005-07-17" "2005-07-24" "2005-07-31" "2005-08-07"
## [11] "2005-08-14" "2005-08-21" "2005-08-28" "2005-09-04" "2005-09-11"
## [16] "2005-09-18" "2005-09-25" "2005-10-02" "2005-10-09" "2005-10-16"
## [21] "2005-10-23" "2005-10-30" "2005-11-06" "2005-11-13" "2005-11-20"
## [26] "2005-11-27" "2005-12-04" "2005-12-11" "2005-12-18" "2005-12-25"
## [31] "2006-01-01" "2006-01-08" "2006-01-15" "2006-01-22" "2006-01-29"
## [36] "2006-02-05" "2006-02-12" "2006-02-19" "2006-02-26" "2006-03-05"
## [41] "2006-03-12" "2006-03-19" "2006-03-26" "2006-04-02" "2006-04-09"
## [46] "2006-04-16" "2006-04-23" "2006-04-30" "2006-05-07" "2006-05-14"
## [51] "2006-05-21" "2006-05-28" "2006-06-04" "2006-06-11" "2006-06-18"
## [56] "2006-06-25" "2006-07-02" "2006-07-09" "2006-07-16" "2006-07-23"
## [61] "2006-07-30" "2006-08-06" "2006-08-13" "2006-08-20"
In general, you can treat date & time vectors as normal vectors, which means that you can use them in plots, add and subtract them, etc.
POSIXct
does more or less the same as Date
, but it stores the time variable as numbers of seconds since midnight GMT 1970-01-01. As a result, this allows us to store data which expresses the time on a specific date.
t.str = "2013-12-19 10:17:07" # default format
t.obj = as.POSIXct(t.str)
t.obj
## [1] "2013-12-19 10:17:07 CET"
If you have any other format than default format e.g., if you need to read in data that are not in the default format, then you can give R the format the data are in:
t.str1 = "19-12-2013 10:17:07"
t.obj1 = as.POSIXct(t.str1, format="%d-%m-%Y %H:%M:%S")
t.obj1
## [1] "2013-12-19 10:17:07 CET"
"%d-%m-%Y %H:%M:%S"
says that we have a time that is specified in the form of day, month and year (separated by -
) which is then followed by a space and the hours, minutes and seconds (separated by :
).
To get an overview of all the format types type help(strptime)
. Below is a table which summarizes commonly used codes.
Code | Description | Example |
---|---|---|
%a | Abbreviated weekday name | Mon |
%A | Full weekday name | Monday |
%b | Abbreviated month name | Jan |
%B | Full month name | January |
%c | Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input. | |
%d | Day of the month as decimal number (01–31) | 01 |
%H | Hours as decimal number (00–23) | 16 |
%I | Hours as decimal number (01–12) | 08 |
%j | Day of year as decimal number (001–366) | 234 |
%m | Month as decimal number (01–12) | 07 |
%M | Minute as decimal number (00–59) | 12 |
%p | AM/PM indicator | AM |
%S | Second as integer (00–59) | 35 |
%U | Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention. | |
%w | Weekday as decimal number (0–6, Sunday is 0). | 1 |
%W | Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention. | |
%x | Date. Locale-specific on output, “%y/%m/%d” on input. | |
%X | Time. Locale-specific on output, “%H:%M:%S” on input. | |
%y | Year without century (00–99) | 91 |
%Y | Year with century | 1991 |
%z | Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC | |
%Z | Time zone abbreviation as a character string | PST |
You can change the format of the output date using the codes mentioned above.
Code | Description | Example |
---|---|---|
%a | Abbreviated weekday name | Mon |
%d | Day of the month as decimal number (01–31) | 01 |
%b | Abbreviated month name | Jan |
%Y | Year with century | 1991 |
t.obj1
## [1] "2013-12-19 10:17:07 CET"
format(t.obj1, "%a %d %b. %Y")
## [1] "Thu 19 Dec. 2013"
You can add and subtract POSIXt objects and perform logical operations on them. If you have a vector of POSIXt objects, you can use the min()
, max()
and range()
functions.
t.str2 = "2013-12-19 18:20:07"
t.obj2 = as.POSIXct(t.str2)
Subtract two date times:
t.obj1 - t.obj2
## Time difference of -8.05 hours
Download the Pr_20May1.csv to your computer in the same way that you did for the other csv file in this practical.
This data contains a time series of monthly temperature data from Jan. 1946 to Dec. 2014, however the data set you just downloaded only has a single column with the temperatures and does not have a column included which corresponds to these dates.
Fix this by making a vector with the dates corresponding to each data point, and make a plot of the temperatures over time.
When dealing with real data sets, you will commonly run into several issues. You will often find yourself:
You may find that the column names for a data frame are excessively long, or may contain extra spaces or strange characters. You can fix this by using the colnames
command.
With a hypothetical data frame called input1
, you can change the first and second column names by doing:
colnames(input1)[1] <- "x" # gives the first column the name "x"
colnames(input1)[2] <- "y" # gives the second column the name "y"
If there are only two columns in the entire data frame, then you could do:
colnames(input1) = c("x", "y")
If there are more than two columns, then you have to specify the indices, otherwise you will get an error.
colnames(input1)[c(1:2)] = c("x", "y")
Real data may often have missing values, which are represented in R as NA
. In some data sets you may see them represented in other forms like NaN
(not a number) or by an unrealistic number (e.g., -999). For R to handle the missing values, you have to rename them as NA
, if they are not named like that, then R does not recognize them and will try to process them as an actual number.
Let’s say that for a data frame we have called df
, NA
values are currently represented as -999
. If we type into the console df == -999
, we’ll get a matrix of TRUE
, FALSE
or NA
values (if NA
is already present).
To replace these -999 values, we use the command below which says that for all locations in the data frame with a value of -999, we should assign to those locations the value NA
instead.
df[df == -999] <-NA
For this exercise, download the Carbon.csv data into R as a data frame called input1
.
After loading in the csv file into R, try to replace the -999 values with NA
. If you run summary
before the replacement you should see:
## date_start elementary.carbon.from.fossil.fuels BB_EC
## 01/03/12 12:35: 1 Min. :0.1077 Min. :0.01385
## 01/06/11 13:30: 1 1st Qu.:0.2289 1st Qu.:0.02643
## 02/05/11 15:15: 1 Median :0.3446 Median :0.06442
## 02/12/11 09:15: 1 Mean :0.5839 Mean :0.12673
## 06/06/11 11:45: 1 3rd Qu.:0.6550 3rd Qu.:0.14614
## 07/07/11 13:30: 1 Max. :2.2389 Max. :0.66726
## (Other) :17
## FF_OC BB_OC BIO_OC
## Min. :-999.0000 Min. :-999.0000 Min. :0.2535
## 1st Qu.: 0.2770 1st Qu.: 0.1179 1st Qu.:0.4915
## Median : 0.4028 Median : 0.3220 Median :0.7457
## Mean : -42.8049 Mean : -42.8112 Mean :1.1032
## 3rd Qu.: 0.5454 3rd Qu.: 0.7306 3rd Qu.:1.2399
## Max. : 2.2815 Max. : 3.3353 Max. :4.0950
##
## FF_WIOC C_WIOC FF_WSOC
## Min. :-999.0000 Min. :0.04875 Min. :-0.16657
## 1st Qu.: 0.1436 1st Qu.:0.12869 1st Qu.: 0.09897
## Median : 0.1950 Median :0.26482 Median : 0.17624
## Mean : -86.5462 Mean :0.42041 Mean : 0.28935
## 3rd Qu.: 0.3089 3rd Qu.:0.42538 3rd Qu.: 0.35644
## Max. : 1.4266 Max. :1.80312 Max. : 1.05329
##
## C_WSOC
## Min. :-999.0000
## 1st Qu.: 0.4852
## Median : 0.8561
## Mean : -42.1394
## 3rd Qu.: 1.5079
## Max. : 4.2263
##
After replacing them, using summary
you should see that several columns now have NA
values:
## date_start elementary.carbon.from.fossil.fuels BB_EC
## 01/03/12 12:35: 1 Min. :0.1077 Min. :0.01385
## 01/06/11 13:30: 1 1st Qu.:0.2289 1st Qu.:0.02643
## 02/05/11 15:15: 1 Median :0.3446 Median :0.06442
## 02/12/11 09:15: 1 Mean :0.5839 Mean :0.12673
## 06/06/11 11:45: 1 3rd Qu.:0.6550 3rd Qu.:0.14614
## 07/07/11 13:30: 1 Max. :2.2389 Max. :0.66726
## (Other) :17
## FF_OC BB_OC BIO_OC FF_WIOC
## Min. :0.1173 Min. :0.06923 Min. :0.2535 Min. :0.08086
## 1st Qu.:0.2867 1st Qu.:0.12886 1st Qu.:0.4915 1st Qu.:0.17058
## Median :0.4328 Median :0.32871 Median :0.7457 Median :0.19604
## Mean :0.6585 Mean :0.65190 Mean :1.1032 Mean :0.35414
## 3rd Qu.:0.5580 3rd Qu.:0.77068 3rd Qu.:1.2399 3rd Qu.:0.31171
## Max. :2.2815 Max. :3.33533 Max. :4.0950 Max. :1.42661
## NA's :1 NA's :1 NA's :2
## C_WIOC FF_WSOC C_WSOC
## Min. :0.04875 Min. :-0.16657 Min. :0.2886
## 1st Qu.:0.12869 1st Qu.: 0.09897 1st Qu.:0.5394
## Median :0.26482 Median : 0.17624 Median :0.8566
## Mean :0.42041 Mean : 0.28935 Mean :1.3542
## 3rd Qu.:0.42538 3rd Qu.: 0.35644 3rd Qu.:1.5549
## Max. :1.80312 Max. : 1.05329 Max. :4.2263
## NA's :1
Using the same Carbon.csv
data set:
date_start
column). If you type class(input1$date_start)
you’ll see that it’s currently represented as a factor.as.Date()
to convert the date_start
values from a factor to a Date object.If you do this correctly, you should see:
months = format(as.Date(input1$date_start), "%B")
months
## [1] "February" "April" "April" "April" "May"
## [6] "May" "June" "June" "July" "July"
## [11] "August" "August" "September" "September" "October"
## [16] "October" "November" "November" "December" "January"
## [21] "February" "March" "August"
We can now add this vector directly to the data frame:
input1$months = months
Sometimes real data sets don’t directly contain all the information you want. For example, you have data covering a year and you are wondering if there is a difference between summer and winter, or between day and night; but you only have a vector with date and time.
We may want to create new category variables which could be useful for ANOVA, box plots, etc.
For this example, assume that input1$y
varies between 0 and 5. Given different ranges of y
, we can specify categories of "low"
, "medium"
and "high"
x1 <- c() # start with empty vector
x1[input1$y < 1] <- "low"
x1[(input1$y >=1) & (input1$y < 3)] <- "medium"
x1[input1$y>=3] <- "high"
x1
is now a vector of length input1$y
and consists of values "low"
, "medium"
, "high"
Using the Carbon.csv
data set, make a vector that correctly assigns the seasons to your data set (use the meteorological definition of season e.g. winter is dec, jan, feb; spring is mar, apr, may, etc.)
Note that you can use the syntax below for matching text values.
x1[input1$y %in% c("big", "large", "enormous")] <- "huge"
Now add the vector of months and the vector of seasons to your data frame.
dplyr
for reshaping, combining, grouping, and summarizing data frames.