We need to install the forecast
package for the ma()
(moving average) function which we’ll use later in this practical.
install.packages("forecast")
Now load the required libraries for this practical:
library(tidyverse)
library(forecast)
Date
and POSIXct
are relevant for this practical. We don’t use the lubridate
library for this example, although it does similar things.Dates, not involving times of the day, are natively stored on your computer as integer numbers with 1 Jan 1970 = 0 and 1 corresponding to 1 day. Due to this, you need to specify origin="1970-01-01"
in the example below.
For example, the number 365 corresponds January 1, 1971. To tell R that a number is a date, not just a number you use the command as.Date
:
as.Date(365, origin="1970-01-01")
## [1] "1971-01-01"
R can also convert strings to dates
as.Date("1971-01-01") # always use the order of year, month, day
## [1] "1971-01-01"
You can make vectors filled with series of dates using seq
t = seq(from = as.Date("2000-11-30"), to = as.Date("2000-12-26"), by="1 day")
t
## [1] "2000-11-30" "2000-12-01" "2000-12-02" "2000-12-03" "2000-12-04"
## [6] "2000-12-05" "2000-12-06" "2000-12-07" "2000-12-08" "2000-12-09"
## [11] "2000-12-10" "2000-12-11" "2000-12-12" "2000-12-13" "2000-12-14"
## [16] "2000-12-15" "2000-12-16" "2000-12-17" "2000-12-18" "2000-12-19"
## [21] "2000-12-20" "2000-12-21" "2000-12-22" "2000-12-23" "2000-12-24"
## [26] "2000-12-25" "2000-12-26"
by=
can be e.g, 1 month, 2 years, 5 days, etc…
In general, you can treat date and time vectors as normal vectors, which means that you can use them in plots, add and subtract them, etc.
POSIXct
does more or less the same as Date
, but it stores the time variable as numbers of seconds since midnight GMT 1970-01-01. As a result, this allows us to store data which expresses the time on a specific date.
t.str = "2013-12-19 10:17:07" # default format
t.obj = as.POSIXct(t.str)
t.obj
## [1] "2013-12-19 10:17:07 CET"
If you have any other format than default format e.g., if you need to read in data that are not in the default format, then you can give R the format the data are in:
t.str1 = "19-12-2013 10:17:07"
t.obj1 = as.POSIXct(t.str1, format="%d-%m-%Y %H:%M:%S")
t.obj1
## [1] "2013-12-19 10:17:07 CET"
"%d-%m-%Y %H:%M:%S"
says that we have a time that is specified in the form of day, month and year (separated by -
) which is then followed by a space and the hours, minutes and seconds (separated by :
).
To get an overview of all the format types type ?strptime
. Below is a table which summarizes commonly used codes.
Code | Description | Example |
---|---|---|
%a | Abbreviated weekday name | Mon |
%A | Full weekday name | Monday |
%b | Abbreviated month name | Jan |
%B | Full month name | January |
%c | Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input. | |
%d | Day of the month as decimal number (01–31) | 01 |
%H | Hours as decimal number (00–23) | 16 |
%I | Hours as decimal number (01–12) | 08 |
%j | Day of year as decimal number (001–366) | 234 |
%m | Month as decimal number (01–12) | 07 |
%M | Minute as decimal number (00–59) | 12 |
%p | AM/PM indicator | AM |
%S | Second as integer (00–59) | 35 |
%U | Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention. | |
%w | Weekday as decimal number (0–6, Sunday is 0). | 1 |
%W | Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention. | |
%x | Date. Locale-specific on output, “%y/%m/%d” on input. | |
%X | Time. Locale-specific on output, “%H:%M:%S” on input. | |
%y | Year without century (00–99) | 91 |
%Y | Year with century | 1991 |
%z | Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC | |
%Z | Time zone abbreviation as a character string | PST |
You can change the format of the output date using the codes mentioned above.
Code | Description | Example |
---|---|---|
%a | Abbreviated weekday name | Mon |
%d | Day of the month as decimal number (01–31) | 01 |
%b | Abbreviated month name | Jan |
%Y | Year with century | 1991 |
t.obj1
## [1] "2013-12-19 10:17:07 CET"
format(t.obj1, "%a %d %b. %Y")
## [1] "Thu 19 Dec. 2013"
You can add and subtract POSIXt objects and perform logical operations on them. If you have a vector of POSIXt objects, you can use the min()
, max()
and range()
functions.
t.str2 = "2013-12-19 18:20:07"
t.obj2 = as.POSIXct(t.str2)
Subtract two date times:
t.obj1 - t.obj2
## Time difference of -8.05 hours
The example we’ll use throughout this practical is based on Google Trends data for the number of daily searches for the term “sustainability” for the period January 1, 2014 until October 8, 2017. Once you read it in, you’ll have a data frame with a single column volume
which is specified in the csv file itself.
sustainability = read.csv("https://raw.githubusercontent.com/cbdavis/DASM/master/2017/data/sustainability.csv")
Tvalues = data.frame(volume = sustainability$volume)
We can create a sequence of dates (using seq
), starting at one date with a defined time interval for each step.
Tvalues$t = seq(from = as.Date("2014-01-01"), to = as.Date("2017-10-08"), by = "1 day")
Now we plot the data:
ggplot(Tvalues, aes(x=t, y=volume)) + geom_line()
From this we can see that there’s a weekly pattern, a seasonal (bi-yearly) pattern where we get peaks in the spring and fall, along with a yearly pattern.
Make a weekly time vector from 5 June 2005 to 20 August 2006. You should see:
## [1] "2005-06-05" "2005-06-12" "2005-06-19" "2005-06-26" "2005-07-03"
## [6] "2005-07-10" "2005-07-17" "2005-07-24" "2005-07-31" "2005-08-07"
## [11] "2005-08-14" "2005-08-21" "2005-08-28" "2005-09-04" "2005-09-11"
## [16] "2005-09-18" "2005-09-25" "2005-10-02" "2005-10-09" "2005-10-16"
## [21] "2005-10-23" "2005-10-30" "2005-11-06" "2005-11-13" "2005-11-20"
## [26] "2005-11-27" "2005-12-04" "2005-12-11" "2005-12-18" "2005-12-25"
## [31] "2006-01-01" "2006-01-08" "2006-01-15" "2006-01-22" "2006-01-29"
## [36] "2006-02-05" "2006-02-12" "2006-02-19" "2006-02-26" "2006-03-05"
## [41] "2006-03-12" "2006-03-19" "2006-03-26" "2006-04-02" "2006-04-09"
## [46] "2006-04-16" "2006-04-23" "2006-04-30" "2006-05-07" "2006-05-14"
## [51] "2006-05-21" "2006-05-28" "2006-06-04" "2006-06-11" "2006-06-18"
## [56] "2006-06-25" "2006-07-02" "2006-07-09" "2006-07-16" "2006-07-23"
## [61] "2006-07-30" "2006-08-06" "2006-08-13" "2006-08-20"
Download the file Practical7_time_series.txt and load it into R. This contains daily data on the internet traffic from November 20, 2004 until January 27, 2005.
ts()
We can also use the ts
function to create time series objects. The relevant syntax is shown below and you can learn more by typing ?ts
into the console.
ts(data = NA, start = 1, end = numeric(), frequency = 1,
deltat = 1)
Arguments:
data
- a vector or matrix of the observed time-series values.start
- the time of the first observation. Either a single number or a vector of two integers, which specify a natural time unit and a (1-based) number of samples into the time unit. See the examples for the use of the second form.end
- the time of the last observation, specified in the same way as start.frequency
- the number of observations per unit of time. Some common examples of values are:
deltat
- the fraction of the sampling period between successive observations; e.g., 1/12 for monthly data. Only one of frequency
or deltat
should be provided.Again we use the sustainability search data. We are using freq = 365
which means that you have 365 observations per cycle (i.e. per year), so the period of the seasonal variation is just 365.
t_ser <- ts(Tvalues$volume, start = c(2014, 1), freq = 365)
plot(t_ser)
You can also define a time series without a start date. Now the values on the x axis correspond to the cycle number, which in this case refers to the number of years after the first observation. Looking at the x axis, we see that there are over 4 years of observations.
t_ser1 <- ts(Tvalues$volume, freq = 365)
plot(t_ser1)
Using the data from Practical7_time_series.txt, make a time series out of it. Assume that the starting point is week 47.
We will now use the ma
function for moving-average smoothing. To do this, make sure to load the forecast
library:
library(forecast)
The formula for creating a moving average is:
The main idea is that you take a set of \(m\) data points with \(k\) number of points to the left and \(k\) to the right of \(x(t)\) and average them with \(x(t)\)
The syntax is for the moving average function in R is:
ma(x, order, centre=TRUE)
Arguments:
x
- Univariate time seriesorder
- Order of moving average smoothercentre
- If TRUE, then the moving average is centered for even orders.Here we create a single moving average on the sustainability search data we loaded in above.
ma5 = ma(Tvalues$volume, order = 5)
To demonstrate what is being calculated, the third element of the moving average is the average of elements 1:5 in the original data (k=2, so we include data 2 places to the left and right). The seventh element of the moving average is the average of elements 5:9 in the original data.
ma5[3]
## [1] 39
mean(Tvalues$volume[1:5])
## [1] 39
ma5[7]
## [1] 66.2
mean(Tvalues$volume[5:9])
## [1] 66.2
Here we can create a new data frame, where we add new columns to Tvalues
that contain two moving averages. For this, we use the dplyr
package along with the mutate
function to add two new columns: ma5
and ma13
.
Tvalues_with_ma = Tvalues %>%
mutate(ma5 = ma(volume, order = 5),
ma13 = ma(volume, order = 13))
Here we use head
with n=10
to look at the first ten rows of the resulting data frame:
head(Tvalues_with_ma, n=10)
## volume t ma5 ma13
## 1 24 2014-01-01 NA NA
## 2 49 2014-01-02 NA NA
## 3 50 2014-01-03 39.0 NA
## 4 33 2014-01-04 47.8 NA
## 5 39 2014-01-05 52.0 NA
## 6 68 2014-01-06 58.0 NA
## 7 70 2014-01-07 66.2 55.15385
## 8 80 2014-01-08 72.4 60.00000
## 9 74 2014-01-09 65.8 63.30769
## 10 70 2014-01-10 61.4 65.84615
We can now see the effects of the different values of order
in the ma
function. Specifically:
order = 5
means that 5 data points are averaged (2 to the left and right of the data point)order = 13
means that 13 data points are averaged (6 to the left and right of the data point)The first few values of ma5
and ma13
are NA
due to this averaging on the left and right of the data point. For example, the first data point, does not have another data point to the left, so the average is defined as NA
.
Below we show how moving averages compare to the original data, where we plot a moving average of length 7 (1 week), and another moving average of length 90 (3 months). Note that in the geom_line
statements, we only specify the value for y
since the data frame and x
have already been specified with ggplot(Tvalues_with_ma, aes(x=t))
ggplot(Tvalues_with_ma, aes(x=t)) +
geom_line(aes(y = volume,
color="original data"),
size=2) +
geom_line(aes(y = ma(volume,
order = 7),
color="ma7")) +
geom_line(aes(y = ma(volume,
order = 90),
color="ma90")) +
xlim(c(as.Date("2014-01-01"),
as.Date("2014-06-01"))) # Just show a limited range of data, otherwise the plot is too cluttered
Calculate a moving average with order = 3, 7, and 15 and add it to the plot.
Again use the data from Practical7_time_series.txt with the time values added in as shown above.
The warning messages that you see are related to NA
values generated based on the different orders of moving averages.
Sometimes you want to calculate the difference between time series values where you offset one set of values by a particular time lag. For example, if you wanted to analyze the yearly growth of a country’s population, you would take the yearly population values and subtract them from the values for the previous years.
An easy way to do this is with the lag
command and dplyr
. The lag
function simply takes as input a vector and shifts its values by the number specified. Take for example a vector containing the numbers 1, 2, …, 10:
a = c(1:10)
print(a)
## [1] 1 2 3 4 5 6 7 8 9 10
lag(a, 3)
## [1] NA NA NA 1 2 3 4 5 6 7
As you can see we push all the values over three places and fill in the left side with NA
. The values 8, 9 and 10 are discarded since they are pushed out.
Here we shift the values for volume
in sustainability
by seven days. We use the mutate()
function to create a new column named lag_7
in the data frame.
sustainability_search_volume_lag7 = sustainability %>% mutate(lag_7 = lag(volume, 7))
Looking at the first few rows of the data frame, we can see how NA
values are inserted at first, with the lagged values starting at row 8:
head(sustainability_search_volume_lag7, n=15)
## volume lag_7
## 1 24 NA
## 2 49 NA
## 3 50 NA
## 4 33 NA
## 5 39 NA
## 6 68 NA
## 7 70 NA
## 8 80 24
## 9 74 49
## 10 70 50
## 11 35 33
## 12 48 39
## 13 77 68
## 14 87 70
## 15 92 80
Below we try to visually investigate if there might be a correlation between the original data and different lag amounts. In other words, if they are correlated, we would expect to see the points arranged in nearly a straight line.
Instead of creating a new data frame for each different lag value, we just pass the results directly to ggplot using statements like data = sustainability %>% mutate(lag_x = lag(volume, 3))
## data shifted 3 days
ggplot(data = sustainability %>% mutate(lag_x = lag(volume, 3)),
aes(x=volume, y=lag_x)) +
geom_point()
## data shifted 7 days
ggplot(data = sustainability %>% mutate(lag_x = lag(volume, 7)),
aes(x=volume, y=lag_x)) +
geom_point()
## data shifted 3 months (84 days = 7 days * 12 weeks)
ggplot(data = sustainability %>% mutate(lag_x = lag(volume, 84)),
aes(x=volume, y=lag_x)) +
geom_point()
## data shifted 6 months (182 days = 7 days * 26 weeks)
ggplot(data = sustainability %>% mutate(lag_x = lag(volume, 182)),
aes(x=volume, y=lag_x)) +
geom_point()
## data shifted 12 months (364 days)
ggplot(data = sustainability %>% mutate(lag_x = lag(volume, 364)),
aes(x=volume, y=lag_x)) +
geom_point()
Note that changing the lag by one day (365 instead of 364) can make a big difference, depending on the nature of your time series data. In our case, this is due to a very strong weekly pattern.
## data shifted 12 months (365 days)
ggplot(data = sustainability %>% mutate(lag_x = lag(volume, 365)),
aes(x=volume, y=lag_x)) +
geom_point()
## Warning: Removed 365 rows containing missing values (geom_point).
Again use the data from Practical7_time_series.txt with the time values added in as shown above. Using a lag of 7, show the actual correlation (scatter) plot.
For a time series, a lag 1 correlation is a the correlation of the time series \(x(t)\) with the time series shifted by one time unit \(x(t + \Delta t)\)
For the autocorrelation function we plot the auto-correlation coefficient \(r(k*\Delta t)\) against the lag \(k\)
The basic syntax of the acf
function is:
acf(x, lag.max = NULL,
type = c("correlation", "covariance", "partial"),
plot = TRUE, na.action = na.fail)
Arguments:
x
- a univariate or multivariate numeric time series object or a numeric vector or matrix.lag.max
- maximum lag at which to calculate the acf. Default is 10*log10(N/m)
where N
is the number of observations and m the number of series. Will be automatically limited to one less than the number of observations in the series.type
- character string giving the type of acf to be computed. Allowed values are "correlation"
(the default), "covariance"
or "partial"
. Will be partially matched.
type="correlation"
.plot
- logical. If TRUE (the default) the acf is plotted.na.action
- function to be called to handle missing values. na.pass
can be used.
Error in na.fail.default(as.ts(x)) : missing values in object
if one of the values in the data is NA
. To get around this, you can add na.action=na.pass
Using the sustainability search data, we estimate the autocorrelation with a maximum lag of 30.
acf_T = acf(Tvalues$volume, type = "correlation", lag.max = 30)
This shows that the time series data is highly correlated on a weekly basis.
Now we try the same with a lag of 400:
acf_T = acf(Tvalues$volume, type = "correlation", lag.max = 400)
This shows that we have a very strong repeating weekly pattern throughout the year, while there is also a correlation between the first and second halves of the year, in addition to a correlation for a yearly pattern.
If you want to look at a specific value at a certain lag (say lag=10), you can use:
acf_10 = acf_T$acf[10]
print(acf_10)
## [1] 0.03084689
Calculate the autocorrelation function (ACF) of the time series data Practical7_time_series.txt
and plot it with a maximum lag of 30.
Time series decomposition allows us to analytically split up a single time series into three different time series that represent the overall trend, a seasonal component and random noise.
The basic syntax is:
decompose(x, type = c("additive", "multiplicative"), filter = NULL)
x
- A time series.type
- The type of seasonal component. Can be abbreviated.filter
- A vector of filter coefficients in reverse time order (as for AR or MA coefficients), used for filtering out the seasonal component. If NULL, a moving average with symmetric window is performed.By default, an additive model is performed if no value is specified for type
.
Here we use the sustainability search data.
## create a time series object with frequency of 7, since there is a strong weekly pattern in the data
# this means that we consider a "season" or period to be 7 days.
t_ser1 <- ts(sustainability$volume, freq = 7)
output = decompose(t_ser1)
plot(output)
The underlying data is available as a set of four vectors:
output$x
- the observed dataoutput$seasonal
- the seasonal variationoutput$trend
- the overall trendoutput$random
- the random noiseNow deseasonalize the data. This shows the noise (random
) + the trend
. The numbers on the Time
axis of the plot refers to weeks, since we specified freq=7
when creating the time series object above. However, when we specify the lag (lag.max
), this doesn’t refer to the period of the data, but rather the interval on which it was sampled (a lag of 7 means seven days).
## subtract the original time series from the seasonal component
t_des <- output$x - output$seasonal
## plot the deseasonalized data
plot(t_des)
## Find the ACF of the random component of the time series data
acf_des <- acf(as.numeric(output$random), type = "correlation", lag.max = 400, na.action=na.omit)
Note that if you don’t use as.numeric
as shown above, the values on the x axis will be based on the cycle number, which for this data set represents the year. In the plot above, the numbers correspond to the months.
We see that there’s also a seasonal component in the trend
data, and we can run acf()
on that as well:
acf_des <- acf(output$trend, type = "correlation", lag.max = 400, na.action=na.omit)
What would happen if we analyzed a time series composed of completely random values?
## 20 "years" of random values uniformly distributed
t_ser_random <- ts(runif(20*12), freq = 12)
output = decompose(t_ser_random)
plot(output)
The decompose
method does show a seasonal component, but it’s important to note the scale - the range of random values is much greater than the range of seasonal values
Decompose the time series from Practical7_time_series.txt
using classical decomposition.
Hint: you need to construct an appropriate time series object first using ts()
. From the discussion about this data set above, you should be able to pick the correct frequency for the time series object.
The basic syntax for the stl
function is:
stl(x, s.window,
t.window = NULL,
l.window,
robust = FALSE)
Arguments:
x
- univariate time series to be decomposed. This should be an object of class “ts” with a frequency greater than one.s.window
- either the character string “periodic” or the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7, according to Cleveland et al. This has no default.t.window
- the span (in lags) of the loess window for trend extraction, which should be odd. If NULL, the default, nextodd(ceiling((1.5*period) / (1-(1.5/s.window)))), is taken.l.window
- the span (in lags) of the loess window of the low-pass filter used for each subseries. Defaults to the smallest odd integer greater than or equal to frequency(x) which is recommended since it prevents competition between the trend and seasonal components. If not an odd integer its given value is increased to the next odd one.robust
- logical indicating if robust fitting be used in the loess procedure.In the sts
function there is an inner loop that separates the seasonal and trend component. This works in principle similar to classical decomposition, except that it uses more sophisticated smoothing methods.
Therefore you can set a smoothing window (l.window
) for the seasonal component. This is by default equal to the next odd integer which is greater than or equal to the cycle length. For the purposes of this practical, it is better to leave it like this.
You can set a window (s.window
) over how many cycles the seasonal cycle can vary. In the classical decomposition all the seasonal cycles are simply averaged. This is done here by a more complicated weighted-average/fitting method. This number should be odd and >=7.
You can set a window (t.window
) that determines how smooth the trend should be. (If the trend is smoother then you get more variability in the remainder). It is recommended to set t.window
>= 1.5*m/(1 – 1/s.window
), where m
is the value set for the frequency =
parameter of your time series.
You can play around with these parameters, to get the trend estimate you like.
There is an optional outer loop, that provides robustness. This is recommended if you have many outliers or not-well behaved time series. You set this to TRUE
or FALSE
using the robust
parameter in the stl
function.
We again use the sustainability search data.
t_ser1 <- ts(sustainability$volume, freq = 7)
output1 = stl(t_ser1, s.window = 25, t.window = (1.5 * 7)/(1 - 1/7))
plot(output1)
On the right of the plot you will see several gray bars. These give an indication of the scale of the different plots. For example, the width of the bar on the trend
plot corresponds to the same range as the smaller gray bar on the data
plot. In other words, the trend is a little smaller than the data
values. We can also see that the trend
is only a little bigger than the noise (i.e. the remainder
.)
To extract the different parts of the decomposed time series, we will need to use a different syntax than we did with the decompose
function. First we look at the first few rows of the time.series
matrix associated with output1
:
head(output1$time.series)
## Time Series:
## Start = c(1, 1)
## End = c(1, 6)
## Frequency = 7
## seasonal trend remainder
## 1.000000 15.43121 30.69492 -22.1261323
## 1.142857 12.37592 36.15903 0.4650536
## 1.285714 -2.38138 41.62314 10.7582442
## 1.428571 -31.89869 46.81114 18.0875561
## 1.571429 -21.23226 51.99914 8.2331211
## 1.714286 12.00211 56.48830 -0.4904088
In order to access these, we need to use the following syntax:
output1$time.series[,"seasonal"]
output1$time.series[,"trend"]
output1$time.series[,"remainder"]
Note that the original data is not included in output1
. In this example, we would have to access it using sustainability$volume
which contains the vector of values.
We now take the remainder and see the auto-correlation:
t_rem <- output1$time.series[,"remainder"]
acf_rem <-acf(as.numeric(t_rem), type = "correlation", lag.max = 400)
This shows us that there is still a seasonal (yearly) pattern remaining in the random component of the time series.
Decompose the time series from Practical7_time_series.txt
using stl
decomposition. Again use ts
and and an appropriate value for frequency
as you did in the previous exercise above.
Play with the parameters s.window
and t.window
and observe how this influence the trend, the remainder, and the seasonal pattern.
Note that the results you get may vary a bit from what you see below based on the values you choose for s.window
and t.window
.
Calculate and plot the de-trended time series.
Plot and interpret the ACF of the de-trended time series:
With the help of the ACF investigate if the remainder still contains information, or if it is random noise.
There is an indication of a weak correlation at lag 1 and a weak anticorrelation at lag 2 and 3. This might indicate a causal relationship i.e., that a day with much internet traffic is followed by another day with much internet traffic. But since the correlation is quite weak is might also be due to imperfect decomposition or due to outliers (i.e. the first week with much higher than average weekly cycle.