We need to install the forecast
package for the ma
(moving average) function which we’ll use later in this practical.
install.packages("forecast")
Now load the required libraries for this practical:
library(ggplot2)
library(dplyr)
library(forecast)
For this example, download the Pr_20May1.csv data set and load it into R. Once you read it in, you’ll have a data frame with a single column x
which is specified in the csv file itself.
Tvalues = read.csv("Pr_20May1.csv")
As we showed in the previous practical, we can create a sequence of dates (using seq
), starting at one date with a defined time interval for each step.
Tvalues$t = seq(from = as.Date("1946-01-01"), to = as.Date("2014-12-01"), by = "1 month")
Now we plot the data:
ggplot(Tvalues, aes(x=t, y=x)) + geom_line()
Download the file Practical7_time_series.txt and load it into R. This contains daily data on the internet traffic from November 20, 2004 until January 27, 2005.
ts()
We can also use the ts
function to create time series objects. The relevant syntax is shown below and you can learn more by typing ?ts
into the console.
ts(data = NA, start = 1, end = numeric(), frequency = 1,
deltat = 1)
Arguments:
data
- a vector or matrix of the observed time-series values.start
- the time of the first observation. Either a single number or a vector of two integers, which specify a natural time unit and a (1-based) number of samples into the time unit. See the examples for the use of the second form.end
- the time of the last observation, specified in the same way as start.frequency
- the number of observations per unit of time. Some common examples of values are:
deltat
- the fraction of the sampling period between successive observations; e.g., 1/12 for monthly data. Only one of frequency
or deltat
should be provided.To perform the same operation as shown in the previous example above, we can also do:
t_ser <- ts(Tvalues$x, start = c(1946, 1), freq = 12)
plot(t_ser)
You can also define a time series without a start date. Using freqquency = 12
means that you have 12 observations per cycle, so the period of the seasonal variation is 12. Now the values on the x axis correspond to the cycle number, which in this case refers to the number of years after the first observation.
t_ser1 <- ts(Tvalues$x, freq = 12)
plot(t_ser1)
We will now use the ma
function for moving-average smoothing. To do this, make sure to load the forecast
library:
library(forecast)
The formula for creating a moving average is:
\(T_{t} = \frac{1}{m} \sum_{j=-k}^{k} x(t + j)\) with \(m = 2k + 1\)
The main idea is that you take a set of \(m\) data points with \(k\) number of points to the left and \(k\) to the right of \(x(t)\) and average them with \(x(t)\)
The syntax is for the moving average function in R is:
ma(x, order, centre=TRUE)
Arguments:
x
- Univariate time seriesorder
- Order of moving average smoothercentre
- If TRUE, then the moving average is centered for even orders.Here we create a single moving average on the Pr_20May1.csv
data we loaded in above.
ma5 = ma(Tvalues$x, order = 5)
To demonstrate what is being calculated, the third element of the moving average is the average of elements 1:5 in the original data (k=2, so we include data 2 places to the left and right). The seventh element of the moving average is the average of elements 5:9 in the original data.
ma5[3]
## [1] 59.8
mean(Tvalues$x[1:5])
## [1] 59.8
ma5[7]
## [1] 146
mean(Tvalues$x[5:9])
## [1] 146
Here we can create a new data frame, where we add new columns to Tvalues
that contain two moving averages. For this, we use the dplyr
package along with the mutate
function to add two new columns: ma5
and ma13
.
Tvalues_with_ma = Tvalues %>%
mutate(ma5 = ma(x, order = 5),
ma13 = ma(x, order = 13))
Here we use head
with n=10
to look at the first ten rows of the resulting data frame:
head(Tvalues_with_ma, n=10)
## x t ma5 ma13
## 1 12 1946-01-01 NA NA
## 2 49 1946-02-01 NA NA
## 3 33 1946-03-01 59.8 NA
## 4 89 1946-04-01 84.4 NA
## 5 116 1946-05-01 108.6 NA
## 6 135 1946-06-01 134.0 NA
## 7 170 1946-07-01 146.0 83.84615
## 8 160 1946-08-01 143.8 78.53846
## 9 149 1946-09-01 131.4 75.92308
## 10 105 1946-10-01 99.4 78.84615
We can now see the effects of the different values of order
in the ma
function. Specifically:
order = 5
means that 5 data points are averaged (2 to the left and right of the data point)order = 13
means that 13 data points are averaged (6 to the left and right of the data point)The first few values of ma5
and ma13
are NA
due to this averaging on the left and right of the data point. For example, the first data point, does not have another data point to the left, so the average is defined as NA
.
Now we can plot everything and see how the moving averages compare to the original data. Note that in the geom_line
statements, we only specify the value for y
since the data frame and x
have already been specified with ggplot(Tvalues_with_ma, aes(x=t))
ggplot(Tvalues_with_ma, aes(x=t)) +
geom_line(aes(y=x, color="original data"), size=2) +
geom_line(aes(y=ma5, color="ma5")) +
geom_line(aes(y=ma13, color="ma13")) +
xlim(c(as.Date("1980-01-01"),
as.Date("1984-01-01"))) # Just show data from 1980-1984, otherwise the plot is too cluttered
Calculate a moving average with order = 3, 7, and 15 and add it to the plot.
Again use the data from Practical7_time_series.txt with the time values added in as shown above.
The warning messages that you see are related to NA
values generated based on the different orders of moving averages.
An easy way to do this is with the lag
command and dplyr
. The lag
function simply takes as input a vector and shifts its values by the number specified. Take for example a vector containing the numbers 1, 2, …, 10:
a = c(1:10)
print(a)
## [1] 1 2 3 4 5 6 7 8 9 10
lag(a, 3)
## [1] NA NA NA 1 2 3 4 5 6 7
As you can see we push all the values over three places and fill in the left side with NA
. The values 8, 9 and 10 are discarded since they are pushed out.
Here we shift the values for x
in Tvalues
by three months. We use the mutate
function to create a new column named lag_x
in the data frame.
Tvalues_lag3 = Tvalues %>% mutate(lag_x = lag(x, 3))
Looking at the first few rows of the data frame, we can see how NA
values are inserted at first, with the lagged values starting at row 4:
head(Tvalues_lag3)
## x t lag_x
## 1 12 1946-01-01 NA
## 2 49 1946-02-01 NA
## 3 33 1946-03-01 NA
## 4 89 1946-04-01 12
## 5 116 1946-05-01 49
## 6 135 1946-06-01 33
Below we try to visually investigate if there might be a correlation between the original data and different lag amounts. In other words, if they are correlated, we would expect to see the points arranged in nearly a straight line.
Instead of creating a new data frame for each different lag value, we just pass the results directly to ggplot using data = Tvalues %>% mutate(lag_x = lag(x, 3))
# data shifted 3 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 3)),
aes(x=x, y=lag_x)) +
geom_point()
# data shifted 6 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 6)),
aes(x=x, y=lag_x)) +
geom_point()
# data shifted 10 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 10)),
aes(x=x, y=lag_x)) +
geom_point()
# data shifted 12 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 12)),
aes(x=x, y=lag_x)) +
geom_point()
Again use the data from Practical7_time_series.txt with the time values added in as shown above. Using a lag of 10, show the actual correlation (scatter) plot.
For a time series, a lag 1 correlation is a the correlation of the time series \(x(t)\) with the time series shifted by one time unit \(x(t + \Delta t)\)
For the autocorrelation function we plot the auto-correlation coefficient \(r(k*\Delta t)\) against the lag \(k\)
The basic syntax of the acf
function is:
acf(x, lag.max = NULL,
type = c("correlation", "covariance", "partial"),
plot = TRUE, na.action = na.fail)
Arguments:
x
- a univariate or multivariate numeric time series object or a numeric vector or matrix.lag.max
- maximum lag at which to calculate the acf. Default is 10*log10(N/m)
where N
is the number of observations and m the number of series. Will be automatically limited to one less than the number of observations in the series.type
- character string giving the type of acf to be computed. Allowed values are "correlation"
(the default), "covariance"
or "partial"
. Will be partially matched.
type="correlation"
.plot
- logical. If TRUE (the default) the acf is plotted.na.action
- function to be called to handle missing values. na.pass
can be used.
Error in na.fail.default(as.ts(x)) : missing values in object
if one of the values in the data is NA
. To get around this, you can add na.action=na.pass
Using the Pr_20May1.csv
data, we estimate the autocorrelation with a maximum lag of 64.
Tvalues = read.csv("Pr_20May1.csv")
acf_T = acf(Tvalues,type = "correlation", lag.max = 64)
If you want to look at a specific value at a certain lag (say lag=10), you can use:
acf_10 = acf_T$acf[10]
print(acf_10)
## [1] 0.007269122
Calculate the autocorrelation function (ACF) of the time series data Practical7_time_series.txt
and plot it with a maximum lag of 30.
The basic syntax is:
decompose(x, type = c("additive", "multiplicative"), filter = NULL)
x
- A time series.type
- The type of seasonal component. Can be abbreviated.filter
- A vector of filter coefficients in reverse time order (as for AR or MA coefficients), used for filtering out the seasonal component. If NULL, a moving average with symmetric window is performed.By default, an additive model is performed if no value is specified for type
.
Here we use the Pr_20May1.csv
data which we’ve loaded into Tvalues
above.
# create a time series object with frequency of 12, since we're dealing with monthly data
t_ser1 <- ts(Tvalues$x, freq = 12)
output = decompose(t_ser1)
plot(output)
The underlying data is available as a set of four vectors:
output$x
- the observed dataoutput$seasonal
- the seasonal variationoutput$trend
- the overall trendoutput$random
- the random noiseNow deseasonalize the data. This shows the noise (random
) + the trend
# subtract the original time series from the seasonal component
t_des <- output$x - output$seasonal
# plot the deseasonalized data
plot(t_des)
# Find the ACF of the deseasonalized time series data
acf_des <- acf(as.numeric(t_des), type = "correlation", lag.max = 64)
Note that if you don’t use as.numeric
as shown above, the values on the x axis will be based on the cycle number, which for this data set represents the year. In the plot above, the numbers correspond to the months.
acf_des <- acf(t_des, type = "correlation", lag.max = 64)
What we see above is a small coherence in the data, i.e. there is still a signal, but not very much.
To examine the relation between the observed
, random
, seasonal
and trend
data, below you can see the histograms of each component.
What we see is that the range of the observed values is much greater than that of the random and the trend values. This gives us some confidence in the output of the analysis. Now what would happen if we analyzed a time series composed of completely random values?
# 20 "years" of random values uniformly distributed
t_ser_random <- ts(runif(20*12), freq = 12)
output = decompose(t_ser_random)
plot(output)
The decompose
method does show a seasonal component, but it’s important to note the scale - the range of random values is much greater than the range of seasonal values
Now looking at the histograms of the components for the decomposition on random values:
What we see now is that the range of the random values is nearly as large as that of the observed values (since they’re all random anyway). As a result, we can’t have much confidence in the overall trend, and the seasonal trend. In other words, if the random component is very large, then we should doubt the analysis.
Decompose the time series from Practical7_time_series.txt
using classical decomposition.
Hint: you need to construct an appropriate time series object first using ts()
. From the discussion about this data set above, you should be able to pick the correct frequency for the time series object.
stl
)The basic syntax for the stl
function is:
stl(x, s.window,
t.window = NULL,
l.window,
robust = FALSE)
Arguments:
x
- univariate time series to be decomposed. This should be an object of class “ts” with a frequency greater than one.s.window
- either the character string “periodic” or the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7, according to Cleveland et al. This has no default.t.window
- the span (in lags) of the loess window for trend extraction, which should be odd. If NULL, the default, nextodd(ceiling((1.5*period) / (1-(1.5/s.window)))), is taken.l.window
- the span (in lags) of the loess window of the low-pass filter used for each subseries. Defaults to the smallest odd integer greater than or equal to frequency(x) which is recommended since it prevents competition between the trend and seasonal components. If not an odd integer its given value is increased to the next odd one.robust
- logical indicating if robust fitting be used in the loess procedure.How it works:
In the sts
function there is an inner loop that separates the seasonal and trend component. This works in principle similar to classical decomposition, except that it uses more sophisticated smoothing methods.
Therefore you can set a smoothing window (l.window
) for the seasonal component. This is by default equal to the next odd integer which is greater than or equal to the cycle length. For the purposes of this practical, it is better to leave it like this.
You can set a window (s.window
) over how many cycles the seasonal cycle can vary. In the classical decomposition all the seasonal cycles are simply averaged. This is done here by a more complicated weighted-average/fitting method. This number should be odd and >=7.
You can set a window (t.window
) that determines how smooth the trend should be. (If the trend is smoother then you get more variability in the remainder). It is recommended to set t.window
>= 1.5*m/(1 – 1/s.window
), where m
is the value set for the frequency =
parameter of your time series.
You can play around with these parameters, to get the trend estimate you like.
There is an optional outer loop, that provides robustness. This is recommended if you have many outliers or not-well behaved time series. You set this to TRUE
or FALSE
using the robust
parameter in the stl
function.
We again use the Tvalues
data which is from Pr_20May1.csv
.
t_ser1 <- ts(Tvalues$x, freq = 12)
output1 = stl(t_ser1, s.window = 15, t.window = 12*15+1)
plot(output1)
On the right of the plot you will see several gray bars. These give an indication of the scale of the different plots. For example, the width of the bar on the trend
plot corresponds to the same range as the smaller gray bar on the data
plot. In other words, the trend is much smaller than the data
values. We can also see that the trend
is also much smaller than the noise (i.e. the remainder
.)
To extract the different parts of the decomposed time series, we will need to use a different syntax than we did with the decompose
function. First we look at the first few rows of the time.series
matrix associated with output1
:
head(output1$time.series)
## seasonal trend remainder
## [1,] -69.38701 96.89005 -15.5030329
## [2,] -74.09804 96.87220 26.2258403
## [3,] -55.75831 96.85435 -8.0960401
## [4,] -17.03611 96.83650 9.1996174
## [5,] 19.80396 96.81865 -0.6226057
## [6,] 49.33438 96.80080 -11.1351754
In order to access these, we need to use the following syntax:
output1$time.series[,"seasonal"]
output1$time.series[,"trend"]
output1$time.series[,"remainder"]
Note that the original data is not included in output1
. In this example, we would have to access it using Tvalues$x
which contains the vector of values. Alternatively, we could use t_ser1
if we want the values which are associated with specific times. In other words, Tvalues$x
is a time series of temperatures (with no explicit information about when the temperatures were observed), while t_ser1
is a time series of temperatures with months associated with each observation.
We now take the remainder and see the auto-correlation:
t_rem <- output1$time.series[,"remainder"]
acf_rem <-acf(as.numeric(t_rem), type = "correlation", lag.max = 64)
With this we can see some coherence, such as if one month was warmer than average, the next month is likely to be warmer than average as well. This is separate from the seasonal pattern. Part of the reason for this is that a heat wave that lasts a week or two could span the borders of two months, thus raising the average temperature for both. It could also be that an unusually warm May will also cause a warmer June, especially in coastal areas. This would be due to the ocean having a higher starting temperature in June when we would expect increased solar radiation due to the changing angle of the earth.
Decompose the time series from Practical7_time_series.txt
using stl
decomposition. Again use ts
and and an appropriate value for frequency
as you did in the previous exercise above.
Play with the parameters s.window
and t.window
and observe how this influence the trend, the remainder, and the seasonal pattern.
Note that the results you get may vary a bit from what you see below based on the values you choose for s.window
and t.window
.
Calculate and plot the de-trended time series.
Plot and interpret the ACF of the de-trended time series:
With the help of the ACF investigate if the remainder still contains information, or if it is random noise.
There is an indication of a weak correlation at lag 1 and a weak anticorrelation at lag 2 and 3. This might indicate a causal relationship i.e., that a day with much internet traffic is followed by another day with much internet traffic. But since the correlation is quite weak is might also be due to imperfect decomposition or due to outliers (i.e. the first week with much higher than average weekly cycle.