1 Load and Install Required Libraries

We need to install the forecast package for the ma (moving average) function which we’ll use later in this practical.

install.packages("forecast")

Now load the required libraries for this practical:

library(ggplot2)
library(dplyr)
library(forecast)

2 Creating Time Series Objects

For this example, download the Pr_20May1.csv data set and load it into R. Once you read it in, you’ll have a data frame with a single column x which is specified in the csv file itself.

Tvalues = read.csv("Pr_20May1.csv")

As we showed in the previous practical, we can create a sequence of dates (using seq), starting at one date with a defined time interval for each step.

Tvalues$t = seq(from = as.Date("1946-01-01"), to = as.Date("2014-12-01"), by = "1 month")

Now we plot the data:

ggplot(Tvalues, aes(x=t, y=x)) + geom_line()

2.1 Exercise

Download the file Practical7_time_series.txt and load it into R. This contains daily data on the internet traffic from November 20, 2004 until January 27, 2005.

  • Add to the data a vector of the times corresponding to each data point
  • Create a line plot of the amount of traffic per day
  • From visually inspecting the plot, estimate the period of the seasonal trend. In other words, how often do you think the fluctuations occur?

2.2 Using ts()

We can also use the ts function to create time series objects. The relevant syntax is shown below and you can learn more by typing ?ts into the console.

ts(data = NA, start = 1, end = numeric(), frequency = 1,
   deltat = 1)

Arguments:

  • data - a vector or matrix of the observed time-series values.
  • start - the time of the first observation. Either a single number or a vector of two integers, which specify a natural time unit and a (1-based) number of samples into the time unit. See the examples for the use of the second form.
  • end - the time of the last observation, specified in the same way as start.
  • frequency - the number of observations per unit of time. Some common examples of values are:
    • 12 - monthly data with a yearly cycle
    • 7 - daily data with a weekly cycle
    • 24 - hourly measurements within a day
  • deltat - the fraction of the sampling period between successive observations; e.g., 1/12 for monthly data. Only one of frequency or deltat should be provided.

To perform the same operation as shown in the previous example above, we can also do:

t_ser <- ts(Tvalues$x, start = c(1946, 1), freq = 12)
plot(t_ser)

You can also define a time series without a start date. Using freqquency = 12 means that you have 12 observations per cycle, so the period of the seasonal variation is 12. Now the values on the x axis correspond to the cycle number, which in this case refers to the number of years after the first observation.

t_ser1 <- ts(Tvalues$x, freq = 12)
plot(t_ser1)

3 Moving-average smoothing

We will now use the ma function for moving-average smoothing. To do this, make sure to load the forecast library:

library(forecast)

3.1 Reminder

The formula for creating a moving average is:

\(T_{t} = \frac{1}{m} \sum_{j=-k}^{k} x(t + j)\) with \(m = 2k + 1\)

The main idea is that you take a set of \(m\) data points with \(k\) number of points to the left and \(k\) to the right of \(x(t)\) and average them with \(x(t)\)

The syntax is for the moving average function in R is:

ma(x, order, centre=TRUE)

Arguments:

  • x - Univariate time series
  • order - Order of moving average smoother
  • centre - If TRUE, then the moving average is centered for even orders.

3.2 Example

Here we create a single moving average on the Pr_20May1.csv data we loaded in above.

ma5 = ma(Tvalues$x, order = 5)

To demonstrate what is being calculated, the third element of the moving average is the average of elements 1:5 in the original data (k=2, so we include data 2 places to the left and right). The seventh element of the moving average is the average of elements 5:9 in the original data.

ma5[3]
## [1] 59.8
mean(Tvalues$x[1:5])
## [1] 59.8
ma5[7]
## [1] 146
mean(Tvalues$x[5:9])
## [1] 146

Here we can create a new data frame, where we add new columns to Tvalues that contain two moving averages. For this, we use the dplyr package along with the mutate function to add two new columns: ma5 and ma13.

Tvalues_with_ma = Tvalues %>% 
  mutate(ma5 = ma(x, order = 5), 
         ma13 = ma(x, order = 13))

Here we use head with n=10 to look at the first ten rows of the resulting data frame:

head(Tvalues_with_ma, n=10)
##      x          t   ma5     ma13
## 1   12 1946-01-01    NA       NA
## 2   49 1946-02-01    NA       NA
## 3   33 1946-03-01  59.8       NA
## 4   89 1946-04-01  84.4       NA
## 5  116 1946-05-01 108.6       NA
## 6  135 1946-06-01 134.0       NA
## 7  170 1946-07-01 146.0 83.84615
## 8  160 1946-08-01 143.8 78.53846
## 9  149 1946-09-01 131.4 75.92308
## 10 105 1946-10-01  99.4 78.84615

We can now see the effects of the different values of order in the ma function. Specifically:

  • order = 5 means that 5 data points are averaged (2 to the left and right of the data point)
  • order = 13 means that 13 data points are averaged (6 to the left and right of the data point)

The first few values of ma5 and ma13 are NA due to this averaging on the left and right of the data point. For example, the first data point, does not have another data point to the left, so the average is defined as NA.

Now we can plot everything and see how the moving averages compare to the original data. Note that in the geom_line statements, we only specify the value for y since the data frame and x have already been specified with ggplot(Tvalues_with_ma, aes(x=t))

ggplot(Tvalues_with_ma, aes(x=t)) + 
  geom_line(aes(y=x, color="original data"), size=2) + 
  geom_line(aes(y=ma5, color="ma5")) + 
  geom_line(aes(y=ma13, color="ma13")) + 
  xlim(c(as.Date("1980-01-01"), 
         as.Date("1984-01-01"))) # Just show data from 1980-1984, otherwise the plot is too cluttered

3.3 Exercise

Calculate a moving average with order = 3, 7, and 15 and add it to the plot.

Again use the data from Practical7_time_series.txt with the time values added in as shown above.

The warning messages that you see are related to NA values generated based on the different orders of moving averages.

4 Lagging time values

4.1 Example

An easy way to do this is with the lag command and dplyr. The lag function simply takes as input a vector and shifts its values by the number specified. Take for example a vector containing the numbers 1, 2, …, 10:

a = c(1:10)
print(a)
##  [1]  1  2  3  4  5  6  7  8  9 10
lag(a, 3)
##  [1] NA NA NA  1  2  3  4  5  6  7

As you can see we push all the values over three places and fill in the left side with NA. The values 8, 9 and 10 are discarded since they are pushed out.

Here we shift the values for x in Tvalues by three months. We use the mutate function to create a new column named lag_x in the data frame.

Tvalues_lag3 = Tvalues %>% mutate(lag_x = lag(x, 3))

Looking at the first few rows of the data frame, we can see how NA values are inserted at first, with the lagged values starting at row 4:

head(Tvalues_lag3)
##     x          t lag_x
## 1  12 1946-01-01    NA
## 2  49 1946-02-01    NA
## 3  33 1946-03-01    NA
## 4  89 1946-04-01    12
## 5 116 1946-05-01    49
## 6 135 1946-06-01    33

Below we try to visually investigate if there might be a correlation between the original data and different lag amounts. In other words, if they are correlated, we would expect to see the points arranged in nearly a straight line.

Instead of creating a new data frame for each different lag value, we just pass the results directly to ggplot using data = Tvalues %>% mutate(lag_x = lag(x, 3))

# data shifted 3 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 3)), 
       aes(x=x, y=lag_x)) + 
  geom_point()

# data shifted 6 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 6)), 
       aes(x=x, y=lag_x)) + 
  geom_point()

# data shifted 10 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 10)), 
       aes(x=x, y=lag_x)) + 
  geom_point()

# data shifted 12 months
ggplot(data = Tvalues %>% mutate(lag_x = lag(x, 12)), 
       aes(x=x, y=lag_x)) + 
  geom_point()

4.2 Exercise

Again use the data from Practical7_time_series.txt with the time values added in as shown above. Using a lag of 10, show the actual correlation (scatter) plot.

5 Auto- and Cross- Covariance and -Correlation Function Estimation

5.1 Reminder of the lecture

For a time series, a lag 1 correlation is a the correlation of the time series \(x(t)\) with the time series shifted by one time unit \(x(t + \Delta t)\)

For the autocorrelation function we plot the auto-correlation coefficient \(r(k*\Delta t)\) against the lag \(k\)

The basic syntax of the acf function is:

acf(x, lag.max = NULL,
    type = c("correlation", "covariance", "partial"), 
    plot = TRUE, na.action = na.fail)

Arguments:

  • x - a univariate or multivariate numeric time series object or a numeric vector or matrix.
  • lag.max - maximum lag at which to calculate the acf. Default is 10*log10(N/m) where N is the number of observations and m the number of series. Will be automatically limited to one less than the number of observations in the series.
  • type - character string giving the type of acf to be computed. Allowed values are "correlation" (the default), "covariance" or "partial". Will be partially matched.
    • For this practical, we will only use type="correlation".
  • plot - logical. If TRUE (the default) the acf is plotted.
  • na.action - function to be called to handle missing values. na.pass can be used.
    • One thing to be aware of is that the acf function will give an error of Error in na.fail.default(as.ts(x)) : missing values in object if one of the values in the data is NA. To get around this, you can add na.action=na.pass

5.2 Example

Using the Pr_20May1.csv data, we estimate the autocorrelation with a maximum lag of 64.

Tvalues = read.csv("Pr_20May1.csv")
acf_T = acf(Tvalues,type = "correlation", lag.max = 64)

If you want to look at a specific value at a certain lag (say lag=10), you can use:

acf_10 = acf_T$acf[10]
print(acf_10)
## [1] 0.007269122

5.3 Exercise

Calculate the autocorrelation function (ACF) of the time series data Practical7_time_series.txt and plot it with a maximum lag of 30.

6 Decomposition

The basic syntax is:

decompose(x, type = c("additive", "multiplicative"), filter = NULL)

By default, an additive model is performed if no value is specified for type.

Here we use the Pr_20May1.csv data which we’ve loaded into Tvalues above.

6.1 Classical

6.1.1 Example

# create a time series object with frequency of 12, since we're dealing with monthly data
t_ser1 <- ts(Tvalues$x, freq = 12)
output = decompose(t_ser1)
plot(output)

The underlying data is available as a set of four vectors:

  • output$x - the observed data
  • output$seasonal - the seasonal variation
  • output$trend - the overall trend
  • output$random - the random noise

Now deseasonalize the data. This shows the noise (random) + the trend

# subtract the original time series from the seasonal component
t_des <- output$x - output$seasonal
# plot the deseasonalized data
plot(t_des)

# Find the ACF of the deseasonalized time series data
acf_des <- acf(as.numeric(t_des), type = "correlation", lag.max = 64)

Note that if you don’t use as.numeric as shown above, the values on the x axis will be based on the cycle number, which for this data set represents the year. In the plot above, the numbers correspond to the months.

acf_des <- acf(t_des, type = "correlation", lag.max = 64)

What we see above is a small coherence in the data, i.e. there is still a signal, but not very much.

To examine the relation between the observed, random, seasonal and trend data, below you can see the histograms of each component.

What we see is that the range of the observed values is much greater than that of the random and the trend values. This gives us some confidence in the output of the analysis. Now what would happen if we analyzed a time series composed of completely random values?

# 20 "years" of random values uniformly distributed
t_ser_random <- ts(runif(20*12), freq = 12)
output = decompose(t_ser_random)
plot(output)

The decompose method does show a seasonal component, but it’s important to note the scale - the range of random values is much greater than the range of seasonal values

Now looking at the histograms of the components for the decomposition on random values:

What we see now is that the range of the random values is nearly as large as that of the observed values (since they’re all random anyway). As a result, we can’t have much confidence in the overall trend, and the seasonal trend. In other words, if the random component is very large, then we should doubt the analysis.

6.1.2 Exercise

Decompose the time series from Practical7_time_series.txt using classical decomposition.

Hint: you need to construct an appropriate time series object first using ts(). From the discussion about this data set above, you should be able to pick the correct frequency for the time series object.

6.2 Seasonal Decomposition of Time Series by Loess (stl)

6.2.1 Reminder

The basic syntax for the stl function is:

stl(x, s.window, 
    t.window = NULL,
    l.window, 
    robust = FALSE)

Arguments:

  • x - univariate time series to be decomposed. This should be an object of class “ts” with a frequency greater than one.
  • s.window - either the character string “periodic” or the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7, according to Cleveland et al. This has no default.
  • t.window - the span (in lags) of the loess window for trend extraction, which should be odd. If NULL, the default, nextodd(ceiling((1.5*period) / (1-(1.5/s.window)))), is taken.
  • l.window - the span (in lags) of the loess window of the low-pass filter used for each subseries. Defaults to the smallest odd integer greater than or equal to frequency(x) which is recommended since it prevents competition between the trend and seasonal components. If not an odd integer its given value is increased to the next odd one.
  • robust - logical indicating if robust fitting be used in the loess procedure.

How it works:

In the sts function there is an inner loop that separates the seasonal and trend component. This works in principle similar to classical decomposition, except that it uses more sophisticated smoothing methods.

Therefore you can set a smoothing window (l.window) for the seasonal component. This is by default equal to the next odd integer which is greater than or equal to the cycle length. For the purposes of this practical, it is better to leave it like this.

You can set a window (s.window) over how many cycles the seasonal cycle can vary. In the classical decomposition all the seasonal cycles are simply averaged. This is done here by a more complicated weighted-average/fitting method. This number should be odd and >=7.

You can set a window (t.window) that determines how smooth the trend should be. (If the trend is smoother then you get more variability in the remainder). It is recommended to set t.window >= 1.5*m/(1 – 1/s.window), where m is the value set for the frequency = parameter of your time series.

You can play around with these parameters, to get the trend estimate you like.

There is an optional outer loop, that provides robustness. This is recommended if you have many outliers or not-well behaved time series. You set this to TRUE or FALSE using the robust parameter in the stl function.

6.2.2 Example

We again use the Tvalues data which is from Pr_20May1.csv.

t_ser1 <- ts(Tvalues$x, freq = 12)
output1 = stl(t_ser1, s.window = 15, t.window = 12*15+1)
plot(output1)

On the right of the plot you will see several gray bars. These give an indication of the scale of the different plots. For example, the width of the bar on the trend plot corresponds to the same range as the smaller gray bar on the data plot. In other words, the trend is much smaller than the data values. We can also see that the trend is also much smaller than the noise (i.e. the remainder.)

To extract the different parts of the decomposed time series, we will need to use a different syntax than we did with the decompose function. First we look at the first few rows of the time.series matrix associated with output1:

head(output1$time.series)
##       seasonal    trend   remainder
## [1,] -69.38701 96.89005 -15.5030329
## [2,] -74.09804 96.87220  26.2258403
## [3,] -55.75831 96.85435  -8.0960401
## [4,] -17.03611 96.83650   9.1996174
## [5,]  19.80396 96.81865  -0.6226057
## [6,]  49.33438 96.80080 -11.1351754

In order to access these, we need to use the following syntax:

  • output1$time.series[,"seasonal"]
  • output1$time.series[,"trend"]
  • output1$time.series[,"remainder"]

Note that the original data is not included in output1. In this example, we would have to access it using Tvalues$x which contains the vector of values. Alternatively, we could use t_ser1 if we want the values which are associated with specific times. In other words, Tvalues$x is a time series of temperatures (with no explicit information about when the temperatures were observed), while t_ser1 is a time series of temperatures with months associated with each observation.

We now take the remainder and see the auto-correlation:

t_rem <- output1$time.series[,"remainder"]
acf_rem <-acf(as.numeric(t_rem), type = "correlation", lag.max = 64)

With this we can see some coherence, such as if one month was warmer than average, the next month is likely to be warmer than average as well. This is separate from the seasonal pattern. Part of the reason for this is that a heat wave that lasts a week or two could span the borders of two months, thus raising the average temperature for both. It could also be that an unusually warm May will also cause a warmer June, especially in coastal areas. This would be due to the ocean having a higher starting temperature in June when we would expect increased solar radiation due to the changing angle of the earth.

6.2.3 Exercise

Decompose the time series from Practical7_time_series.txt using stl decomposition. Again use ts and and an appropriate value for frequency as you did in the previous exercise above.

Play with the parameters s.window and t.window and observe how this influence the trend, the remainder, and the seasonal pattern.

Note that the results you get may vary a bit from what you see below based on the values you choose for s.window and t.window.

Calculate and plot the de-trended time series.

Plot and interpret the ACF of the de-trended time series:

With the help of the ACF investigate if the remainder still contains information, or if it is random noise.

There is an indication of a weak correlation at lag 1 and a weak anticorrelation at lag 2 and 3. This might indicate a causal relationship i.e., that a day with much internet traffic is followed by another day with much internet traffic. But since the correlation is quite weak is might also be due to imperfect decomposition or due to outliers (i.e. the first week with much higher than average weekly cycle.