Immediately, we proceed our exploration of multi-step time-series forecasting with `torch`

. This publish is the third in a collection.

Initially, we coated fundamentals of recurrent neural networks (RNNs), and educated a mannequin to foretell the very subsequent worth in a sequence. We additionally discovered we may forecast fairly a number of steps forward by feeding again particular person predictions in a loop.

Subsequent, we constructed a mannequin “natively” for multi-step prediction. A small multi-layer-perceptron (MLP) was used to undertaking RNN output to a number of time factors sooner or later.

Of each approaches, the latter was the extra profitable. However conceptually, it has an unsatisfying contact to it: When the MLP extrapolates and generates output for, say, ten consecutive deadlines, there is no such thing as a causal relation between these. (Think about a climate forecast for ten days that by no means received up to date.)

Now, we’d wish to strive one thing extra intuitively interesting. The enter is a sequence; the output is a sequence. In pure language processing (NLP), one of these process is quite common: It’s precisely the sort of state of affairs we see with machine translation or summarization.

Fairly fittingly, the kinds of fashions employed to those ends are named sequence-to-sequence fashions (usually abbreviated *seq2seq*). In a nutshell, they cut up up the duty into two elements: an encoding and a decoding half. The previous is finished simply as soon as per input-target pair. The latter is finished in a loop, as in our first strive. However the decoder has extra data at its disposal: At every iteration, its processing relies on the earlier prediction in addition to earlier state. That earlier state would be the encoder’s when a loop is began, and its personal ever thereafter.

Earlier than discussing the mannequin intimately, we have to adapt our information enter mechanism.

We proceed working with `vic_elec`

, offered by `tsibbledata`

.

Once more, the dataset definition within the present publish appears a bit totally different from the best way it did earlier than; it’s the form of the goal that differs. This time, `y`

equals `x`

, shifted to the left by one.

The explanation we do that is owed to the best way we’re going to practice the community. With *seq2seq*, folks usually use a method known as “trainer forcing” the place, as an alternative of feeding again its personal prediction into the decoder module, you move it the worth it *ought to* have predicted. To be clear, that is performed throughout coaching solely, and to a configurable diploma.

```
library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)
n_timesteps <- 7 * 24 * 2
n_forecast <- n_timesteps
vic_elec_get_year <- perform(yr, month = NULL) {
vic_elec %>%
filter(yr(Date) == yr, month(Date) == if (is.null(month)) month(Date) else month) %>%
as_tibble() %>%
choose(Demand)
}
elec_train <- vic_elec_get_year(2012) %>% as.matrix()
elec_valid <- vic_elec_get_year(2013) %>% as.matrix()
elec_test <- vic_elec_get_year(2014, 1) %>% as.matrix()
train_mean <- imply(elec_train)
train_sd <- sd(elec_train)
elec_dataset <- dataset(
title = "elec_dataset",
initialize = perform(x, n_timesteps, sample_frac = 1) {
self$n_timesteps <- n_timesteps
self$x <- torch_tensor((x - train_mean) / train_sd)
n <- size(self$x) - self$n_timesteps - 1
self$begins <- type(pattern.int(
n = n,
dimension = n * sample_frac
))
},
.getitem = perform(i) {
begin <- self$begins[i]
finish <- begin + self$n_timesteps - 1
lag <- 1
listing(
x = self$x[start:end],
y = self$x[(start+lag):(end+lag)]$squeeze(2)
)
},
.size = perform() {
size(self$begins)
}
)
```

Dataset in addition to dataloader instantations then can proceed as earlier than.

```
batch_size <- 32
train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)
valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)
test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)
```

Technically, the mannequin consists of three *modules*: the aforementioned encoder and decoder, and the *seq2seq* module that orchestrates them.

### Encoder

The encoder takes its enter and runs it via an RNN. Of the 2 issues returned by a recurrent neural community, outputs and state, thus far we’ve solely been utilizing output. This time, we do the alternative: We throw away the outputs, and solely return the state.

If the RNN in query is a GRU (and assuming that of the outputs, we take simply the ultimate time step, which is what we’ve been doing all through), there actually is not any distinction: The ultimate state equals the ultimate output. If it’s an LSTM, nevertheless, there’s a second sort of state, the “cell state”. In that case, returning the state as an alternative of the ultimate output will carry extra data.

```
encoder_module <- nn_module(
initialize = perform(sort, input_size, hidden_size, num_layers = 1, dropout = 0) {
self$sort <- sort
self$rnn <- if (self$sort == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
}
},
ahead = perform(x) {
x <- self$rnn(x)
# return final states for all layers
# per layer, a single tensor for GRU, a listing of two tensors for LSTM
x <- x[[2]]
x
}
)
```

### Decoder

Within the decoder, identical to within the encoder, the primary part is an RNN. In distinction to previously-shown architectures, although, it doesn’t simply return a prediction. It additionally stories again the RNN’s last state.

```
decoder_module <- nn_module(
initialize = perform(sort, input_size, hidden_size, num_layers = 1) {
self$sort <- sort
self$rnn <- if (self$sort == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
}
self$linear <- nn_linear(hidden_size, 1)
},
ahead = perform(x, state) {
# enter to ahead:
# x is (batch_size, 1, 1)
# state is (1, batch_size, hidden_size)
x <- self$rnn(x, state)
# break up RNN return values
# output is (batch_size, 1, hidden_size)
# next_hidden is
c(output, next_hidden) %<-% x
output <- output$squeeze(2)
output <- self$linear(output)
listing(output, next_hidden)
}
)
```

`seq2seq`

module

`seq2seq`

is the place the motion occurs. The plan is to encode as soon as, then name the decoder in a loop.

When you look again to decoder `ahead()`

, you see that it takes two arguments: `x`

and `state`

.

Relying on the context, `x`

corresponds to certainly one of three issues: last enter, previous prediction, or prior floor reality.

The very first time the decoder is named on an enter sequence,

`x`

maps to the ultimate enter worth. That is totally different from a process like machine translation, the place you’ll move in a begin token. With time collection, although, we’d wish to proceed the place the precise measurements cease.In additional calls, we wish the decoder to proceed from its most up-to-date prediction. It is just logical, thus, to move again the previous forecast.

That mentioned, in NLP a method known as “trainer forcing” is usually used to hurry up coaching. With trainer forcing, as an alternative of the forecast we move the precise floor reality, the factor the decoder ought to have predicted. We try this solely in a configurable fraction of instances, and – naturally – solely whereas coaching. The rationale behind this system is that with out this type of re-calibration, consecutive prediction errors can rapidly erase any remaining sign.

`state`

, too, is polyvalent. However right here, there are simply two prospects: encoder state and decoder state.

The primary time the decoder is named, it’s “seeded” with the ultimate state from the encoder. Notice how that is

*the one time*we make use of the encoding.From then on, the decoder’s personal earlier state will likely be handed. Bear in mind the way it returns two values, forecast and state?

```
seq2seq_module <- nn_module(
initialize = perform(sort, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
self$encoder <- encoder_module(sort = sort, input_size = input_size,
hidden_size = hidden_size, num_layers, encoder_dropout)
self$decoder <- decoder_module(sort = sort, input_size = input_size,
hidden_size = hidden_size, num_layers)
self$n_forecast <- n_forecast
},
ahead = perform(x, y, teacher_forcing_ratio) {
# put together empty output
outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(gadget = gadget)
# encode present enter sequence
hidden <- self$encoder(x)
# prime decoder with last enter worth and hidden state from the encoder
out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
# decompose into predictions and decoder state
# pred is (batch_size, 1)
# state is (1, batch_size, hidden_size)
c(pred, state) %<-% out
# retailer first prediction
outputs[ , 1] <- pred$squeeze(2)
# iterate to generate remaining forecasts
for (t in 2:self$n_forecast) {
# name decoder on both floor reality or earlier prediction, plus earlier decoder state
teacher_forcing <- runif(1) < teacher_forcing_ratio
enter <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
enter <- enter$unsqueeze(3)
out <- self$decoder(enter, state)
# once more, decompose decoder return values
c(pred, state) %<-% out
# and retailer present prediction
outputs[ , t] <- pred$squeeze(2)
}
outputs
}
)
web <- seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)
# coaching RNNs on the GPU at the moment prints a warning which will muddle
# the console
# see https://github.com/mlverse/torch/points/461
# alternatively, use
# gadget <- "cpu"
gadget <- torch_device(if (cuda_is_available()) "cuda" else "cpu")
web <- web$to(gadget = gadget)
```

The coaching process is *primarily* unchanged. We do, nevertheless, must determine about `teacher_forcing_ratio`

, the proportion of enter sequences we need to carry out re-calibration on. In `valid_batch()`

, this could at all times be `0`

, whereas in `train_batch()`

, it’s as much as us (or moderately, experimentation). Right here, we set it to `0.3`

.

```
optimizer <- optim_adam(web$parameters, lr = 0.001)
num_epochs <- 50
train_batch <- perform(b, teacher_forcing_ratio) {
optimizer$zero_grad()
output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio)
goal <- b$y$to(gadget = gadget)
loss <- nnf_mse_loss(output, goal)
loss$backward()
optimizer$step()
loss$merchandise()
}
valid_batch <- perform(b, teacher_forcing_ratio = 0) {
output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio)
goal <- b$y$to(gadget = gadget)
loss <- nnf_mse_loss(output, goal)
loss$merchandise()
}
for (epoch in 1:num_epochs) {
web$practice()
train_loss <- c()
coro::loop(for (b in train_dl) {
loss <-train_batch(b, teacher_forcing_ratio = 0.3)
train_loss <- c(train_loss, loss)
})
cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
web$eval()
valid_loss <- c()
coro::loop(for (b in valid_dl) {
loss <- valid_batch(b)
valid_loss <- c(valid_loss, loss)
})
cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}
```

```
Epoch 1, coaching: loss: 0.37961
Epoch 1, validation: loss: 1.10699
Epoch 2, coaching: loss: 0.19355
Epoch 2, validation: loss: 1.26462
# ...
# ...
Epoch 49, coaching: loss: 0.03233
Epoch 49, validation: loss: 0.62286
Epoch 50, coaching: loss: 0.03091
Epoch 50, validation: loss: 0.54457
```

It’s attention-grabbing to check performances for various settings of `teacher_forcing_ratio`

. With a setting of `0.5`

, coaching loss decreases much more slowly; the alternative is seen with a setting of `0`

. Validation loss, nevertheless, isn’t affected considerably.

The code to examine test-set forecasts is unchanged.

```
web$eval()
test_preds <- vector(mode = "listing", size = size(test_dl))
i <- 1
coro::loop(for (b in test_dl) {
output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio = 0)
preds <- as.numeric(output)
test_preds[[i]] <- preds
i <<- i + 1
})
vic_elec_jan_2014 <- vic_elec %>%
filter(yr(Date) == 2014, month(Date) == 1)
test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))
test_pred2 <- test_preds[[408]]
test_pred2 <- c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))
test_pred3 <- test_preds[[817]]
test_pred3 <- c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)
preds_ts <- vic_elec_jan_2014 %>%
choose(Demand) %>%
add_column(
mlp_ex_1 = test_pred1 * train_sd + train_mean,
mlp_ex_2 = test_pred2 * train_sd + train_mean,
mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
pivot_longer(-Time) %>%
update_tsibble(key = title)
preds_ts %>%
autoplot() +
scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
theme_minimal()
```

Evaluating this to the forecast obtained from final time’s RNN-MLP combo, we don’t see a lot of a distinction. Is that this shocking? To me it’s. If requested to take a position concerning the purpose, I’d in all probability say this: In the entire architectures we’ve used thus far, the primary service of data has been the ultimate hidden state of the RNN (*one and solely* RNN within the two earlier setups, *encoder* RNN on this one). It will likely be attention-grabbing to see what occurs within the final a part of this collection, after we increase the encoder-decoder structure by *consideration*.

Thanks for studying!

Picture by Suzuha Kozuki on Unsplash