Part 1: Echo State NetworksĀ¶
Echo state networks (ESNs) are a kind of recurrent neural network (RNN) with randomly initialized and fixed weights. It's been shown that this is actually sufficient for reasonably good performance on some time series tasks, and because the weights are fixed, there is an analytical solution for training the network. The general setting is given below.
Letās say we have a single input signal $U(t)$, and a single output signal $Y(t)$. We can perform data processing with an encoding and decoding matrix $W_{in}$ and $W_{out}$ into a system that performs computation. $W_{out}$ will be learned. For us, we imagine a very simple model that is parameterized primarily by a matrix $W_{res}$. We will be performing temporal signal processing, and so at each step in time this matrix $W_{res}$ will be applied to the internal state of the reservoir. We will also apply tanh activations, and so we will call these states neurons, and the associated internal state signal their activations. The picture to have in mind is the following:
from PIL import Image
img = Image.open('esn.png');
img
(Image from "Long-Short Term Echo State Network for Time Series Prediction" by Kaihong Zheng et al.)
Specifically, denoting the neurons of the reservoir as $R$ we will consider a network with the following internal dynamics:
$R(t) = (1-\alpha)R(t-1) + \alpha \tanh(U(t)W_{in} + R(t-1)W_{res} + \beta)$
where $\alpha$ sets the bleedthrough between time steps, and $\beta$ is a bias term. In the cell below, we've included the bias by concatenation a $1$ with the input data.
def ESN_init(inSize, outSize, resSize, alpha, sparsity):
Win = np.random.rand(inSize+1, resSize) - 0.5 # Encoding.
Wres = np.random.rand(resSize, resSize) - 0.5 # Matrix describing reservoir.
# Technical details of echo state networks. We tune the sparsity of the reservoir
# and normalize Wres.
Wres[np.random.rand(resSize, resSize)>sparsity] = 0
spec_rad = max(abs(np.linalg.eig(Wres)[0]))
Wres /= spec_rad
return Win, Wres, inSize, resSize, alpha
def reservoir(data, Win, Wres, inSize, resSize, alpha):
"""
alpha: a bleed-through constant that controls the memory of the neurons
"""
datamatrix = np.zeros((data.shape[0], 1+inSize+resSize))
# Initialize the neurons to some default value.
R = .1*(np.ones((1, resSize)) - 0.5)
for t in range(data.shape[0]):
u = data[t]
R = ((1 - alpha)*R + # Bleed through.
alpha*np.tanh( # Activation function.
# New input and feed forward from previous time step.
np.dot(np.hstack((1,u)), Win) + np.dot(R, Wres)))
datamatrix[t] = np.append(np.append(1,u), R)
return datamatrix
Let's see what this model can do, starting with NARMA10.
NARMA10Ā¶
NARMA is an acronym for Nonlinear AutoRegressive Moving Average, and NARMA10 particular is defined by the following time series.
$y(t) = .3y(t-1) + .05y(t-1)\sum_{i=1}^{10} y(t-i) + 1.5U(t-1)*U(t-10) + .1,$ where $X(t)$ is a uniform random variable.
Itās commonly used to benchmark the performance of RNNs, especially ESNs. To do this, the reservoir is run with inputs $U(t)$, and trained on $Y(t)$, up to some $T$. Using the analytic training solution in the next text cell, the reservoir is then used to predict the values of $Y(t)$ for $t>T$. We'll use it in this notebook to showcase the performance of these networks. It defines a temporal data processing task where the data is drawn at random from $[0,1]$, and processed by a specific nonlinear, autoregressive function, defined as:
def fetchData(n):
"""Return a particular instance of NARMA10, on $n$ inputs."""
u = 0.5 * np.random.uniform(size=(n+10))
y = np.zeros(shape=(n+10))
for i in range(10, n+10):
y[i] = 0.3 * y[i-1] + 0.05 * y[i-1] * np.sum(y[i-10:i]) + 1.5 * u[i-1] * u[i-10] + 0.1
return (u[10:], y[10:])
We have a 1D input and output signal, and we select the remaining parameters heuristically. The one technical detail in this cell is that we now choose $W_{out}$ to minimize the loss of the network on our training data. The reservoir produces time series outputs which are given simply as vectors of real numbers, and thus if we choose the squared error as our loss function, the optimal weight matrix is given by the standard least squares solution.
$$ \begin{align} W_{out} = (X^TX)^{-1}X^TY \end{align} $$
We also measure reservoir performance with a typical metric - the normalized root mean square error.
$$ \begin{align} NRMSE = \frac{1}{\sigma_Y}\sqrt{\overline{(Y-\hat{Y})^2}} \end{align} $$
def test_NARMA10(resSize, inSize=1, outSize=1, train_cycles=11000, test_cycles=1000,
alpha=0.7, sparsity=0.9):
Echo = ESN_init(inSize, outSize, resSize, alpha, sparsity)
data_train, Y_train = fetchData(train_cycles)
data_test, Y_test = fetchData(test_cycles)
train = np.array(reservoir(data_train, *Echo))
# Standard least squares solution.
Wout = np.dot(np.linalg.pinv(train), Y_train)
test = reservoir(data_test, *Echo)
Yhat = np.dot(test, Wout)
# We remove the initial datapoints because the reservoir needs to "warmup".
NRMSE = np.sqrt(np.divide(np.mean(np.square(Y_test[50:]-Yhat[50:])),np.var(Y_test[50:])))
return NRMSE, Y_test, Yhat, test
Now we will just plot the estimate from the reservoir, given the input data, against the desired output in the test set. We seed our examples so that we can sensibly compare performance improvements when we're running single experiments. Additionally, in this notebook, we set the reservoir size, fairly arbitarily, to 5. With more nerons, we can expect better performance, but the effects we are interested in can be observed at this scale, which has the added benefit of being quick to compute.
import numpy as np
import matplotlib.pyplot as plt
from itertools import chain, combinations
from functools import reduce
def plot_NARMA(resSize=5):
np.random.seed(137)
NRMSE, Y_test, Yhat, _ = test_NARMA10(resSize)
plt.figure(figsize=(18,6))
plt.yscale('log')
plt.plot(Yhat, color='red', linewidth=5, label='Single ESN Prediction')
plt.plot(Y_test, color='green', linestyle=":", linewidth=2, label='Target Value')
plt.ylim(.2,1)
plt.ylabel("NARMA(t)")
plt.xlabel("Test Sample (t)")
plt.title("Test Performance")
print(f"Normalized root mean squared error is {NRMSE}.")
return Yhat, Y_test
Yhat1, Y_test1 = plot_NARMA()
Normalized root mean squared error is 0.7496595380852676.
This isn't too bad, but it doesn't guarantee that the reservoir is learning - one simple way to get relatively good performance is to simply output the previous time step - then, as long as the function doesn't change that much, we'll have relatively small error. One way we can see if this is happening is to see how correlated the predicted change is with the actual change. We plot this below, and crop outliers out of the plot --- just for visualization! The red line is y=x.
plt.scatter(Yhat1[1:]-Yhat1[:-1], Y_test1[1:]-Yhat1[:-1])
plt.xlabel("Difference between steps")
plt.ylabel("Predicted difference")
plt.plot(np.linspace(-.3,.3), np.linspace(-.3,.3), color='r', label="y=x")
plt.tight_layout()
plt.xlim(-.3,.3)
plt.ylim(-.3,.3)
plt.legend();
First, we see there is noise, since the cloud has some width. Second, we see it's not aligned with y=x, but we at least see correlation. Let's see if we can improve this.
Powerset SignalsĀ¶
Given some input signal, the learner is trying to make sense of the signal and process it in some way. In this work we consider the setting where we fix the network size, and instead consider engineering features. One reason we might do this is that making the network larger is either too expensive or slows down computation. I was motivated by quantum computation, where this is the case. Primarily, we are interested in understanding if the engineered features we consider can be useful.
Given $n$ neurons, a simple way to generate new features is to consider the pointwise products of the elements of the powerset. This multiplication produces linearly independent signals and is the simplest way I can think of for quickly creating a very large collection of signals. There are other ways of creating new functions from an existing collection, but one motivation for the power set is the following:
Consider two-outcome functions $f_1(x),..., f_n(x) \in \{0, 1\}$ (like the outcomes of circuits executing Boolean logic). Sums of these functions are contained in their linear span. What about products? For $k > 0$, $f_i^{k}(x)=f_i(x)$, so that the powerset now contains all polynomials of these functions in its span. Generally speaking, short of exactly computing functions, computing approximations with Taylor approximations using sums of high degree polynomials is the best we can do.
That being said, we are not in the binary setting, and there are many, many more functions we could construct. For this tutorial, we are interested in showing the effect of noise on any collection of functions we add, and so we will stick with the power set as an example.
def reservoir(data, Win, Wres, inSize, resSize, alpha, powerset):
"""
alpha: a bleed-through constant that controls the memory of the neurons
"""
datamatrix = np.zeros((data.shape[0], 1+inSize+resSize))
# Initialize the neurons to some default value.
R = .1*(np.ones((1, resSize)) - 0.5)
for t in range(data.shape[0]):
u = data[t]
R = (1 - alpha)*R + alpha*np.tanh(np.dot(np.hstack((1,u)), Win) + np.dot(R, Wres))
datamatrix[t] = np.append(np.append(1,u), R)
num_points = datamatrix.shape[0]
if not powerset:
return datamatrix
else:
# Add the powerset of signals.
s = list(datamatrix.T)[2:]
power_signals = [reduce(lambda a, b: a*b, el, np.ones(num_points))
for el in list(chain.from_iterable(
combinations(s, r) for r in range(len(s)+1)))[1:]]
power_signals = list(datamatrix.T)[0:2] + power_signals
return np.array(power_signals).T
Moving forward we are going to also introduce ridge regression to avoid both overfitting and help with the fact that as we add more signals from the power set the data matrix that consitutes the least squares problem becomes very poorly conditioned.
def test_NARMA10(resSize, inSize=1, outSize=1, train_cycles=9000, test_cycles=1000, alpha=0.7,
sparsity=0.9, powerset=True):
Echo = ESN_init(inSize, outSize, resSize, alpha, sparsity)
data_train, Y_train = fetchData(train_cycles)
data_test, Y_test = fetchData(test_cycles)
train = np.array(reservoir(data_train, *Echo, powerset))
# We're going to add in ridge regression to avoid overfitting, and help with
# the ill-conditioned train matrix.
coeff = 1E-8
ridge = np.zeros(train.shape)
np.fill_diagonal(ridge, coeff)
Wout = np.dot(np.linalg.pinv(train + ridge), Y_train)
# Standard least squares solution, with ridge regression.
test = reservoir(data_test, *Echo, powerset)
Yhat = np.dot(test, Wout)
NRMSE = np.sqrt(np.divide(np.mean(np.square(Y_test[50:]-Yhat[50:])),np.var(Y_test[50:])))
return NRMSE, Y_test, Yhat, test
Now we can look again to see if the error has decreased. Whereas we previously had 7 signals (2 + 5), we now have 34 (2 + $2^5$) so we expect better performance!
Yhat2, Y_test2 = plot_NARMA()
Normalized root mean squared error is 0.5437949184260549.
We can again look at the correlation, and we see that it is both a more narrow cloud, so less variation, and better correlated with y=x, meaning it is doing better at prediction.
plt.scatter(Yhat2[1:]-Yhat2[:-1], Y_test2[1:]-Yhat2[:-1])
plt.xlabel("Difference between steps")
plt.ylabel("Predicted difference")
plt.plot(np.linspace(-.3,.3), np.linspace(-.3,.3), color='r', label="y=x")
plt.tight_layout()
plt.xlim(-.3,.3)
plt.ylim(-.3,.3)
plt.legend();
The performance gain is noticeable between the two, with a decreased NRMSE of .54 from .74. We can visualize this improvement as a shift in the histogram of the residuals at each point in time, with the power set reservoir outperforming the standard reservoir, and generally having its residuals shifted left.
plt.hist((Yhat1-Y_test1)**2, bins=np.logspace(-10, .1), alpha=0.5, label='no power set')
plt.hist((Yhat2-Y_test2)**2, bins=np.logspace(-10, .1), alpha=0.5, label='power set')
plt.xscale('log')
plt.legend()
plt.title("Histogram of squared errors per test point");
This shift is small, but noticeable! We've constrained the size of the reservoirs in this notebook to simulation easy, but, in general, the power set adds an extremely large set of new functions that should be able to drastically reduce the error.