Part 3: AnalysisĀ¶
In this post we will analyze the data generated in the previous posts. In a perfect zero-noise setting, we might expect a power law decay [1, 2] in the NRMSE with the number of signals. When the effect of noise is factored in, the behavior of the system changes, reflecting the degraded effectiveness of each signal due to the noise. We will empirically observe the form of this degradation, as well as offer a simple analytic model.
[1] - Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).
[2] - Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).
A simple model is to assume that each sample is corrupted with noise with some probability $p$. This degradation can then be quantified by considering combinations of signals that remain unaffected by noise. Using simple combinatorial logic, there are $n \choose k$ signals we can form from subsets of $k$ signals via multiplication, and the probability that they remain noise-free is $(1-p)^k$. That is, the expected number of uncorrupted signals $s(n)$ is given roughly as:
$s(n) = \sum_{k=1}^n {n\choose k} (1-p)^k$
When these summands are sharply peaked around a signal value of $k$, the expected number of signals due to noise will consequently scale polynomially in $n$ since ${n\choose k} = n\cdot(n-1)\dots(n-k)$. Of course, if $k$ is proportional to $n$ then this is in fact exponential - this depends on how large $p$ is, and we discuss this below. Intutively, there is a trade off between the growth of the binomial coeffcient, and the decay due to the exponentiated probability. This results in a peaking behavior in the summands, and we plot this below, where we plot the values of the terms ${n \choose k}(1-p)^k $, normalized by the largest value.
import numpy as np
from math import comb
import matplotlib.pyplot as plt
import scipy
num_points = 10
num_terms = 100
p_values = np.linspace(.05, 1, num_points)
vals = np.array([[comb(num_terms, i + 1) * (1 - p)**(i + 1) for i in range(num_terms)] for p in p_values]).T
vals /= (scipy.linalg.norm(vals, axis=0)+.01) #adding small number just for numerical stability
np.argmax(vals, axis=0)
plt.plot(vals, label=np.round(p_values, 2))
plt.xlabel('polynomial power k')
plt.ylabel('normalized value, proportional to (n choose k)(1-p)^k')
plt.title('Counting the expected number of signals associated with binomial coefficients')
plt.legend(title='noise strength')
plt.savefig('fig1')
from PIL import Image
img = Image.open('fig1.png');
img.convert("RGB")
We can then take the index of the maximum value, and plot the functional relationship between the index with the largest expected value of uncorrupted signals and the noise. As the noise increases, the distribution shifts to the left because for $p=1$, all signals are corrupted. As $p$ approaches $0$, the number of uncorrupted signals grows exponentially (because ${n \choose n/2}$ is approximately exponential). Hence, the scaling is not monomial for small probabilities and we will ignore it at small probabilities for the rest of this notebook. (We start at $p=.05$ in the next cell.)
plt.plot(p_values, np.argmax(vals, axis=0))
plt.title("peak index of binomial coefficients vs. noise")
plt.xlabel("noise strength")
plt.ylabel("index");
We see that as the noise strength increases the function becomes peaked around a lower degree polynomial. That is, noise effectively degrades the combinatorial sum to a small degree monomial.
In this notebook, we have thus far assumed that with some probability $p$ the signal is completely corrupted, whereas in our code in the previous post we used a probability $q$ of an error on single output value. One simple change that might make this model more realistic is to consider the functional relationship between $p$ and $q$. If the reservoir has a memory of length $T$, that is it can only remember the past $T$ inputs, and these errors are independent, we might expect a relationship like $q^T = p$. Hence, to compare to our plots later on which are a function of the per datum error rate, we might instead plot $p^{1/T}$. We plot this for a scaling of $p^{1/10}$ below, corresponding to $T=10$.
vals = np.array([[comb(num_terms, i + 1) * (1 - p**.1)**(i + 1) for i in range(num)] for p in p_values]).T
vals /= (scipy.linalg.norm(vals, axis=0)+.01) #adding small number just for numerical stability
np.argmax(vals, axis=0)
plt.plot(p_values, np.argmax(vals, axis=0))
plt.title("peak index of binomial coefficients vs. noise")
plt.xlabel("noise strength")
plt.ylabel("index");
Now for the fun part - does this agree with our data? We consider a simple model for a power law type performance given by $NRMSE(n) = \frac{a}{n^b} + c$, where $n$ is the reservoir size, and $a$ is a constant that should be determined by the performance at a particular reservoir size, e.g. $n=1$, and $b$ is the decay that we are interested in. We fix the asymptote $c$ manually - this can in principle be fit as well, but for consistency and ease, we fit it by eye. This asymptote is not the focus of this tutorial, and so will not be studied in detail, but may be attributed to the span of the reservoir's natural functions, given by the neural activations and their product signals, not fully containing the desired NARMA10 function. This could be the case if, for instance, the NARMA10 task was more varied much more rapidly than a typical neural activation, or required more memory than the reservoir can provide.
Now we are going to load the data from the simulations in the last post, and see how well it agrees with this simple analysis.
# Necessary imports
import pickle
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
from scipy.special import factorial
from math import comb
# Load the dataset
file_name = 'webpost.pkl'
with open(file_name, 'rb') as f:
data = pickle.load(f)
# To start, we organize the data we've gathered, extracting minimum values associated with each noise.
min_values_by_noise = {}
for (noise, _), pairs in data.items():
if noise not in min_values_by_noise:
min_values_by_noise[noise] = {}
for x, y in pairs:
min_values_by_noise[noise][x] = min(y, min_values_by_noise[noise].get(x, y))
# Reduce the data for plotting
reduced_data = {}
for noise, x_y_pairs in min_values_by_noise.items():
x_values_sorted = sorted(x_y_pairs.keys())
y_values_sorted = [x_y_pairs[x] for x in x_values_sorted]
reduced_data[noise] = y_values_sorted
plt.plot(x_values_sorted, y_values_sorted, marker='x', color='r')
# Plot the theoretical curves and compare them to the numerical data
alpha = 0.2
x = np.linspace(0, 7, 100)
# Generate and plot theoretical lines
asymptote = .8
num = 1000
for i in range(4, num, int(num/20)):
plt.plot(x, (1-asymptote)/(x/1000 + 1)**(i/2) + asymptote, alpha=alpha, color='k')
# Final plot settings
plt.title('Loss vs. Reservoir Size')
plt.xlabel('Reservoir Size')
plt.ylabel('Loss')
plt.plot([], [], marker='x', color='r', label='Numerics')
plt.plot([], [], color='k', label='Theory')
plt.legend()
plt.savefig('fig2')
plt.clf()
# Now, we fit the data to a model and find actual parameters.
def model(x, a, b):
return a/x**b + asymptote
fits = []
# Perform the fit for each dataset
for key in reduced_data:
x = np.array(sorted(min_values_by_noise[key].keys()))
y = np.array(reduced_data[key])
popt, pcov = curve_fit(model, x, y, maxfev=1000000, p0=(1, 1))
fits.append((key, popt))
x_fine = np.linspace(1, 7, 400)
# Plot the fitted function
plt.scatter(x, y)
plt.plot(x_fine, model(x_fine, *popt), 'r-')
plt.xlabel('Reservoir Size')
plt.ylabel('Loss')
plt.title('Loss vs. Reservoir Size, Fits')
plt.savefig('fig3')
plt.clf();
<Figure size 640x480 with 0 Axes>
from PIL import Image
img = Image.open('fig2.png');
img.convert("RGB")
Note that the fit becomes worse at the bottom. These are the approximation breaks down, and where the functional form should become approximately exponential. The rest of the lines look similar enough for the purpose of this tutorial, and we have omitted serious statistical analysis - error bars would be helpful! We can perform numerical fits to these model, and plot the fit parameters against the model we described above for the polynomial powers.
img = Image.open('fig3.png');
img.convert("RGB")
It's not bad! We again note the divergence from the fit at the bottom, which also appears to be systematic for large reservoir sizes. Since we have made a few decisions by hand, like the asymptote, and not computed error bars, I would take these defects with a grain of salt.
In the following cell we extract the fit parameters and compare them to our simple model.
# We analyze the decay power versus noise power and plot the results.
# Extract the fitting parameters
xs = [fit[0] for fit in fits]
ys = [fit[1] for fit in fits]
# Sort the fits for plotting
xys = sorted(zip(xs, ys), key=lambda x: x[0])
xs = [el[0] for el in xys]
ys = [el[1] for el in xys]
# Select the range for plotting
c, d = 12, -1
x = xs[c:d]
y = (np.array(ys).T)[1][c:d]
We saw earlier that the fits do not do particularly well for the bottom curves where we expect the behavior to be exponential, and hence not fit well to a single monomial, and so we ignore these points with $c$ and $d$. Now we regenerate the analytic model from above.
# Generate a smooth line for the heuristic argument
x_fine = np.linspace(min(x), max(x), 100)
scale = 100
num = len(x_fine)
p_values = np.linspace(.05, 1, num)
vals = np.array([[comb(scale, i + 1) * (1 - p**.1)**(i + 1) for i in range(scale)] for p in p_values]).T
offset = 3
max_power = offset + np.argmax(vals, axis=0)
A few details:
- As noted before, have adjusted the sensitivity to $p$ to instead be $p^{1/10}$. As we mentioned before, the noise model wasn't as simple as the one we proposed - we added some Gaussian noise to each point in the time series indepednently, so it isn't suprisingly that we are more sensitive to the noise.
- The range of $p$ itself is off by a factor of ten (the values of noise we actually simulated were about ten times smaller), which accounts to a scaling and should depend on the details of the noise model, i.e. Gaussian noise should behave differently than uniform noise.
- We have parameterized the model with an offset off the peak index (fit by hand), which could be attributed to the width of the distributions.
Now we plot the comparison!
# Plot the decay power
plt.scatter(x, y/y[0], label=f"Data")
# Plot the heuristic line
plt.plot(x_fine, max_power/max_power[0], label="Heuristic Prediction")
plt.legend()
plt.title("Fit Polynomial Decays vs. Heuristic Prediction from ")
plt.xlabel("noise strength")
plt.ylabel("power law scaling coefficient\n normalized to first point")
plt.savefig('fig4')
plt.clf()
# A small noise rate becomes non-meaningful at some point. With only 1000 points, the ability to distinguish errors diminishes.
# In a zero-noise scenario, behavior tends towards an exponential decay; detecting deviations from this trend requires larger reservoir sizes.
from PIL import Image
img = Image.open('fig4.png');
img.convert("RGB")
And we see rough agreement! This is a simple model, but it highlights how we might expect the exponentiality to arise in the zero noise limit, and polynomial scaling in the noisy setting. In fact, while this model is simple enough that the exponential reduction in performance can be analytically predicted, understanding the general behavior in the presence of more general noise is more complicated, and is the subject of the theory developed in [3].
[3] "Limits to Reservoir Learning", Anthony M. Polloreno
So, in summary, in this blog post we have started with a simple learning model given by echo state networks and introduced the standard computational task given by NARMA10. We then identified a small collection of tunable and interpretable parameters, fixing most of them except for the error rate and the reservoir size, and used them to get a better understanding for how these systems perform. Thanks for reading!