Sloppy (and non-sloppy) eigenvalues from many fields. (a) our systems biology model; (b) quantum Monte Carlo variational wavefunction; (c) radioactive decay for a mixture of twelve common nuclides; (d) fit of 48 exponential decays; (e) random 48x48 matrix (GOE, not sloppy); (f) product of five 48x48 random matrices (not sloppy, but exponentially spread), (g) multiple linear regression fit of a plane to data in 48 dimensions (not sloppy), and (h) a polynomial fit to data (sloppy). |
Is sloppiness some kind of emergent property, that naturally emerges when you try to use a model with lots of parameters? Why do these different kinds of systems all have the same weird sloppy features:
In approaching this problem, we were inspired by random matrix theory. The eigenvalues of large matrices arising in many fields (nuclear physics, nanophysics, acoustics, quantum chaos, number theory, quantum gravity, signal processing, ...) all share key common universal statistical features. In particular, our eigenvalue plot above looks very similar to plots you would find from random matrices -- except that the random matrix plots would have a linear vertical scale, where ours plot on a log scale. (So, for example, columns (e) and (g) on our plot are examples of eigenvalue distributions studied in random matrix theory. The random matrix eigenvalues have many stiff directions, only a few sloppy ones, and are not equally distributed over many decades (they are clumped mostly in one decade, with a few smaller ones).
So, our sloppy model Hessians are different from random matrices, but they also all share common features. In random matrix theory, they predict the properties of particular random matrices arising in the real world in terms of a universality class of matrices constructed by a particular rule (for example, filling all entries with random numbers chosen from a normal distribution, and then symmetrizing). The particular matrix usually isn't constructed at all in this way, but the behavior of its eigenvalues and many other properties is nicely described by the typical behavior in this class of matrices.
Can we come up with a sloppy universality class of models fit to data? We knew that not all models are sloppy. In particular, sloppiness is characteristic of models whose collective behavior is fit to data. (If for each parameter you design a special experiment to measure it, then your Hessian will be diagonal, with each parameter being an eigendirection with its own non-sloppy eigenvalue.) So, we took as the definition of our universality class: we assume every data point yi depends on each parameter θj in a completely symmetrical way:
yi(θ1,θ2,θ3) = yi(θ2,θ3,θ1)
(This property does hold for the model of fitting exponentials, but is not typical of sloppy systems - just like most large matrices aren't generated by Gaussian random variables for each matrix entry.) We also wanted the different parameters to have similar effects (so they can be traded for one another, as in fitting polynomials), so we assumed that all of the parameters are close to one another:
θj = θ0 + εj
With these assumptions, we could compute the Hessian describing the sensitivity of model behavior on changes in the parameters. It factored into the product of four matrices
H = VT AT A V
(where MT is the transpose of a matrix, exchanging rows and columns). Here the matrix A depends on the model, but its properties are not very important. The matrix V, though, is the famous Vandermonde matrix:
V = |
|
The Vandermonde matrix is famous, I think, because you can guess its determinant (for the square case N=d). That means that everyone learns about it in math class when they learn about determinants. We are taught that the determinant is a big sum of terms, each ± a products of matrix elements, with one element chosen from each row and each column. So our determinant will be a polynomial in the epsilons. We are also taught that the determinant will be zero if any two columns are equal. So, the polynomial has a factor (εi-εj) for every i and j. Fiddling around, you can show that the determinant is precisely a product of these factors:
det(V) = ∏i>j (εi-εj) ∝ εN(N-1)/2
Now, if the εs are small the determinant of V is tiny. Since the determinant of the product of four matrices is the product of the determinants, and since the determinant of A is boring, we find out that the determinant of our Hessian
det(H) ∝εN(N-1),
is very small if ε is small (so our parameters are all close to one another).
This tells us that our model is sloppy. First, remember when we looked at polynomial fits that the determinant of the Hessian told us how skew the transformation was from the original parameters to the parameter combinations governing model behavior? The more skew, the more sloppy. Second, the determinant is the product of all of the eigenvalues - so a tiny determinant means that there must be some tiny eigenvalues (sloppy eigendirections in parameter space).
We also argued that the eigenvalue distribution will have a very strong level-repulsion, leading to eigenvalues that are equally spaced on our logarithmic vertical axis when ε gets small. (Our argument relied upon a mathematical conjecture, which was later proven by Ari Turner, then a grad student at Harvard, and also by Bryan Chen, then a grad student at U Penn.)
So, the key weird features of the eigenvalue spectra of sloppy models
can be explained mathematically, at least for this family of models fit to
data.
James P. Sethna, sethna@lassp.cornell.edu; This work supported by the Division of Materials Research of the U.S. National Science Foundation, through grant DMR-070167.
Statistical Mechanics: Entropy, Order Parameters, and Complexity, now available at Oxford University Press (USA, Europe).