James Sethna > Sloppy Models > Why sloppiness? The sloppy universality class.

Why sloppiness? The sloppy universality class.

Eigenvalues for (a) through (h), spanning a range of ten million

Sloppy (and non-sloppy) eigenvalues from many fields. (a) our systems biology model; (b) quantum Monte Carlo variational wavefunction; (c) radioactive decay for a mixture of twelve common nuclides; (d) fit of 48 exponential decays; (e) random 48x48 matrix (GOE, not sloppy); (f) product of five 48x48 random matrices (not sloppy, but exponentially spread), (g) multiple linear regression fit of a plane to data in 48 dimensions (not sloppy), and (h) a polynomial fit to data (sloppy).

Is sloppiness some kind of emergent property, that naturally emerges when you try to use a model with lots of parameters? Why do these different kinds of systems all have the same weird sloppy features:

a few stiff parameter combinations, in which directions the model behavior is sensitive,
many sloppy parameter directions, along which one can vary huge amounts without altering the system behavior significantly,
a distribution of eigenvalues characterizing the local parameter sensitivity (large for the stiff eigendirections, near zero for the sloppy eigendirections) roughly equally distributed over many decades.

In approaching this problem, we were inspired by random matrix theory. The eigenvalues of large matrices arising in many fields (nuclear physics, nanophysics, acoustics, quantum chaos, number theory, quantum gravity, signal processing, ...) all share key common universal statistical features. In particular, our eigenvalue plot above looks very similar to plots you would find from random matrices -- except that the random matrix plots would have a linear vertical scale, where ours plot on a log scale. (So, for example, columns (e) and (g) on our plot are examples of eigenvalue distributions studied in random matrix theory. The random matrix eigenvalues have many stiff directions, only a few sloppy ones, and are not equally distributed over many decades (they are clumped mostly in one decade, with a few smaller ones).

So, our sloppy model Hessians are different from random matrices, but they also all share common features. In random matrix theory, they predict the properties of particular random matrices arising in the real world in terms of a universality class of matrices constructed by a particular rule (for example, filling all entries with random numbers chosen from a normal distribution, and then symmetrizing). The particular matrix usually isn't constructed at all in this way, but the behavior of its eigenvalues and many other properties is nicely described by the typical behavior in this class of matrices.

Can we come up with a sloppy universality class of models fit to data? We knew that not all models are sloppy. In particular, sloppiness is characteristic of models whose collective behavior is fit to data. (If for each parameter you design a special experiment to measure it, then your Hessian will be diagonal, with each parameter being an eigendirection with its own non-sloppy eigenvalue.) So, we took as the definition of our universality class: we assume every data point y_i depends on each parameter θ_j in a completely symmetrical way:

y_i(θ₁,θ₂,θ₃) = y_i(θ₂,θ₃,θ₁)

(This property does hold for the model of fitting exponentials, but is not typical of sloppy systems - just like most large matrices aren't generated by Gaussian random variables for each matrix entry.) We also wanted the different parameters to have similar effects (so they can be traded for one another, as in fitting polynomials), so we assumed that all of the parameters are close to one another:

θ_j = θ₀ + ε_j

With these assumptions, we could compute the Hessian describing the sensitivity of model behavior on changes in the parameters. It factored into the product of four matrices

H = V^T A^T A V

(where M^T is the transpose of a matrix, exchanging rows and columns). Here the matrix A depends on the model, but its properties are not very important. The matrix V, though, is the famous Vandermonde matrix:

V =

1 1 1 ... 1 1
ε₁ ε₂ ε₃ ... ε_N-1 ε_N
ε₁² ε₂² ε₃² ... ε_N-1² ε_N²
ε₁³ ε₂³ ε₃³ ... ε_N-1³ ε_N³
... ... ... ... ... ...
ε₁^d ε₂^d ε₃^d ... ε_N-1^d ε_N^d

with each row a successively higher power of the small variables ε_j. (Here N is the number of parameters, and d is the number of data points.)

The Vandermonde matrix is famous, I think, because you can guess its determinant (for the square case N=d). That means that everyone learns about it in math class when they learn about determinants. We are taught that the determinant is a big sum of terms, each ± a products of matrix elements, with one element chosen from each row and each column. So our determinant will be a polynomial in the epsilons. We are also taught that the determinant will be zero if any two columns are equal. So, the polynomial has a factor (ε_i-ε_j) for every i and j. Fiddling around, you can show that the determinant is precisely a product of these factors:

det(V) = ∏_i>j (ε_i-ε_j) ∝ ε^N(N-1)/2

Now, if the εs are small the determinant of V is tiny. Since the determinant of the product of four matrices is the product of the determinants, and since the determinant of A is boring, we find out that the determinant of our Hessian

det(H) ∝ε^N(N-1),

is very small if ε is small (so our parameters are all close to one another).

This tells us that our model is sloppy. First, remember when we looked at polynomial fits that the determinant of the Hessian told us how skew the transformation was from the original parameters to the parameter combinations governing model behavior? The more skew, the more sloppy. Second, the determinant is the product of all of the eigenvalues - so a tiny determinant means that there must be some tiny eigenvalues (sloppy eigendirections in parameter space).

We also argued that the eigenvalue distribution will have a very strong level-repulsion, leading to eigenvalues that are equally spaced on our logarithmic vertical axis when ε gets small. (Our argument relied upon a mathematical conjecture, which was later proven by Ari Turner, then a grad student at Harvard, and also by Bryan Chen, then a grad student at U Penn.)

So, the key weird features of the eigenvalue spectra of sloppy models can be explained mathematically, at least for this family of models fit to data.

References

"Sloppy model universality class and the Vandermonde matrix", Joshua J. Waterfall, Fergal P. Casey, Ryan N. Gutenkunst, Kevin S. Brown, Christopher R. Myers, Piet W. Brouwer, Veit Elser, and James P. Sethna, Phys. Rev. Letters 97, 150601 (2006), pdf.
"Universally Sloppy Parameter Sensitivities in Systems Biology", Ryan N. Gutenkunst, Joshua J. Waterfall, Fergal P. Casey, Kevin S. Brown, Christopher R. Myers, James P. Sethna, PLoS Comput Biol 3(10) e189 (2007). (PLoS, doi:10.1371/journal.pcbi.0030189), pdf). [Reviewed in NewsBytes of Biomedical Computation Review (Winter 07/08).]
Other papers on sloppy models

More on sloppiness:

Short course on information geometry, sloppy models, and visualizing behavior in high dimensions

Sloppy Models
- A sloppy systems biology model
- What is Sloppiness?
- What are Sloppy Models?
- Fitting Exponentials: Prediction without parameters
- Fitting Polynomials: Where is sloppiness from?
- Why sloppiness? The Sloppy Universality Class
- Differential Geometry and Sloppy Models (Transtrum)
  - The Model Manifold and Hyperribbons (Transtrum)
  - Sloppy Curvature (Transtrum)
- Why is science possible? Sloppy models in Physics.
  - Jessie Silverberg's Huffington Post article and Katheryn McGill's vlog Interview from The Physics Factor.
  - Unedited workshop interview by Steven Reiner, Stony Brook School of Journalism; Mobile version.
  - News article on our paper showing physics is sloppy too
  - The Secret Simplicity of Science, an AI video describing our work by From Coin Flips to Culture Wars by chargeDeficit.
Sloppy model applications
- Do parameters matter? Fits versus measurements.
- Experimental design in sloppy systems
- Robustness and sloppiness
- Estimating systematic errors for interatomic potentials and for density functional theory.
- Learning digits with InPCA
- Visualizing the learning trajectories of deep neural networks using InPCA

Last Modified: June 11, 2008

James P. Sethna, sethna@lassp.cornell.edu; This work supported by the Division of Materials Research of the U.S. National Science Foundation, through grant DMR-070167.

Statistical Mechanics: Entropy, Order Parameters, and Complexity, now available at Oxford University Press (USA, Europe).