New paper: Random Search for Hyper-Parameter Optimization

8 March 2012 - Cambridge MA

New paper about hyper-parameter optimization by random search in JMLR: pdf. Basically this paper connects the idea of “low effective dimension” that has been discovered in QMC integration, to the remarkable efficiency of random search for hyper-parameter optimization in some problems. I hope this paper convinces you that unless you really know what you’re doing, you should NOT be using grid-search to optimize hyper-parameters.

Title:

Authors:
J. Bergstra and Y. Bengio (2012).
Random Search for Hyper-Parameter Optimization.
Journal of Machine Learning Research 13:281–305.

Abstract:
Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven datasets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most datasets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different datasets. This phenomenon makes grid search a poor choice for configuring algorithms for new datasets. Our analysis casts some light on why recent “High Throughput” methods achieve surprising success – they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.