Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression

Xu, Ning and Hong, Jian and Fisher, Timothy (2016): Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression.

There is a more recent version of this item available.

Preview

PDF
MPRA_paper_73622.pdf
Download (3MB) | Preview

Abstract

In this paper, we study the generalization ability (GA)---the ability of a model to predict outcomes in new samples from the same population---of the extremum estimators. By adapting the classical concentration inequalities, we propose upper bounds for the empirical out-of-sample prediction error for extremum estimators, which is a function of the in-sample error, the severity of heavy tails, the sample size of in-sample data and model complexity. The error bounds not only serve to measure GA, but also to illustrate the trade-off between in-sample and out-of-sample fit, which is connected to the traditional bias-variance trade-off. Moreover, the bounds also reveal that the hyperparameter K, the number of folds in $K$-fold cross-validation, cause the bias-variance trade-off for cross-validation error, which offers a route to hyperparameter optimization in terms of GA. As a direct application of GA analysis, we implement the new upper bounds in penalized regression estimates for both n>p and n<p cases. We show that the L2 norm difference between penalized and un-penalized regression estimates can be directly explained by the GA of the regression estimates and the GA of empirical moment conditions. Finally, we show that all the penalized regression estimates are L2 consistent based on GA analysis.

Item Type:	MPRA Paper
Original Title:	Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression
Language:	English
Keywords:	generalization ability, upper bound of generalization error, penalized regression, bias-variance trade-off, lasso, high-dimensional data, cross-validation, $\mathcal{L}_2$ difference between penalized and unpenalized regression
Subjects:	C - Mathematical and Quantitative Methods > C1 - Econometric and Statistical Methods and Methodology: General > C13 - Estimation: General C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C52 - Model Evaluation, Validation, and Selection C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C55 - Large Data Sets: Modeling and Analysis
Item ID:	73622
Depositing User:	Mr Ning Xu
Date Deposited:	12 Sep 2016 11:14
Last Modified:	28 Sep 2019 23:29
References:	Hirotugu Akaike. Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki, editors, 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, pages 267–281. Budapest: Akademiai Kaido, 1973. Takeshi Amemiya. Advanced econometrics. Harvard university press, 1985. Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, and Chris Hansen. Program evaluation with high-dimensional data. arXiv preprint arXiv:1311.2645, 2013. Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37:1705–1732, 2009. Richard Blundell, Monica Costa Dias, Costas Meghir, and John Reenen. Evaluating the employment impact of a mandatory job search program. Journal of the European economic association, 2(4): 569–606, 2004. Leo Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4): 373–384, 1995. Emmanuel J. Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313–2351, 2007. Mehmet Caner. Lasso-type gmm estimator. Econometric Theory, 25(1):270–290, 2009. Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. David M. Chickering, David Heckerman, and Christopher Meek. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5:1287–1330, 2004. Peter Dolton. The econometric evaluation of the New Deal for Lone Parents. PhD thesis, Department of Economics, University of Michigan, 2006. Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of statistics, 32(2):407–499, 2004. Ildiko E. Frank and Jerome H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736, 2010. Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine Learning, 29(2-3):131–163, 1997. Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using Bayesian networks to analyze expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB ’00, pages 127–135, New York, NY, USA, 2000. ACM. Wenjiang J. Fu. Penalized regressions: the bridge versus the Lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998. Michael Gechter. Generalizing the results from social experiments: Theory and evidence from mexico and india. manuscript, Pennsylvania State University, 2015. Peter Hall and James S Marron. Local minima in cross-validation functions. Journal of the Royal Statistical Society. Series B (Methodological), pages 245–252, 1991. Peter Hall, Jeff Racine, and Qi Li. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association, 2011. David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. Arthur E. Hoerl and RobertW. Kennard. Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1):69–82, 1970. Ting Hu and Ding-Xuan Zhou. Online learning with samples drawn from non-identical distributions. Journal of Machine Learning Research, 10(Dec):2873–2898, 2009. Jian Huang, Joel L. Horowitz, and Shuangge Ma. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, pages 587–613, 2008. Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. pages 801–808, 2009. Keith Knight and Wenjiang Fu. Asymptotics for Lasso-type estimators. Annals of statistics, pages 1356–1378, 2000. Ron Kohavi et al. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14, pages 1137–1145, 1995. Daniel J McDonald, Cosma Rohilla Shalizi, and Mark Schervish. Generalization error bounds for stationary autoregressive models. arXiv preprint arXiv:1103.0942, 2011. Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, pages 1436–1462, 2006. Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for highdimensional data. The Annals of Statistics, pages 246–270, 2009. A. Michalski and A.I. Yashin. Structural minimization of risk on estimation of heterogeneity distributions. 1986. URL http://pure.iiasa.ac.at/2785/1/WP-86-076.pdf. Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes. pages 1097–1104, 2009. Whitney K. Newey and Daniel McFadden. Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245, 1994. Muhammad Aslam Noor. Differentiable non-convex functions and general variational inequalities. Applied Mathematics and Computation, 199(2):623–630, 2008. Judea Pearl. Detecting latent heterogeneity. Sociological Methods & Research, page 0049124115600597, 2015. Gideon E. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. Jun Shao. Asymptotic theory for model selection. Statistica Sinica, 7:221–242, 1997. Anders Skrondal and Sophia Rabe-Hesketh. Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Crc Press, 2004. Steve Smale and Ding-Xuan Zhou. Online learning with markov sampling. Analysis and Applications, 7(01):87–113, 2009. James H Stock and Mark W Watson. Generalized shrinkage methods for forecasting using many predictors. Journal of Business & Economic Statistics, 30(4):481–493, 2012. M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B (Methodological), 36(2):111–147, 1974. M. Stone. An asymptotic equivalence of choice of model by cross-validation and akaike’s criterion. Journal of the Royal Statistical Society, Series B (Methodological), 39(1):44–47, 1977. Roman G Strongin and Yaroslav D Sergeyev. Global optimization with non-convex constraints: Sequential and parallel algorithms, volume 45. Springer Science & Business Media, 2013. Yongqiang Tang. A hoeffding-type inequality for ergodic time series. Journal of Theoretical Probability, 20(2):167–176, 2007. Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58:267–288, 1996. Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 50(10):2231–2242, 2004. Vladimir N. Vapnik and Alexey Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theoretical Probability and its Applications, 16(2):264–280, 1971a. Vladimir N. Vapnik and Alexey Y. Chervonenkis. Theory of uniform convergence of frequencie of appearance of attributes to their probabilities and problems of defining optimal solution by empiric data. Avtomatika i Telemekhanika, (2):42–53, 1971b. Vladimir N. Vapnik and Alexey Y. Chervonenkis. The method of ordered risk minimization, I.Avtomatika i Telemekhanika, (8):21–30, 1974a. Vladimir N. Vapnik and Alexey Y. Chervonenkis. On the method of ordered risk minimization, II. Avtomatika i Telemekhanika, (9):29–39, 1974b. Li-Wei Wang and Ju-Fu Feng. Learning gaussian mixture models by structural risk minimization. In 2005 International Conference on Machine Learning and Cybernetics, volume 8, pages 4858–4863. IEEE, 2005. Jon A Wellner. A glivenko-cantelli theorem for empirical measures of independent but nonidentically distributed random variables. Stochastic Processes and their Applications, 11(3): 309–312, 1981. Ning Xu, Jian Hong, and Timothy C.G. Fisher. Clustered model selection and clustered model averaging: Simultaneous heterogeneity control and modeling. School of Economics, The University of Sydney, 2016. Liexiang Yan and Dexian Ma. Global optimization of non-convex nonlinear programs using line-up competition algorithm. Computers & Chemical Engineering, 25(11):1601–1610, 2001. Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994. Chun-Nam John Yu and Thorsten Joachims. Learning structural svms with latent variables. In Proceedings of the 26th annual international conference on machine learning, pages 1169–1176. ACM, 2009. Hao Yu. A glivenko-cantelli lemma and weak convergence for empirical processes of associated sequences. Probability theory and related fields, 95(3):357–370, 1993. Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942, 2010. Cun-Hui Zhang and Jian Huang. The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics, 36:1567–1594, 2008. Tong Zhang. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10:555–568, 2009. Peng Zhao and Bin Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. Hui Zou. The adaptive Lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
URI:	https://mpra.ub.uni-muenchen.de/id/eprint/73622

Available Versions of this Item

Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression. (deposited 12 Sep 2016 11:14) [Currently Displayed]
- Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression. (deposited 14 Sep 2016 06:00)

All papers reproduced by permission. Reproduction and distribution subject to the approval of the copyright owners.

View Item