Xu, Ning and Hong, Jian and Fisher, Timothy (2016): Finitesample and asymptotic analysis of generalization ability with an application to penalized regression.
This is the latest version of this item.

PDF
MPRA_paper_73657.pdf Download (3MB)  Preview 
Abstract
In this paper, we study the performance of extremum estimators from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. By adapting the classical concentration inequalities, we derive upper bounds on the empirical outofsample prediction errors as a function of the insample errors, insample data size, heaviness in the tails of the error distribution, and model complexity. We show that the error bounds may be used for tuning key estimation hyperparameters, such as the number of folds K in crossvalidation. We also show how K affects the biasvariance tradeoff for crossvalidation. We demonstrate that the L2norm difference between penalized and the corresponding unpenalized regression estimates is directly explained by the GA of the estimates and the GA of empirical moment conditions. Lastly, we prove that all penalized regression estimates are L2consistent for both the n > p and the n < p cases. Simulations are used to demonstrate key results.
Item Type:  MPRA Paper 

Original Title:  Finitesample and asymptotic analysis of generalization ability with an application to penalized regression 
Language:  English 
Keywords:  generalization ability, upper bound of generalization error, penalized regression, crossvalidation, biasvariance tradeoff, L2 difference between penalized and unpenalized regression, lasso, highdimensional data. 
Subjects:  C  Mathematical and Quantitative Methods > C1  Econometric and Statistical Methods and Methodology: General > C13  Estimation: General C  Mathematical and Quantitative Methods > C5  Econometric Modeling > C52  Model Evaluation, Validation, and Selection C  Mathematical and Quantitative Methods > C5  Econometric Modeling > C55  Large Data Sets: Modeling and Analysis 
Item ID:  73657 
Depositing User:  Mr Ning Xu 
Date Deposited:  14 Sep 2016 06:00 
Last Modified:  14 Sep 2016 06:01 
References:  Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Petrov, B. N., Csaki, F. (Eds.), 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR. Budapest: Akademiai Kaido, pp. 267–281. Amemiya, T., 1985. Advanced econometrics. Harvard university press. Belloni, A., Chernozhukov, V., FernándezVal, I., Hansen, C., 2013. Program evaluation with highdimensional data. arXiv preprint arXiv:1311.2645. Bickel, P. J., Ritov, Y., Tsybakov, A. B., 2009. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37, 1705–1732. Blundell, R., Dias, M. C., Meghir, C., Reenen, J., 2004. Evaluating the employment impact of a mandatory job search program. Journal of the European economic association 2 (4), 569–606. Breiman, L., 1995. Better subset regression using the nonnegative garrote. Technometrics 37 (4), 373–384. Candes, E. J., Tao, T., 2007. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 2313–2351. Caner, M., 2009. Lassotype gmm estimator. Econometric Theory 25 (1), 270–290. CesaBianchi, N., Conconi, A., Gentile, C., 2004. On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory 50 (9), 2050–2057. Chickering, D. M., Heckerman, D., Meek, C., 2004. Largesample learning of Bayesian networks is NPhard. Journal of Machine Learning Research 5, 1287–1330. Dolton, P., 2006. The econometric evaluation of the new deal for lone parents. Ph.D. thesis, Department of Economics, University of Michigan. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of statistics 32 (2), 407–499. Frank, I. E., Friedman, J. H., 1993. A statistical view of some chemometrics regression tools. Technometrics 35 (2), 109–135. Friedman, J., Hastie, T., Tibshirani, R., 2010. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736. Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian network classifiers. Machine Learning 29 (23), 131–163. Friedman, N., Linial, M., Nachman, I., Pe’er, D., 2000. Using Bayesian networks to analyze expression data. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. RECOMB ’00. ACM, New York, NY, USA, pp. 127–135. Fu, W. J., 1998. Penalized regressions: the bridge versus the Lasso. Journal of computational and graphical statistics 7 (3), 397–416. Gechter, M., 2015. Generalizing the results from social experiments: Theory and evidence from mexico and india. manuscript, Pennsylvania State University. Hall, P., Marron, J. S., 1991. Local minima in crossvalidation functions. Journal of the Royal Statistical Society. Series B (Methodological), 245–252. Hall, P., Racine, J., Li, Q., 2011. Crossvalidation and the estimation of conditional probability densities. Journal of the American Statistical Association. Heckerman, D., Geiger, D., Chickering, D. M., 1995. Learning Bayesian networks: The combina tion of knowledge and statistical data. Machine learning 20 (3), 197–243. Hoeffding, W., 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (301), 13–30. Hoerl, A. E., Kennard, R. W., 1970. Ridge regression: Applications to nonorthogonal problems. Technometrics 12 (1), 69–82. Hu, T., Zhou, D.X., 2009. Online learning with samples drawn from nonidentical distributions. Journal of Machine Learning Research 10 (Dec), 2873–2898. Huang, J., Horowitz, J. L., Ma, S., 2008. Asymptotic properties of bridge estimators in sparse highdimensional regression models. The Annals of Statistics, 587–613. Kakade, S. M., Tewari, A., 2009. On the generalization ability of online strongly convex program ming algorithms, 801–808. Knight, K., Fu, W., 2000. Asymptotics for Lassotype estimators. Annals of statistics, 1356–1378. Kohavi, R., et al., 1995. A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: Ijcai. Vol. 14. pp. 1137–1145. McDonald, D. J., Shalizi, C. R., Schervish, M., 2011. Generalization error bounds for stationary autoregressive models. arXiv preprint arXiv:1103.0942. Meinshausen, N., Bühlmann, P., 2006. Highdimensional graphs and variable selection with the Lasso. The Annals of Statistics, 1436–1462. Meinshausen, N., Yu, B., 2009. Lassotype recovery of sparse representations for highdimensional data. The Annals of Statistics, 246–270. Michalski, A., Yashin, A., 1986. Structural minimization of risk on estimation of heterogeneity distributions. URL http://pure.iiasa.ac.at/2785/1/WP86076.pdf Mohri, M., Rostamizadeh, A., 2009. Rademacher complexity bounds for noniid processes, 1097– 1104. Newey, W. K., McFadden, D., 1994. Large sample estimation and hypothesis testing. Handbook of econometrics 4, 2111–2245. Noor, M. A., 2008. Differentiable nonconvex functions and general variational inequalities. Applied Mathematics and Computation 199 (2), 623–630. Pearl, J., 2015. Detecting latent heterogeneity. Sociological Methods & Research, 0049124115600597. Schwarz, G. E., 1978. Estimating the dimension of a model. Annals of Statistics 6 (2), 461–464. Shao, J., 1997. Asymptotic theory for model selection. Statistica Sinica 7, 221–242. Skrondal, A., RabeHesketh, S., 2004. Generalized latent variable modeling: Multilevel, longitudi nal, and structural equation models. Crc Press. Smale, S., Zhou, D.X., 2009. Online learning with markov sampling. Analysis and Applications 7 (01), 87–113. Stock, J. H., Watson, M. W., 2012. Generalized shrinkage methods for forecasting using many predictors. Journal of Business & Economic Statistics 30 (4), 481–493. Stone, M., 1974. Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B (Methodological) 36 (2), 111–147. Stone, M., 1977. An asymptotic equivalence of choice of model by crossvalidation and akaike’s criterion. Journal of the Royal Statistical Society, Series B (Methodological) 39 (1), 44–47. Strongin, R. G., Sergeyev, Y. D., 2013. Global optimization with nonconvex constraints: Sequential and parallel algorithms. Vol. 45. Springer Science & Business Media. Tang, Y., 2007. A hoeffdingtype inequality for ergodic time series. Journal of Theoretical Probabil ity 20 (2), 167–176. Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological) 58, 267–288. Vapnik, V. N., Chervonenkis, A. Y., 1971a. On the uniform convergence of relative frequencies of events to their probabilities. Theoretical Probability and its Applications 16 (2), 264–280. Vapnik, V. N., Chervonenkis, A. Y., 1971b. Theory of uniform convergence of frequencie of appearance of attributes to their probabilities and problems of defining optimal solution by empiric data. Avtomatika i Telemekhanika (2), 42–53. Vapnik, V. N., Chervonenkis, A. Y., 1974a. The method of ordered risk minimization, I. Avtomatika i Telemekhanika (8), 21–30. Vapnik, V. N., Chervonenkis, A. Y., 1974b. On the method of ordered risk minimization, II. Avtomatika i Telemekhanika (9), 29–39. Wang, L.W., Feng, J.F., 2005. Learning gaussian mixture models by structural risk minimization. In: 2005 International Conference on Machine Learning and Cybernetics. Vol. 8. IEEE, pp. 4858–4863. Wellner, J. A., 1981. A glivenkocantelli theorem for empirical measures of independent but non identically distributed random variables. Stochastic Processes and their Applications 11 (3), 309–312. Xu, N., Hong, J., Fisher, T. C., 2016. Clustered model selection and clustered model averaging: Simultaneous heterogeneity control and modeling, School of Economics, The University of Sydney. Yan, L., Ma, D., 2001. Global optimization of nonconvex nonlinear programs using lineup competition algorithm. Computers & Chemical Engineering 25 (11), 1601–1610. Yu, B., 1994. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116. Yu, C.N. J., Joachims, T., 2009. Learning structural svms with latent variables. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp. 1169–1176. Yu, H., 1993. A glivenkocantelli lemma and weak convergence for empirical processes of associ ated sequences. Probability theory and related fields 95 (3), 357–370. Zhang, C.H., 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38, 894–942. Zhang, C.H., Huang, J., 2008. The sparsity and bias of the Lasso selection in highdimensional linear regression. The Annals of Statistics 36, 1567–1594. Zhang, T., 2009. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research 10, 555–568. Zhao, P., Yu, B., 2006. On model selection consistency of Lasso. The Journal of Machine Learning Research 7, 2541–2563. Zou, H., 2006. The adaptive Lasso and its oracle properties. Journal of the American statistical association 101 (476), 1418–1429. 
URI:  https://mpra.ub.unimuenchen.de/id/eprint/73657 
Available Versions of this Item

Finitesample and asymptotic analysis of generalization ability with an application to penalized regression. (deposited 12 Sep 2016 11:14)
 Finitesample and asymptotic analysis of generalization ability with an application to penalized regression. (deposited 14 Sep 2016 06:00) [Currently Displayed]