Xu, Ning and Hong, Jian and Fisher, Timothy (2016): Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression.
This is the latest version of this item.
Preview |
PDF
MPRA_paper_73657.pdf Download (3MB) | Preview |
Abstract
In this paper, we study the performance of extremum estimators from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. By adapting the classical concentration inequalities, we derive upper bounds on the empirical out-of-sample prediction errors as a function of the in-sample errors, in-sample data size, heaviness in the tails of the error distribution, and model complexity. We show that the error bounds may be used for tuning key estimation hyper-parameters, such as the number of folds K in cross-validation. We also show how K affects the bias-variance trade-off for cross-validation. We demonstrate that the L2-norm difference between penalized and the corresponding un-penalized regression estimates is directly explained by the GA of the estimates and the GA of empirical moment conditions. Lastly, we prove that all penalized regression estimates are L2-consistent for both the n > p and the n < p cases. Simulations are used to demonstrate key results.
Item Type: | MPRA Paper |
---|---|
Original Title: | Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression |
Language: | English |
Keywords: | generalization ability, upper bound of generalization error, penalized regression, crossvalidation, bias-variance trade-off, L2 difference between penalized and unpenalized regression, lasso, high-dimensional data. |
Subjects: | C - Mathematical and Quantitative Methods > C1 - Econometric and Statistical Methods and Methodology: General > C13 - Estimation: General C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C52 - Model Evaluation, Validation, and Selection C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C55 - Large Data Sets: Modeling and Analysis |
Item ID: | 73657 |
Depositing User: | Mr Ning Xu |
Date Deposited: | 14 Sep 2016 06:00 |
Last Modified: | 27 Sep 2019 07:13 |
References: | Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Petrov, B. N., Csaki, F. (Eds.), 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR. Budapest: Akademiai Kaido, pp. 267–281. Amemiya, T., 1985. Advanced econometrics. Harvard university press. Belloni, A., Chernozhukov, V., Fernández-Val, I., Hansen, C., 2013. Program evaluation with high-dimensional data. arXiv preprint arXiv:1311.2645. Bickel, P. J., Ritov, Y., Tsybakov, A. B., 2009. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37, 1705–1732. Blundell, R., Dias, M. C., Meghir, C., Reenen, J., 2004. Evaluating the employment impact of a mandatory job search program. Journal of the European economic association 2 (4), 569–606. Breiman, L., 1995. Better subset regression using the nonnegative garrote. Technometrics 37 (4), 373–384. Candes, E. J., Tao, T., 2007. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 2313–2351. Caner, M., 2009. Lasso-type gmm estimator. Econometric Theory 25 (1), 270–290. Cesa-Bianchi, N., Conconi, A., Gentile, C., 2004. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 50 (9), 2050–2057. Chickering, D. M., Heckerman, D., Meek, C., 2004. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research 5, 1287–1330. Dolton, P., 2006. The econometric evaluation of the new deal for lone parents. Ph.D. thesis, Department of Economics, University of Michigan. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of statistics 32 (2), 407–499. Frank, I. E., Friedman, J. H., 1993. A statistical view of some chemometrics regression tools. Technometrics 35 (2), 109–135. Friedman, J., Hastie, T., Tibshirani, R., 2010. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736. Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian network classifiers. Machine Learning 29 (2-3), 131–163. Friedman, N., Linial, M., Nachman, I., Pe’er, D., 2000. Using Bayesian networks to analyze expression data. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. RECOMB ’00. ACM, New York, NY, USA, pp. 127–135. Fu, W. J., 1998. Penalized regressions: the bridge versus the Lasso. Journal of computational and graphical statistics 7 (3), 397–416. Gechter, M., 2015. Generalizing the results from social experiments: Theory and evidence from mexico and india. manuscript, Pennsylvania State University. Hall, P., Marron, J. S., 1991. Local minima in cross-validation functions. Journal of the Royal Statistical Society. Series B (Methodological), 245–252. Hall, P., Racine, J., Li, Q., 2011. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association. Heckerman, D., Geiger, D., Chickering, D. M., 1995. Learning Bayesian networks: The combina- tion of knowledge and statistical data. Machine learning 20 (3), 197–243. Hoeffding, W., 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (301), 13–30. Hoerl, A. E., Kennard, R. W., 1970. Ridge regression: Applications to nonorthogonal problems. Technometrics 12 (1), 69–82. Hu, T., Zhou, D.-X., 2009. Online learning with samples drawn from non-identical distributions. Journal of Machine Learning Research 10 (Dec), 2873–2898. Huang, J., Horowitz, J. L., Ma, S., 2008. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 587–613. Kakade, S. M., Tewari, A., 2009. On the generalization ability of online strongly convex program- ming algorithms, 801–808. Knight, K., Fu, W., 2000. Asymptotics for Lasso-type estimators. Annals of statistics, 1356–1378. Kohavi, R., et al., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai. Vol. 14. pp. 1137–1145. McDonald, D. J., Shalizi, C. R., Schervish, M., 2011. Generalization error bounds for stationary autoregressive models. arXiv preprint arXiv:1103.0942. Meinshausen, N., Bühlmann, P., 2006. High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 1436–1462. Meinshausen, N., Yu, B., 2009. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 246–270. Michalski, A., Yashin, A., 1986. Structural minimization of risk on estimation of heterogeneity distributions. URL http://pure.iiasa.ac.at/2785/1/WP-86-076.pdf Mohri, M., Rostamizadeh, A., 2009. Rademacher complexity bounds for non-iid processes, 1097– 1104. Newey, W. K., McFadden, D., 1994. Large sample estimation and hypothesis testing. Handbook of econometrics 4, 2111–2245. Noor, M. A., 2008. Differentiable non-convex functions and general variational inequalities. Applied Mathematics and Computation 199 (2), 623–630. Pearl, J., 2015. Detecting latent heterogeneity. Sociological Methods & Research, 0049124115600597. Schwarz, G. E., 1978. Estimating the dimension of a model. Annals of Statistics 6 (2), 461–464. Shao, J., 1997. Asymptotic theory for model selection. Statistica Sinica 7, 221–242. Skrondal, A., Rabe-Hesketh, S., 2004. Generalized latent variable modeling: Multilevel, longitudi- nal, and structural equation models. Crc Press. Smale, S., Zhou, D.-X., 2009. Online learning with markov sampling. Analysis and Applications 7 (01), 87–113. Stock, J. H., Watson, M. W., 2012. Generalized shrinkage methods for forecasting using many predictors. Journal of Business & Economic Statistics 30 (4), 481–493. Stone, M., 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B (Methodological) 36 (2), 111–147. Stone, M., 1977. An asymptotic equivalence of choice of model by cross-validation and akaike’s criterion. Journal of the Royal Statistical Society, Series B (Methodological) 39 (1), 44–47. Strongin, R. G., Sergeyev, Y. D., 2013. Global optimization with non-convex constraints: Sequential and parallel algorithms. Vol. 45. Springer Science & Business Media. Tang, Y., 2007. A hoeffding-type inequality for ergodic time series. Journal of Theoretical Probabil- ity 20 (2), 167–176. Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological) 58, 267–288. Vapnik, V. N., Chervonenkis, A. Y., 1971a. On the uniform convergence of relative frequencies of events to their probabilities. Theoretical Probability and its Applications 16 (2), 264–280. Vapnik, V. N., Chervonenkis, A. Y., 1971b. Theory of uniform convergence of frequencie of appearance of attributes to their probabilities and problems of defining optimal solution by empiric data. Avtomatika i Telemekhanika (2), 42–53. Vapnik, V. N., Chervonenkis, A. Y., 1974a. The method of ordered risk minimization, I. Avtomatika i Telemekhanika (8), 21–30. Vapnik, V. N., Chervonenkis, A. Y., 1974b. On the method of ordered risk minimization, II. Avtomatika i Telemekhanika (9), 29–39. Wang, L.-W., Feng, J.-F., 2005. Learning gaussian mixture models by structural risk minimization. In: 2005 International Conference on Machine Learning and Cybernetics. Vol. 8. IEEE, pp. 4858–4863. Wellner, J. A., 1981. A glivenko-cantelli theorem for empirical measures of independent but non- identically distributed random variables. Stochastic Processes and their Applications 11 (3), 309–312. Xu, N., Hong, J., Fisher, T. C., 2016. Clustered model selection and clustered model averaging: Simultaneous heterogeneity control and modeling, School of Economics, The University of Sydney. Yan, L., Ma, D., 2001. Global optimization of non-convex nonlinear programs using line-up competition algorithm. Computers & Chemical Engineering 25 (11), 1601–1610. Yu, B., 1994. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116. Yu, C.-N. J., Joachims, T., 2009. Learning structural svms with latent variables. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp. 1169–1176. Yu, H., 1993. A glivenko-cantelli lemma and weak convergence for empirical processes of associ- ated sequences. Probability theory and related fields 95 (3), 357–370. Zhang, C.-H., 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38, 894–942. Zhang, C.-H., Huang, J., 2008. The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics 36, 1567–1594. Zhang, T., 2009. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research 10, 555–568. Zhao, P., Yu, B., 2006. On model selection consistency of Lasso. The Journal of Machine Learning Research 7, 2541–2563. Zou, H., 2006. The adaptive Lasso and its oracle properties. Journal of the American statistical association 101 (476), 1418–1429. |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/73657 |
Available Versions of this Item
-
Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression. (deposited 12 Sep 2016 11:14)
- Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression. (deposited 14 Sep 2016 06:00) [Currently Displayed]