Dushyn, Oleksiy and Dushyn, Borys (2024): Извлечение информации из редких событий в регрессионном анализе.
Preview |
PDF
MPRA_paper_120235.pdf Download (429kB) | Preview |
Abstract
This paper investigated an important practical problem of extracting information from rare events in sparse and high-dimensional data while building a linear regression model. It analyzes the advantages and the limitations of the different linear regression method used for high-dimensional problems. Main known meth-ods were selected and tested on the real Tripadvisor.com dataset. The results of this research show the impor-tance of the data aggregation based on hierarchical clustering. It allows extracting information from rare fea-tures by aggregating them according the clustering tree. Comparative analyses of main different linear regres-sion methods that use clustering aggregation were done.
Item Type: | MPRA Paper |
---|---|
Original Title: | Извлечение информации из редких событий в регрессионном анализе |
English Title: | Extracting information from rare events in regression analysis |
Language: | Russian |
Keywords: | rare events, regression Analysis, sparse data, high-dimensional data, Lasso, Ridge, ElasticNet, rare methods, text mining, semantic aggregation, hierarchical clustering, vector word representation. |
Subjects: | C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C51 - Model Construction and Estimation C - Mathematical and Quantitative Methods > C6 - Mathematical Methods ; Programming Models ; Mathematical and Simulation Modeling > C63 - Computational Techniques ; Simulation Modeling C - Mathematical and Quantitative Methods > C8 - Data Collection and Data Estimation Methodology ; Computer Programs > C87 - Econometric Software |
Item ID: | 120235 |
Depositing User: | Mr Borys Dushyn |
Date Deposited: | 30 Apr 2024 19:28 |
Last Modified: | 30 Apr 2024 19:28 |
References: | Тихонов А. Н. (1963). О регуляризации некорректно поставленных задач, Докл. АН СССР, 1963, том 153, №1, 49–52. Alan Talevi et al. (2020). Machine Learning in Drug Discovery and Development. Part 1: A Primer. CPT: Pharmacometrics & Systems Pharmacology, 9, 129–142. On line at: https://ascpt.onlinelibrary.wiley.com/doi/pdf/10.1002/psp4.12491. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122. GloVe: Global Vectors for Word Representation. (2014). On line at: https://nlp.stanford.edu/projects/glove/ . Feinerer, I. and Hornik, K. (2017). tm: Text Mining Package. R package version 0.7–1. Hastie, T., Tibshirani, R.J. and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 362 p. On line at: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS_corrected_1.4.16.pdf. Hoerl, A. and Kennard, R. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12, 55–67. On line at: https://doi.org/10.1080/00401706.1970.10488634. Hui Zou and Trevor Hastie, Regularization and variable selection via the elastic net, J. R. Statist. Soc. B (2005) 67, Part 2, pp. 301–320. On line at: https://hastie.su.domains/Papers/B67.2%20(2005)%20301–320%20Zou%20&%20Hastie.pdf. NRC Emotion Lexicon. (2020) On line at: https://archive.org/details/nrc-emotion-lexicon-v0.92. Ingo Feinerer et al. (2023) Package ‘tm’. On line at: https://cran.r-project.org/web/packages/tm/tm.pdf. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. On line at: https://nlp.stanford.edu/pubs/glove.pdf. Tibshirani, R.J. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58: 267–288. TripAdvisor Data Set. (2009). On line at: http://times.cs.uiuc.edu/~wang296/Data/. Wang, H., Lu, Y., and Zhai, C. (2010). Latent aspect rating analysis on review text data: A rating regres-sion approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, 783–792, New York, NY, USA. ACM. Wang, T.E., and Zhao, H. (2017). Structured subcomposition selection in regression and its application to microbiome data analysis. Ann. Appl. Stat., 11(2):771–791. On line at: https://www.researchgate.net/publication/312910482_Structured_subcomposition_selection_in_regression_and_its_application_to_microbiome_data_analysis. Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., and Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature genetics, 45(10), 1113–1120. doi:10.1038/ng.2764. On line at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3919969/. Yan, X. and Bien, J. (2018). Rare Feature Selection in High Dimensions. ArXiv e-print 1803.06675. On line at: https://arxiv.org/abs/1803.06675. Yan, X. and Bien, J. (2018a). Package ‘rare’. On line at: https://cran.r-project.org/web/packages/rare/rare.pdf. Yang, Y. and Pederson, J.O. (1997). A Comparative Study on Feature Selection in Text Categorization. Proceedings of 14th International Conference on Machine Learning, Nashville, 8–12 July 1997, 412–420. On line at: http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf. |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/120235 |