Limosani, Michele and Millemaci, Emanuele and Mustica, Paolo (2023): An efficient Bayes classifier for word classification: an application on the EU Recovery and Resilience Plans.
Preview |
PDF
MPRA_paper_119875.pdf Download (1MB) | Preview |
Abstract
This paper proposes the Prior Adaptive Bayes (PAB) classifier, a new algorithm to assign words appearing in a text to their respective topics. It is an adaption of the Bayes classifier where, as the prior probabilities of classes, their posterior probabilities associated with the adjacent words are used. Simulations show an improvement of more than 20% over the standard Bayes classifier. The PAB classifier is applied to the Recovery and Resilience Plans (RRPs) of the 27 European Union member states to evaluate their alignment with the environmental dimension of the Sustainable Development Goals (SDGs) as compared to the socioeconomic one. Results show that the attention paid by the countries to the pro-environment SDGs increases with the funds per capita assigned, the gap in the environmental endowment and the touristic attractiveness. Finally, the environmental dimension appears associated positively with available GDP growth projections for the next few years.
Item Type: | MPRA Paper |
---|---|
Original Title: | An efficient Bayes classifier for word classification: an application on the EU Recovery and Resilience Plans |
Language: | English |
Keywords: | textual analysis; Prior Adaptive Bayes classifier; Recovery and Resilience Plans; Sustainable Development Goals; pro-environment policy |
Subjects: | C - Mathematical and Quantitative Methods > C8 - Data Collection and Data Estimation Methodology ; Computer Programs > C82 - Methodology for Collecting, Estimating, and Organizing Macroeconomic Data ; Data Access H - Public Economics > H2 - Taxation, Subsidies, and Revenue > H22 - Incidence O - Economic Development, Innovation, Technological Change, and Growth > O4 - Economic Growth and Aggregate Productivity > O44 - Environment and Growth |
Item ID: | 119875 |
Depositing User: | Dr. Paolo Mustica |
Date Deposited: | 21 Jan 2024 12:11 |
Last Modified: | 21 Jan 2024 12:11 |
References: | ALFANO V. and GUARINO M. (2022), A Word to the Wise Analyzing the Impact of Textual Strategies in Determining House Pricing, Journal of Housing Research, vol. 31, issue 1, pp. 88-112. doi: https://doi.org/10.1080/10527001.2021.2013058 ALSHANIK F., APON A., HERZOG A., SAFRO I. and SYBRANDT J. (2020), Accelerating Text Mining Using Domain-Specific Stop Word Lists, 2020 IEEE International Conference on Big Data (Big Data). doi: 10.1109/BigData50022.2020.9378226 APRIGLIANO V., EMILIOZZI S., GUAITOLI G., LUCIANI A., MARCUCCI J., MONTEFORTE L. (2023), The power of text-based indicators in forecasting Italian economic activity, International Journal of Forecasting, vol. 39, issue 2, pp. 791-808. doi: https://doi.org/10.1016/j.ijforecast.2022.02.006 ASH E. and HANSEN S. (2022), Text algorithms in economics, Annual Review of Economics, In press. ATHEY S. and IMBENS G. W. (2019), Machine Learning Methods That Economists Should Know About, Annual Review of Economics, vol. 11, pp. 685-725. doi: https://doi.org/10.1146/annurev-economics-080217-053433 BORCHARDT S., BUSCAGLIA D., BARBERO VIGNOLA G., MARONI M. and MARELLI L. (2020), A sustainable recovery for the EU: A text mining approach to map the EU Recovery Plan to the Sustainable Development Goals, Publications Office of the European Union, Luxembourg. doi: 10.2760/030575 BRAMER M. (2020), Principles of Data Mining, Springer, Berlin. BREIMAN L. (2001), Random Forests, Machine Learning, vol. 45, pp. 5-32. doi: https://doi.org/10.1023/A:1010933404324 BREIMAN L., FRIEDMAN J., STONE C. J. and OLSHEN R. A. (1984), Classification and Regression Trees, Chapman and Hall/CRC, New York. COMMISSION DELEGATED REGULATION (EU) 2021/2106 of 28 September 2021 on supplementing Regulation (EU) 2021/241 of the European Parliament and of the Council establishing the Recovery and Resilience Facility by setting out the common indicators and the detailed elements of the recovery and resilience scoreboard. DEBOLE F. and SEBASTIANI F. (2005), An analysis of the relative hardness of Reuters-21578 subsets, Journal of the American Society for Information Science and Technology, vol. 56, issue 6, pp. 584-596. doi: https://doi.org/10.1002/asi.20147 DIAS C., HECSER A. and TURCU O. (2022), Recovery and Resilience Plans - public documents, Economic Governance Support Unit. DIAS C., ZOPPÈ A., GRIGAITĖ K., SEGALL R., ANGERER J., LEHOFER W., GOTTI G., KOMAZEC K. and TURCU O. (2021), Recovery and Resilience Plans – An overview, Economic Governance Support Unit. ERCAN G. and CICEKLI I. (2012), Keyphrase extraction through query performance prediction, Journal of Information Science, vol. 38, issue 5. doi: https://doi.org/10.1177/0165551512448984 EUROPEAN COMMISSION (2019), Annual Sustainable Growth Strategy 2020, COM(2019) 650 final. EUROPEAN COMMISSION (2021a), Annexes – Resilience dashboards for the social and economic, green, digital and geopolitical dimensions. EUROPEAN COMMISSION (2021b), Commission staff working document – Guidance to member states Recovery and Resilience plans, SWD (2021) 12 final. EUROSTAT (2020), Sustainable development in the European Union - Monitoring report on progress towards the SDGs in an EU context (2020 edition), Publications Office of the European Union, Luxembourg. doi: 10.2785/555257 FOLKE C., BIGGS R., NORSTRÖM A. V., REYERS B. and ROCKSTRÖM J. (2016), Social-ecological resilience and biosphere-based sustainability science, Ecology & Society, vol. 21, no. 3, art. 41. doi: http://dx.doi.org/10.5751/ES-08748-210341 HAE-CHEON K., JIN-HYEONG P., DAE-WON K. and JAESUNG L. (2020), Multilabel naïve Bayes classification considering label dependence, Pattern Recognition Letters, vol. 136, pp. 279-285. doi: https://doi.org/10.1016/j.patrec.2020.06.021 HASTIE T., TIBSHIRANI R. and WAINWRIGHT M. (2015), Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman and Hall/CRC, New York. doi: https://doi.org/10.1201/b18401 HILDUM D. C. (1963), Semantic Analysis of Texts by Computer, Language, vol. 39, no. 4, pp. 649-653. doi: https://doi.org/10.2307/411960 HOERL A. E. and KENNARD R. W. (1970), Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, vol. 12, pp. 55-67. HORNIK K., STINCHCOMBE M. and WHITE H. (1989), Multilayer feedforward networks are universal approximators, Neural Networks, vol. 2, issue 5, pp. 359-366. doi: https://doi.org/10.1016/0893-6080(89)90020-8 JOUVET P. A. and DE PERTHUIS C. (2013), Green growth: from intention to implementation, International Economics, vol. 134, pp. 29-55. doi: https://doi.org/10.1016/j.inteco.2013.05.003 KALAMARA E., TURRELL A., REDL C., KAPETANIOS G. and KAPADIA S. (2022), Making text count: economic forecasting using newspaper text, Journal of Applied Econometrics, vol. 37, issue 5, pp. 896-919. doi: https://doi.org/10.1002/jae.2907 KARABOYTCHEVA M. (2021), Recovery and Resilience Facility, European Parliamentary Research Service. KOUNDOURI P., DEVVES S. and PLATANIOTIS A. (2021), Alignment of the European green deal, the sustainable development goals and the European semester process: Method and application, Theoretical Economics Letters, vol. 11, issue 4, pp. 743-770. doi: 10.4236/tel.2021.114049 LOUGHRAN T. and MCDONALD B. (2011), When is a liability not a liability? Textual analysis, dictionaries, and 10-ks, The Journal of Finance, vol. 66, issue 1, pp. 35–65. doi: https://doi.org/10.1111/j.1540-6261.2010.01625.x LOUGHRAN T. and MCDONALD B. (2016), Textual Analysis in Accounting and Finance: A Survey, Journal of Accounting Research, vol. 54, issue 4, pp. 1187-1230. doi: https://doi.org/10.1111/1475-679X.12123 MANNING C. D. and SCHÜTZE H. (2003), Foundations of Statistical Natural Language Processing, MIT Press, Cambridge. MITCHELL T. M. (2019), Machine Learning, McGraw-Hill Education, New York. MUNDACA L. and MARKANDYA A. (2016), Assessing regional progress towards a ‘Green Energy Economy’, Applied Energy, vol. 179, pp. 1372-1394. OECD (2018), Economic Outlook No 103 - July 2018 - Long-term baseline projections. OECD (2021), Economic Outlook No 109 - October 2021 - Long-term baseline projections. doi: https://doi.org/10.1787/cbdb49e6-en PENG F., SCHUURMANS D. and WANG S. (2004), Augmenting Naive Bayes Classifiers with Statistical Language Models, Information Retrieval, vol. 7, pp. 317-345. doi: https://doi.org/10.1023/B:INRT.0000011209.19643.e2 PICAULT M. and RENAULT T. (2017), Words are not all created equal: A new measure of ECB communication, Journal of International Money and Finance, vol. 79, pp. 136-156. doi: https://doi.org/10.1016/j.jimonfin.2017.09.005 PINHEIRO R. H.W., CAVALCANTI G. D.C., CORREA R. F. and REN T. I. (2012), A global-ranking local feature selection method for text categorization, Expert Systems with Applications, vol. 39, issue 17, pp. 12851-12857. doi: https://doi.org/10.1016/j.eswa.2012.05.008 RENAULT T. (2017), Intraday online investor sentiment and return patterns in the U.S. stock market, Journal of Banking & Finance, vol. 84, pp. 25-40. doi: https://doi.org/10.1016/j.jbankfin.2017.07.002 RINGEL M., SCHLOMANN B., KRAIL M. and ROHDE C. (2016), Towards a green economy in Germany? The role of energy efficiency policies, Applied Energy, vol. 179, pp. 1293-1303. doi: https://doi.org/10.1016/j.apenergy.2016.03.063 ROTONDO F., PERCHINUNNO P., L’ABBATE S. and MONGELLI L. (2022), Ecological transition and sustainable development: integrated statistical indicators to support public policies, Scientific Reports, 12, article number 18513. doi: https://doi.org/10.1038/s41598-022-23085-0 SAHAMI M., DUMAIS S., HECKERMAN D. and HORVITZ E. (1998), A Bayesian approach to filtering junk e-mail, Learning for Text Categorization: Papers from the 1998 workshop 62, pp. 98-105. SINGH J. and GUPTA V. (2016), Text Stemming: Approaches, Applications, and Challenges, ACM Computing Surveys, vol. 49, issue 3, pp. 1-46. doi: https://doi.org/10.1145/2975608 TETLOCK P. C. (2007), Giving Content to Investor Sentiment: The Role of Media in the Stock Market, The Journal of Finance, vol. 62, issue 3. doi: https://doi.org/10.1111/j.1540-6261.2007.01232.x THORSRUD L. A. (2020), Words are the New Numbers: A Newsy Coincident Index of the Business Cycle, Journal of Business & Economic Statistics, vol. 38, issue 2, pp. 393-409. doi: https://doi.org/10.1080/07350015.2018.1506344 TIBSHIRANI R. (1996), Regression Shrinkage and Selection Via the Lasso, Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, issue 1, pp. 267-288. doi: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x UN (2022), The Sustainable Development Goals Report 2022, United Nations Publications, New York. WANG A. H. (2010), Don’t follow me: Spam detection in Twitter, 2010 International Conference on Security and Cryptography (SECRYPT), pp. 1-10. WHITE H. (1992), Artificial Neural Networks: Approximation and Learning Theory, Blackwell Publishers, Hoboken. WOOLDRIDGE J. M. (2010), Econometric Analysis of Cross Section and Panel Data, MIT Press, Cambridge (Massachusetts). ZHANG W., HU HUA, HU HAIYANG and FANG J. (2019), Semantic distance between vague concepts in a framework of modeling with words, Soft Computing, vol. 23, pp. 3347-3364. doi: https://doi.org/10.1007/s00500-017-2992-x |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/119875 |