Machine learning classification of entrepreneurs in British historical census data

Montebruno, Piero and Bennett, Robert and Smith, Harry and van Lieshout, Carry (2019): Machine learning classification of entrepreneurs in British historical census data. Published in: Information Processing & Management , Vol. 57, No. 3 (May 2020): p. 102210.

This is the latest version of this item.

Preview

PDF
MPRA_paper_106931.pdf
Download (9MB) | Preview

Abstract

This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851-1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.

Item Type:	MPRA Paper
Original Title:	Machine learning classification of entrepreneurs in British historical census data
Language:	English
Keywords:	machine learning; deep learning; logistic regression; classification; big data; census
Subjects:	M - Business Administration and Business Economics ; Marketing ; Accounting ; Personnel Economics > M1 - Business Administration > M13 - New Firms ; Startups N - Economic History > N8 - Micro-Business History > N83 - Europe: Pre-1913
Item ID:	106931
Depositing User:	Dr Piero Montebruno
Date Deposited:	06 Apr 2021 01:43
Last Modified:	06 Apr 2021 01:43
References:	[1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Computer software. [2] Abdi, A., Shamsuddin, S.M., Hasan, S., Piran, J., 2019. Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion. Information Processing and Management 56, 1245–1259. doi:10.1016/j.ipm.2019.02.018 [3] Al-Salemi, B., Ayob, M., Kendall, G., Noah, S.A.M., 2019. Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. Information Processing and Management 56, 212–227. doi:10.1016/j.ipm.2018.09.008 [4] Alvarez-Galvez, J., 2016. Discovering complex interrelationships between socioeconomic status and health in Europe: A case study applying Bayesian Networks. Social Science Research 56, 133–143. doi:10.1016/j.ssresearch.2015.12.011 [5] Bennett, R.J., Smith, H., van Lieshout, C., Montebruno, P., Newton, G., 2019. The Age of Entrepreneurship: Business Proprietors, Self-employment and Corporations Since 1851. Routledge International Studies in Business History, London and New York. doi:10.4324/9781315160375 [6] Bennett, R.J., Smith, H., Montebruno, P., 2020. The Population of Non-corporate Business Proprietors in England and Wales 1891–1911. Business History 62:8, 1341-1372. doi:10.1080/00076791.2018.1534959 [7] Blanchflower, D.G., Oswald, A.J., 1998. What Makes an Entrepreneur? Journal of Labor Economics 16, 26–60. doi:10.3386/w3252 [8] Boutell, M.R., Luo, J., Shen, X., Brown, C.M., 2004. Learning multi-label scene classification. Pattern Recognition 37, 1757–1771. doi:10.1016/j.patcog.2004.03.009 [9] Cameron, A.C., Trivedi, P.K., 2005. Microeconometrics: methods and applications. Cambridge University Press, Cambridge. doi:10.1017/CBO9780511811241 [10] Capobianco, S., Marinai, S., 2019. Deep neural networks for record counting in historical handwritten documents. Pattern Recognition Letters 119, 103–111. doi:10.1016/j.patrec.2017.10.023 [11] Cheng, W., Hüllermeier, E., 2009. Combining instance-based learning and logistic regression for multilabel classification. Machine Learning 76, 211–225. doi:/10.1007/s10994-009-5127-5 [12] Chollet, F., 2018. Deep learning with Python. Manning, Shelter Island, New York [13] Dawe, N., 2018. Python Code for Two-class AdaBoost. Computer software (3-clause BSD License). [14] Fawcett, T., 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874.doi:10.1016/j.patrec.2005.10.010 [15] Freund, Y., Schapire, R.R.E., 1996. Experiments with a New Boosting Algorithm. International Conference on Machine Learning, 148–156doi:10.1.1.133.1040 [16] Friedman, J.H., 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29, 1189–1232 [17] Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K., 2008. Multilabel classification via calibrated label ranking. Machine Learning 73, 133–153. doi:10.1007/s10994-008-5064-8 [18] Géron, A., 2017. Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. O’Reilly, Sebastopol, California [19] Goldberger, A.S., 1991. A course in econometrics. Harvard University Press, Cambridge, Massachusetts; London, England [20] Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep learning. Adaptive computation and machine learning, The MIT Press, Cambridge, Massachusetts; London, England [21] Head, T., 2018. Python Code for ROC curve. Computer software (3-clause BSD License) [22] Higgs, E., 2004. Life, death and statistics: civil registration, censuses and the work of the General Register Office, 1836-1952. Local population studies supplement, Local Population Studies, Hatfield, England [23] Higgs, E., Schürer, K., 2014. Integrated Census Microdata (I-CeM), 1851-1911, UK Data Archive data deposit SN-7481. doi:10.5255/UKDA-SN-7481-1 [24] Hindman, M., 2015. Building Better Models: Prediction, Replication, and Machine Learning in the Social Sciences. The ANNALS of the American Academy of Political and Social Science 65948–62. doi:10.1177/0002716215570279 [25] Hu, Y., Li, H., Cao, Y., Li, T., 2006. Automatic extraction of titles from general documents using machine learning. Information Processing & Management 42, 1276–1293. doi:10.1016/j.ipm.2005.12.001 [26] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K., 2008. Label ranking by learning pairwise preferences. Artificial Intel ligence 172, 1897–1916. doi:10.1016/j.artint.2008.08.002 [27] James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An introduction to statistical learning: with applications in R. Springer texts in statistics, Springer, New York; Heidelberg; Dordrecht; London [28] Kastrati, Z., Imran, A.S., Yayilgan, S.Y., 2019. The impact of deep learning on document classification using semantically rich representations. Information Processing and Management 56, 1618–1632. doi:10.1016/j.ipm.2019.05.003 [29] Kucukyilmaz, T., Cambazoglu, B.B., Aykanat, C., Baeza-Yates, R., 2017. A machine learning approach for result caching in web search engines. Information Processing and Management 53834–850. doi:10.1016/j.ipm.2017.02.006 [30] Liu, Y., Jin, L., Lai, S., 2019. Automatic labeling of large amounts of handwritten characters with gate-guided dynamic deep learning. Pattern Recognition Letters 119, 94–102. doi:10.1016/j.patrec.2017.09.042 [31] Matloff, N.S., 2011. The art of R programming. No Starch Press, San Francisco, California [32] McCulloch, W., Pitts, W., 1943. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5, 115–133. doi:10.1007/bf02478259 [33] Montebruno, P., Bennett, R., van Lieshout, C., Smith, H., Satchell, A., 2019. Shifts in agrarian entrepreneurship in mid-Victorian England and Wales. The Agricultural History Review 67, 71–108 [34] Montebruno, P., Bennett, R.J., van Lieshout, C., Smith, H., 2019. A tale of two tails: Do Power Law and Lognormal models fit firm-size distributions in the mid-Victorian era? Physica A: Statistical Mechanics and its Applications 573, 858–875. doi:10.1016/j.physa.2019.02.054 [35] Montebruno, P., Bennett, R.J., van Lieshout, C., Smith, H., 2019. Research data supporting “A tale of two tails: Do Power Law and Lognormal models fit firm-size distributions in the mid-Victorian era?”. Mendeley Data doi:10.17632/86xkkncmw3.1 [36] Montebruno, P., Bennett, R.J., Smith, H., van Lieshout, C., 2020. Research data supporting “Machine learning in the processing of historical census data”. Mendeley Data doi:10.17632/p4zptr98dh.1 [37] Murphy, K., 2012. Machine Learning A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, London, England [38] Murthy, D., Gross, A.J., 2017. Social media processes in disasters: Implications of emergent technology use. Social Science Research 63, 356–370. doi:10.1016/j.ssresearch.2016.09.015 [39] Parker, S.C., 2004. The Economics of Self-Employment and Entrepreneurship. Cambridge University Press, Cambridge, England. doi:10.1017/cbo9780511493430 [40] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 [41] Rabe-Hesketh, S., Skrondal, A., 2012. Multilevel and longitudinal modelling using Stata. Volume 2, Categorical responses, counts, and survival. 3rd ed. ed., Stata Press, College Station, Texas [42] Read, J., Pfahringer, B., Holmes, G., Frank, E., 2011. Classifier chains for multi-label classification. Machine Learning 85, 333–359. doi:10.1007/s10994-011-5256-5 [43] Reichenberg, O., Berglund, T., 2019. “Stepping up or stepping down?“: The earnings differences associated with Swedish temporary workers’ employment sequences. Social Science Research doi:10.1016/j.ssresearch.2019.04.007 [44] Rosenblatt, F., 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–408. doi:10.1037/h0042519 [45] Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1988. Learning Internal Representations by Error Propagation, in Collins, A., Smith, E.E. (Eds.), Readings in Cognitive Science. Morgan Kaufmann, pp. 399–421. doi:10.1016/B978-1-4832-1446-7.50035-2 [46] Schapire, R., Singer, Y., 1999. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37, 297–336. doi:10.1023/A:100761452 [47] Schürer, K., Penkova, T., Shi, Y., 2015. Standardising and Coding Birthplace Strings and Occupational Titles in the British Censuses of 1851 to 1911. Historical Methods: A Journal of Quantitative and Interdisciplinary History 48, 195–213. doi:10.1080/01615440.2015.1010028 [48] Schwartz, R.L., Phoenix, T., Foy, b.d., 2008. Learning Perl. 5th ed., O’Reilly, Beijing; Farnham [49] Su, Z., Meng, T., 2016. Selective responsiveness: Online public demands and government responsiveness in authoritarian China. Social Science Research 59, 52–67. doi:10.1016/j.ssresearch.2016.04.017 [50] Tang, B., Kay, S., He, H., 2016. Toward Optimal Feature Selection in Naive Bayes for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 28, 2508–2521. doi:10.1109/tkde.2016.2563436 [51] Tang, X., Chen, L., Cui, J., Wei, B., 2019. Knowledge representation learning with entity descriptions, hierarchical types, and textual relations. Information Processing and Management 56, 809–822. doi:10.1016/j.ipm.2019.01.005 [52] The Editors of the American Heritage Dictionaries, 2011. The American Heritage Dictionary of the English language. Houghton Mifflin Harcourt, Boston [53] Tong, S., Chang, E., 2001. Support vector machine active learning for image retrieval, in: Proceedings of the ninth ACM international conference on multimedia, ACM. pp. 107–118. doi:10.1145/500156.500159 [54] Tsoumakas, G., Katakis, I., Vlahavas, I., 2011. Random k-Label sets for Multilabel Classification. IEEE Transactions on Knowledge and Data Engineering 23, 1079–1089. doi:10.1109/tkde.2010.164 [55] UK Parliament, 1890. The Census. Report of the Committee appointed by the Treasury to inquire into certain questions connected with the taking of the Census, presented to both Houses of Parliament by Command of Her Majesty. LVIII.13. Printed for Her Majesty Stationery Office by Eyre and Spottiswoode, London, England [56] Varoquaux, G., Müller, A., 2018. Python Code for Classifier Comparison. Modified for documen- tation by Jaques Grobler. Computer software (3-clause BSD License) [57] Wolpert, D.H., 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation 8, 1341–1390. doi:10.1162/neco.1996.8.7.1341 [58] Wu, Q., Ye, Y., Zhang, H., Ng, M.K., Ho, S.S., 2014. ForesTexter: An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based Systems 67, 105–116. doi:10.1016/j.knosys.2014.06.004 [59] Zhang, M.L., Zhou, Z.H., 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40, 2038–2048. doi:10.1016/j.patcog.2006.12.019
URI:	https://mpra.ub.uni-muenchen.de/id/eprint/106931

Available Versions of this Item

Machine learning classification of entrepreneurs in British historical census data. (deposited 28 Jun 2020 12:30)
- Machine learning classification of entrepreneurs in British historical census data. (deposited 06 Apr 2021 01:43) [Currently Displayed]

All papers reproduced by permission. Reproduction and distribution subject to the approval of the copyright owners.

View Item