Fantazzini, Dean and Xiao, Yufeng (2023): Detecting Pump-and-Dumps with Crypto-Assets: Dealing with Imbalanced Datasets and Insiders’ Anticipated Purchases. Forthcoming in: Econometrics
Preview |
PDF
Free_format_exchanges_REVISED_Repec.pdf Download (13MB) | Preview |
Abstract
Detecting pump-and-dump schemes involving cryptoassets with high-frequency data is challenging due to imbalanced datasets and the early occurrence of unusual trading volumes. To address these issues, we propose constructing synthetic balanced datasets using resampling methods and flagging a pump-and-dump from the moment of public announcement up to 60 min beforehand. We validated our proposals using data from Pumpolymp and the CryptoCurrency eXchange Trading Library to identify 351 pump signals relative to the Binance crypto exchange in 2021 and 2022. We found that the most effective approach was using the original imbalanced dataset with pump-and-dumps flagged 60 min in advance, together with a random forest model with data segmented into 30-s chunks and regressors computed with a moving window of 1 h. Our analysis revealed that a better balance between sensitivity and specificity could be achieved by simply selecting an appropriate probability threshold, such as setting the threshold close to the observed prevalence in the original dataset. Resampling methods were useful in some cases, but threshold-independent measures were not affected. Moreover, detecting pump-and-dumps in real-time involves high-dimensional data, and the use of resampling methods to build synthetic datasets can be time-consuming, making them less practical.
Item Type: | MPRA Paper |
---|---|
Original Title: | Detecting Pump-and-Dumps with Crypto-Assets: Dealing with Imbalanced Datasets and Insiders’ Anticipated Purchases |
Language: | English |
Keywords: | pump-and-dump; crypto-assets; minority class; class imbalance; machine learning; random forests |
Subjects: | C - Mathematical and Quantitative Methods > C1 - Econometric and Statistical Methods and Methodology: General > C14 - Semiparametric and Nonparametric Methods: General C - Mathematical and Quantitative Methods > C2 - Single Equation Models ; Single Variables > C25 - Discrete Regression and Qualitative Choice Models ; Discrete Regressors ; Proportions ; Probabilities C - Mathematical and Quantitative Methods > C3 - Multiple or Simultaneous Equation Models ; Multiple Variables > C35 - Discrete Regression and Qualitative Choice Models ; Discrete Regressors ; Proportions C - Mathematical and Quantitative Methods > C3 - Multiple or Simultaneous Equation Models ; Multiple Variables > C38 - Classification Methods ; Cluster Analysis ; Principal Components ; Factor Models C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C51 - Model Construction and Estimation C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C53 - Forecasting and Prediction Methods ; Simulation Methods C - Mathematical and Quantitative Methods > C5 - Econometric Modeling > C58 - Financial Econometrics G - Financial Economics > G1 - General Financial Markets > G17 - Financial Forecasting and Simulation G - Financial Economics > G3 - Corporate Finance and Governance > G32 - Financing Policy ; Financial Risk and Risk Management ; Capital and Ownership Structure ; Value of Firms ; Goodwill K - Law and Economics > K4 - Legal Procedure, the Legal System, and Illegal Behavior > K42 - Illegal Behavior and the Enforcement of Law |
Item ID: | 118435 |
Depositing User: | Prof. Dean Fantazzini |
Date Deposited: | 31 Aug 2023 14:10 |
Last Modified: | 31 Aug 2023 14:12 |
References: | Akbani, Rehan, Stephen Kwek, and Nathalie Japkowicz. 2004. Applying support vector machines to imbalanced datasets. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20–24, 2004. Proceedings 15. Berlin: Springer, pp. 39–50. Antonopoulos, Andreas. 2014. Mastering Bitcoin: Unlocking Digital Cryptocurrencies. Sonoma County: O’Reilly Media, Inc. Arltová, Markéta, and Darina Fedorová. 2016. Selection of Unit Root Test on the Basis of Length of the Time Series and Value of AR(1) Parameter. Statistika: Statistics & Economy Journal 96: 47–64. Barua, Sukarna, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. 2012. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering 26: 405–25. Bouraoui, Taoufik. 2015. Does’ pump and dump’affect stock markets? International Journal of Trade, Economics and Finance 6: 45. Breiman, Leo. 2001. Random forests. Machine Learning 45: 5–32. Breiman, Leo, Friedman, Jerome, Richard Olshen, and Charles Stone. 1984. Classification and Regression Trees. Monterey: Wadsworth & Brooks. Bunkhumpornpat, Chumphol, Krung Sinapiromsaran, and Chidchanok Lursinsap. 2009. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27–30, Proceedings 13. Berlin: Springer, pp. 475–82 Charu, C. Aggarwal. 2019. Outlier Analysis. Berlin: Springer. Chawla, Nitesh V. 2003. C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML. Toronto: CIBC, vol. 3, p. 66. Chawla, Nitesh V., KevinW. Bowyer, Lawrence O. Hall, andW. Philip Kegelmeyer. 2002. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321–57. Cieslak, David A., and Nitesh V. Chawla. 2008. Learning decision trees for unbalanced data. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15–19, Proceedings, Part I 19. Berlin: Springer, pp. 241–56. Dhawan, Anirudh, and Talis Putnin, š. 2023. A new wolf in town? pump-and-dump manipulation in cryptocurrency markets. Review of Finance, Forthcoming 27: 935–75. Feder, Amir, Neil Gandal, J. T. Hamrick, Tyler Moore, Arghya Mukherjee, Farhang Rouhi, and Marie Vasek. 2018. The Economics of Cryptocurrency Pump and Dump Schemes. Technical Report, CEPR Discussion Papers, No. 13404. London: Centre for Economic Policy Research. Freeman, Elizabeth A., and Gretchen G. Moisen. 2008. A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecological Modelling 217: 48–58. Frieder, Laura, and Jonathan Zittrain. 2008. Spam works: Evidence from stock touts and corresponding market activity. Hastings Communications and Entertainment Law Journal 30: 479–520. Guo, Hongyu, and Herna L. Viktor. 2004. Boosting with data generation: improving the classification of hard to learn examples. In Innovations in Applied Artificial Intelligence: 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, IEA/AIE 2004, Ottawa, Canada, May 17–20, Proceedings 17. Berlin: Springer, pp. 1082–91. Hamrick, J. T., Farhang Rouhi, Arghya Mukherjee, Amir Feder, Neil Gandal, Tyler Moore, and Marie Vasek. 2021. An examination of the cryptocurrency pump-and-dump ecosystem. Information Processing & Management 58: 102506. Hand, David J. 2009. Measuring classifier performance: a coherent alternative to the area under the roc curve. Machine Learning 77: 103–23. Hand, David J., and Christoforos Anagnostopoulos. 2014. A better beta for the h measure of classification performance. Pattern Recognition Letters 40: 41–46. Hand, David J., and Christoforos Anagnostopoulos. 2022. Notes on the h-measure of classifier performance. Advances in Data Analysis and Classification 17: 109–24. Hand, David J., and Veronica Vinciotti. 2003. Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognition Letters 24: 1555–62. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2017. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. 12th Printing. Berlin: Springer. Hawkins, D. M., and S. Weisberg. 2017. Combining the box-cox power and generalised log transformations to accommodate nonpositive responses in linear and mixed-effects linear models. South African Statistical Journal 51: 317–28. He, Haibo, and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21: 1263–84. Janitza, Silke, Carolin Strobl, and Anne-Laure Boulesteix. 2013. An auc-based permutation variable importance measure for random forests. BMC Bioinformatics 14: 119. Kamps, Josh, and Bennett Kleinberg. 2018. To the moon: Defining and detecting cryptocurrency pump-and-dumps. Crime Science 7: 18. King, Gary, and Langche Zeng. 2001. Logistic regression in rare events data. Political Analysis 9: 137–63. Kotsiantis, Sotiris, Dimitris Kanellopoulos, and Panayiotis Pintelas. 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30: 25–36. Krinklebine, Karlos. 2010. Hacking Wall Street: Attacks And Countermeasures. Chicago: Independently Published. Kukar, Matjaz, and Igor Kononenko. 1998. Cost sensitive learning with neural networks. In ECAI 98: 13th European Conference on Artificial Intelligence. Hoboken: JohnWiley & Sons, Ltd., vol. 15, pp. 88–94. La Morgia, Massimo, Alessandro Mei, Francesco Sassi, and Julinda Stefa. 2020. Pump and dumps in the bitcoin era: Real time detection of cryptocurrency market manipulations. Paper presented at 2020 29th International Conference on Computer Communications and Networks (ICCCN), Honolulu, HI, USA, August 3–6, pp. 1–9. La Morgia, Massimo, Alessandro Mei, Francesco Sassi, and Julinda Stefa. 2023. The doge of wall street: Analysis and detection of pump and dump cryptocurrency manipulations. ACM Transactions on Internet Technology 23: 1–28. Lee, Sauchi Stephen. 1999. Regularization in skewed binary classification. Computational Statistics 14: 277–92. Lin, Yi, Yoonkyung Lee, and Grace Wahba. 2002. Support vector machines for classification in nonstandard situations. Machine Learning 46: 191–202. López-Ratón, Mónica, María Xosé Rodríguez-Álvarez, Carmen Cadarso-Suárez, and Francisco Gude-Sampedro. 2014. OptimalCutpoints: An R package for selecting optimal cutpoints in diagnostic tests. Journal of Statistical Software 61: 1–36. Lunardon, Nicola, Giovanna Menardi, and Nicola Torelli. 2014. Rose: A package for binary imbalanced learning. R Journal 6: 79–89. McCarthy, Kate, Bibi Zabar, and GaryWeiss. 2005. Does cost-sensitive learning beat sampling for classifying rare classes? In Proceedings of the 1st International Workshop on Utility-Based Data Mining. New York: Gary Weiss, pp. 69–77. Mease, David, Abraham J. Wyner, and Andreas Buja. 2007. Boosted classification trees and class probability/quantile estimation. Journal of Machine Learning Research 8:409–39. Menardi, Giovanna, and Nicola Torelli. 2014. Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery 28: 92–122. Narayanan, Arvind, Joseph Bonneau, Edward Felten, Andrew Miller, and Steven Goldfeder. 2016. Bitcoin and Cryptocurrency Technologies: A Comprehensive Introduction. Princeton: Princeton University Press. Nghiem, Huy, Goran Muric, Fred Morstatter, and Emilio Ferrara. 2021. Detecting cryptocurrency pump-and-dump frauds using market and social signals. Expert Systems with Applications 182: 115284. Ouyang, Liangyi, and Bolong Cao. 2020. Selective pump-and-dump: The manipulation of their top holdings by chinese mutual funds around quarter-ends. Emerging Markets Review 44: 100697. Pukelsheim, Friedrich. 1994. The three sigma rule. The American Statistician 48: 88–91. Riddle, Patricia, Richard Segal, and Oren Etzioni. 1994. Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence an International Journal 8: 125–47. Rousseeuw, Peter J., and Annick M. Leroy. 2005. Robust Regression and Outlier Detection. Hoboken: John Wiley & Sons. Sammut, Claude, and GeoffreyWebb. 2011. Encyclopedia of Machine Learning. Berlin: Springer. Schiavo, Rosa A., and David J. Hand. 2000. Ten more years of error rate research. International Statistical Review 68: 295–310. Shao, Sisi. 2021. The effectiveness of supervised learning models in detection of pump and dump activity in dogecoin. In Second IYSF Academic Symposium on Artificial Intelligence and Computer Engineering. Bellingham: SPIE, Volume 12079, pp. 356–63. Siering, Michael. 2019. The economics of stock touting during internet-based pump and dump campaigns. Information Systems Journal 29: 456–83. Siris, Vasilios A., and Fotini Papagalou. 2004. Application of anomaly detection algorithms for detecting syn flooding attacks. Paper presented at IEEE Global Telecommunications Conference, GLOBECOM’04, Dallas, TX, USA, November 29–December 3, Volume 4, pp. 2050–54. Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8: 25. Sun, Yanmin, Andrew K. C. Wong, and Mohamed S. Kamel. 2009. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23: 687–719. Tantithamthavorn, Chakkrit, Ahmed E. Hassan, and Kenichi Matsumoto. 2018. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46: 1200–19. Thiele, Christian, and Gerrit Hirschfeld. 2021. Cutpointr: Improved Estimation and Validation of Optimal Cutpoints in R. Journal of Statistical Software 98: 1–27. Ting, Kai Ming. 2002. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering 14: 659–65. US Security and Exchange Commission. 2005. Pump&Dump.con: Tips for Avoiding Stock Scams on the Internet. Technical Report. Washington, DC: US Security and Exchange Commission. van den Goorbergh, Ruben, Maarten van Smeden, Dirk Timmerman, and Ben Van Calster. 2022. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association 29: 1525–34. Victor, Friedhelm, and Tanja Hagemann. 2019. Cryptocurrency pump and dump schemes: Quantification and detection. Paper presented at 2019 International Conference on Data Mining Workshops (ICDMW), Beijing, China, November 8–11, pp. 244–51. Weiss, Gary M. 2004. Mining with rarity: A unifying framework. ACM Sigkdd Explorations Newsletter 6: 7–19. Weiss, Gary M., and Foster Provost. 2001. The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report. Piscataway: Rutgers University. Withanawasam, Rasika, Whigham, Peter, and Timothy Crack. 2013. Characterising trader manipulation in a limit-order driven market. Mathematics and Computers in Simulation 93: 43–52. Wongvorachan, Tarid, Surina He, and Okan Bulut. 2023. A comparison of undersampling, oversampling, and smote methods for dealing with imbalanced classification in educational data mining. Information 14: 54. Xu, Jiahua, and Benjamin Livshits. 2019. The anatomy of a cryptocurrency pump-and-dump scheme. In USENIX Security Symposium. Santa Clara: USENIX Association, pp. 1609–25. Zaki, Mohamed, David Diaz, and Babis Theodoulidis. 2012. Financial market service architectures: A “pump and dump” case study. Paper presented at 2012 Annual SRII Global Conference, San Jose, CA, USA, July 24–27, pp. 554–63. |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/118435 |