Tsagris, Michail and Preston, Simon and T.A. Wood, Andrew (2016): Improved classi cation for compositional data using the $\alpha$-transformation. Forthcoming in: Journal of Classification (2016)
Preview |
PDF
MPRA_paper_67657.pdf Download (317kB) | Preview |
Abstract
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we investigate methods for classi�cation of compositional data. Our approach centres on the idea of using the �-transformation to transform the data and then to classify the transformed data via regularised discriminant analysis and the k-nearest neighbours algorithm. Using the �-transformation generalises two rival approaches in compositional data analysis, one (when α=1) that treats the data as though they were Euclidean, ignoring the compositional constraint, and another (when $\alpha$ = 0) that employs Aitchison's centred log-ratio transformation. A numerical study with several real datasets shows that whether using $\alpha$ = 1 or $\alpha$ = 0 gives better classification performance depends on the dataset, and moreover that using an intermediate value of α can sometimes give better performance than using either 1 or 0.
Item Type: | MPRA Paper |
---|---|
Original Title: | Improved classi cation for compositional data using the $\alpha$-transformation |
Language: | English |
Keywords: | compositional data, classi�cation, �-transformation, �-metric, Jensen-Shannon divergence |
Subjects: | C - Mathematical and Quantitative Methods > C1 - Econometric and Statistical Methods and Methodology: General > C18 - Methodological Issues: General |
Item ID: | 67657 |
Depositing User: | Mr Michail Tsagris |
Date Deposited: | 05 Nov 2015 14:57 |
Last Modified: | 01 Oct 2019 09:03 |
References: | AITCHISON, J. (1982), "The statistical analysis of compositional data", Journal of the Royal Statistical Society. Series B, 44, 139--177. AITCHISON, J. (1983), "Principal component analysis of compositional data", Biometrika, 70,57--65. AITCHISON, J. (1992), "On criteria for measures of compositional difference", Mathematical Geology, 24, 365--379. AITCHISON, J. (2003), "The Statistical Analysis of Compositional Data" (Reprinted with additional material by The Blackburn Press), London (UK): Chapman & Hall. AITCHISON, J. and BARCELO-VIDAL, C. and MARTIN-FERNANDEZ, J.A. and PAWLOWSKY-GLAHN, V. (2000), "Logratio analysis and compositional distance", Mathematical Geology, 32, 271--275. BAXTER, M. J. (2001), "Statistical modelling of artefact compositional data", Archaeometry, 43, 131--147. BAXTER, M. J., BEARDAH, C. C., COOL, H. E. M., and JACKSON, C. M. (2005), "Compositional data analysis of some alkaline glasses", Mathematical Geology, 37, 183--196. BAXTER, M. J. and FREESTONE, I. C. (2006), "Log-ratio compositional data analysis in archaeometry", Archaeometry, 48, 511--531. BUTLER, A. and GLASBEY, C. (2008), "A latent Gaussian model for compositional data with zeros", Journal of the Royal Statistical Society: Series C, 57, 505--520. DRDYEN, I. L. and MARDIA, K. V. (1998), "Statistical Shape Analysis", New York: Wiley. EGOZQUE, J.J., PAWLOWSKY-GLAHN, V. and MATEU-FIGUERAS, G. and BARCELO-VIDAL, C. (2003), "Isometric logratio transformations for compositional data analysis", Mathematical Geology, 35, 279--300. ENDRES, D. M. and SCHINDELIN, J. E. (2003), "A new metric for probability distributions", Information Theory, IEEE Transactions on, 49, 1858--1860. FRY, J. M., FRY, T. R. L., and McLAREN, K. R. (2000), "Compositional data analysis and zeros in micro data", Applied Economics, 32, 953--959. GREENACRE, M. (2009), "Power transformations in correspondence analysis", Computational Statistics & Data Analysis, 53, 3107--3116. GREENACRE, M. (2011), "Measuring subcompositional incoherence", Mathematical Geosciences, 43, 681--693. GUEORGUIEVA, R., ROSENHECK, R., and ZELTERMAN, D. (2008), "Dirichlet component regression and its applications to psychiatric data", Computational statistics & data analysis, 52, 5344--5355. HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J. (2001), "The Elements of Statistical Learning: Data Mining, Inference, and Prediction", Berlin: Springer. IVERSON, S. J., FIELD, C., Don BOWEN, W., and BLANCHARD, W. (2004), "Quantitative fatty acid signature analysis: a new method of estimating predator diets", Ecological Monographs, 74, 211--235. LANCASTER, H. O. (1965), "The Helmert matrices", American Mathematical Monthly, 72, 4--12. LARROSA, J. M. (2003), "A compositional statistical analysis of capital stock", In Proceedings of the 1st Compositional Data Analysis Workshop, Girona, Spain. NEOCLEOUS, T., AITKEN, C., and ZADORA, G. (2011), "Transformations for compositional data with zeros with an application to forensic evidence evaluation", Chemometrics and Intelligent Laboratory Systems, 109, 77--85. OSTERREICHER, F. and VAJDA, I. (2003), "A new class of metric divergences on probability spaces and its applicability in statistics", Annals of the Institute of Statistical Mathematics, 55, 639--653. OTERO, N., TOLOSANA-DELGADO, R., SOLER, A., PAWLOWSKY-GLAHN, V., and CANALS, A. (2005), "Relative vs. absolute statistical analysis of compositions: A comparative study of surface waters of a mediterranean river", Water Research, 39, 1404--1414. PALAREA-ALBALADEJO, J., MARTIN-FERNANDEZ, J. A. and SOTO, J. A. (2012), "Dealing with distances and transformations for fuzzy c-means clustering of compositional data", Journal of classification, 29, 144--169. RODRIGUES, P. C. and LIMA, A. T. (2009), "Analysis of an European union election using principal component analysis", Statistical Papers, 50, 895--904. SCEALY, J. L. and WELSH, A. H. (2011), "Regression for compositional data by using distributions defined on the hypersphere", Journal of the Royal Statistical Society. Series B, 73, 351--375. SCEALY, J. L. and WELSH, A. H. (2014), "Colours and cocktails: compositional data analysis. 2013 Lancaster lecture", Australian & New Zealand Journal of Statistics, 56, 145--169. STEPHENS, M. A. (1982), "Use of the von Mises distribution to analyse continuous proportions", Biometrika, 69, 197--203. STEWART, C. and FIELD, C. (2011), "Managing the essential zeros in quantitative fatty acid signature analysis", Journal of Agricultural, Biological, and Environmental Statistics, 16, 45--69. TSAGRIS, M. T., PRESTON, S., and WOOD, A. T. A. (2011), "A data-based power transformation for compositional data", In Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain. UC IRVINE MACHINE LEARNING REPOSITORY (2014), "Forensic Glass Dataset", http://archive.ics.uci.edu/ml/datasets/Glass+Identification. WORONOW, A. (1997), "The elusive benefits of logratios", In Proceedings of the 3rd Annual Conference of the International Association for Mathematical Geology, Barcelona, Spain. ZADORA, G., NEOCLEOUS, T., and AITKEN, C. (2010), "A two-level model for evidence evaluation in the presence of zeros", Journal of forensic sciences, 55, 371--384. |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/67657 |