Lagani, Vincenzo and Athineou, Giorgos and Farcomeni, Alessio and Tsagris, Michail and Tsamardinos, Ioannis (2016): Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets. Forthcoming in: Journal of Statistical Software
Preview |
PDF
MPRA_paper_72772.pdf Download (518kB) | Preview |
Abstract
The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constrained-based learning of Bayesian Networks. Most of the currently available feature-selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. Under that respect SES subsumes and extends previous feature selection algorithms, like the maxmin parent children algorithm. SES is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data-analysis tasks, namely classi�cation, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.
Item Type: | MPRA Paper |
---|---|
Original Title: | Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets |
Language: | English |
Keywords: | feature selection, constraint-based algorithms, multiple predictive signatures |
Subjects: | C - Mathematical and Quantitative Methods > C8 - Data Collection and Data Estimation Methodology ; Computer Programs > C88 - Other Computer Software |
Item ID: | 72772 |
Depositing User: | Mr Michail Tsagris |
Date Deposited: | 31 Jul 2016 04:47 |
Last Modified: | 02 Oct 2019 20:37 |
References: | Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010). \Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I : Algorithms and Empirical Evaluation." Journal of Machine Learning Research, 11, 171-234. Aliferis CF, Statnikov AR, Tsamardinos I, Brown LE (2003). "Causal Explorer : A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery." The 2003 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sci- ences (METMBS '03). Barton K (2016). MuMIn: Multi-Model Inference. R package version 1.15.6, URL https: //CRAN.R-project.org/package=MuMIn. Brown C (2013). hash: Full Feature Implementation of Hash/Associated Arrays/Dictionaries. R package version 2.2.6, URL http://CRAN.R-project.org/package=hash. Buckland ST, Burnham KP, Augustin NH (1997). \Model selection: an integral part of inference." Biometrics, pp. 603{618. Calcagno V (2013). glmulti: Model selection and multimodel inference made easy. R package version 1.0.7, URL https://CRAN.R-project.org/package=glmulti. Christensen RHB (2015). \ordinal|Regression Models for Ordinal Data." R package version 2015.6-28. http://www.cran.r-project.org/package=ordinal/. Cox DR (1972). \Regression Models and Life-Tables." Journal of the Royal Statistical Society. Series B (Methodological), 34, 187{220. doi:10.2307/2985181. 22 Feature Selection with the R Package MXM Cribari-Neto F, Zeileis A (2010). \Beta Regression in R." Journal of Statistical Software, 34(2), 1{24. URL http://www.jstatsoft.org/v34/i02. Dethlefsen C, Hojsgaard S (2005). \A Common Platform for Graphical Models in R: The gR- base Package." Journal of Statistical Software, 14(17), 1-12. URL http://www.jstatsoft. org/v14/i17/. Efron B, Hastie T, Johnstone I, Tibshirani R (2004). \Least Angle Regression." The Annals of Statistics, 32(2), 407{499. doi:10.1214/009053604000000067. Fawcett T (2006). \An Introduction to ROC Analysis." Pattern Recognition Letters, 27, 861-874. doi:10.1016/j.patrec.2005.10.010. Fisher RA (1924). \The Distribution of the Partial Correlation Coefficient." Metron, 3, 329-332. Friedman J, Hastie T, Tibshirani R (2010). \Regularization Paths for Generalized Linear Models via Coordinate Descent." Journal of Statistical Software, 33(1), 1{13. URL http: //www.jstatsoft.org/v33/i01. Gautier L, Cope L, Bolstad BM, Irizarry RA (2004). "affy- Analysis of Affymetrix GeneChip Data at the Probe Level." Bioinformatics, 20(3), 307{315. ISSN 1367-4803. doi:10.1093/ bioinformatics/btg405. Gentleman RC, Carey VJ, Bates DM, others (2004). \Bioconductor: Open Software Development for Computational Biology and Bioinformatics." Genome Biology, 5, R80. URL http://genomebiology.com/2004/5/10/R80. Guyon I, Elisseeff A (2003). \An introduction to variable and feature selection." Journal of Machine Learning Research, 3, 1157-1182. doi:10.1162/153244303322753616. Harrell FE (2001). Regression Modeling Strategies. With Applications to Linear Models, Logistic Regression, and survival Analysis. Springer-Verlag, New York. doi:10.1007/ 978-1-4757-3462-1. Harrell FE (2015). Hmisc: Harrell Miscellaneous. R package version 3.17-0, URL http: //CRAN.R-project.org/package=Hmisc. Huang G, Tsamardinos I, Raghu V, Kaminski N, Benos PV (2015). \T-Recs: Stable selection of Dynamically Formed Groups of Features with Application to Prediction of Clinical Outcomes." In Pacific Symposium on Biocomputing (PSB). Koenker R (2015). quantreg: Quantile Regression. R package version 5.19, URL http: //CRAN.R-project.org/package=quantreg. Kuhn M (2013). QSARdata: Quantitative Structure Activity Relationship (QSAR) Data Sets. R package version 1.3, URL http://CRAN.R-project.org/package=QSARdata. Lagani V, Kortas G, Tsamardinos I (2013). "Biomarker Signature Identification in "Omics" Data with Multi-Class Outcome." Computational and structural biotechnology journal, 6, e201303004. doi:10.5936/csbj.201303004. Lagani V, Tsamardinos I (2010). "Structure-Based Variable Selection for Survival Data." Bioinformatics, 26(15), 1887{1894. URL http://www.ncbi.nlm.nih.gov/pubmed/ 20519286. Landsheer JA (2010). "The Specification of Causal Models with Tetrad IV: A Review." Structural Equation Modeling: A Multidisciplinary Journal, 17, 703{711. doi:10.1080/ 10705511.2010.510074. Mazerolle MJ (2016). AICcmodavg: Model selection and multimodel inference based on (Q)AIC(c). R package version 2.0-4, URL http://CRAN.R-project.org/package= AICcmodavg. Meinshausen N, Buhlmann P (2006). "High-Dimensional Graphs and Variable Selection with the Lasso." The Annals of Statistics, 34(3), 1436{1462. doi:10.1214/ 009053606000000281. Mussel C, Lausser L, Maucher M, Kestler HA (2012). \Multi-Objective Parameter Selection for Classi�ers." Journal of Statistical Software, 46(5), 1{27. URL http://www.jstatsoft. org/v46/i05/. Neapolitan RE (2004). Learning Bayesian Networks. Pearson Prentice Hall, Upper Saddle River, NJ, USA. R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. Raftery A, Hoeting J, Volinsky C, Painter I, Yeung KY (2015). BMA: Bayesian Model Averaging. R package version 3.18.6, URL https://CRAN.R-project.org/package=BMA. Reshef DN, Reshef Ya, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011). "Detecting Novel Association in Large Data Sets." Science, 334(December), 1518-1523. doi:10.1126/science.1205438. Revolution Analytics, Weston S (2015a). doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.10, URL https://CRAN.R-project.org/package= doParallel. Revolution Analytics, Weston S (2015b). foreach: Provides Foreach Looping Construct for R. R package version 1.4.3, URL https://CRAN.R-project.org/package=foreach. Schroeder M, Haibe-Kains B, Culhane A, Sotiriou C, Bontempi G, Quackenbush J (2011). breastCancerVDX: Gene Expression Datasets Published by Wang et al. [2005] and Minn et al. [2007] (VDX). R package version 1.6.0, URL http://compbio.dfci.harvard.edu/. Scutari M (2010). \Learning Bayesian Networks with the bnlearn R Package." Journal of Statistical Software, 35(3), 1{22. URL http://www.jstatsoft.org/v35/i03. Simon N, Tibshirani R (2014). "Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011." ArXiv preprint. URL http://arxiv.org/ abs/1401.7645. Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). \ROCR: Visualizing Classifier Performance in R." Bioinformatics, 21(20), 7881. URL http://rocr.bioinf.mpi-sb.mpg.de. Spirtes P, Glymour CN, Scheines R (2000). Causation, prediction, and search. MIT press. Statnikov A, Aliferis CF (2010). \Analysis and Computational Dissection of Molecular Signature Multiplicity." PLoS computational biology, 6(5), e1000790. doi:10.1371/journal. pcbi.1000790. Therneau T (2015). A Package for Survival Analysis in S. R package version 2.38, URL http://CRAN.R-project.org/package=survival. Tsamardinos I, Aliferis CF (2003). \Towards Principled Feature Selection: Relevancy, Filters, and Wrappers." In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Tsamardinos I, Brown LE, Aliferis CF (2006). "The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm." Machine Learning, 65(1), 31{78. doi:10.1007/ s10994-006-6889-7. Tsamardinos I, Lagani V, Pappas D (2012). "Discovering Multiple, Equivalent Biomarker Signatures." In 7th Conference of the Hellenic Society for Computational Biology and Bioinformatics (HSCBB12). Tsamardinos I, Rakhshani A, Lagani V (2014). "Performance-Estimation Properties of Cross- Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization." In A Likas, K Blekas, D Kalles (eds.), Artificial Intelligence: Methods and Applications, of Lecture Notes in Computer Science, volume 8445, pp. 1-14. Springer International Publishing. ISBN 978-3-319-07063-6. doi:10.1007/978-3-319-07064-3_1. URL http://dx.doi.org/10. 1007/978-3-319-07064-3_1. Van De Vijver MJ, He YD, Van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R (2002). "A Gene-Expression Signature as a Predictor of Survival in Breast Cancer." The New England Journal of Medicine, 347(25), 1999{2009. doi:10.1056/NEJMoa021967. Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Fourth edition. Springer, New York. ISBN 0-387-95457-0, URL http://www.stats.ox.ac.uk/pub/MASS4. Zeileis A, Kleiber C, Jackman S (2008). "Regression models for count data in R." Journal of Statistical Software, 27(8). URL http://www.jstatsoft.org/v27/i08/. |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/72772 |