Chen, Song Xi and Qin, Yingli (2010): A Two Sample Test for High Dimensional Data with Applications to Geneset Testing. Published in: The Annals of Statistics , Vol. 38, (2010): pp. 808835.

PDF
MPRA_paper_59642.pdf Download (264kB)  Preview 
Abstract
We proposed a two sample test for means of high dimensional data when the data dimension is much larger than the sample size. The classical Hotelling's $T^2$ test does not work for this ``large p, small n" situation. The proposed test does not require explicit conditions on the relationship between the data dimension and sample size. This offers much flexibility in analyzing high dimensional data. An application of the proposed test is in testing significance for sets of genes, which we demonstrate in an empirical study on a Leukemia data set.
Item Type:  MPRA Paper 

Original Title:  A Two Sample Test for High Dimensional Data with Applications to Geneset Testing 
English Title:  A Two Sample Test for High Dimensional Data with Applications to Geneset Testing 
Language:  English 
Keywords:  large p small n; martingale central limit theorem; multiple comparison. 
Subjects:  C  Mathematical and Quantitative Methods > C1  Econometric and Statistical Methods and Methodology: General C  Mathematical and Quantitative Methods > C1  Econometric and Statistical Methods and Methodology: General > C12  Hypothesis Testing: General C  Mathematical and Quantitative Methods > C1  Econometric and Statistical Methods and Methodology: General > C14  Semiparametric and Nonparametric Methods: General 
Item ID:  59642 
Depositing User:  Professor Song Xi Chen 
Date Deposited:  04. Nov 2014 05:47 
Last Modified:  04. Nov 2014 06:22 
References:  Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis. Wiley. Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adaptive to unknown sparsity in controlling the false discovery rate. The Annals of Statistics, 34, 584653. Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: by an example of a two sample problem. Statistica Sinica, 6 311329. Barry, W., Nobel, A. and Wright, F. (2005). Significance analysis of functional categories in gene expression studies: A structured permutation approach. Bioinformatics, 21 19431949. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society series B 57, 289300. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29, 11651188. Chen, S. X. and Qin, Y.L. (2008). A Two Sample Test For High Dimensional Data With Applications To Geneset Testing. Research Report, Department of Statistics, Iowa State University. Chiaretti, S., Li, X.C., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz,J. and Foa, R. (2004) Gene expression profile of adult Tcell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood, 103, No. 7, 27712778. Dudoit, S., Keles, S. and van der Laan, M. (2006). Multiple tests of association with biological annotation metadata. Manuscript. Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. The Annals of Applied Statistics, 1, 107129. Fan, J., Hall, P. and Yao, Q. (2007). To how many simultaneous hypothesis tests can normal, student's t or bootstrap calibration be applied. Journal of the American Statistical Association, 102, 12821288. Fan, J., Peng, H. and Huang, T. (2005). Semilinear highdimensional model for normalization of microarray data: a theoretical analysis and partial consistency. Journal of the American Statistical Association}, 100, 781796. Gentleman, R., Irizarry, R.A., Carey, V.J., Dudoit, S. and Huber, W. (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor, Springer. Hall, P. and Heyde, C.(1980). Martingale Limit Theory and Applications, Academic Press, New York. Huang, J., Wang, D. and Zhang, C. (2005). A twoway semilinear model for normalization and analysis of cDNA microarray data. Journal of the American Statistical Association, 100 814829. Kosorok, M. and Ma, S. (2007). Marginal asymptotics for the "large p, small n" paradigm: with applications to microarray data. The Annals of Statistics, 35, 14561486. Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrix when the dimension is large compare to the sample size. The Annals of Statistics, 30, 10811102. Newton, M., Quintana, F., Den Boon, J., Sengupta, S. and Ahlquist, P.(2007). Randomset methods identify distinct aspects of the enrichment signal in geneset analysis. The Annals of Applied Statistics, 1, 85106. Portnoy, S. (1986). On the central limit theorem in $R^p$ when $p \to \infty$. Probability Theory and Related Fields, 73, 571583. Nettleton, D., Recknor, J. and Reecy, J. (2008). Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics, 24 192 – 201. Schott, J. R. (2005). Testing for complete independence in high dimensions. Biometrika, 92, 951956. Storey, J., Taylor, J. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society series B, 66, 187205. Tracy, C. and Widom, H. (1996). On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics 177, 727754. van der Laan, M. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistic 2, 445461. Yin, Y., Bai, Z. and Krishnaiah, P. R. (1988) On the limit of the largest eigenvalue of the largedimensional sample covariance matrix. Probability Theory and Related Fields 78, 509521. 
URI:  https://mpra.ub.unimuenchen.de/id/eprint/59642 