Prada, Sergio I and Gonzalez, Claudia and Borton, Joshua and FernandesHuessy, Johannes and Holden, Craig and Hair, Elizabeth and Mulcahy, Tim (2011): Avoiding disclosure of individually identifiable health information: a literature review. Published in: SAGE Open (14. December 2011): pp. 116.

PDF
MPRA_paper_35463.pdf Download (351kB)  Preview 
Abstract
Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on deidentification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.
Item Type:  MPRA Paper 

Original Title:  Avoiding disclosure of individually identifiable health information: a literature review 
Language:  English 
Keywords:  public use files, disclosure avoidance, reidentification, deidentification, data utility 
Subjects:  I  Health, Education, and Welfare > I1  Health > I18  Government Policy ; Regulation ; Public Health C  Mathematical and Quantitative Methods > C4  Econometric and Statistical Methods: Special Topics > C46  Specific Distributions ; Specific Statistics 
Item ID:  35463 
Depositing User:  Sergio Prada 
Date Deposited:  19. Dec 2011 02:14 
Last Modified:  22. Sep 2015 23:17 
References:  Abowd, J. M., & Lane, J. (2004). New approaches to confidentiality protection: Synthetic data, remote access and research data centers. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.3083 Abowd, J. M., Stinson, M., & Benedetto, G. (2006). Final report to the social security administration on the SIPP/SSA/IRS public use file project. Retrieved from http://www.bls.census.gov/sipp/synth\_data.html Abowd, J. M., & Woodcock, S. D. (2002). Disclosure limitation in longitudinal linked data. In P. Doyle, J. I. Lane, J. J. Theeuwes, & L. V. Zayatz (Eds.), Confidentiality, disclosure, and data access (pp. 215278). Amsterdam, Netherlands: North Holland. Agrawal, R., & Srikant, R. (2000). Privacypreserving data mining. In Proceedings ACM SIGMOD International Conference on Management of Data (pp. 439450). Dallas, TX. Alexander, T., Davern, M., & Stevenson, B. (2010). Inaccurate age and sex data in the census PUMS files: Evidence and implications. Public Opinion Quarterly, 74, 551569. Algranati, D., & Kadane, J. (2004). Extracting confidential information from public documents: The 2000 Department of Justice Report on the federal use of the death penalty in the United States. Journal of Official Statistics, 20, 97113. Anderson, M., & Seltzer, W. (2009). Federal statistical confidentiality and business data: Twentieth century challenges and continuing issues. Journal of Privacy and Confidentiality, 1, 752. Bacher, J., Brand, R., & Bender, S. (2002). Reidentifying register data by survey data using cluster analysis: An empirical study. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 10, 589607. Benedetti, R., & Franconi, L. (1998). Statistical and technological solutions for controlled data dissemination. In Preproceedings of New Techniques and Technologies for Statistics, 1, 225232. Retrieved from http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf Benitez, K., & Malin, B. (2010). Evaluating reidentification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association, 17, 169177. Bethlehem, J. G., Keller, W. J., & Pannekoek, J. (1990). Disclosure control of microdata. Journal of the American Statistical Association, 85, 3845. Bradburn, N., & Straf, M. (2003). The eleventh Morris Hansen lecture information and statistical data: A distinction with a difference. Journal of Official Statistics, 19, 321331. Brickell, J., & Shmatikov, V. (2008). The cost of privacy: Destruction of datamining utility in anonymized data publishing. In Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (pp.7078). Retrieved from http://www.cs.utexas.edu/~shmat/ Chen, B., Kifer, D., LeFevre, K., & Machanavajjhala, A. (2009). Privacypreserving data publishing. Foundations and Trends in Databases, 2, 1167. Chen, G., & KellerMcNulty, S. (1998). Estimation of deidentification disclosure risk in microdata. Journal of Official Statistics, 14, 7995. European Communities. (1998). Statistical data protection ’98. In Proceeding by Office of Official Publications, Statistical Office of the European Communities. Retrieved from http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home Confidentiality and Data Access Committee. (2005). Federal committee on statistical methodology (Statistical Policy Working Paper No. 22). Retrieved from http://www.fcsm.gov/committees/cdac/ cdac.html Couper, M., Singer, E., Conrad, F., & Groves, R. (2008). Risk of disclosure, perceptions of risk, and concerns about privacy and confidentiality as factors in survey participation. Journal of Official Statistics, 24, 255275. Couper, M., Singer, E., Conrad, F., & Groves, R. (2010). Experimental studies of disclosure risk, disclosure harm, topic sensitivity, and survey participation. Journal of Official Statistics, 26, 287300. Dalenius, T. (1977). Toward a methodology for statistical disclosure control. Statistik Tidskrift, 15, 429444. Dalenius, T. (1988). Controlling invasion of privacy in surveys. Stockholm: Statistics Sweden. Dalenius, T., & Reiss, S. P. (1982). Dataswapping: A technique for disclosure control. Journal of Statistical Planning and Inference, 6, 7385. DomingoFerrer, J. (2002). Privacy in statistical databases (LNCS 2316). Springer. Retrieved from http://www.springer.com/computer/theoretical+computer+science/book/9783540436140 DomingoFerrer, J., & Franconi, L. (2006). Privacy in statistical databases (LNCS 4302). Springer. Retrieved from http://www.springer.com/computer/database+management+%26+information+retrieval/book/9783540493303?changeHeader DomingoFerrer, J., & Magkos, E. (2010). Privacy in statistical databases (LNCS 6344). Springer. Retrieved from http://www.springer.com/computer/database+management+%26+information+retrieval/book/9783642158377?changeHeader DomingoFerrer, J., & Saygin, Y. (2008). Privacy in statistical databases (LNCS 5262). Springer. Retrieved from http://www.springer.com/computer/database+management+%26+information+retrieval/book/9783540874706?changeHeader DomingoFerrer, J., Sebe, F., & CastellaRoca, J. (2004). On the security of noise addition for privacy in statistical databases. In J. DomingoFerrer & V. Torra (Eds.), Privacy in statistical databases 2004 (LNCS 3050) (pp. 149161). Verlag Berlin Heidelberg: Springer. DomingoFerrer, J., & Torra, V. (2003). Disclosure risk assessment in statistical microdata protection via advanced record linkage. Statistics and Computing, 13, 343354. DomingoFerrer, J., & Torra, V. (2004). Privacy in statistical databases (LNCS 3050). Springer. Retrieved from http://www.springer.com/computer/database+management+%26+information+retrieval/book/9783540221180?changeHeader Duncan, G. T., Elliot, M., & SalazarGonzález, J. (2011). Statistical confidentiality: Principles and practice. New York, NY: Springer Duncan, G. T., Jabine, T. B., & de Wolf, V. A. (1993). Private lives and public policies. Washington, DC: National Academy Press. Dwork, C. (2006). Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II. Venice, Italy: SpringerVerlag. Dwork, C., & Naor, M. (2010). On the difficulties of disclosure prevention in statistical database or the case for differential privacy. Journal of Privacy and Confidentiality, 2, 93107. El Emam, K. (2011). Methods for the deidentification of electronic health records for genomic research. Genome Medicine, 3, 19. El Emam, K., Brown, A., AbdelMalik, P., Neisa, A., Walker, M., Bottomley, J., & Roffey, T. (2010). A method for managing reidentification risk from small geographic areas in Canada. BMC Medical Informatics & Decision Making, 10, 18. El Emam, K., Kamal Dankar, F., Issa, R., Jonker, E., Amyot, D., Cogo, E., & Bottomley, J. (2009). A globally optimal kanonymity method for the deidentification of health data. Journal of the American Medical Informatics Association, 16, 670682. Elliot, M. J. (2000). DIS: A new approach to the measurement of statistical disclosure risk. International Journal of Risk Management, 2, 3948. Fayyoumi, E., & Oommen, B. J. (2010). A surveyon statistical disclosure control and microaggregation techniques for secure statistical databases. Software: Practice & Experience, 40, 11611188. Fienberg, S., & McIntyre, J. (2005). Data swapping: Variations on a theme by Dalenius and Reiss. Journal of Official Statistics, 21, 309323. Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. Journal of Official Statistics, 9, 383406. Fung, B. C. M., Wang, K., Chen, R., & Yu, P. S. (2010). Privacypreserving data publishing: A survey of recent developments. ACM Computing Surveys, 42, 14:114:53. Gouweleeuw, J. M., Kooiman, P., Willenborg, L. C. R.J., & De Wolf, P.P. (1998). Post randomisation for statistical disclosure control: Theory and implementation. Journal of Official Statistics, 14, 463478. Hundepool, A., Wetering, A., Ramaswamy, R., Franconi, L., Polettini, S., Capobianchi, A., & Giessing, S. (2008). MuArgus, Version 4.2 User’s Manual. The Hague, Netherlands: Statistics. Retrieved from http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf Kennickell, A., & Lane, J. (2007). Measuring the impact of data protection techniques on data utility: Evidence from the survey of consumer finances (SCF Working Papers). Retrieved from http://www.federalreserve.gov/pubs/oss/oss2/papers/Disclosure3.pdf Kinney, S. K., Karr, A. F., & Gonzalez, J. F. (2009). Data confidentiality: the next five years summary and guide to papers.Journal of Privacy and Confidentiality, 1, 125134. Lane, J. (2007). Optimizing the use of microdata: an overview of the issues. Journal of Official Statistics, 23, 299317. Lane, J., Heus, P., & Mulcahy, T. (2008). Data access in a cyber world: Making use of cyberinfrastructure. Transactions on Data Privacy, 1, 216. Lane, J., & Schur, C. (2010). Balancing access to health data and privacy: A review of the issues and approaches for the future. Health Services Research, 45, 14561467. Little, R. J. A., & Liu, F. (2003). Comparison of SMIKe with dataswapping and PRAM for statistical disclosure control of simulated microdata. In Proceedings of the Section on Survey Research Methods, CDROM. American Statistical Association. Retrieved from http://www.amstat.org/sections/srms/Proceedings/ Loukides, G., Denny, J., & Malin, B. (2010). The disclosure of diagnosis codes can breach research participants’ privacy. Journal of the American Medical Association, 17, 322327. Malin, B., Sweeney, L., & Newton, E. (2003). Trail reidentification: learning who you are from where you have been. Workshop on Privacy in Data, Carnegie Mellon University, Pittsburgh, PA. McCallum, A., & Wellner, B. (2003). Object consolidation by graph partitioning with a conditionallytrained distance metric. In Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC. Nadeau, C., Gagnon, E., & Latouche, M. (1999). Disclosure control strategy for the release of microdata in the Canadian Survey of Labour and Income Dynamics. Paper presented at the 1999 Joint Statistical Meetings, Baltimore, MD. Narayanan, A., & Shmatikov, V. (2008). Robust deanonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy. Retrieved from http://www.cs.utexas.edu/~shmat/ Narayanan, A., & Shmatikov, V. (2010). Myths and fallacies of “personally identifiable information.” Communications of the ACM, 53, 2426. Ochoa, S., Rasmussen, J., Robson, C., & Salib, M. (2008). Reidentification of individuals in Chicago’s homicide database—A technical and legal study. Retrieved from http://web.mit.edu/sem083/www/assignments/reidentification.html Raghunathan, T. E., Reiter, J. P., & Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19, 461468. Rastogi, V., Suciu, D., & Hong, S. (2007, September 2327). The boundary between privacy and utility in data publishing. In Proceedings of the 33rd International Conference on Very Large Data Bases. Vienna, Austria. Reiter, J. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181188. Reiter, J. (2009). Multiple imputation for disclosure limitation: Future research challenges. Journal of Privacy and Confidentiality, 1, 223233. Rosenbaum, S. (2010). Data governance and stewardship: designing data stewardship entities and advancing data access. Health Services Research, 45, 14421455. Rothstein, M. (2010). Is deidentification sufficient to protect health privacy in research? American Journal of Bioethics, 10, 311. Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official Statistics, 9, 461468. Scheuren, F., & Winkler, W. (1997). Regression analysis of data files that are computer matched—Part II. Survey Methodology, 23, 157165. Singh, A. (2009). Maintaining analytic utility while protecting confidentiality of survey and nonsurvey data. Journal of Privacy and Confidentiality, 1, 155182. Singh, A., Yu, F., & Dunteman, G. (2003, April). MASSC: A new data mask for limiting statistical information loss and disclosure (Working Paper No. 23). Paper presented at the Joint ECE/Eurostat work session on statistical confidentiality, Luxemburg. Retrieved from http://www.unece.org/fileadmin/DAM/stats/documents/2003/04/confidentiality/wp.23.s.e.pdf Skinner, C. J. (2007). The probability of identification: Applying ideas from forensic statistics to disclosure risk assessment. Journal of the Royal Statistical Society. Series A: Statistics in Society, 170, 195212. Skinner, C. J. (2009). Statistical disclosure control for survey data. In D. Pfeffermann & C. R. Rao (Eds.), Handbook of statistics (Vol. 29a, pp. 381396). Amsterdam, Netherlands:Elsevier. Retrieved from http://www.elsevier.com/wps/find/ bookdescription.cws_home/719334/description#description Skinner, C. J., & Shlomo, N. (2008). Assessing identification risk in survey microdata using loglinear models. Journal of the American Statistical Association, 103, 9891001. Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine & Ethics, 25, 98110. Sweeney, L. (2000). Uniqueness of simple demographics in the U.S. population (LIDAPWP4). Laboratory for International Data Privacy, Carnegie Mellon University, Pittsburgh, PA. Sweeney, L. (2002). kAnonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledgebased Systems, 10, 557570. Sweeney, L. (2010a, March 89). Data sharing under HIPAA: 12 years later. Paper Presented at HHS Workshop on the HIPAA Privacy Rule’s Deidentification Standar, Washington, DC. Retrieved from http://hhshipaaprivacy.com/assets/5/resources/Panel2_Sweeney.pdf Sweeney, L. (2010b). Designing a Trustworthy Nationwide Health Information Network (NHIN) Promises Americans Privacy and Utility, Rather Than Falsely Choosing Between Privacy or Utility (Testimony before the 21st Century Healthcare Caucus Round Table, U.S. Congress April 22, 2010). Retrieved from http://patientprivacyrights.org/wpcontent/uploads/2010/04/ SweeneyCongressTestimony42210.pdf Truta, T. M., Fotouhi, F., & BarthJones, D. (2004). Assessing global disclosure risk in masked Microdata. In Proceedings of the 2004 Workshop on Privacy in Electronic Society (pp. 8593). Washington, DC. United Nations. (2007). Principles and guidelines for managing statistical confidentiality and microdata access. Retrieved from http://unstats.un.org/unsd/statcom/doc07/BGMicrodataE.pdf Weinberg, D. H., Abowd, J. M., Steel, P. M., Zayatz, L., & Rowland, S. K. (2007). Access methods for United States microdata (U.S. Census Bureau Center for Economic Studies Paper No. CESWP0725). Retrieved from http://ssrn.com/ abstract=1015374 Willenborg, L., & de Waal, T. (1996). Statistical disclosure control in practice, lecture notes in statistics. New York, NY: SpringerVerlag. Winkler, W. E. (1997). Views on the production and use of confidential microdata (Statistical Research Division report RR 97/01). Retrieved from http://www.census.gov/srd/www/ byyear.html Winkler, W. E. (2004a). Masking and reidentification methods for publicuse microdata: Overview and research problems. In J. DomingoFerrer & V. Torra (Eds.), Privacy in statistical database (pp. 231247). New York, NY: Springer. Winkler, W. E. (2004b). Reidentification methods for masked microdata. In J. DomingoFerrer & V. Torra (Eds.), Privacy in statistical databases (pp. 216230). New York, NY: Springer. Winkler, W. E. (2007). Examples of easytoimplement, widely used methods of masking for which analytic properties are not justified (Research Report Series #200721). Statistical Research Division, U.S. Census Bureau. Retrieved from http://www.census.gov/srd/papers/pdf/rrs200721.pdf 
URI:  https://mpra.ub.unimuenchen.de/id/eprint/35463 