Influence of Pattern of Missing Data on Performance of Imputation Methods: An Example from National Data on Drug Injection in Prisons

Document Type: Original Article

Authors

1 Regional Knowledge Hub for HIV/AIDS Surveillance, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran

2 Social Determinant of Health Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran

3 Research Center for Modeling in Healtth, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran

Abstract

Background
Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern, to be addressed here, is the role of the pattern of missing data.
 
Methods
We used information of 2720 prisoners. Results derived from fitting regression model to whole data were served as gold standard. Missing data were then generated so that 10%, 20% and 50% of data were lost. In scenario 1, we generated missing values, at above rates, in one variable which was significant in gold model (age). In scenario 2, a small proportion of each of independent variable was dropped out. Four imputation methods, under different Event Per Variable (EPV) values, were compared in terms of selection of important variables and parameter estimation.
 
Results
In scenario 2, bias in estimates was low and performances of all methods for handing missing data were similar. All methods at all missing rates were able to detect significance of age. In scenario 1, biases in estimations were increased, in particular at 50% missing rate. Here at EPVs of 10 and 5, imputation methods failed to capture effect of age.
 
Conclusion
In scenario 2, all imputation methods at all missing rates, were able to detect age as being significant. This was not the case in scenario 1. Our results showed that performance of imputation methods depends on the pattern of missing data.

Keywords

Main Subjects


1. Shokoohi M, Baneshi MR, Haghdoost AA. Estimation of the Active Network Size of Kermanian Males. Addiction and Health 2011; 2(3-4): 81-88.

2. Prison and AIDS: UNAIDS point of view. [cited 2013 May). Available at: http://www.unaids.org/en/media/unaids/contentassets/dataimport/publications/irc-pub05/prisons-pov_en.pdf

3. Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM. Review: a gentle introduction to imputation of missing values. J  Clin Epidemiol 2006; 59(10): 1087-91. doi: 10.1016/j.jclinepi.2006.01.014

4. Marshall A, Altman DG, Royston P, Holder RL. Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 2010; 10(1): 7. doi: 10.1186/1471-2288-10-7

5. Knol MJ, Janssen KJM, Donders ART, Egberts ACG, Heerdink ER, Grobbee DE, et al. Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epidemiol 2010; 63: 728-36.

6. Barzi F, Woodward M. Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. Am J Epidemiol 2004; 160(1): 34-45. doi: 10.1093/aje/kwh175

7. Baneshi MR, Talei AR. Impact of imputation of missing data on estimation of survival rates: an example in breast cancer. Iranian Journal of Cancer Prevention 2010; 3(3): 127-31.

8. Baneshi MR, Talei AR. Prevention of Disease Complications through Diagnostic Models: How to Tackle the Problem of Missing Data? Iran J Public Health 2012; 41(1).

9. Vargas-Chanes D, Decker PA, Schroeder DR, Offord KP. An Introduction to Multiple Imputation Methods: Handling Missing Data with SAS@ V8. 2. Rochester, MN: Mayo Foundation; 2003.

10. Farhangfar A, Kurgan L, Dy J. Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 2008; 41(12): 3692-705. doi: http://dx.doi.org/10.1016/j.patcog.2008.05.019

11. Horton NJ, Kleinman KP. Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. Am Stat 2007; 61(1): 79-90. doi: 10.1198/000313007X172556

12. Langkamp DL, Lehman A, Lemeshow S. Techniques for handling missing data in secondary analyses of large surveys. Acad Pediatr 2010; 10(3): 205-10. doi: http://dx.doi.org/10.1016/j.acap.2010.01.005

13. Marlin BM. Missing  data  problems  in  machine  learning. Toronto: University of Toronto; 2008.

14. Klebanoff MA, Cole SR. Use of multiple imputation in the epidemiologic literature. Am J Epidemiol 2008; 168(4): 355-7. doi: 10.1093/aje/kwn071

15. Harel O, Zhou XH. Multiple imputation: review of theory, implementation and software. Stat Med 2007; 26(16): 3057-77. doi: 10.1002/sim.2787

16. Chen Q, Wang S. Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 2013. [In Press] doi: 10.1002/sim.5783

17. Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J  Clin Epidemiol 2002; 55(2): 184-91.

18. Van Buuren S, Groothuis-Oudshoorn K. MICE: Multivariate imputation by chained equations in R. Journal  of  Statistical  Software 2011; 45(3): 1-68.

19. JC W. Multiple Imputation For Missing Data: What Is It And How Can I Use It? Annual Meeting of the American Educational Research Association; Chicago, IL; 2003.

20. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 2010; 30(4): 377-99. doi: 10.1002/sim.4067

21. Suraphee S, Raksmanee C, Busaba J, Chaisorn C, Nakornthai W. A Comparison of Estimation Methods for Missing Data in Multiple Linear Regression with Two Independent Variables. Thailand Statistician 2006; 4: 13-26.

22. Lin TH. A Comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data. Qual  Quant 2010; 44: 277-87. doi: 10.1007/s11135-008-9196-5

23. N.Otwombe K, Galpin J. Weighting Method for Binary Longitudinal Data With Incomplete Covariates and Outcomes Incorporating Auxiliary Information. Available at: http://www.statssa.gov.za/ycs/SpeakerPresentations/Acropolis5/Day3/Session%20VIIID_Prof.%20Jacky%20Galpin/Otwombe%20Kennedy.pdf

24. Allison PD. Missing Data. Available at: http://www.statisticalhorizons.com/wp-content/uploads/2012/01/Milsap-Allison.pdf.

25. Baneshi MR, Talei AR. Does the Missing Data Imputation Method Affect the Composition and Performance of Prognostic Models? Iran Red Crescent Med J 2012; 14(1): 51-6.

26. Yuan YC, editor. Multiple imputation for missing data: concepts and new development (version 9.0). 2000. Available at: http://www.math.montana.edu/~jimrc/classes/stat506/notes/multipleimputation-SAS.pdf

27. Bernaards CA, Farmer MM, Qi K, Dulai GS, Ganz PA, Kahn KL. Comparison of Two Multiple Imputation Procedures in a Cancer Screening Survey. J Data Sci 2003; 1: 1-20.

28. Catellier DJ, Hannan PJ, Murray DM, Addy CL, Conway TL, Yang S, et al. Imputation of missing data when measuring physical activity by accelerometry. Med Sci Sports Exerc 2005; 37(11 Suppl): S555. doi: 10.1249/01.mss.0000185651.59486.4e

29. Cheng Y, Sherman SG, Srirat N, Vongchak T, Kawichai S, Jittiwutikarn J, et al. Risk factors associated with injection initiation among drug users in Northern Thailand. Harm Reduct J 2006; 3: 10. doi: 10.1186/1477-7517-3-10

30. Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18(6): 681-94.

31. Kristman VL, Manno M, Cote P. Methods to account for attrition in longitudinal data: do they work? A simulation study. Eur J Epidemiol 2005; 20(8): 657-62. doi: 10.1007/s10654-005-7919-7

32. Janssen KJM, Donders ART, Harrell FE, Vergouwe Y, Chen Q, Grobbee DE, et al. Missing covariate data in medical research: to impute is better than to ignore. J Clin Epidemiol 2010; 63(7): 721-7. doi: 10.1016/j.jclinepi.2009.12.008