Chapter 5 Analysis of imputed data

You must use a computer to do data science; you cannot do it in your head, or with pencil and paper.

— Hadley Wickham

Creating plausible imputations is the most challenging activity in multiple imputation. Once we have the multiply imputed data, we can estimate the parameters of scientific interest from each of the \(m\) imputed datasets, but now without the need to deal with the missing data, as all data are now complete. These repeated analyses produce \(m\) results.

The \(m\) results will feed into step 3 (pooling the results). The pooling step to derive the final statistical inferences is relatively straightforward, but its application in practice is not entirely free of problems. First of all, the complete-data analyses are nontrivial. Historically, the imputation literature (including the first edition of this book) has concentrated on step 1 (creating the imputations) and on step 3 (pooling the results), and has worked from the notion that step 2 (estimating the parameters) is well-specified and easy to execute once the data are complete. In practice step 2 can be quite involved. The step often includes model searching, optimization, validation, prediction, assessment of the quality of model fit, in fact, step 2 may embrace almost any aspect of machine learning and data science. All the analyses need to be repeated for each of the \(m\) datasets, which may put a considerable burden on the data analyst.

Fortunately, thanks to tremendous advances in recent computational technology, the use of modern data science techniques in step 2 is now becoming feasible. This chapter focuses on step 2. The next chapter addresses issues related to step 3.