12.1 Some dangers, some do’s and some don’ts

Any statistical technique has limitations and pitfalls, and multiple imputation is no exception. This books emphasizes the virtues of being flexible, but this comes at a price. The next sections outline some dangers, do’s and don’ts.

12.1.1 Some dangers

The major danger of the technique is that it may provide nonsensical or even misleading results if applied without appropriate care or insight. Multiple imputation is not a simple technical fix for the missing data. Scientific and statistical judgment comes into play at various stages: during diagnosis of the missing data problem, in the setup of a good imputation model, during validation of the quality of the generated synthetic data and in combining the repeated analyses. While software producers attempt to set defaults that will work in a large variety of cases, we cannot simply hand over our scientific decisions to the software. We need to open the black box, and adjust the process when appropriate.

The MICE algorithm is univariate optimal, but not necessarily multivariate optimal. There is no clear theoretical rationale for convergence of the multivariate algorithm. The main justification of the MICE algorithm rests on simulation studies. The research on this topic is intensifying. Even though the results obtained thus far are reassuring, at this moment it is not possible to outline in advance the precise conditions that would guarantee convergence for some set of conditionally specified models.

Another danger occurs if the imputation model is uncongenial (Meng 1994; Schafer 2003). Uncongeniality can occur if the imputation model is specified as more restrictive than the complete-data model, or if it fails to account for important factors in the missing data mechanism. Both types of omissions introduce biased and possibly inefficient estimates. The other side of the coin is that multiple imputation can be more efficient if the imputer uses information that is not accessible to the analyst. The statistical infererences may become more precise than those in maximum likelihood, a property known as superefficiency (Rubin 1996).

There are many data-analytic situations for which we do not yet know the appropriate way to generate imputations. For example, it is not yet clear how design factors of a complex sampling design, e.g., a stratified cluster sample, should be incorporated into the imputation model. Also, relatively little is known about how to impute nested and hierarchical data, or autocorrelated data that form time series. These problems are not inherent limitations of multiple imputation, but of course they may impede the practical application of the imputation techniques for certain types of data.

12.1.2 Some do’s

Constructing good imputation models requires analytic skills. The following list of do’s summarizes some of the advice given in this book.

Find out the reasons for the missing data;
Include the outcome variable(s) in the imputation model;
Include factors that govern the missingness in the imputation model;
Impute categorical data by techniques for categorical data;
Remove response indicators from the imputation model;
Aim for a scope broad enough for all analyses;
Set the random seed to enhance reproducible results;
Break any direct feedback loops that arise in passive imputation;
Inspect the trace lines for slow convergence;
Inspect the imputed data;
Evaluate whether the imputed data could have been real data if they had not been missing;
Take \(m\) = 5 for model building, and increase afterward if needed;
Specify simple MNAR models for sensitivity analysis;
Impute by proper imputation methods;
Impute by robust hot deck-like models like predictive mean matching;
Reduce a large imputation model into smaller components;
Transform statistics toward approximate normality before pooling;
Assess critical assumptions about the missing data mechanism;
Eliminate badly connected, uninteresting variables (low influx, low outflux) from the imputation model;
Take obvious features like non-negativity, functional relations and skewed distributions in the data into account in the imputation model;
Use more flexible (e.g., nonlinear or nonparametric) imputation models;
Perform and pool the repeated analyses per dataset;
Describe potential departures from MAR;
Report accurately and concisely.

12.1.3 Some don’ts

Do not:

Use multiple imputation if simpler methods are valid;
Take predictions as imputations;
Impute blindly;
Put too much faith in the defaults;
Average the multiply imputed data;
Create imputations using a model that is more restrictive than needed;
Uncritically accept imputations that are very different from the observed data.