4.2 Issues in multivariate imputation

Most imputation models for \(Y_j\) use the remaining columns \(Y_{-j}\) as predictors. The rationale is that conditioning on \(Y_{-j}\) preserves the relations among the \(Y_j\) in the imputed data. identified various practical problems that can occur in multivariate missing data:

  • The predictors \(Y_{-j}\) themselves can contain missing values;

  • “Circular” dependence can occur, where \(Y_j^\mathrm{mis}\) depends on \(Y_h^\mathrm{mis}\), and \(Y_h^\mathrm{mis}\) depends on \(Y_j^\mathrm{mis}\) with \(h \neq j\), because in general \(Y_j\) and \(Y_h\) are correlated, even given other variables;

  • Variables are often of different types (e.g., binary, unordered, ordered, continuous), thereby making the application of theoretically convenient models, such as the multivariate normal, theoretically inappropriate;

  • Especially with large \(p\) and small \(n\), collinearity or empty cells can occur;

  • The ordering of the rows and columns can be meaningful, e.g., as in longitudinal data;

  • The relation between \(Y_j\) and predictors \(Y_{-j}\) can be complex, e.g., nonlinear, or subject to censoring processes;

  • Imputation can create impossible combinations, such as pregnant fathers.

This list is by no means exhaustive, and other complexities may appear for particular data. The next sections discuss three general strategies for imputing multivariate data:

  1. Monotone data imputation. For monotone missing data patterns, imputations are created by a sequence of univariate methods;

  2. Joint modeling. For general patterns, imputations are drawn from a multivariate model fitted to the data;

  3. Fully conditional specification, also known as chained equations and sequential regressions. For general patterns, a multivariate model is implicitly specified by a set of conditional univariate models. Imputations are created by drawing from iterated conditional models.