12.4 Future developments

Multiple imputation is not a finished product or algorithm. New applications call for innovative ways to implement the key ideas. This section identifies some areas where further research could be useful.

12.4.1 Derived variables

Section 6.4 describes techniques to generate imputations for interactions, sum scores, quadratic terms and other derived variables. Many datasets contain derived variables of some form. The relations between the variables need to be maintained if imputations are to be plausible. There are no fail-safe methods for drawing imputations that preserve the proper interactions in the substantive models. One promising area of development is the rise of model-based forms of imputation that essentially take the substantive model as leading (cf. Section 4.5.5). It would be interesting to study how well model-based techniques can cope with derived variables of various sorts.

12.4.2 Algorithms for blocks and batches

In some applications it is useful to generalize the variable-by-variable scheme of the MICE algorithm to blocks. A block can contain just one variable, but also groups of variables. An imputation model is specified for each block, and the algorithm iterates over the blocks. Starting with mice 3.0, the user can specify blocks of variables that are structurally related, such as dummy variables, semi-continuous variables, bracketed responses, compositions, item subsets, and so on. It would be useful to obtain experience with the practical application of this facility, which could stimulate specification of the imputation model on a higher level of abstraction, and allow for mixes of joint and conditional models.

Likewise, it may be useful to define batches, groups of records that form logical entities. For example, batches could consist of different populations, time points, classes, and so on. Imputation models can be defined per batch, and iteration takes place over the batches. Javaras and Van Dyk (2003) proposed algorithms for blocks using joint modeling. Zhu (2016) discusses alternatives within an FCS context. The incorporation of blocks and batches will allow for tremendous flexibility in the specification of imputation models. Such techniques require a keen database administration strategy to ensure that the predictors needed at any point are completed.

12.4.3 Nested imputation

In some applications it can be useful to generate different numbers of imputations for different variables. Rubin (2003) described an application that used fewer imputations for variables that were expensive to impute. Alternatively, we may want to impute a dataset that has already been multiply imputed, for example, to impute some additional variables while preserving the original imputations. The technique of using different numbers of imputations is known as nested multiple imputation (Shen 2000). Nested multiple imputation also has potential applications for modeling different types of missing data (Harel 2009). Nested multiple imputation has theoretical advantages, but it would be good to develop a good understanding of its added value in typical use cases.

12.4.4 Better trials with dynamic treatment regimes

Adaptive treatment designs follow patients over time, and can re-randomize them to alternate treatments conditional on previous outcomes. The design poses several challenges, and it is possible to address these in a coherent way by multiple imputation (Shortreed et al. 2014). As adaptive designs become more widely used, better methology with dynamic treatment regimes could result in huge savings, and at the same time, make designs more ethical.

12.4.5 Distribution-free pooling rules

Rubin’s theory is based on a convenient summary of the sampling distribution by the mean and the variance. There seems to be no intrinsic limitation in multiple imputation that would prevent it from working for more elaborate summaries. Suppose that we summarize the work for more elaborate summaries. Suppose that we summarize the distribution of the parameter estimates for each completed dataset by a dense set of quantiles. As before, there will be within- and between-variability as a result of the sampling and missing data mechanisms, respectively. The problem of how to combine these two types of distribution into the appropriate total distribution has not yet been solved. If we would be able to construct the total distribution, this would permit precise distribution-free statistical inference from incomplete data.

12.4.6 Improved diagnostic techniques

The key problem in multiple imputation is how to generate good imputations. The ultimate criterion of a good imputation is that it is confidence proper with respect to the scientific parameters of interest. Diagnostic methods are intermediate tools to evaluate the plausibility of a set of imputations. Section 6.6 discussed several techniques, but these may be laborious for datasets involving many variables. It would be useful to have informative summary measures that can signal whether “something is wrong” with the imputed data. Multiple measures are likely to be needed, each of which is sensitive to a particular aspect of the data.

12.4.7 Building block in modular statistics

Multiple imputation requires a well-defined function of the population data, an adequate missing data mechanism, and an idea of the parameters that will be estimated from the imputed data. The technique is an attempt to separate the missing data problem from the complete-data problem, so that both can be addressed independently. This helps in simplifying statistical analyses that are otherwise difficult to optimize or interpret.

The modular nature of multiple imputation helps our understanding. Aided by the vast computational possibilities, statistical models are becoming more and more complex nowadays, up to the point that the models outgrow the capabilities of our minds. The modular approach to statistics starts from a series of smaller models, each dedicated to a particular task. The main intellectual task is to arrange these models in a sensible way, and to link up the steps to provide an overall solution. Compared to the one-big-model-for-everything approach, the modular strategy may sacrifice some optimality. On the other hand, the analytic results are easier to track, as we can inspect what happened after each step, and thus easier to understand. And that is what matters in the end.