4.7 MICE extensions
The MICE algorithm listed in box 4.3 can be extended in several ways.
4.7.1 Skipping imputations and overimputation
By default, the MICE algorithm imputes the missing data, and leaves the observed data untouched. In some cases it may also be useful to skip imputation of certain cells. For example, we wish to skip imputation of quality of life for the deceased, or not impute customer satisfaction for people who did not buy the product. The primary difficulty with this option is that it creates missing data in the predictors, so the imputer should either remove the predictor from all imputation models, or have the missing values propagated through the algorithm. Another use case involves imputing cells with observed data, a technique called overimputation. For example, it may be useful to evaluate whether the observed point data fit the imputation model. If all is well, we expect the observed data point in the center of the multiple imputations. The primary difficulty with this option is to ensure that only the observed data (and not the imputed data) are used as an outcome in the imputation model. Version 3.0
of mice
includes the where
argument, a matrix with with logicals that has the same dimensions as the data, that indicates where in the data the imputations should be created. This matrix can be used to specify for each cell whether it should be imputed or not. The default is that the missing data are imputed.
4.7.2 Blocks of variables, hybrid imputation
An important difference between JM and FCS is that JM imputes all variables at once, whereas FCS imputes each variable separately. JM and FCS are the extremes scenarios of the much wider range of hybrid imputation models. In actual data analysis sets of variables are often connected in some way. Examples are:
A set of scale items and its total score;
A variable with one or more transformations;
Two variables with one or more interaction terms;
A block of normally distributed \(Z\)-scores;
Compositions that add up to a total;
Set of variables that are collected together.
Instead of specifying the steps for each variable separately, it is more user-friendly to impute these as a block. Version 3.0
of mice
includes a new block
argument that partitions the complete set of variables into blocks. All variables within the same block are jointly imputed, which provides a strategy to specify hybrids of JM and FCS. The joint models need to be open to accept external covariates. One possibility is to use predictive mean matching to impute multivariate nonresponse, where the donor values for the variables within the block come from the same donor (Little 1988). The main algorithm in mice 3.0
iterates over the blocks rather than the variables. By default, each variable is its own block, which gives the familiar behavior.
4.7.3 Blocks of units, monotone blocks
Another way to partition the data is to define blocks of units. One weakness of the algorithm in box 4.3 is that it may become unstable when many of the predictors are imputed. Zhu (2016) developed a solution called “Block sequential regression multivariate imputation”, where units are partitioned into blocks according to the missing data pattern. The imputation model for a given variable is modified for each block, such that only the observed data with the block can serve as predictor. The method generalizes the monotone block approach of Li et al. (2014).
4.7.4 Tile imputation
The block-wise partitioning methods are complementary strategies to multivariate imputation. The methods in Section 4.7.2 partition the columns and apply one model to many outcomes, whereas the methods in Section 4.7.3 partition the rows and apply many models to one outcome. These operations can be freely combined into a whole new class of algorithms based on tiles, i.e., combinations of row and column partitions. This is a vast and yet unexplored field. I expect that it will be possible to develop imputation algorithms that are user-friendly, stable and automatic. A major new application of such tile algorithm will be in the imputation of combined data. The problem of automatic detection of “optimal tiles” provides both enormous challenges and substantial pay-offs.