6.3 Model form and predictors
6.3.1 Model form
The MICE algorithm requires a specification of a univariate imputation method separately for each incomplete variable. Chapter 3 discussed many possible methods. The measurement level largely determines the form of the univariate imputation model. The mice()
function distinguishes numerical, binary, ordered and unordered categorical data, and sets the defaults accordingly.
Method | Description | Scale Type |
---|---|---|
pmm |
Predictive mean matching | Any\(^*\) |
midastouch |
Weighted predictive mean matching | Any |
sample |
Random sample from observed values | Any |
cart |
Classification and regression trees | Any |
rf |
Random forest imputation | Any |
mean |
Unconditional mean imputation | Numeric |
norm |
Bayesian linear regression | Numeric |
norm.boot |
Normal imputation with bootstrap | Numeric |
norm.nob |
Normal imputation ignoring model error | Numeric |
norm.predict |
Normal imputation, predicted values | Numeric |
quadratic |
Imputation of quadratic terms | Numeric |
ri |
Random indicator for nonignorable data | Numeric |
logreg |
Logistic regression | Binary\(^*\) |
logreg.boot |
Logistic regression with bootstrap | Binary |
polr |
Proportional odds model | Ordinal\(^*\) |
polyreg |
Polytomous logistic regression | Nominal\(^*\) |
lda |
Discriminant analysis | Nominal |
Table 6.1 lists the built-in univariate imputation method in the mice
package. The defaults have been chosen to work well in a wide variety of situations, but in particular cases different methods may be better. For example, if it is known that the variable is close to normally distributed, using norm
instead of the default pmm
may be more efficient. For large datasets where sampling variance is not an issue, it could be useful to select norm.nob
, which does not draw regression parameters, and is thus simpler and faster. The norm.boot
method is a fast non-Bayesian alternative for norm
. The norm
methods are an alternative to pmm
in cases where pmm
does not work well, e.g., when insufficient nearby donors can be found.
The mean
method is included for completeness and should not be generally used. For sparse categorical data, it may be better to use method pmm
instead of logreg
, polr
or polyreg
. Method logreg.boot
is a version of logreg
that uses the bootstrap to emulate sampling variance. Method lda
is generally inferior to polyreg
(Brand 1999), and should be used only as a backup when all else fails. Finally, sample
is a quick method for creating starting imputations without the need for covariates.
6.3.2 Predictors
A useful feature of the mice()
function is the ability to specify the set of predictors to be used for each incomplete variable. The basic specification is made through the predictorMatrix
argument, which is a square matrix of size ncol(data)
containing 0/1 data. Each row in predictorMatrix
identifies which predictors are to be used for the variable in the row name. If diagnostics = TRUE
(the default), then mice()
returns a mids
object containing a predictorMatrix
entry. For example, type
imp <- mice(nhanes, print = FALSE)
imp$predictorMatrix
age bmi hyp chl
age 0 1 1 1
bmi 1 0 1 1
hyp 1 1 0 1
chl 1 1 1 0
The rows correspond to incomplete target variables, in the sequence as they appear in the data. A value of 1
indicates that the column variable is a predictor to impute the target (row) variable, and a 0
means that it is not used. Thus, in the above example, bmi
is predicted from age
, hyp
and chl
. Note that the diagonal is 0
since a variable cannot predict itself. Since age
contains no missing data, mice()
silently sets all values in the row to 0
. The default setting of the predictorMatrix
specifies that every variable predicts all others.
Conditioning on all other data is often reasonable for small to medium datasets, containing up to, say, 20–30 variables, without derived variables, interactions effects and other complexities. As a general rule, using every bit of available information yields multiple imputations that have minimal bias and maximal efficiency (Meng 1994; Collins, Schafer, and Kam 2001). It is often beneficial to choose as large a number of predictors as possible. Including as many predictors as possible tends to make the MAR assumption more plausible, thus reducing the need to make special adjustments for MNAR mechanisms (Schafer 1997).
For datasets containing hundreds or thousands of variables, using all predictors may not be feasible (because of multicollinearity and computational problems) to include all these variables. It is also not necessary. In my experience, the increase in explained variance in linear regression is typically negligible after the best, say, 15 variables have been included. For imputation purposes, it is expedient to select a suitable subset of data that contains no more than 15 to 25 variables. provide the following strategy for selecting predictor variables from a large database:
Include all variables that appear in the complete-data model, i.e., the model that will be applied to the data after imputation, including the outcome (Little 1992; Moons et al. 2006). Failure to do so may bias the complete-data model, especially if the complete-data model contains strong predictive relations. Note that this step is somewhat counter-intuitive, as it may seem that imputation would artificially strengthen the relations of the complete-data model, which would be clearly undesirable. If done properly however, this is not the case. On the contrary, not including the complete-data model variables will tend to bias the results toward zero. Note that interactions of scientific interest also need to be included in the imputation model.
In addition, include the variables that are related to the nonresponse. Factors that are known to have influenced the occurrence of missing data (stratification, reasons for nonresponse) are to be included on substantive grounds. Other variables of interest are those for which the distributions differ between the response and nonresponse groups. These can be found by inspecting their correlations with the response indicator of the variable to be imputed. If the magnitude of this correlation exceeds a certain level, then the variable should be included.
In addition, include variables that explain a considerable amount of variance. Such predictors help reduce the uncertainty of the imputations. They are basically identified by their correlation with the target variable.
Remove from the variables selected in steps 2 and 3 those variables that have too many missing values within the subgroup of incomplete cases. A simple indicator is the percentage of observed cases within this subgroup, the percentage of usable cases (cf. Section 4.1.2).
Most predictors used for imputation are incomplete themselves. In principle, one could apply the above modeling steps for each incomplete predictor in turn, but this may lead to a cascade of auxiliary imputation problems. In doing so, one runs the risk that every variable needs to be included after all.
In practice, there is often a small set of key variables, for which imputations are needed, which suggests that steps 1 through 4 are to be performed for key variables only. This was the approach taken in Van Buuren and Groothuis-Oudshoorn (1999), but it may miss important predictors of predictors. A safer and more efficient, though more laborious, strategy is to perform the modeling steps also for the predictors of predictors of key variables. This is done in Groothuis-Oudshoorn, Van Buuren, and Van Rijckevorsel (1999). I expect that it is rarely necessary to go beyond predictors of predictors. At the terminal node, we can apply a simple method like sample
that does not need any predictors for itself.
The mice
package contains several tools that aid in automatic predictor selection. The quickpred()
function is a quick way to define the predictor matrix using the strategy outlined above. The flux()
function was introduced in Section 4.1.3. The mice()
function detects multicollinearity, and solves the problem by removing one or more predictors for the model. Each removal is noted in the loggedEvents
element of the mids
object. For example,
imp <- mice(cbind(nhanes, chl2 = 2 * nhanes$chl),
print = FALSE, maxit = 1, m = 3, seed = 1)
Warning: Number of logged events: 1
imp$loggedEvents
it im dep meth out
1 0 0 collinear chl2
yields a warning that informs us that at initialization variable chl2
was removed from the imputation model because it is collinear with chl
. As a result, chl
will be imputed, but chl2
is not. We may override removal by
imp <- mice(cbind(nhanes, chl2 = 2 * nhanes$chl),
print = FALSE, maxit = 1, m = 3, seed = 1,
remove.collinear = FALSE)
Warning: Number of logged events: 3
imp$loggedEvents
it im dep meth out
1 1 1 chl2 pmm chl
2 1 2 chl2 pmm chl
3 1 3 chl2 pmm chl
Now, the algorithm detects multicollinearity during iterations, and removes chl
from the imputation model for chl2
. Although this imputes both chl
and chl2
, their relation is not maintained. See Figure 6.1.
As a general rule, feedback between different versions of the same variable should be prevented. The next section describes a number of techniques that are useful in various situations. Another measure to control the algorithm is the ridge
parameter, denoted by \(\kappa\) in Algorithms 3.1, 3.2 and 3.3. The ridge
parameter is specified as an argument to mice()
. Setting ridge=0.001
or ridge=0.01
makes the algorithm more robust at the expense of bias.