## 2.5 How to evaluate imputation methods

### 2.5.1 Simulation designs and performance measures

The advantageous properties of multiple imputation are only guaranteed if the imputation method used to create the missing data is proper. Equations (2.21)(2.23) describe the conditions needed for proper imputation.

Checking the validity of statistical procedures is often done by simulation. There are generally two mechanisms that influence the observed data, the sampling mechanism and the missing data mechanism. Simulation can address sampling mechanism separately, the missing data mechanism separately, and both mechanisms combined. This leads to three general simulation designs.

1. Sampling mechanism only. The basic simulation steps are: choose $$Q$$, take samples $$Y^{(s)}$$, fit the complete-data model, estimate $$\hat Q^{(s)}$$ and $$U^{(s)}$$ and calculate the outcomes aggregated over $$s$$.

2. Sampling and missing data mechanisms combined. The basic simulation steps are: choose $$Q$$, take samples $$Y^{(s)}$$, generate incomplete data $$Y_{\mathrm obs}^{(s,t)}$$, impute, estimate $$\hat Q^{(s,t)}$$ and $$T^{(s,t)}$$ and calculate outcomes aggregated over $$s$$ and $$t$$.

3. Missing data mechanism only. The basic simulation steps are: choose $$(\hat Q, U)$$, generate incomplete data $$Y_{\mathrm obs}^{(t)}$$, impute, estimate $$(\bar Q, \bar U)^{(t)}$$ and $$B^{(t)}$$ and calculate outcomes aggregated over $$t$$.

A popular procedure for testing missing-data applications is design 2 with settings $$s=1,\dots,1000$$ and $$t = 1$$. As this design does not separate the two mechanisms, any problems found may result from both the sampling and the missing-data mechanism. Design 1 does not address the missing data, and is primarily of interest to study whether any problems are attributable to the complete-data model. Design 3 addresses the missing-data mechanism only, and thus allows for a more detailed assessment of any problem caused by the imputation step. An advantage of this procedure is that no population model is needed. Brand et al. (2003) describe this procedure in more detail.

If we are primarily interested in determining the quality of imputation methods, we may simplify evaluation by defining the sample equal to the population, and set the within-variance $$\bar U = 0$$ in Equation (2.20). See Vink and Van Buuren (2014) for a short exploration of the idea.

### 2.5.2 Evaluation criteria

The goal of multiple imputation is to obtain statistically valid inferences from incomplete data. The quality of the imputation method should thus be evaluated with respect to this goal. There are several measures that may inform us about the statistical validity of a particular procedure. These are:

1. Raw bias (RB) and percent bias (PB). The raw bias of the estimate $$\bar Q$$ is defined as the difference between the expected value of the estimate and truth: $$\rm{RB} = \rm{E}(\bar Q) - Q$$. RB should be close to zero. Bias can also be expressed as percent bias: $$\rm{PB} = 100 \times |(\rm{E}(\bar Q) - Q) / Q|$$. For acceptable performance we use an upper limit for PB of 5%. (Demirtas, Freels, and Yucel 2008)

2. Coverage rate (CR). The coverage rate (CR) is the proportion of confidence intervals that contain the true value. The actual rate should be equal to or exceed the nominal rate. If CR falls below the nominal rate, the method is too optimistic, leading to false positives. A CR below 90 percent for a nominal 95 percent interval indicates poor quality. A high CR (e.g., 0.99) may indicate that confidence interval is too wide, so the method is inefficient and leads to inferences that are too conservative. Inferences that are “too conservative” are generally regarded a lesser sin than “too optimistic”.

3. Average width (AW). The average width of the confidence interval is an indicator of statistical efficiency. The length should be as small as possible, but not so small that the CR will fall below the nominal level.

4. Root mean squared error (RMSE). The $$\rm{RMSE} = \sqrt{(\rm{E}(\bar Q) - Q)^2}$$ is a compromise between bias and variance, and evaluates $$\bar Q$$ on both accuracy and precision.

If all is well, then RB should be close to zero, and the coverage should be near 0.95. Methods having no bias and proper coverage are called randomization-valid (Rubin 1987b). If two methods are both randomization-valid, the method with the shorter confidence intervals is more efficient. While the RMSE is widely used, we will see in Section 2.6 that it is not a suitable metric to evaluate multiple imputation methods.

### 2.5.3 Example

This section demonstrates the measures defined in Section 2.5.2 can be calculated using simulation. The process starts by specifying a model that is of scientific interest and that fixes $$Q$$. Pseudo-observations according to the model are generated, and part of these observations is deleted, resulting in an incomplete dataset. The missing values are then filled using the new imputation procedure, and Rubin’s rules are applied to calculate the estimates $$\bar Q$$ and $$T$$. The whole process is repeated a large number of times, say in 1000 runs, each starting from different random seeds.

For the sake of simplicity, suppose scientific interest focuses on determining $$\beta$$ in the linear model $$y_i = \alpha + x_i \beta + \epsilon_i$$. Here $$\epsilon_i \sim N(0, \sigma^2)$$ are random errors uncorrelated with $$x$$. Suppose that the true values are $$\alpha = 0$$, $$\beta = 1$$ and $$\sigma^2 = 1$$. We have 50% random missing data in $$x$$, and compare two imputation methods: regression imputation (cf. Section 1.3.4) and stochastic regression imputation (cf. Section 1.3.5).

It is convenient to create a series of small R functions. The create.data() function randomly draws artificial data from the specified linear model.

create.data <- function(beta = 1, sigma2 = 1, n = 50,
run = 1) {
set.seed(seed = run)
x <- rnorm(n)
y <- beta * x + rnorm(n, sd = sqrt(sigma2))
cbind(x = x, y = y)
}

Next, we remove some data in order to make the data incomplete. Here we use a simple random missing data mechanism (MCAR) to generate approximately 50% missing values.

make.missing <- function(data, p = 0.5){
rx <- rbinom(nrow(data), 1, p)
data[rx == 0, "x"] <- NA
data
}

We then define a small test function that calls mice() and applies Rubin’s rules to the imputed data.

test.impute <- function(data, m = 5, method = "norm", ...) {
imp <- mice(data, method = method, m = m, print = FALSE, ...)
fit <- with(imp, lm(y ~ x))
tab <- summary(pool(fit), "all", conf.int = TRUE)
as.numeric(tab["x", c("estimate", "2.5 %", "97.5 %")])
}

The following function puts everything together:

simulate <- function(runs = 10) {
res <- array(NA, dim = c(2, runs, 3))
dimnames(res) <- list(c("norm.predict", "norm.nob"),
as.character(1:runs),
c("estimate", "2.5 %","97.5 %"))
for(run in 1:runs) {
data <- create.data(run = run)
data <- make.missing(data)
res[1, run, ] <- test.impute(data, method = "norm.predict",
m = 2)
res[2, run, ] <- test.impute(data, method = "norm.nob")
}
res
}

Performing 1000 simulations is now done by calling simulate(), thus

res <- simulate(1000)

The means of the estimate, the lower and upper bounds of the confidence intervals per method, can be obtained by

apply(res, c(1, 3), mean, na.rm = TRUE)
             estimate 2.5 % 97.5 %
norm.predict    1.343 1.065   1.62
norm.nob        0.995 0.648   1.34

The function of the following code is to calculate the quality statistics as defined in Section 2.5.2.

true <- 1
RB <- rowMeans(res[,, "estimate"]) - true
PB <- 100 * abs((rowMeans(res[,, "estimate"]) - true)/ true)
CR <- rowMeans(res[,, "2.5 %"] < true & true < res[,, "97.5 %"])
AW <- rowMeans(res[,, "97.5 %"] - res[,, "2.5 %"])
RMSE <- sqrt(rowMeans((res[,, "estimate"] - true)^2))
data.frame(RB, PB, CR, AW, RMSE)
                 RB   PB    CR    AW  RMSE
norm.predict  0.343 34.3 0.364 0.555 0.409
norm.nob     -0.005  0.5 0.925 0.693 0.201

The interpretation of the results is as follows. Regression imputation by method norm.predict produces severely biased estimates of $$\beta$$. The true $$\beta$$ is 1, but the average estimate after regression imputation is 1.343. Moreover, the true value is located within the confidence interval in only 36% of the cases, far below the nominal value of 95%. Hence, regression imputation is not randomization-valid for $$\beta$$, even under MCAR. Because of this, the estimates for AW and RMSE are not relevant. This example shows that statistical inference on incomplete data that were imputed by regression imputation can produce the wrong answer.

The story for stochastic regression imputation is different. The norm.nob method is unbiased and has a coverage of 92.5%. The method is not randomization-valid, but it is near. The AW and RMSE serve as useful indicators, though both may be a little low as the confidence intervals appear somewhat short. Chapter 3 shows how to make it fully randomization-valid.

Of course, simulation studies do not guarantee fitness for a particular application. However, if simulation studies illustrate limitations in simple examples, we may expect these will also be present in applications.