2.6 Imputation is not prediction

In the world of simulation we have access to both the true and imputed values, so an obvious way to quantify the quality of a method is to see how well it can recreate the true data. The method that best recovers the true data “wins.” An early paper developing this idea is Gleason and Staelin (1975), but the literature is full of examples. The approach is simple and appealing. But will it also select the best imputation method?

The answer is “No”. Suppose that we would measure discrepancy by the RMSE of the imputed values:

\[\begin{equation} \mathrm{RMSE} = \sqrt{\frac{1}{n_\mathrm{mis}}\sum_{i=1}^{n_\mathrm{mis}} ({\mbox{$y_i^\mathrm{mis}$}}- {\mbox{$\dot y_i$}})^2} \tag{2.37} \end{equation}\]

where $y_i^\mathrm{mis}$ represents the true (removed) data value for unit $i$ and where $\dot y_i$ is imputed value for unit $i$. For multiply imputed data we calculate RMSE for each imputed dataset, and average these.

It is well known that the minimum RMSE is attained by predicting the missing $\dot y_i$ by the linear model with the regression weights set to their least squares estimates. According to this reasoning the “best” method replaces each missing value by its most likely value under the model. However, this will find the same values over and over, and is single imputation. This method ignores the inherent uncertainty of the missing values (and acts as if they were known after all), resulting in biased estimates and invalid statistical inferences. Hence, the method yielding the lowest RMSE is bad for imputation. More generally, measures based on similarity between the true and imputed values do not separate valid from invalid imputation methods.

Let us check this claim with a short simulation. The rmse() below calculates the RMSE from the true and multiply imputed data for missing data in variable x.

rmse <- function(truedata, imp, v = "x") {
  mx <- is.na(mice::complete(imp, 0))[, v]
  mse <- rep(NA, imp$m)
  for (k in seq_len(imp$m)) {
    filled <- mice::complete(imp, k)[mx, v]
    true <- truedata[mx, v]
    mse[k] <- mean((filled - true)^2)
  }
  sqrt(mean(mse))
}

The simulate2() function creates the same missing data as before.

simulate2 <- function(runs = 10) {
  res <- array(NA, dim = c(2, runs, 1))
  dimnames(res) <- list(c("norm.predict", "norm.nob"),
                        as.character(1:runs),
                        "RMSE")
  for(run in 1:runs) {
    truedata <- create.data(run = run)
    data <- make.missing(truedata)
    imp <- mice(data, method = "norm.predict", m = 1,
                print = FALSE)
    res[1, run, ] <- rmse(truedata, imp)
    imp <- mice(data, method = "norm.nob", print = FALSE)
    res[2, run, ] <- rmse(truedata, imp)
  }
  res
}

res2 <- simulate2(1000)
apply(res2, c(1, 3), mean, na.rm = TRUE)

              RMSE
norm.predict 0.725
norm.nob     1.025

The simulation confirms that regression imputation is better at recreating the missing data. Remember from Section 1.3.4 that regression imputation is fundamentally flawed. Its estimate of $\beta$ is biased (even under MCAR) and the accompanying confidence interval is too short.

The example demonstrates that the RMSE is not informative for evaluating imputation methods. Assessing the discrepancy between true data and the imputed data may seem a simple and attractive way to select the best imputation method. However, it is not useful to evaluate methods solely based on their ability to recreate the true data. On the contrary, selecting such methods may be harmful as these might increase the rate of false positives. Imputation is not prediction.