Multiple Imputation (using the package mice)

For this practical we will use data from the package mice:

library(mice)

The dataset nhanes contains 25 observations on the following 4 variables:

In R the dataset looks as follows:

nhanes
##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

Complete-case analysis

When we would model without taking the missing values into account, we will get the following model:

model <- lm(chl ~ bmi + age, data = nhanes)
summary(model)
## 
## Call:
## lm(formula = chl ~ bmi + age, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.187 -19.517  -0.310   6.915  60.606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -80.194     58.772  -1.364 0.202327    
## bmi            6.884      1.846   3.730 0.003913 ** 
## age           53.069     11.293   4.699 0.000842 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.67 on 10 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.6781 
## F-statistic: 13.64 on 2 and 10 DF,  p-value: 0.001388

Note that almost half of the cases were not used in the analysis.

Missing data

With multiple imputation we want to provide plausible values for the missing values, while taking the uncertainty about these numbers into account. Hence, we will first inspect the missing data pattern:

md.pattern(nhanes)
##    age hyp bmi chl   
## 13   1   1   1   1  0
##  1   1   1   0   1  1
##  3   1   1   1   0  1
##  1   1   0   0   1  2
##  7   1   0   0   0  3
##      0   8   9  10 27

Thus, for 13 subjects we have all variables. Moreover, for none of the subjects the variable age is missing. On the other hand, for 7 subjects we only have the age.

One useful feature of the mice package is the ability to specify which predictors can be used for each incomplete variable.

imp <- mice(nhanes, print = FALSE)
imp$predictorMatrix
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The rows identify which predictors can be used for the variable in the row name. Hence, to impute the variable bmi we can use the variables age, hyp, and chl. Note, that the diagonal is equal to zero, because a variable cannot predict itself. Moreover, there were no missing values for age, hence we do not need to predict its missing values and its row contains only zeroes.

Multiply impute the data

Now, we can multiply impute the missing values in our dataset. It is useful to plot the parameters against the number of iterations to check for convergence. On convergence, the different streams should be freely intermingled with one another, without showing any definite trends.

imp <- mice(nhanes, print = FALSE, maxit = 10, seed = 24415) #10 iterations
plot(imp) #inspect the trace lines for convergence

Analysis of imputed data

It is important to note that taking the average of the imputed datasets and analyze the averaged data is not the way to proceed. Doing this will yield incorrect standard errors, confidence intervals and p-values because it ignores the between-imputation variability. In other words, it does not take the uncertainty about the imputed variables into account.

The appropriate way to analyze multiply imputed data is to perform complete data analysis on each imputed dataset seperately. In the mice package we can use the with() command for this purpose. For example, we fit a regression model to each dataset and print out the estimate from the first and second completed datasets by:

fit <- with(imp, lm(chl ~ bmi + age))
coef(fit$analyses[[1]])
## (Intercept)         bmi         age 
##  -49.037929    6.656636   36.061794
coef(fit$analyses[[2]])
## (Intercept)         bmi         age 
##  -89.914211    7.318115   49.178204

Note, that the estimates for bmi and age are different from each other in the two completed datasets. This is due to the uncertainty created by the missing data. We can now apply the standard pooling rules by doing the following. In this way we get the final coefficient estimates for the model using imputed data:

est <- pool(fit)
summary(est)
##                   est        se         t       df   Pr(>|t|)        lo 95
## (Intercept) -29.54833 79.471793 -0.371809 6.421464 0.72199780 -220.9564841
## bmi           5.83619  2.421364  2.410290 6.998748 0.04676021    0.1103667
## age          37.34718 12.185849  3.064799 7.325582 0.01721003    8.7898172
##                 hi 95 nmis       fmi    lambda
## (Intercept) 161.85983   NA 0.5961291 0.4872905
## bmi          11.56201    9 0.5649691 0.4561944
## age          65.90454    0 0.5482141 0.4396845

Comparison to complete-case analysis

The estimated model ignoring the missing values (complete-case analysis) was given by:

summary(model)
## 
## Call:
## lm(formula = chl ~ bmi + age, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.187 -19.517  -0.310   6.915  60.606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -80.194     58.772  -1.364 0.202327    
## bmi            6.884      1.846   3.730 0.003913 ** 
## age           53.069     11.293   4.699 0.000842 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.67 on 10 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.6781 
## F-statistic: 13.64 on 2 and 10 DF,  p-value: 0.001388

When we compare this multiply imputed model model with complete-case analysis, we see that the coefficient estimates are quite different. The estimates for bmi and age are significant in both models. The standard errors of the coefficient estimates of complete-analysis are smaller here than the standard errors of the model were the missing values were imputed. This is not always the case. Because the multiply imputed model is based on 25 observations rather than 13, it could also have been the other way around.

In this case we assumed that the parameter estimates are normally distributed around the population value. Many types of estimates are approximately normally distributed: e.g., means, standard deviations, regression coefficients, proportions and linear predictors.