mice
mice
)For this practical we will use data from the package mice
:
library(mice)
The dataset nhanes
contains 25 observations on the following 4 variables:
In R
the dataset looks as follows:
nhanes
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
When we would model without taking the missing values into account, we will get the following model:
model <- lm(chl ~ bmi + age, data = nhanes)
summary(model)
##
## Call:
## lm(formula = chl ~ bmi + age, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.187 -19.517 -0.310 6.915 60.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.194 58.772 -1.364 0.202327
## bmi 6.884 1.846 3.730 0.003913 **
## age 53.069 11.293 4.699 0.000842 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.67 on 10 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.6781
## F-statistic: 13.64 on 2 and 10 DF, p-value: 0.001388
Note that almost half of the cases were not used in the analysis.
With multiple imputation we want to provide plausible values for the missing values, while taking the uncertainty about these numbers into account. Hence, we will first inspect the missing data pattern:
md.pattern(nhanes)
## age hyp bmi chl
## 13 1 1 1 1 0
## 1 1 1 0 1 1
## 3 1 1 1 0 1
## 1 1 0 0 1 2
## 7 1 0 0 0 3
## 0 8 9 10 27
Thus, for 13 subjects we have all variables. Moreover, for none of the subjects the variable age is missing. On the other hand, for 7 subjects we only have the age.
One useful feature of the mice
package is the ability to specify which predictors can be used for each incomplete variable.
imp <- mice(nhanes, print = FALSE)
imp$predictorMatrix
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The rows identify which predictors can be used for the variable in the row name. Hence, to impute the variable bmi
we can use the variables age
, hyp
, and chl
. Note, that the diagonal is equal to zero, because a variable cannot predict itself. Moreover, there were no missing values for age
, hence we do not need to predict its missing values and its row contains only zeroes.
Now, we can multiply impute the missing values in our dataset. It is useful to plot the parameters against the number of iterations to check for convergence. On convergence, the different streams should be freely intermingled with one another, without showing any definite trends.
imp <- mice(nhanes, print = FALSE, maxit = 10, seed = 24415) #10 iterations
plot(imp) #inspect the trace lines for convergence
It is important to note that taking the average of the imputed datasets and analyze the averaged data is not the way to proceed. Doing this will yield incorrect standard errors, confidence intervals and p-values because it ignores the between-imputation variability. In other words, it does not take the uncertainty about the imputed variables into account.
The appropriate way to analyze multiply imputed data is to perform complete data analysis on each imputed dataset seperately. In the mice
package we can use the with()
command for this purpose. For example, we fit a regression model to each dataset and print out the estimate from the first and second completed datasets by:
fit <- with(imp, lm(chl ~ bmi + age))
coef(fit$analyses[[1]])
## (Intercept) bmi age
## -49.037929 6.656636 36.061794
coef(fit$analyses[[2]])
## (Intercept) bmi age
## -89.914211 7.318115 49.178204
Note, that the estimates for bmi and age are different from each other in the two completed datasets. This is due to the uncertainty created by the missing data. We can now apply the standard pooling rules by doing the following. In this way we get the final coefficient estimates for the model using imputed data:
est <- pool(fit)
summary(est)
## est se t df Pr(>|t|) lo 95
## (Intercept) -29.54833 79.471793 -0.371809 6.421464 0.72199780 -220.9564841
## bmi 5.83619 2.421364 2.410290 6.998748 0.04676021 0.1103667
## age 37.34718 12.185849 3.064799 7.325582 0.01721003 8.7898172
## hi 95 nmis fmi lambda
## (Intercept) 161.85983 NA 0.5961291 0.4872905
## bmi 11.56201 9 0.5649691 0.4561944
## age 65.90454 0 0.5482141 0.4396845
The estimated model ignoring the missing values (complete-case analysis) was given by:
summary(model)
##
## Call:
## lm(formula = chl ~ bmi + age, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.187 -19.517 -0.310 6.915 60.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.194 58.772 -1.364 0.202327
## bmi 6.884 1.846 3.730 0.003913 **
## age 53.069 11.293 4.699 0.000842 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.67 on 10 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.6781
## F-statistic: 13.64 on 2 and 10 DF, p-value: 0.001388
When we compare this multiply imputed model model with complete-case analysis, we see that the coefficient estimates are quite different. The estimates for bmi
and age
are significant in both models. The standard errors of the coefficient estimates of complete-analysis are smaller here than the standard errors of the model were the missing values were imputed. This is not always the case. Because the multiply imputed model is based on 25 observations rather than 13, it could also have been the other way around.
In this case we assumed that the parameter estimates are normally distributed around the population value. Many types of estimates are approximately normally distributed: e.g., means, standard deviations, regression coefficients, proportions and linear predictors.