mice
mice
)For this practical we will use data from the package mice
:
library(mice)
The dataset nhanes
contains 25 observations on the following 4 variables:
In R
the dataset looks as follows:
nhanes
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
When we would model without taking the missing values into account, we will get the following model:
model <- lm(chl ~ bmi + age, data = nhanes)
summary(model)
##
## Call:
## lm(formula = chl ~ bmi + age, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.187 -19.517 -0.310 6.915 60.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.194 58.772 -1.364 0.202327
## bmi 6.884 1.846 3.730 0.003913 **
## age 53.069 11.293 4.699 0.000842 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.67 on 10 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.6781
## F-statistic: 13.64 on 2 and 10 DF, p-value: 0.001388
Note that almost half of the cases were not used in the analysis.
With multiple imputation we want to provide plausible values for the missing values, while taking the uncertainty about these numbers into account. Hence, we will first inspect the missing data pattern:
md.pattern(nhanes)
## age hyp bmi chl
## 13 1 1 1 1 0
## 1 1 1 0 1 1
## 3 1 1 1 0 1
## 1 1 0 0 1 2
## 7 1 0 0 0 3
## 0 8 9 10 27
Thus, for 13 subjects we have all variables. Moreover, for none of the subjects the variable age is missing. On the other hand, for 7 subjects we only have the age.
One useful feature of the mice
package is the ability to specify which predictors can be used for each incomplete variable.
imp <- mice(nhanes, print = FALSE)
imp$predictorMatrix
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The rows identify which predictors can be used for the variable in the row name. Hence, to impute the variable bmi
we can use the variables age
, hyp
, and chl
. Note, that the diagonal is equal to zero, because a variable cannot predict itself. Moreover, there were no missing values for age
, hence we do not need to predict its missing values and its row contains only zeroes.
Now, we can multiply impute the missing values in our dataset. It is useful to plot the parameters against the number of iterations to check for convergence. On convergence, the different streams should be freely intermingled with one another, without showing any definite trends.
imp <- mice(nhanes, print = FALSE, maxit = 10, seed = 24415) #10 iterations
plot(imp) #inspect the trace lines for convergence