# Multiple Imputation (using the package `mice`)

For this practical we will use data from the package `mice`:

``library(mice)``

The dataset `nhanes` contains 25 observations on the following 4 variables:

• age: Age group (1 = 20-39, 2 = 40-59, 3 = 60+)
• bmi: Body mass index (kg/m^2)
• hyp: Hypertensive (1 = no, 2 = yes)
• chl: Total serum cholesterol (mg/dL)

In `R` the dataset looks as follows:

``nhanes``
``````##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186``````

## Complete-case analysis

When we would model without taking the missing values into account, we will get the following model:

``````model <- lm(chl ~ bmi + age, data = nhanes)
summary(model)``````
``````##
## Call:
## lm(formula = chl ~ bmi + age, data = nhanes)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -31.187 -19.517  -0.310   6.915  60.606
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -80.194     58.772  -1.364 0.202327
## bmi            6.884      1.846   3.730 0.003913 **
## age           53.069     11.293   4.699 0.000842 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.67 on 10 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.6781
## F-statistic: 13.64 on 2 and 10 DF,  p-value: 0.001388``````

Note that almost half of the cases were not used in the analysis.

## Missing data

With multiple imputation we want to provide plausible values for the missing values, while taking the uncertainty about these numbers into account. Hence, we will first inspect the missing data pattern:

``md.pattern(nhanes)``
``````##    age hyp bmi chl
## 13   1   1   1   1  0
##  1   1   1   0   1  1
##  3   1   1   1   0  1
##  1   1   0   0   1  2
##  7   1   0   0   0  3
##      0   8   9  10 27``````

Thus, for 13 subjects we have all variables. Moreover, for none of the subjects the variable age is missing. On the other hand, for 7 subjects we only have the age.

One useful feature of the `mice` package is the ability to specify which predictors can be used for each incomplete variable.

``````imp <- mice(nhanes, print = FALSE)
imp\$predictorMatrix``````
``````##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0``````

The rows identify which predictors can be used for the variable in the row name. Hence, to impute the variable `bmi` we can use the variables `age`, `hyp`, and `chl`. Note, that the diagonal is equal to zero, because a variable cannot predict itself. Moreover, there were no missing values for `age`, hence we do not need to predict its missing values and its row contains only zeroes.

## Multiply impute the data

Now, we can multiply impute the missing values in our dataset. It is useful to plot the parameters against the number of iterations to check for convergence. On convergence, the different streams should be freely intermingled with one another, without showing any definite trends.

``````imp <- mice(nhanes, print = FALSE, maxit = 10, seed = 24415) #10 iterations
plot(imp) #inspect the trace lines for convergence``````