This document is based in section 7.4 of the book ‘Flexible Imputation of Missing Data’ by Stef van Buuren.
This practical needs the mice
library:
library(mice)
YA
Are you able to walk outdoors on flat ground?
YB
Can you, fully independently, walk outdoors (if necessary with a cane)?
We have two studies, A and B. YA
has been measured in Study A, and YB
has been measured in Study B.
Would it be a good idea just to equate the four categories?
The equating assumption implicitly assumes that only combinations (0, 0), (1, 1), (2, 2) and (3, 3) can occur. Is that realistic?
Let YA
be the item of Study A
, and let YB
be the item of Study B
. The comparability problem is a missing data problem, where YA
is missing for population B
and YB
is missing for population A
. This formulation may help in using multiple imputation to solve the problem.
First, we create a small dataset with responses as follows:
fA <- c(242, 43, 15, 0, 6) # frequencies of population A
fB <- c(145, 110, 29, 8) # frequencies of population B
YA <- rep(ordered(c(0:3, NA)), fA) # outcome item A population A
YB <- rep(ordered(c(0:3)), fB) # outcome item B population B
Combine both datasets with missing values for item YB
for population A
, and missing values for item YA
for population B
. The dataframe Y
contains 604 rows and 2 columns: YA
and YB
.
Y <- rbind(data.frame(YA, YB = ordered(NA)),
data.frame(YB, YA = ordered(NA)))
dim(Y)
## [1] 598 2
head(Y)
## YA YB
## 1 0 <NA>
## 2 0 <NA>
## 3 0 <NA>
## 4 0 <NA>
## 5 0 <NA>
## 6 0 <NA>
tail(Y)
## YA YB
## 593 <NA> 3
## 594 <NA> 3
## 595 <NA> 3
## 596 <NA> 3
## 597 <NA> 3
## 598 <NA> 3
md.pattern(Y)
## YA YB
## 292 0 1 1
## 300 1 0 1
## 6 0 0 2
## 298 306 604
There no observations that link YA
to YB
, and so the missing data pattern is unconnected. Moreover, there are 6 records that contain no item data at all.
The following chunk is a bit of specialty code that defines two functions. The function micemill()
calculates Kendall’s \(\tau\) (rank order correlation) between the imputed versions of YA
and YB
at each iteration. The function ra
is a small helper function that puts the imputed data in proper shape.
micemill <- function(n){
for (i in 1:n){
imp <<- mice.mids(imp)
cors <- with(imp, cor(as.numeric(YA),
as.numeric(YB), method = 'kendall'))
tau <<- rbind(tau, ra(cors, s =T))
}
}
ra <- function(x, simplify = FALSE) {
if (!is.mira(x)) return(NULL)
ra <- x$analyses
if (simplify) ra <- unlist(ra)
return(ra)
}
The following code imputes the missing data in Y
under the (dubious) assumption that YA
and YB
are mutually independent.
tau <- NULL
imp <- mice(Y, max = 0, m = 10, print = FALSE, seed = 32662)
micemill(25)
# define a function to plot tracelines of Kendall's tau
plotit <- function() matplot(x = 1:nrow(tau),
y = tau, ylab = expression(paste("Kendall's ", tau)),
xlab = "Iteration", type = "l", lwd = 1,
lty = 1:10, col = "black")
plotit()
In the plot 25 iterations are plotted: the trace start near zero, but then freely wander off over a substantial range of the correlation. The MICE algorithm does not know where to go, and wander pointlessly through the parameter space. This occurs because the data contains no information that informs the relation between YA
and YB
, so \(\tau\) can be anything.
Suppose that we have a third, external study E
in which both YA
and YB
are measured.
## 0 1 2 3
## 0 128 45 3 2 178
## 1 13 45 10 0 68
## 2 3 20 14 5 42
## 3 0 0 1 1 2
## NA 1 0 1 0 2
## 145 110 29 8 292
The contingency table shows that there is a strong relation between YA
and YB
. However, it is far from perfect, so simply equating the four categories between YA
and YB
will distort their relationship. Note that the table is not symmetric, indicating that YA
is more difficult than YB
.
Simple equating assumes 100% concordance of the pairs. The contingency table clearly shows that this is not the case in study E
. On surface, the four response categories of YA
and YB
may look similar, but the information from sample E
suggests that the items work differently in a systematic way.
Is there be a way to incorporate the relationship between YA
and YB
so that they will become comparable?
The answer is yes. We can redo the imputation, but now with sample E
added to the data. In this way study E
acts as a bridge study.
The relevant data are built-in in the mice
under the name of walking
.
head(walking)
## sex age YA YB src
## 1 Male 61 1 <NA> A
## 2 Female 69 1 <NA> A
## 3 Male 74 0 <NA> A
## 4 Male 66 0 <NA> A
## 5 Female 72 2 <NA> A
## 6 Male 67 0 <NA> A
table(walking$src)
##
## A B E
## 306 292 292
with(walking, table(YA, YB, src, useNA = "always"))
## , , src = A
##
## YB
## YA 0 1 2 3 <NA>
## 0 0 0 0 0 242
## 1 0 0 0 0 43
## 2 0 0 0 0 15
## 3 0 0 0 0 0
## <NA> 0 0 0 0 6
##
## , , src = B
##
## YB
## YA 0 1 2 3 <NA>
## 0 0 0 0 0 0
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## <NA> 145 110 29 8 0
##
## , , src = E
##
## YB
## YA 0 1 2 3 <NA>
## 0 128 45 3 2 0
## 1 13 45 10 0 0
## 2 3 20 14 5 0
## 3 0 0 1 1 0
## <NA> 1 0 1 0 0
##
## , , src = NA
##
## YB
## YA 0 1 2 3 <NA>
## 0 0 0 0 0 0
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## <NA> 0 0 0 0 0
The missing data pattern of the combined dataset of populations A, B and E:
md.pattern(walking)
## sex age src YA YB
## 290 1 1 1 1 1 0
## 294 1 1 1 0 1 1
## 300 1 1 1 1 0 1
## 6 1 1 1 0 0 2
## 0 0 0 300 306 606
Now, for 290 subjects we have scores on both YA
and YB
(from bridge study E
).
Multiple imputation on the dataset walking
can now be done as
tau <- NULL
imp <- mice(walking, max = 0, m = 10, seed = 92786)
pred <- imp$pred
pred[, c("src", "age", "sex")] <- 0
imp <- mice(walking, max = 0, m = 10, seed = 92786, pred = pred)
micemill(20)
plotit()
After five iterations the procedure seems to convergence. Speed of convergence is dependent on the size of the bridge study (now 1/3 of the total dataset). If the relative size of the bridge study was smaller, it might have taken more iterations to reach convergence.
We have made three different assumptions on the relation between YA
and YB
. Does the assumption matter for the conclusion we draw from the data?
Assumption | Mean | Mean | Perc(0) | Perc(0) |
---|---|---|---|---|
- | Study A | Study B | Study A | Study B |
Equate | 0.24 | 0.66 | 81 | 50 |
Independence | 0.24 | 0.25 | 50 | 50 |
Bridge | 0.24 | 0.45 | 58 | 50 |
We calculate two statistics of interest:
From the table we see
A
are healthier than persons from study B
, and by a considerable margin (e.g. 81 versus 50 percent in the zero category).A
and B
are about equally healthy.Thus, different assumption may lead to radically different conclusion. We find that
YA
and YB
YA
and YB
Neither equate or independence is OK. The more reasonable assumption is here the bridge.