This document is based in section 7.4 of the book ‘Flexible Imputation of Missing Data’ by Stef van Buuren.

This practical needs the mice library:

library(mice)

Item YA

Are you able to walk outdoors on flat ground?

  1. Without any difficulty
  2. With some difficulty
  3. With much difficulty
  4. Unable to do

Item YB

Can you, fully independently, walk outdoors (if necessary with a cane)?

  1. Yes, no difficulty
  2. Yes, with some difficulty
  3. Yes, with much difficulty
  4. No, only with help from others

Equating categories

We have two studies, A and B. YA has been measured in Study A, and YB has been measured in Study B.

Would it be a good idea just to equate the four categories?

The equating assumption implicitly assumes that only combinations (0, 0), (1, 1), (2, 2) and (3, 3) can occur. Is that realistic?

Imputation under independence

Let YA be the item of Study A, and let YB be the item of Study B. The comparability problem is a missing data problem, where YA is missing for population B and YB is missing for population A. This formulation may help in using multiple imputation to solve the problem.

First, we create a small dataset with responses as follows:

fA <- c(242, 43, 15, 0, 6)         # frequencies of population A
fB <- c(145, 110, 29, 8)           # frequencies of population B
YA <- rep(ordered(c(0:3, NA)), fA) # outcome item A population A
YB <- rep(ordered(c(0:3)), fB)     # outcome item B population B

Combine both datasets with missing values for item YB for population A, and missing values for item YA for population B. The dataframe Y contains 604 rows and 2 columns: YA and YB.

Y <- rbind(data.frame(YA, YB = ordered(NA)), 
           data.frame(YB, YA = ordered(NA)))
dim(Y)
## [1] 598   2
head(Y)
##   YA   YB
## 1  0 <NA>
## 2  0 <NA>
## 3  0 <NA>
## 4  0 <NA>
## 5  0 <NA>
## 6  0 <NA>
tail(Y)
##       YA YB
## 593 <NA>  3
## 594 <NA>  3
## 595 <NA>  3
## 596 <NA>  3
## 597 <NA>  3
## 598 <NA>  3
md.pattern(Y)
##      YA  YB    
## 292   0   1   1
## 300   1   0   1
##   6   0   0   2
##     298 306 604

There no observations that link YA to YB, and so the missing data pattern is unconnected. Moreover, there are 6 records that contain no item data at all.

The following chunk is a bit of specialty code that defines two functions. The function micemill() calculates Kendall’s \(\tau\) (rank order correlation) between the imputed versions of YA and YB at each iteration. The function ra is a small helper function that puts the imputed data in proper shape.

micemill <- function(n){
  for (i in 1:n){
    imp <<- mice.mids(imp)
    cors <- with(imp, cor(as.numeric(YA),
                          as.numeric(YB), method = 'kendall'))
    tau <<- rbind(tau, ra(cors, s =T))
  }
}
ra <- function(x, simplify = FALSE) {
  if (!is.mira(x)) return(NULL)
  ra <- x$analyses
  if (simplify) ra <- unlist(ra)
  return(ra)
}

The following code imputes the missing data in Y under the (dubious) assumption that YA and YB are mutually independent.

tau <- NULL
imp <- mice(Y, max = 0, m = 10, print = FALSE,  seed = 32662)
micemill(25)

# define a function to plot tracelines of Kendall's tau
plotit <- function() matplot(x = 1:nrow(tau),
                             y = tau, ylab = expression(paste("Kendall's ", tau)),
                             xlab = "Iteration", type = "l", lwd = 1,
                             lty = 1:10, col = "black")
plotit()

In the plot 25 iterations are plotted: the trace start near zero, but then freely wander off over a substantial range of the correlation. The MICE algorithm does not know where to go, and wander pointlessly through the parameter space. This occurs because the data contains no information that informs the relation between YA and YB, so \(\tau\) can be anything.

Why we cannot simply equate categories

Suppose that we have a third, external study E in which both YA and YB are measured.

##      0   1  2 3    
## 0  128  45  3 2 178
## 1   13  45 10 0  68
## 2    3  20 14 5  42
## 3    0   0  1 1   2
## NA   1   0  1 0   2
##    145 110 29 8 292

The contingency table shows that there is a strong relation between YA and YB. However, it is far from perfect, so simply equating the four categories between YA and YB will distort their relationship. Note that the table is not symmetric, indicating that YA is more difficult than YB.

Simple equating assumes 100% concordance of the pairs. The contingency table clearly shows that this is not the case in study E. On surface, the four response categories of YA and YB may look similar, but the information from sample E suggests that the items work differently in a systematic way.

Imputation using a bridge study

Is there be a way to incorporate the relationship between YA and YB so that they will become comparable?

The answer is yes. We can redo the imputation, but now with sample E added to the data. In this way study E acts as a bridge study.

The relevant data are built-in in the mice under the name of walking.

head(walking)
##      sex age YA   YB src
## 1   Male  61  1 <NA>   A
## 2 Female  69  1 <NA>   A
## 3   Male  74  0 <NA>   A
## 4   Male  66  0 <NA>   A
## 5 Female  72  2 <NA>   A
## 6   Male  67  0 <NA>   A
table(walking$src)
## 
##   A   B   E 
## 306 292 292
with(walking, table(YA, YB, src, useNA = "always"))
## , , src = A
## 
##       YB
## YA       0   1   2   3 <NA>
##   0      0   0   0   0  242
##   1      0   0   0   0   43
##   2      0   0   0   0   15
##   3      0   0   0   0    0
##   <NA>   0   0   0   0    6
## 
## , , src = B
## 
##       YB
## YA       0   1   2   3 <NA>
##   0      0   0   0   0    0
##   1      0   0   0   0    0
##   2      0   0   0   0    0
##   3      0   0   0   0    0
##   <NA> 145 110  29   8    0
## 
## , , src = E
## 
##       YB
## YA       0   1   2   3 <NA>
##   0    128  45   3   2    0
##   1     13  45  10   0    0
##   2      3  20  14   5    0
##   3      0   0   1   1    0
##   <NA>   1   0   1   0    0
## 
## , , src = NA
## 
##       YB
## YA       0   1   2   3 <NA>
##   0      0   0   0   0    0
##   1      0   0   0   0    0
##   2      0   0   0   0    0
##   3      0   0   0   0    0
##   <NA>   0   0   0   0    0

The missing data pattern of the combined dataset of populations A, B and E:

md.pattern(walking)
##     sex age src  YA  YB    
## 290   1   1   1   1   1   0
## 294   1   1   1   0   1   1
## 300   1   1   1   1   0   1
##   6   1   1   1   0   0   2
##       0   0   0 300 306 606

Now, for 290 subjects we have scores on both YA and YB (from bridge study E).

Multiple imputation on the dataset walking can now be done as

tau <- NULL
imp <- mice(walking, max = 0, m = 10, seed = 92786)
pred <- imp$pred
pred[, c("src", "age", "sex")] <- 0
imp <- mice(walking, max = 0, m = 10, seed = 92786, pred = pred)
micemill(20)
plotit()

After five iterations the procedure seems to convergence. Speed of convergence is dependent on the size of the bridge study (now 1/3 of the total dataset). If the relative size of the bridge study was smaller, it might have taken more iterations to reach convergence.

Does the assumption matter?

We have made three different assumptions on the relation between YA and YB. Does the assumption matter for the conclusion we draw from the data?

Assumption Mean Mean Perc(0) Perc(0)
- Study A Study B Study A Study B
Equate 0.24 0.66 81 50
Independence 0.24 0.25 50 50
Bridge 0.24 0.45 58 50

We calculate two statistics of interest:

  1. Mean: mean of the distribution, lower indicates a more healthy population
  2. Perc(0): percentage zeroes in the distribution, higher indicates a more healthy population

From the table we see

Thus, different assumption may lead to radically different conclusion. We find that

Neither equate or independence is OK. The more reasonable assumption is here the bridge.