2.3 Why and when multiple imputation works

2.3.1 Goal of multiple imputation

A scientific estimand \(Q\) is a quantity of scientific interest that we can calculate if we would observe the entire population. For example, we could be interested in the mean income of the population. In general, \(Q\) can be expressed as a known function of the population data. If we are interested in more than one quantity, \(Q\) will be a vector. Note that \(Q\) is a property of the population, so it does not depend on any design characteristics. Examples of scientific estimands include the population mean, the population (co)variance or correlation, and population factor loadings and regression coefficients, as well as these quantities calculated within known strata of the population. Examples of quantities that are not scientific estimands are sample means, standard errors and test statistics.

We can only calculate \(Q\) if the population data are fully known, but this is almost never the case. The goal of multiple imputation is to find an estimate \(\hat Q\) that is unbiased and confidence valid (Rubin 1996). We explain these concepts below.

Unbiasedness means that the average \(\hat Q\) over all possible samples \(Y\) from the population is equal to \(Q\). The formula is

\[\begin{equation} E(\hat Q|Y) = Q \tag{2.12} \end{equation}\]

The explanation of confidence validity requires some additional symbols. Let \(U\) be the estimated variance-covariance matrix of \(\hat Q\). This estimate is confidence valid if the average of \(U\) over all possible samples is equal or larger than the variance of \(\hat Q\). The formula is

\[\begin{equation} E(U|Y) \geq V(\hat Q|Y) \tag{2.13} \end{equation}\]

where the function \(V(\hat Q|Y)\) denotes the variance caused by the sampling process. A statistical test with a stated nominal rejection rate of 5% should reject the null hypothesis in at most 5% of the cases when in fact the null hypothesis is true. A procedure is said to be confidence valid if this holds.

In summary, the goal of multiple imputation is to obtain estimates of the scientific estimand in the population. This estimate should on average be equal to the value of the population parameter. Moreover, the associated confidence intervals and hypothesis tests should achieve at least the stated nominal value.

2.3.2 Three sources of variation\(^\spadesuit\)

The actual value of \(Q\) is unknown if some of the population data are unknown. Suppose we make an estimate \(\hat Q\) of \(Q\). The amount of uncertainty in \(\hat Q\) about the true population value \(Q\) depends on what we know about \(Y_\mathrm{mis}\). If we would be able to re-create \(Y_\mathrm{mis}\) perfectly, then we can calculate \(Q\) with certainty. However, such perfect re-creation is almost never unachievable. In other cases, we need to summarize the distribution of \(Q\) under varying \(Y_\mathrm{mis}\). The possible values of \(Q\) given our knowledge of the data \(Y_\mathrm{obs}\) are captured by the posterior distribution \(P(Q|Y_\mathrm{obs})\). In itself, \(P(Q|Y_\mathrm{obs})\) is often intractable, but it can be decomposed into two parts that are easier to solve as follows:

\[\begin{equation} P(Q|{\mbox{$Y_\mathrm{obs}$}}) = \int{P(Q|{\mbox{$Y_\mathrm{obs}$}},{\mbox{$Y_\mathrm{mis}$}})P({\mbox{$Y_\mathrm{mis}$}}|{\mbox{$Y_\mathrm{obs}$}})} d{\mbox{$Y_\mathrm{mis}$}}\tag{2.14} \end{equation}\]

Here, \(P(Q|Y_\mathrm{obs})\) is the posterior distribution of \(Q\) given the observed data \(Y_\mathrm{obs}\). This is the distribution that we would like to know. \(P(Q|Y_\mathrm{obs},Y_\mathrm{mis})\) is the posterior distribution of \(Q\) in the hypothetically complete data, and \(P(Y_\mathrm{mis}|Y_\mathrm{obs})\) is the posterior distribution of the missing data given the observed data.

The interpretation of Equation (2.14) is most conveniently done from right to left. Suppose that we use \(P(Y_\mathrm{mis}|Y_\mathrm{obs})\) to draw imputations for \(Y_\mathrm{mis}\), denoted as \(\dot Y_\mathrm{mis}\). We can then use \(P(Q|Y_\mathrm{obs},\dot Y_\mathrm{mis})\) to calculate the quantity of interest \(Q\) from the imputed data (\(Y_\mathrm{obs}\),\(\dot Y_\mathrm{mis}\)). We repeat these two steps with new draws \(\dot Y_\mathrm{mis}\), and so on. Equation (2.14) says that the actual posterior distribution of \(Q\) is equal to the average over the repeated draws of \(Q\). This result is important since it expresses \(P(Q|Y_\mathrm{obs})\), which is generally difficult, as a combination of two simpler posteriors from which draws can be made.

It can be shown that the posterior mean of \(P(Q|Y_\mathrm{obs})\) is equal to

\[\begin{equation} E(Q|{\mbox{$Y_\mathrm{obs}$}}) = E(E[Q|{\mbox{$Y_\mathrm{obs}$}},{\mbox{$Y_\mathrm{mis}$}}]|{\mbox{$Y_\mathrm{obs}$}}) \tag{2.15} \end{equation}\]

the average of the posterior means of \(Q\) over the repeatedly imputed data. This equation suggests the following procedure for combining the results of repeated imputations. Suppose that \(\hat Q_l\) is the estimate of the \(\ell^\mathrm{th}\) repeated imputation, then the combined estimate is equal to

\[\begin{equation} \bar Q = \frac{1}{m}\sum_{\ell=1}^m \hat Q_\ell \tag{2.16} \end{equation}\]

where \(\hat Q_\ell\) contains \(k\) parameters and is represented as a \(k \times 1\) column vector.

The posterior variance of \(P(Q|Y_\mathrm{obs})\) is the sum of two variance components:

\[\begin{equation} V(Q|{\mbox{$Y_\mathrm{obs}$}})=E[V(Q|{\mbox{$Y_\mathrm{obs}$}},{\mbox{$Y_\mathrm{mis}$}})|{\mbox{$Y_\mathrm{obs}$}}] + V[E(Q|{\mbox{$Y_\mathrm{obs}$}},{\mbox{$Y_\mathrm{mis}$}})|{\mbox{$Y_\mathrm{obs}$}}]\tag{2.17} \end{equation}\]

This equation is well known in statistics, but can be difficult to grasp at first. The first component is the average of the repeated complete-data posterior variances of \(Q\). This is called the within-variance. The second component is the variance between the complete-data posterior means of \(Q\). This is called the between variance. Let \(\bar U_\infty\) and \(B_\infty\) denote the estimated within and between components for an infinitely large number of imputations \(m=\infty\). Then \(T_\infty = \bar U_\infty + B_\infty\) is the posterior variance of \(Q\).

Equation (2.17) suggests the following procedure to estimate \(T_\infty\) for finite \(m\). We calculate the average of the complete-data variances as

\[\begin{equation} \bar U = \frac{1}{m}\sum_{\ell=1}^m \bar U_\ell \tag{2.18} \end{equation}\]

where the term \(\bar U_\ell\) is the variance-covariance matrix of \(\hat Q_\ell\) obtained for the \(\ell^\mathrm{th}\) imputation. The standard unbiased estimate of the variance between the \(m\) complete-data estimates is given by

\[\begin{equation} B = \frac{1}{m-1}\sum_{\ell=1}^m (\hat Q_\ell-\bar Q)(\hat Q_\ell-\bar Q)' \tag{2.19} \end{equation}\]

where \(\bar Q\) is calculated by Equation (2.16).

It is tempting to conclude that the total variance \(T\) is equal to the sum of \(\bar U\) and \(B\), but that would be incorrect. We need to incorporate the fact that \(\bar Q\) itself is estimated using finite \(m\), and thus only approximates \(\bar Q_\infty\). Rubin (1987b eq. 3.3.5) shows that the contribution to the variance of this factor is systematic and equal to \(B_\infty/m\). Since \(B\) approximates \(B_\infty\), we may write

\[\begin{align} T &= \bar U + B + B/m \\ &= \bar U + \left(1+\frac{1}{m}\right)B \tag{2.20} \\ \end{align}\]

for the total variance of \(\bar Q\), and hence of \((Q-\bar Q)\) if \(\bar Q\) is unbiased. The procedure to combine the repeated-imputation results by Equations (2.16) and (2.20) is referred to as Rubin’s rules.

In summary, the total variance \(T\) stems from three sources:

  1. \(\bar U\), the variance caused by the fact that we are taking a sample rather than observing the entire population. This is the conventional statistical measure of variability;

  2. \(B\), the extra variance caused by the fact that there are missing values in the sample;

  3. \(B/m\), the extra simulation variance caused by the fact that \(\bar Q\) itself is estimated for finite \(m\).

The addition of the latter term is critical to make multiple imputation work at low values of \(m\). Not including it would result in \(p\)-values that are too low, or confidence intervals that are too short. Traditional choices for \(m\) are \(m=3\), \(m=5\) and \(m=10\). The current advice is to set \(m\) higher, e.g., \(m=50\) (cf.Section 2.8). The larger \(m\) gets, the smaller the effect of simulation error on the total variance.

Steel, Wang, and Raftery (2010) investigated alternatives for obtaining estimates of \(T\) using mixtures of normals. Under multivariate normality and for low \(m\), these methods yield slightly more efficient estimates of \(T\). The behavior of these methods is not known when normality is violated. Since application of the procedure is more complex than Rubin’s rules, it is used sparingly.

2.3.3 Proper imputation

In order to yield valid statistical inferences, the imputed values should possess certain characteristics. Procedures that yield such imputations are called proper (Rubin 1987b, 118–28). Section 2.3.1 described two conditions needed for a valid estimate of \(Q\). These requirements apply simultaneously to both the sampling and the nonresponse model. An analogous set of requirements exists if we zoom in on procedures that deal exclusively with the response model. The important theoretical result is: If the imputation method is proper and if the complete-data model is valid in the sense of Section 2.3.1, the whole procedure is valid (Rubin 1987b, 119).

Table 2.2: Role of symbols at three analytic levels and the relations between them. The relation \(\Longrightarrow\) means “is an estimate of.” The relation \(\doteq\) means “is asymptotically equal to.”
Incomplete Sample Complete Sample Population
\(Y_\mathrm{obs}\) \(Y=(Y_\mathrm{obs},Y_\mathrm{mis})\)
\(\bar Q\) \(\Longrightarrow\) \(\hat Q\) \(\Longrightarrow\) \(Q\)
\(\bar U\) \(\Longrightarrow\) \(U\doteq V(\hat Q)\)
\(B\doteq V(\bar Q)\)

Recall from Section 2.3.1 that the goal of multiple imputation is to find an estimate \(\hat Q\) of \(Q\) with correct statistical properties. At the level of the sample, there is uncertainty about \(Q\). This uncertainty is captured by \(U\), the estimated variance-covariance of \(\hat Q\) in the sample. If we have no missing data in the sample, the pair \((\hat Q, U)\) contains everything we know about \(Q\).

If we have incomplete data, we can distinguish three analytic levels: the population, the sample and the incomplete sample. The problem of estimating \(Q\) in the population by \(\hat Q\) from the sample is a traditional statistical problem. The key idea of the solution is to accompany \(\hat Q\) by an estimate of its variability under repeated sampling \(U\) according to the sampling model.

Now suppose that we want to go from the incomplete sample to the complete sample. At the sample level we can distinguish two estimands, instead of one: \(\hat Q\) and \(U\). Thus, the role of the single estimand \(Q\) at the population level is taken over by the estimand pair \((\hat Q, U)\) at the sample level. Table 2.2 provides an overview of the three different analytic levels involved, the quantities defined at each level and their relations. Note that \(\hat Q\) is both an estimate (of \(Q\)) as well as an estimand (of \(\bar Q\)). Also, \(U\) has two roles.

Imputation is the act of converting an incomplete sample into a complete sample. Imputation of data should, at the very least, lead to adequate estimates of both \(\hat Q\) and \(U\). Three conditions define whether an imputation procedure is considered proper. We use the slightly simplified version given by Brand (1999, 89) combined with Rubin (1987b). An imputation procedure is said to be confidence proper for complete-data statistics \((\hat Q,U)\) if at large \(m\) all of the following conditions hold approximately:

\[\begin{align} E(\bar Q|Y) &= \hat Q \tag{2.21} \\ E(\bar U|Y) &= U \tag{2.22} \\ \left(1+\frac{1}{m}\right) E(B|Y) &\geq V(\bar Q) \tag{2.23} \\ \end{align}\]

The hypothetically complete sample data \(Y\) is now held fixed, and the response indicator \(R\) varies according to a specified model.

The first requirement is that \(\bar Q\) is an unbiased estimate of \(\hat Q\). This means that, when averaged over the response indicators \(R\) sampled under the assumed response model, the multiple imputation estimate \(\bar Q\) is equal to \(\hat Q\), the estimate calculated from the hypothetically complete data in the realized sample.

The second requirement is that \(\bar U\) is an unbiased estimate of \(U\). This means that, when averaged over the response indicator \(R\) sampled under the assumed response model, the estimate \(\bar U\) of the sampling variance of \(\hat Q\) is equal to \(U\), the sampling variance estimate calculated from the hypothetically complete data in the realized sample.

The third requirement is that \(B\) is a confidence valid estimate of the variance due to missing data. Equation (2.23) implies that the extra inferential uncertainty about \(\hat Q\) due to missing data is correctly reflected. On average, the estimate \(B\) of the variance due to missing data should be equal to \(V(\bar Q)\), the variance observed in the multiple imputation estimator \(\bar Q\) over different realizations of the response mechanism. This requirement is analogous to Equation (2.13) for confidence valid estimates of \(U\).

If we replace \(\geq\) in Equation (2.23) by \(>\), then the procedure is said to be proper, a stricter version. In practice, being confidence proper is enough to obtain valid inferences.

Note a procedure may be proper for the estimand pair \((\hat Q, U)\), while being improper for another pair \((\hat Q', U')\). Also, a procedure may be proper with respect to one response mechanism \(P(R)\), but improper for an alternative mechanism \(P(R')\).

It is not always easy to check whether a certain procedure is proper. Section 2.5 describes simulation-based tools for checking the adequacy of imputations for valid statistical inference. Chapter 3 provides examples of proper and improper procedures.

2.3.4 Scope of the imputation model

Imputation models vary in their scope. Models with a narrow scope are proper with respect to specific estimand \((\hat Q, U)\) and particular response mechanism, e.g., a particular proportion of nonresponse. Models with a broad scope are proper with respect to a wide range of estimates \(\hat Q\), e.g., subgroup means, correlations, ratios and so on, and under a large variety of response mechanisms.

The scope is related to the setting in which the data are collected. The following list distinguishes three typical situations:

  • Broad. Create one set of imputations to be used for all projects and analyses. A broad scope is appropriate for publicly released data, cohort data and registers, where different people use the data for different purposes.

  • Intermediate. Create one set of imputations per project and use this set for all analyses. An intermediate scope is appropriate for analyses that estimate relatively similar quantities. The imputer and analyst can be different persons.

  • Narrow. A separate imputed dataset is created for each analysis. The imputer and analyst are typically the same person. A narrow scope is appropriate if the imputed data are used only to estimate the same quantity. Different analyses require different imputations.

In general, imputations created under a broad scope can be applied more widely, and preferable for that reason. On the other hand, if we have a strong scientific model for the data, or if the parameters of interest have high-stakes consequences then using a narrow scope is better because the imputation model can be informed by the complete-data model, thus making sure that all interactions, non-linearities and distributional details are adequately met. In practice the correct model is often unknown. Therefore the techniques discussed in this book will emphasize imputations for the broader scope. Whatever is chosen, it is the responsibility of the imputer to indicate the scope of the generated imputations.

2.3.5 Variance ratios\(^\spadesuit\)

For scalar \(Q\), the ratio

\[\begin{equation} \lambda = \frac{B+B/m}{T} \tag{2.24} \end{equation}\]

can be interpreted as the proportion of the variation attributable to the missing data. It is equal to zero if the missing data do not add extra variation to the sampling variance, an exceptional situation that can occur only if we can perfectly re-create the missing data. The maximum value is equal to 1, which occurs only if all variation is caused by the missing data. This is equally unlikely to occur in practice since it means that there is no information at all. If \(\lambda\) is high, say \(\lambda>0.5\), the influence of the imputation model on the final result is larger than that of the complete-data model.

The ratio

\[\begin{equation} r = \frac{B+B/m}{\bar U} \tag{2.25} \end{equation}\]

is called the relative increase in variance due to nonresponse (Rubin 1987b eq. 3.1.7). The quantity is related to \(\lambda\) by \(r = \lambda/(1-\lambda)\).

Another related measure is the fraction of information about \(Q\) missing due to nonresponse (Rubin 1987b eq. 3.1.10). This measure is defined by

\[\begin{equation} \gamma = \frac{r+2/(\nu+3)}{1+r} \tag{2.26} \end{equation}\]

This measure needs an estimate of the degrees of freedom \(\nu\), and will be discussed in Section 2.3.6. The interpretations of \(\gamma\) and \(\lambda\) are similar, but \(\gamma\) is adjusted for the finite number of imputations. Both statistics are related by

\[\begin{equation} \gamma = \frac{\nu+1}{\nu+3}\lambda+\frac{2}{\nu+3}\tag{2.27} \end{equation}\]

The literature often confuses \(\gamma\) and \(\lambda\), and erroneously labels \(\lambda\) as the fraction of missing information. The values of \(\lambda\) and \(\gamma\) are almost identical for large \(\nu\), but they could notably differ for low \(\nu\).

If \(Q\) is a vector, it is sometimes useful to calculate a compromise \(\lambda\) over all elements in \(\bar Q\) as

\[\begin{equation} \bar\lambda = \left(1+\frac{1}{m}\right)\mathrm{tr}(BT^{-1})/k \tag{2.28} \end{equation}\]

where \(k\) is the dimension of \(\bar Q\), and where \(B\) and \(T\) are now \(k \times k\) matrices. The compromise expression for \(r\) is equal to

\[\begin{equation} \bar r = \left(1+\frac{1}{m}\right)\mathrm{tr}(B\bar U^{-1})/k \tag{2.29} \end{equation}\]

the average relative increase in variance.

The quantities \(\lambda\), \(r\) and \(\gamma\) as well as their multivariate analogues \(\bar\lambda\) and \(\bar r\) are indicators of the severity of the missing data problem. Fractions of missing information up to 0.2 can be interpreted as “modest,” 0.3 as “moderately large” and 0.5 as “high” (Li, Raghunathan, and Rubin 1991). High values indicate a difficult problem in which the final statistical inferences are highly dependent on the way in which the missing data were handled. Note that estimates of \(\lambda\), \(r\) and \(\gamma\) may be quite variable for low \(m\) (cf. Section 2.8).

2.3.6 Degrees of freedom\(^\spadesuit\)

The degrees of freedom is the number of observations after accounting for the number of parameters in de model. The calculation of the degrees of freedom cannot be the same as for the complete data because part of the data is missing. The “old” formula (Rubin 1987b eq. 3.1.6) for the degrees of freedom can be written concisely as

\[\begin{align} \nu_\mathrm{old} &= (m-1)\left(1+\frac{1}{r^2}\right) \nonumber\\ &= \frac{m-1}{\lambda^2} \tag{2.30} \\ \end{align}\]

with \(r\) and \(\lambda\) defined as in Section 2.3.5. The lowest possible value is \(\nu_\mathrm{old} = m-1\), which occurs if essentially all variation is attributable to the nonresponse. The highest value \(\nu_\mathrm{old}=\infty\) indicates that all variation is sampling variation, either because there were no missing data, or because we could re-create them perfectly.

Barnard and Rubin (1999) noted that Equation (2.30) can produce values that are larger than the sample size in the complete data, a situation that is “clearly inappropriate.” They developed an adapted version for small samples that is free of the problem. Let \(\nu_\mathrm{com}\) be the degrees of freedom of \(\bar Q\) in the hypothetically complete data. In models that fit \(k\) parameters on data with a sample size of \(n\) we may set \(\nu_\mathrm{com}=n-k\). The estimated observed data degrees of freedom that accounts for the missing information is

\[\begin{equation} \nu_\mathrm{obs} = \frac{\nu_\mathrm{com}+1}{\nu_\mathrm{com}+3}\nu_\mathrm{com}(1-\lambda) \tag{2.31} \end{equation}\]

The adjusted degrees of freedom to be used for testing in multiple imputation can be written concisely as

\[\begin{equation} \nu = \frac{\nu_\mathrm{old} \nu_\mathrm{obs}} {\nu_\mathrm{old}+\nu_\mathrm{obs}} \tag{2.32} \end{equation}\]

The quantity \(\nu\) is always less than or equal to \(\nu_\mathrm{com}\). If \(\nu_\mathrm{com}=\infty\), then Equation (2.32) reduces to (2.30). If \(\lambda=0\) then \(\nu=\nu_\mathrm{com}\), and if \(\lambda=1\) we find \(\nu=0\). Distributions with zero degrees of freedom are nonsensical, so for \(\nu<1\) we should refrain from any testing due to lack of information.

Alternative corrections were proposed by Reiter (2007) and Lipsitz, Parzen, and Zhao (2002). Wagstaff and Harel (2011) compared the four methods, and concluded that the sample-sample methods by Barnard-Rubin and Reiter performed satisfactory.

2.3.7 Numerical example

Many quantities introduced in the previous sections can be obtained by the pool() function in mice. The following code imputes the nhanes dataset, fits a simple linear model and pools the results:

library(mice)
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit)
est
Class: mipo    m = 10 
            estimate  ubar     b    t dfcom   df   riv lambda
(Intercept)    30.50 3.408 1.454 5.01    23 12.4 0.469  0.319
age            -2.13 0.906 0.238 1.17    23 15.1 0.289  0.224
              fmi
(Intercept) 0.408
age         0.310

The column estimate is the value of \(\bar Q\) as defined in Equation (2.16). Columns ubar, b and t are the variance estimates from Equations (2.18), (2.19) and (2.20), respectively. Column dfcom is the degrees of freedom is the hypothetically complete data \(\nu_\mathrm{com}\), and df is the degrees of freedom after the Barnard-Rubin correction \(\nu\). The last three columns are the relative increase in variance \(r\), the proportion of variance to due nonresponse \(\lambda\) and the fraction of missing information \(\gamma\) per parameter.