2.8 How many imputations?

One of the distinct advantages of multiple imputation is that it can produce unbiased estimates with correct confidence intervals with a low number of imputed datasets, even as low as \(m=2\). Multiple imputation is able to work with low \(m\) since it enlarges the between-imputation variance \(B\) by a factor \(1/m\) before calculating the total variance in \(T=\bar U+(1+m^{-1})B\).

The classic advice is to use a low number of imputation, somewhere between 3 and 5 for moderate amounts of missing information. Several authors investigated the influence of \(m\) on various aspects of the results. The picture emerging from this work is that it is often beneficial to set \(m\) higher, somewhere in the range of 20–100 imputations. This section reviews the relevant work in the area.

The advice for low \(m\) rests on the following argument. Multiple imputation is a simulation technique, and hence \(\bar Q\) and its variance estimate \(T\) are subject to simulation error. Setting \(m=\infty\) causes all simulation error to disappear, so \(T_{\infty}<T_m\) if \(m<\infty\). The question is when \(T_{\infty}\) is close enough to \(T_m\). (Rubin 1987 b, 114) showed that the two variances are related by

\[\begin{equation} T_m=\left(1+\frac{\gamma_0}{m}\right)T_{\infty} \tag{2.38} \end{equation}\]

where \(\gamma_0\) is the (true) population fraction of missing information. This quantity is equal to the expected fraction of observations missing if \(Y\) is a single variable without covariates, and commonly less than this if there are covariates that predict \(Y\). For example, for \(\gamma_0=0.3\) (e.g., a single variable with 30% missing) and \(m=5\) we find that the calculated variance \(T_m\) is \(1+0.3/5 = 1.06\) times (i.e., 6%) larger than the ideal variance \(T_{\infty}\). The corresponding confidence interval would thus be \(\sqrt{1.06}=1.03\) (i.e., 3%) longer than the ideal confidence interval based on \(m=\infty\). Increasing \(m\) to 10 or 20 would bring the factor down 1.5% and 0.7%, respectively. The argument is that “the additional resources that would be required to create and store more than a few imputations would not be well spent” (Schafer 1997, 107), and “in most situations there is simply little advantage to producing and analyzing more than a few imputed datasets” (Schafer and Olsen 1998, 549).

Royston (2004) observed that the length of the confidence interval also depends on \(\nu\), and thus on \(m\) (cf. Equation (2.30)). He suggested to base the criterion for \(m\) on the confidence coefficient \(t_{\nu}\sqrt{T}\), and proposed that the coefficient of variation of \(\ln(t_{\nu}\sqrt{T})\) should be smaller than 0.05. This effectively constrains the range of uncertainty about the confidence interval to roughly within 10%. This rule requires \(m\) to be “at least 20 and possibly more.”

Graham, Olchowski, and Gilreath (2007) investigated the effect of \(m\) on the statistical power of a test for detecting a small effect size \((< 0.1)\). Their advice is to set \(m\) high in applications where high statistical power is needed. For example, for \(\gamma_0=0.3\) and \(m=5\) the statistical power obtained is 73.1% instead of the theoretical value of 78.4%. We need \(m=20\) to increase the power to 78.2%. In order to have an attained power within 1% of the theoretical power, then for fractions of missing information \(\gamma\) = (0.1, 0.3, 0.5, 0.7, 0.9) we need to set \(m\) = (20, 20, 40, 100, \(> 100\)), respectively.

Bodner (2008) explored the variability of three quantities under various \(m\): the width of the 95% confidence interval, the \(p\)-value, and \(\gamma_0\). Bodner selected \(m\) such that the width of the 95% confidence interval is within 10% of its true value 95% of the time. For \(\gamma_0\) = (0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9), he recommends \(m\) = (3, 6, 12, 24, 59, 114, 258), respectively, using a linear rule. Since the true \(\gamma_0\) is unknown, Bodner suggested the proportion of complete cases as a conservative estimate of \(\gamma_0\). Von Hippel (2018) showed that a relation between \(m\) and \(\gamma_0\) is better explained by a quadratic rule

\[\begin{equation} m = 1 + \frac{1}{2}\left(\frac{\gamma_0}{\mathrm{SD}(\sqrt{U_\ell})\mathrm{E}(\sqrt{U_\ell})}\right)^2 \tag{2.39} \end{equation}\]

where \(\mathrm{E}(\sqrt{U_\ell})\) and \(\mathrm{SD}(\sqrt{U_\ell})\) are the mean and standard deviation of the standard errors calculated from the imputed datasets. The rule is used in a two-step procedure, where the first step estimates \(\gamma_0\) and its 95% confidence interval. The upper limit of the confidence interval is then plugged into Equation (2.39). Compared to Bodner, the rule suggests somewhat lower \(m\) if \(\gamma_0 < 0.5\) and substantially higher \(m\) if \(\gamma_0 > 0.5\).

The starting point of White, Royston, and Wood (2011) is that all essential quantities in the analysis should be reproducible within some limit, including confidence intervals, \(p\)-values and estimates of the fraction of missing information. They take a quote from Von Hippel (2009) as a rule of thumb: the number of imputations should be similar to the percentage of cases that are incomplete. This rule applies to fractions of missing information of up to 0.5. If \(m \approx\, 100\lambda\), the following properties will hold for a parameter \(\beta\):

The Monte Carlo error of \(\hat\beta\) is approximately 10% of its standard error;
The Monte Carlo error of the test statistic \(\hat\beta/\mathrm{se}(\hat\beta)\) is approximately 0.1;
The Monte Carlo error of the \(p\)-value is approximately 0.01 when the true \(p\)-value is 0.05.

White, Royston, and Wood (2011) suggest these criteria provide an adequate level of reproducibility in practice. The idea of reproducibility is sensible, the rule is simple to apply, so there is much to commend it. The rule has now become the de-facto standard, especially in medical applications. One potential difficulty might be that the percentage of complete cases is sensitive to the number of variables in the data. If we extend the active dataset by adding more variables, then the percentage of complete cases can only drop. An alternative would be to use the average missing data rate as a less conservative estimate.

Theoretically it is always better to use higher \(m\), but this involves more computation and storage. Setting \(m\) very high (say \(m = 200\)) may be useful for low-level estimands that are very uncertain, and for which we want to approximate the full distribution, or for parameters that are notoriously different to estimates, like variance components. On the other hand, setting \(m\) high may not be worth the extra wait if the primary interest is on the point estimates (and not on standard errors, \(p\)-values, and so on). In that case using \(m = 5-20\) will be enough under moderate missingness.

Imputing a dataset in practice often involves trial and error to adapt and refine the imputation model. Such initial explorations do not require large \(m\). It is convenient to set \(m=5\) during model building, and increase \(m\) only after being satisfied with the model for the “final” round of imputation. So if calculation is not prohibitive, we may set \(m\) to the average percentage of missing data. The substantive conclusions are unlikely to change as a result of raising \(m\) beyond \(m=5\).