5.3 Multi-parameter inference

There are many situations where we need to test whether a set of parameters is significantly different from zero. For example, if a categorical variable enters the analysis through a set of dummy variables, all parameters related to this set should be tested simultaneously. More generally, ANOVA type of designs can be formulated and tested as a multi-parameter inference regression problem. Van Ginkel and Kroonenberg (2014) provide many practical examples with missing data.

Schafer (1997) distinguished three types of statistics in multi-parameter tests: \(D_1\) (multivariate Wald test), \(D_2\) (combining test statistics) and \(D_3\) (likelihood ratio test). The next sections outline the idea of each approach, and demonstrate how these tests can be performed as part of the repeated data analyses .

5.3.1 \(D_1\) Multivariate Wald test

The multivariate Wald test is an extension of the procedure for scalar quantities as described in Section 2.4.2. The procedure tests whether \(Q = Q_0\), where \(Q_0\) is a \(k\)-vector of values under the null hypothesis (typically all zeros). The multivariate Wald test requires an estimate of the variance-covariance matrix \(U\) of \(\bar Q\). We could use \(T\) from Equation (2.20), but this estimate can be unreliable. The problem is that for small \(m\), the estimate of the between-imputation variance \(B\) is unstable, and if \(m\leq k\), it is not even full rank. Thus \(T\) can be unreliable if \(B\) is a substantial component of \(T\).

Li, Raghunathan, and Rubin (1991) proposed an estimate of \(T\) in which \(B\) and \(\bar U\) are assumed to be proportional to each other. A more stable estimate of the total variance is then \(\tilde T = (1+r_1)\bar U\), where \(r_1 = \bar r\) (from Equation (2.29)) is the average fraction of missing information.

Under the assumption that \((Q_0 - \bar Q)\) is sufficiently close to a normal distribution, the \(D_1\)-statistic

\[ D_1 = (\bar Q-Q_0)'\tilde T^{-1}(\bar Q-Q_0)/k \tag{5.3} \]

follows an \(F\)-distribution \(F_{k,\nu_w}\) with \(k\) and \(\nu_1\) degrees of freedom, where

\[ \nu_1 = \left\{\begin{array} {l@{\quad}l} 4 + (t-4)[1+(1-2t^{-1})r_1^{-1}]^2 & \mathrm{if} \quad t = k(m-1) > 4\\ t(1+k^{-1})(1+r_1^{-1})^2/2 & \mathrm{otherwise}\tag{5.4} \end{array} \right. \]

The \(p\)-value for \(D_1\) is

\[ P_1 = \Pr[F_{k,\nu_1}>D_1] \tag{5.5} \]

The assumption that the fraction of missing information is the same across all variables and statistics is unlikely to hold in practice. However, Li, Raghunathan, and Rubin (1991) provide encouraging simulation results for situations where this assumption is violated. Except for some extreme cases, the level of the procedure was close to the nominal level, while the loss of power from such violations was modest.

The work of Li, Raghunathan, and Rubin (1991) is based on large samples. Reiter (2007) developed a small-sample version for the degrees of freedom using ideas similar to Barnard and Rubin (1999). Reiter’s \(\nu_f\) spans several lines of text, and is not given here. A simulation study conducted by Reiter showed marked improvements over the earlier formulation, especially in smaller samples. Simulation work by Grund, Lüdtke, and Robitzsch (2016 b) and Liu and Enders (2017) confirmed that for small samples (say \(n < 50\)) \(\nu_f\) is more conservative than \(\nu_1\), and produced type I errors rates closer to their nomimal value. Raghunathan (2015) recently provided an elegant alternative based on Equation (2.32) with \(\nu_\mathrm{obs}\) substituted as \(\nu_\mathrm{obs} = (\nu_\mathrm{com}+1)\nu_\mathrm{com}/(\nu_\mathrm{com}+3)(1+\bar r)\). It is not yet known how this correction compares to \(\nu_1\) and \(\nu_f\).

The mice package implements the multivariate Wald test as the D1() function. Let us impute the nhanes2 data, and fit the linear regression of chl on age and bmi.

imp <- mice(nhanes2, m = 10, print = FALSE, seed = 71242)
m2 <- with(imp, lm(chl ~ age + bmi))
pool(m2)

Class: mipo    m = 10 
            estimate    ubar       b       t dfcom    df   riv
(Intercept)    -2.98 2896.71 1266.44 4289.79    21 11.28 0.481
age40-59       45.73  296.37  152.46  464.07    21 10.43 0.566
age60-99       65.62  342.71  229.04  594.66    21  9.08 0.735
bmi             6.40    3.26    1.41    4.81    21 11.32 0.477
            lambda   fmi
(Intercept)  0.325 0.419
age40-59     0.361 0.456
age60-99     0.424 0.519
bmi          0.323 0.418

We want to simplify the model by testing for age. Since age is a categorical variable with three categories, removing it involves deleting two columns at the same time, hence the univariate Wald test does not apply. The solution is to fit the model without age, and run the multivariate Wald statistic to test whether the model estimates are different.

m1 <- with(imp, lm(chl ~ bmi))
summary(D1(m2, m1))


Models:
 model         formula
     1 chl ~ age + bmi
     2       chl ~ bmi

Comparisons:
   test statistic df1  df2 df.com p.value  riv
 1 ~~ 2      4.23   2 14.4     21   0.036 0.63

Number of imputations:  10   Method D1

Since the Wald test is significant, removing age from the model reduces its predictive power.

5.3.2 \(D_2\) Combining test statistics\(^\spadesuit\)

The multivariate Wald test may become cumbersome when \(k\) is large, or when many tests are to be done. Some analytic models may not produce an estimate of \(Q\), or of its variance-covariance matrix. For example, nonparametric tests like the sign test or Wilcoxon-Mann-Whitney produce a \(p\)-value, and no estimate of \(Q\). In cases like these, we can still calculate a combined significance test using the \(m\) test statistics (e.g., \(p\)-values, \(Z\)-values, \(\chi^2\)-values, \(t\)-values) as input.

Rubin (1987 b, 87) and Li et al. (1991) describe a procedure for pooling the values of the test statistics. Suppose that \(d_\ell\) is the test statistic obtained from the analysis of the \(\ell^\mathrm{th}\) imputed dataset \(Y_\ell\), \(\ell=1,\dots,m\). Let \(\bar d = m^{-1}\sum_\ell d_\ell\) be the average test statistic. The statistic for the combined test is

\[ D_2 = \frac{\bar dk^{-1}-(m+1)(m-1)^{-1}r_2}{1+r_2} \tag{5.6} \]

where the relative increase of the variance is calculated as

\[ r_2 = \left(1+\frac{1}{m}\right)\frac{1}{m-1}\sum_{\ell=1}^m\left(\sqrt{d_\ell}-\overline{\sqrt{d}}\right)^2 \tag{5.7} \]

with \(\overline{\sqrt{d}} = m^{-1}\sum_\ell \sqrt{d_\ell}\), so that \(r_2\) equals the sample variance of \(\sqrt{d_1}\), \(\sqrt{d_2}\), …, \(\sqrt{d_m}\) multiplied by \((1 + 1/m)\). The \(p\)-value for testing the null hypothesis is

\[ P_2 = \Pr[F_{k,\nu_2} > D_2] \tag{5.8} \]

where

\[ \nu_2 = k^{-3/m}(m-1)(1+r_2^{-1})^2 \tag{5.9} \]

The procedure assumes that the test statistic is approximately normally distributed. This is clearly not the case for \(p\)-values, which follow a uniform distribution under the null. One may transform the \(p\)-values to approximate normality, combine and back-transform afterwards. Based on this idea, Licht (2010, 40–43) proposed a method for obtaining significance levels from repeated \(p\)-values similar to Equation (5.6) with custom \(r_2\) and \(\nu_2\).

In context of significance testing for logistic regression, Eekhout, Wiel, and Heymans (2017) suggest taking the median of the \(m\) \(p\)-values as the combined \(p\)-value, an exceedingly simple method. It nevertheless appears to outperform more sophisticated techniques if the variable to be tested is categorical with more than two categories. It would be useful to explore whether this median \(P\) rule has wider validity.

Let us continue with the previous example. Suppose that our software cannot export the variance-covariance matrix in each repeated analysis, but it does provide a table with the Wald statistics for testing age. The D2 function calculates the \(D_2\)-statistic and its degrees of freedom as

D2(m2, m1)

   test statistic df1  df2 df.com p.value  riv
 1 ~~ 2      2.26   2 17.6     NA   0.134 1.81

In contrast to the previous analysis, observe that the \(D_2\)-statistic is not significant at an \(\alpha\)-level of 0.05. The reason is that the \(D_2\) test is less informed by the data, and hence less powerful than the \(D_1\) test.

5.3.3 \(D_3\) Likelihood ratio test\(^\spadesuit\)

The likelihood ratio test (Meng and Rubin 1992) is designed to handle situations where one cannot obtain the covariance matrices of the complete-data estimates. This could be the case if the dimensionality of \(Q\) is high, which can occur with partially classified contingency tables. For large \(n\) the procedure is equivalent to the method of Section 5.3.1. The likelihood ratio test is the preferred method for testing random effects (Singer and Willett 2003), and connects to global model fit statistics in structural equation models (Enders and Mansolf 2018).

Let the vector \(Q\) contain the parameters of interest. We wish to test the hypothesis \(Q=Q_0\) for some given \(Q_0\). The usual scenario is that we compare two models, one where \(Q\) can vary freely and one more restrictive model that constrains \(Q=Q_0\).

The procedure for calculating the likelihood ratio test is as follows. First, estimate \(\bar Q\) (for the full model) and \(\bar Q_0\) (for the restricted model) from the \(m\) datasets by Rubin’s rules. Calculate the value of the log-likelihood functions \(l(\hat Q_\ell)\) (for the full model) and \(l(\hat Q_{0,\ell})\) (for the restricted model), and determine the average of the likelihood ratio tests across the \(m\) datasets, i.e.,

\[ \hat d = m^{-1} \sum_\ell -2(l(\hat Q_{0,\ell}) - l(\hat Q_\ell))\tag{5.10} \]

Then re-estimate the full and restricted models, with their model parameters fixed to \(\bar Q\) and \(\bar Q_0\), respectively, and average the corresponding likelihood ratio tests as

\[ \bar d = m^{-1} \sum_\ell -2(l(\bar Q_{0,\ell}) - l(\bar Q_\ell))\tag{5.11} \]

The test statistic proposed by Meng and Rubin (1992) is

\[ D_3 = \frac{\bar d}{k(1+r_3)} \tag{5.12} \]

where

\[ r_3 = \frac{m+1}{k(m-1)}(\hat d-\bar d)\tag{5.13} \]

estimates the average relative increase in variance due to nonresponse. The quantity \(r_3\) is asymptotically equivalent to \(\bar r\) from Equation (2.29). The \(p\)-value for \(D_3\) is equal to

\[ P_3 = \Pr[F_{k,\nu_3} > D_3]\tag{5.14} \]

where \(\nu_3=\nu_1\), or equal Reiter’s correction for small samples.

The likelihood ratio test does not require normality. For complete data, the likelihood ratio test is invariant to scale changes, which is the reason that many prefer the likelihood ratio scale over the Wald test. However, Schafer (1997, 118) observed that the invariance property is lost in multiple imputation because the averaging operations in Equations (5.10) and (5.11) may yield somewhat different results under nonlinear transformations of \(l(Q)\). He advised that the best results will be obtained if the distribution of \(Q\) is approximately normal. One may transform the parameters to achieve normality, provided that appropriate care is taken to infer that the result is still within the allowable parameter space.

Liu and Enders (2017) found in their simulations that \(D_3\) can become negative, a nonsensical value, in some scenarios. They suggest that a value of \(r_3 > 10\) or a 1000% increase in sampling variance due to missing data may act as warning signals for this anomaly.

Routine use of the likelihood ratio statistic has long been hampered by difficulties in calculating the likelihood ratio tests for the models with fixed parameters \(\bar Q\) and \(\bar Q_0\). With the advent of the broom package (Robinson 2017), the calculations have become feasible for a wide class of models. The D3() function in mice can be used to calculate the likelihood ratio test. We apply it to the data from previous examples by

D3(m2, m1)

   test statistic df1 df2 df.com p.value riv
 1 ~~ 2      9.57   2  18    Inf 0.00147 766

The \(D_3\)-statistic strongly indicates that age is a significant predictor. Note however the extremely large value for \(r_3\) (column riv), so the result must be taken with a grain of salt. The likely cause for this anomaly could well be the lack of a small-sample correction for this test.

5.3.4 \(D_1\), \(D_2\) or \(D_3\)?

If the estimates are approximately normal and if the software can produce the required variance-covariance matrices, we recommend using \(D_1\) with an adjustment for small samples if \(n < 100\). \(D_1\) is a direct extension of Rubin’s rules to multi-parameter problems, theoretically convincing, mature and widely applied. \(D_1\) is insensitive to the assumption of equal fractions of missing information, is well calibrated, works well with small \(m\) (unless the fractions of information are large and variable) and suffers only modest loss of power. The relevant literature (Rubin 1987 b; Li, Raghunathan, and Rubin 1991; Reiter 2007; Grund, Lüdtke, and Robitzsch 2016 b; Liu and Enders 2017) is quite consistent.

If only the test statistics are available for pooling, then the \(D_2\)-statistic is a good option, provided that the number of imputations \(m > 20\). The test is easy to calculate and applies to different test statistics. For \(m < 20\), the power may be low. \(D_2\) tends to become optimistic for high fractions of missing information \((> 0.3)\), and this effect unfortunately increases with sample size (Grund, Lüdtke, and Robitzsch 2016 b). Thus, careless application of \(D_2\) to large datasets with many missing values may yield high rates of false positives.

The likelihood ratio statistic \(D_3\) is theoretically sound. Calculation of \(D_3\) requires refitting the repeated analysis models with the estimates constrained to their pooled values. This was once an issue, but probably less so in the future. \(D_3\) is asymptotically equivalent to \(D_1\), and may be preferred for theoretical reasons: it does not require normality in the complete-data model, it is often more powerful and it may be more stable than if \(k\) is large (as \(\bar U\) need not be inverted). Grund, Lüdtke, and Robitzsch (2016 b), Liu and Enders (2017) and Eekhout, Wiel, and Heymans (2017) found that \(D_3\) produces Type 1 error rates that were comparable to \(D_1\). \(D_3\) tends to be somewhat conservative in smaller samples, especially with high fractions of missing information and with high \(k\). Also, \(D_3\) has lower statistical power in some of the extreme scenarios. For small samples, \(D_1\) has a slight edge over \(D_3\), so given the current available evidence \(D_1\) is the better option for \(n < 200\). In larger samples (\(n \geq 200\)) \(D_1\) and \(D_3\) appear equally good, so the choice between them is mostly a matter of convenience.