4.4 Joint modeling

4.4.1 Overview

Joint modeling (JM) starts from the assumption that the data can be described by a multivariate distribution. Assuming ignorability, imputations are created as draws from the fitted distribution. The model can be based on any multivariate distribution. The multivariate normal distribution is most widely applied.

The general idea is as follows. For a general missing data pattern, missing data can occur anywhere in \(Y\), so in practice the distribution from which imputations are to be drawn varies from row to row. For example, if the missingness pattern of row \(i\) is \(r_{[i]} = (0, 0, 1, 1)\), then we need to draw imputations from the bivariate distribution \(P_i(Y_1^\mathrm{mis},Y_2^\mathrm{mis}|Y_3, Y_4, \phi_{1,2})\), whereas if \(r_{[i']} = (0, 1, 1, 1)\) we need draws from the univariate distribution \(P_{i'}(Y_1^\mathrm{mis}|Y_1, Y_3, Y_4, \phi_1)\).

4.4.2 Continuous data

Under the assumption of multivariate normality \(Y\sim N(\mu,\Sigma)\), the \(\phi\)-parameters of these imputation models are functions of \(\theta=(\mu, \Sigma)\) (Schafer 1997, 157). The sweep operator transforms \(\theta\) into \(\phi\) by converting outcome variables into predictors, while the reverse sweep operator allows for the inverse operation (Beaton 1964). The sweep operators allow rapid calculation of the \(\phi\) parameters for imputation models that pertain to different missing data patterns. For reasons of efficiency, rows can be grouped along the missing data pattern. See Little and Rubin (2002, 148–56) and Schafer (1997, 157–63) for computational details.

The \(\theta\)-parameters are usually unknown. For non-monotone missing data, however, it is generally difficult to estimate \(\theta\) from \(Y_\mathrm{obs}\) directly. The solution is to iterate imputation and parameter estimation using a general algorithm known as data augmentation (Tanner and Wong 1987). At step \(t\), the algorithm draws \(Y_\mathrm{mis}\) and \(\theta\) by alternating the following steps:

\[\begin{align} \dot Y_\mathrm{mis}^t &\sim& P(Y_\mathrm{mis}|Y_\mathrm{obs}, \dot\theta^{t-1})\\ \dot\theta^t &\sim& P(\theta|Y_\mathrm{obs}, \dot Y_\mathrm{mis}^t) \end{align}\]

where imputations from \(P(Y_\mathrm{mis}|Y_\mathrm{obs}, \dot\theta^{t-1})\) are drawn by the method as described in the previous section, and where draws from the parameter distribution \(P(\theta|Y_\mathrm{obs}, \dot Y_\mathrm{mis}^t)\) are generated according to the method of Schafer (1997, 184).

Algorithm 4.2 (Imputation of missing data by a joint model for multivariate normal data.\(^\spadesuit\))

Sort the rows of \(Y\) into \(S\) missing data patterns \(Y_{[s]}, s=1,\dots,S\).
Initialize \(\theta^0 = (\mu^0, \Sigma^0)\) by a reasonable starting value.
Repeat for \(t = 1,\dots,M\).
Repeat for \(s=1,\dots,S\).
Calculate parameters \(\dot\phi_s = \mathrm{SWP}(\dot\theta^{t-1}, s)\) by sweeping the predictors of pattern \(s\) out of \(\dot\theta^{t-1}\).
Calculate \(p_s\) as the number of missing data in pattern \(s\). Calculate \(o_s=p-p_s\).
Calculate the Cholesky decomposition \(C_s\) of the \(p_s \times p_s\) submatrix of \(\dot\phi_s\) corresponding to the missing data in pattern \(s\).
Draw a random vector \(z \sim N(0,1)\) of length \(p_s\).
Take \(\dot\beta_s\) as the \(o_s \times p_s\) submatrix of \(\dot\phi_s\) of regression weights.
Calculate imputations \(\dot Y_{[s]}^t = Y_{[s]}^\mathrm{obs}\dot\beta_s + C_s^\prime z\), where \(Y_{[s]}^\mathrm{obs}\) is the observed data in pattern \(s\).
End repeat \(s\).
Draw \(\dot\theta^t=(\dot\mu\), \(\dot\Sigma\)) from the normal-inverse-Wishart distribution according to Schafer (1997, 184)
End repeat \(t\).

Algorithm 4.2 lists the major steps needed to impute multivariate missing data under the normal model. Additional background can be found in Li (1988), Rubin and Schafer (1990) and Schafer (1997). Song and Belin (2004) generated multiple imputations under the common factor model. The performance of the method was found to be similar to that of the multivariate normal distribution, the main pitfall being the danger of setting the numbers of factors too low. Audigier, Husson, and Josse (2016) proposed an imputation method based on Bayesian principal components analysis, and suggested it as an alternative to regularize data with more columns than rows.

Schafer (1997, 211–18) reported simulations that showed that imputations generated under the multivariate normal model are robust to non-normal data. Demirtas, Freels, and Yucel (2008) confirmed this claim in a more extensive simulation study. The authors conclude that “imputation under the assumption of normality is a fairly reasonable tool, even when the assumption of normality is clearly violated; the fraction of missing information is high, especially when the sample size is relatively large.” It is often beneficial to transform the data before imputation toward normality, especially if the scientifically interesting parameters are difficult to estimate, like quantiles or variances. For example, we could apply a logarithmic transformation before imputation to remove skewness, and apply an exponential transformation after imputation to revert to the original scale. See Von Hippel (2013) for a cautionary note.

Some work on automatic transformation methods for joint models is available. developed an iterative transformation-imputation algorithm that finds optimal transformations of the variables toward multivariate normality. The algorithm is iterative because the multiply imputed values contribute to define the transformation, and vice versa. Transformations toward normality have also been incorporated in transcan() and aregImpute() of the Hmisc package in R (Harrell 2001).

If a joint model is specified, it is nearly always the multivariate normal model. Alternatives like the \(t\)-distribution (Liu 1995) are hardly being developed or applied. The recent research effort in the area has focused on models for multilevel data. These developments are covered in a Chapter 7.

4.4.3 Categorical data

The multivariate normal model is often applied to categorical data. suggested rounding off continuous imputed values in categorical data to the nearest category “to preserve the distributional properties as fully as possible and to make them intelligible to the analyst.” This advice was questioned by Horton, Lipsitz, and Parzen (2003), who showed that simple rounding may introduce bias in the estimates of interest, in particular for binary variables. Allison (2005) found that it is usually better not to round the data, and preferred methods specifically designed for categorical data, like logistic regression imputation or discriminant analysis imputation. Bernaards, Belin, and Schafer (2007) confirmed the results of Horton, Lipsitz, and Parzen (2003) for simple rounding, and proposed two improvements to simple rounding: coin flip and adaptive rounding. Their simulations showed that “adaptive rounding seemed to provide the best performance, although its advantage over simple rounding was sometimes slight.” Further work has been done by Yucel, He, and Zaslavsky (2008), who proposed rounding such that the marginal distribution in the imputations is similar to that of the observed data. Alternatively, Demirtas (2009) proposed two rounding methods based on logistic regression and an additional drawing step that makes rounding dependent on other variables in the imputation model. Another proposal is to model the indicators of the categorical variables (Lee et al. 2012). A single best rounding rule for categorical data has yet to be identified. Demirtas (2010) encourages researchers to avoid rounding altogether, and apply methods specifically designed for categorical data.

Several joint models for categorical variables have been proposed that do not rely on rounding. proposed several techniques to impute categorical data and mixed continuous-categorical data. Missing data in contingency tables can be imputed under the log-linear model. The model preserves higher-order interactions, and works best if the number of variables is small, say, up to six. Mixed continuous-categorical data can be imputed under the general location model originally developed by Olkin and Tate (1961). This model combines the log-linear and multivariate normal models by fitting a restricted normal model to each cell of the contingency table. Further extensions have been suggested by Liu and Rubin (1998) and Peng, Little, and Raghunathan (2004). Belin et al. (1999) pointed out some limitations of the general location model for a larger dataset with 16 binary and 18 continuous variables. Their study found substantial differences between the imputed and follow-up data, especially for the binary data.

Alternative imputation methods based on joint models have been developed. maximized internal consistency by the \(k\)-means clustering algorithm, and outlined methods to generate multiple imputations. This is a single imputation method which artificially strengthens the relations in the data. The MIMCA imputation technique (Audigier, Husson, and Josse 2017) uses a similar underlying model, and derives variability in imputations by taking bootstrap samples under a chosen number of dimensions.

Van Ginkel, Van der Ark, and Sijtsma (2007) proposed two-way imputation, a technique for imputing incomplete categorical data by conditioning on the row and column sum scores of the multivariate data. This method has applications for imputing missing test item responses. Chen, Xie, and Qian (2011) proposed a class of models that specifies the conditional density by an odds ratio representation relative to the center of the distribution. This allows for separate models of the odds ratio function and the conditional density at the center.

Vermunt et al. (2008) pioneered the use of the latent class analysis for imputing categorical data. The latent class (or finite mixture) model describes the joint distribution as the product of locally independent categorical variables. When the number of classes is large, the model can be used as a generic density estimation tool that captures the relations between the variables by a highly parameterized model. The relevant associations in the data need not to be specified a-priori, and the main modeling effort consists of setting the number of latent classes. Unlike the saturated log-linear models advocated by Schafer (1997), latent models can handle a large number of variables. surveyed several different implementations of the latent class model for imputation, both frequentistic (Vermunt et al. 2008; Van der Palm, Van der Ark, and Vermunt 2016 b) and Bayesian (Si and Reiter 2013), which differ in the ways to select the number of classes. proposed an extension to longitudinal data.

Joint models for nested (multilevel) data have been intensively studied. Section 7.4 discusses these developments in more detail.