3.6 Categorical data
3.6.1 Generalized linear model
Imputation of missing categorical data is possible under the broad class of generalized linear models (McCullagh and Nelder 1989). For incomplete binary variables we use logistic regression, where the outcome probability is modeled as
\[\begin{equation} \Pr(y_i=1|X_i, \beta) = \frac{\exp(X_i\beta)}{1+\exp(X_i\beta)} \tag{3.4} \end{equation}\]A categorical variable with \(K\) unordered categories is imputed under the multinomial logit model
\[ \Pr(y_i=k|X_i, \beta) = \frac{\exp(X_i\beta_k)}{\sum_{k=1}^K \exp(X_i\beta_k)} \tag{3.5} \]
for \(k=1,\dots,K\), where \(\beta_k\) varies over the categories and where \(\beta_1=0\) to identify the model. A categorical variable with \(K\) ordered categories is imputed by the ordered logit model, or proportional odds model
\[ \Pr(y_i\leq k|X_i, \beta, \tau_k) = \frac{\exp(\tau_k - X_i\beta)} {1 + \exp(\tau_k - X_i\beta)} \tag{3.6} \]
with the slope \(\beta\) is identical across categories, but the intercepts \(\tau_k\) differ. For identification, we set \(\tau_1=0\). The probability of observing category \(k\) is written as
\[ \Pr(y_i = k|X_i) = \Pr(y_i\leq k|X_i) - \Pr(y_i\leq k-1|X_i) \tag{3.7} \]
where the model parameters \(\beta\), \(\tau_k\) and \(\tau_{k-1}\) are suppressed for clarity. Scott Long (1997) is a very readable introduction to these methods. The practical application of these techniques in R
is treated in Aitkin et al. (2009). The general idea is to estimate the probability model on the subset of the observed data, and draw synthetic data according to the fitted probabilities to impute the missing data. The parameters are typically estimated by iteratively reweighted least squares. As before, the variability of the model parameters \(\beta\) and \(\tau_2,\dots,\tau_K\) introduces additional uncertainty that needs to be incorporated into the imputations.
Algorithm 3.5 (Imputation of a binary variable by means of Bayesian logistic regression.)
Estimate regression weights \(\hat\beta\) from \((y_\mathrm{obs},X_\mathrm{obs})\) by iteratively reweighted least squares.
Obtain \(V\), the unscaled estimated covariance matrix of \(\hat\beta\).
Draw \(q\) independent \(N(0,1)\) variates in vector \(\dot z_1\).
Calculate \(V^{1/2}\) by Cholesky decomposition.
Calculate \(\dot\beta = \hat\beta + \dot z_1 V^{1/2}\).
Calculate \(n_0\) predicted probabilities \(\dot p = 1 / (1+ \exp(-\Xmis\dot\beta))\).
Draw \(n_0\) random variates from the uniform distribution \(U(0,1)\) in the vector \(u\).
- Calculate imputations \(\dot y_j = 1\) if \(u_j\leq\dot p_j\), and \(\dot y_j = 0\) otherwise, where \(j=1,\dots,n_0\).
Algorithm 3.5 provides the steps for an approximate Bayesian imputation method using logistic regression. The method assumes that the parameter vector \(\beta\) follows a multivariate normal distribution. Although this is true in large samples, the distribution can in fact be far from normal for modest \(n_1\), for large \(q\) or for predicted probabilities close to 0 or 1. The procedure is also approximate in the sense that it does not draw the estimated covariance \(V\) matrix. It is possible to define an explicit Bayesian sampling for drawing \(\beta\) and \(V\) from their exact posteriors. This method is theoretically preferable, but as it requires more elaborate modeling, it does not easily extend to other regression situations. In mice
the algorithm is implemented as the method logreg
.
It is easy to construct a bootstrap version that avoids some of the difficulties in Algorithm @refImputation of a binary variable by means of Bayesian logistic regression.. Prior to estimating \(\hat\beta\), we include a step that draws a bootstrap sample from \(Y_\mathrm{obs}\) and \(X_\mathrm{obs}\). Steps 2–5 can then be replaced by equating \(\dot\beta=\hat\beta\).
The algorithms for imputation of variables with more than two categories follow the same structure. In mice
the multinomial logit model in method polyreg
is estimated by the nnet::multinom()
function in the nnet
package. The ordered logit model in method polr
is estimated by the polr()
function of the MASS
package. Even though the ordered model uses fewer parameters, it is often more difficult to estimate. In cases where MASS::polr()
fails to converge, nnet::multinom()
will take over its duties. See Venables and Ripley (2002) for more details on both functions.
3.6.2 Perfect prediction\(^\spadesuit\)
There is a long-standing technical problem in models with categorical outcomes, known as separation or perfect prediction (Albert and Anderson 1984; Lesaffre and Albert 1989). The standard work by Hosmer and Lemeshow (2000, 138–41) discussed the problem, but provided no solution. The problem occurs, for example, when predicting the presence of a disease from a set of symptoms. If one of the symptoms (or a combination of symptoms) always leads to the disease, then we can perfectly predict the disease for any patient who has the symptom(s).
Disease | Yes | No |
---|---|---|
Yes | 100 | 100 |
No | 0 | 100 |
Unknown | 100 | 100 |
Table 3.4 contains an artificial numerical example. Having the symptom always implies the disease, so knowing that the patient has the symptom will allow perfect prediction of the disease status. When such data are analyzed, most software will print out a warning message and produce unusually large standard errors.
Now suppose that in a new group of 200 patients (100 in each symptom group) we know only the symptom and impute disease status. Under MAR, we should impute all 100 cases with the symptom to the diseased group, and divide the 100 cases without the symptom randomly over the diseased and non-diseased groups. However, this is not what happens in Algorithm @refImputation of a binary variable by means of Bayesian logistic regression.. The estimate of \(V\) will be very large as a result of separation. If we naively use this \(V\) then \(\dot\beta\) in step 5 effectively covers both positive and negative values equally likely. This results in either correctly 100 imputations in Yes
or incorrectly 100 imputations in No
, thereby resulting in bias in the disease probability.
The problem has recently gained attention. There are at least six different approaches to perfect prediction:
Eliminate the variable that causes perfect prediction.
Take \(\hat\beta\) instead of \(\dot\beta\).
Use penalized regression with Jeffreys prior in step 2 of Algorithm ?? (Firth 1993; Heinze and Schemper 2002).
Use the bootstrap, and then apply method 1.
Use data augmentation, a method that concatenates pseudo-observations with a small weight to the data, effectively prohibiting infinite estimates (Clogg et al. 1991; White, Daniel, and Royston 2010).
Apply the explicit Bayesian sampling with a suitable weak prior. Gelman et al. (2008) recommend using independent Cauchy distributions on all logistic regression coefficients.
Eliminating the most predictive variable is generally undesirable in the context of imputation, and may in fact bias the relation of interest. Option 2 does not yield proper imputations, and is therefore not recommended. Option 3 provides finite estimates, but has been criticized as not being well interpretable in a regression context (Gelman et al. 2008) and computationally inefficient (White, Daniel, and Royston 2010). Option 4 corrects method 1, and is simple to implement. Options 5 and 6 have been recommended by White, Daniel, and Royston (2010) and Gelman et al. (2008), respectively.
Methods 4, 5 and 6 all solve a major difficulty in the construction of automatic imputation techniques. It is not yet clear whether one of these methods is superior. The logreg
, polr
and polyreg
methods in mice
implement option 5.
3.6.3 Evaluation
The methods are based on the elegant generalized linear models. Simulations presented in Van Buuren et al. (2006) show that these methods performed quite well in the lab. When used in practice however, the methods may be unstable, slow and exhibit poor performance. Hardt et al. (2013) intentionally pushed the logistic methods to their limits, and observed that most methods break down relatively quick, i.e., if the proportion of missing values exceeds 0.4. Van der Palm, Van der Ark, and Vermunt (2016a) found that logreg
failed to pick up a three-way association in the data, leading to biased estimates. Likewise, Vidotto, Vermunt, and Kaptein (2015) observed that logreg
did not recover the structure in the data as well as latent class models. Wu, Jia, and Enders (2015) found poor results for all three methods (i.e., binary, multinomial and proportional odds), and advise against their application. Akande, Li, and Reiter (2017) reported difficulties with fitting multinomial variables having many categories. The performance of the procedures suffered when variables with probabilities nearly equal to one (or zero) are included in the models. Methods based on the generalized linear model were found to be inferior to method cart
(cf. Section 3.5) and to latent class models for categorical data (cf. Section 4.4). Audigier, Husson, and Josse (2017) found that logistic regression presented difficulties on the datasets with a high number of categories, resulting in undercoverage on several quantities.
Imputation of categorical data is more difficult than continuous data. As a rule of thumb, in logistic regression we need at least 10 events per predictor in order to get reasonably stable estimates of the regression coefficients (Van Belle 2002, 87). So if we impute 10 binary outcomes, we need \(100\) events, and if the events occur with a probability of 0.1, then we need \(n > 1000\) cases. If we impute outcomes with more categories, the numbers rapidly increase for two reasons. First, we have more possible outcomes, and we need 10 events for each category. Second, when used as predictor, each nominal variable is expanded into dummy variables, so the number of predictors multiplies by the number of categories minus 1. The defaults logreg
, polyreg
and polr
tend to preserve the main effects well provided that the parameters are identified and can be reasonably well estimated. In many datasets, especially those with many categories, the ratio of the number of fitted parameters relative to the number of events easily drops below 10, which may lead to estimation problems. In those cases, the advice is to specify more robust methods, like pmm
, cart
or rf
.