3.7 Other data types

3.7.1 Count data

Examples of count data include the number of children in a family or the number of crimes committed. The minimum value is zero. Imputing incomplete count data should produce non-negative synthetic replacement values. Count data can be imputed in various ways:

Predictive mean matching (cf. Section 3.4).
Ordered categorical imputation (cf. Section 3.6).
(Zero-inflated) Poisson regression (Raghunathan et al. 2001).
(Zero-inflated) negative binomial regression (Royston 2009).

Poisson regression is a class of models that is widely applied in biostatistics. The Poisson model can be thought of as the sum of the outcomes from a series of independent flips of the coin. The negative binomial is a more flexible model that is often applied an as alternative to account for over-dispersion. Zero-inflated versions of both models can be used if the number of zero values is larger than expected. The models are special cases of the generalized linear model, and do not bring new issues compared to, say, logistic regression imputation.

Kleinke and Reinecke (2013) developed methods for zero-inflated and over-dispersed data, using both Bayesian and bootstrap approaches. Methods are available in the countimp package for the Poisson model (pois, pois.boot), quasi Poission model (qpois, qpois.boot), the negative binomial model (nb), the zero-inflated Poisson (2l.zip, 2l.zip.boot), and the zero-inflated negative binomial (2l.zinb, 2l.zinb.boot). Note that, despite their naming, these 2l methods are for single-level imputation. An alternative is the ImputeRobust package (Salfran and Spiess 2017), which implements the following mice methods for count data: gamlssPO (Poisson), gamlssZIBI (zero-inflated binomial) and gamlssZIP (zero-inflated Poisson). Kleinke (2017) evaluated the use of predictive mean matching as a multipurpose missing data tool. By and large, the simulations illustrate that the method is robust against violations of its assumptions, and can be recommended for imputation of mildly to moderately skewed variables when sample size is sufficiently large.

3.7.2 Semi-continuous data

Semi-continuous data have a high mass at one point (often zero) and a continuous distribution over the remaining values. An example is the number of cigarettes smoked per day, which has a high mass at zero because of the non-smokers, and an often highly skewed unimodal distribution for the smokers. The difference with count data is gradual. Semi-continuous data are typically treated as continuous data, whereas count data are generally considered discrete.

Imputation of semi-continuous variables needs to reproduce both the point mass and continuously varying part of the data. One possibility is to apply a general-purpose method that preserves distributional features, like predictive mean matching (cf. Section 3.4).

An alternative is to model the data in two parts. The first step is to determine whether the imputed value is zero or not. The second step is only done for those with a non-zero value, and consists of drawing a value from the continuous part. Olsen and Schafer (2001) developed an imputation technique by combining a logistic model for the discrete part, and a normal model for the continuous part, possibly after a normalizing transformation. A more general two-part model was developed by Javaras and Van Dyk (2003), who extended the standard general location model (Olkin and Tate 1961) to impute partially observed semi-continuous data.

Yu, Burton, and Rivero-Arias (2007) evaluated nine different procedures. They found that predictive mean matching performs well, provided that a sufficient number of data points in the neighborhood of the incomplete data are available. Vink et al. (2014) found that generic predictive mean matching is at least as good as three dedicated methods for semi-continuous data: the two-part models as implemented in mi (Su et al. 2011) and irmi (Templ, Kowarik, and Filzmoser 2011), and the blocked general location model by Javaras and Van Dyk (2003). Vroomen et al. (2016) investigated imputation of cost data, and found that predictive mean matching of the log-transformed outperformed plain predictive mean matching, a two-step method and complete-case analysis, and hence recommend log-transformed method for monetary data.

3.7.3 Censored, truncated and rounded data

An observation \(y_i\) is censored if its value is only partly known. In right-censored data we only know that \(y_i > a_i\) for a censoring point \(a_i\). In left-censored data we only know that \(y_i \leq b_i\) for some known censoring point \(b_i\), and in interval censoring we know \(a_i \leq y_i \leq b_i\). Right-censored data arise when the true value is beyond the maximum scale value, for example, when body weight exceeds the scale maximum, say 150kg. When \(y_i\) is interpreted as time taken to some event (e.g., death), right-censored data occur when the observation period ends before the event has taken place. Left and right censoring may cause floor and ceiling effects. Rounding data to fewer decimal places results in interval-censored data.

Truncation is related to censoring, but differs from it in the sense that value below (left truncation) or above (right truncation) the truncation point is not recorded at all. For example, if persons with a weight in excess of 150kg are removed from the sample, we speak of truncation. The fact that observations are entirely missing turns the truncation problem into a missing data problem. Truncated data are less informative than censored data, and consequently truncation has a larger potential to distort the inferences of interest.

The usual approach for dealing with missing values in censored and truncated data is to delete the incomplete records, i.e., complete-case analysis. In the event that time is the censored variable, consider the following two problems:

Censored event times. What would have been the uncensored event time if no censoring had taken place?
Missing event times. What would have been the event time and the censoring status if these had been observed?

The problem of censored event times has been studied extensively. There are many statistical methods that can analyze left- or right-censored data directly, collectively known as survival analysis. Kleinbaum and Klein (2005), Hosmer, Lemeshow, and May (2008) and Allison (2010) provide useful introductions into the field. Survival analysis is the method of choice if censoring is restricted to the single outcomes. The approach is, however, less suited for censored predictors or for multiple interdependent censored outcomes. Van Wouwe et al. (2009) discuss an empirical example of such a problem. The authors are interested in knowing time interval between resuming contraception and cessation of lactation in young mothers who gave birth in the last 6 months. As the sample was cross-sectional, both contraception and lactation were subject to censoring. Imputation could be used to impute the hypothetically uncensored event times in both durations, and this allowed a study of the association between the uncensored event times.

The problem of missing event times is relevant if the event time is unobserved. The censoring status is typically also unknown if the event time is missing. Missing event times may be due to happenstance, for example, resulting from a technical failure of the instrument that measures event times. Alternatively, the missing data could have been caused by truncation, where all event times beyond the truncation point are set to missing. It will be clear that the optimal way to deal with the missing events data depends on the reasons for the missingness. Analysis of the complete cases will systematically distort the analysis of the event times if the data are truncated.

Imputation of right-censored data has received most attention to date. In general, the method aims to find new (longer) event times that would have been observed had the data not been censored. Let \(n_1\) denote the number of observed failure times, let \(n_0=n-n_1\) denote the number of censored event times and let \({t_1,\dots,t_n}\) be the ordered set of failure and censored times. For some time point \(t\), the risk set \(R(t) = t_i>t\) for \(i=1,\dots,n\) is the set of event and censored times that is longer than \(t\). Taylor et al. (2002) proposed two imputation strategies for right-censored data:

Risk set imputation. For a given censored value \(t\) construct the risk set \(R(t)\), and randomly draw one case from this set. Both the failure time and censoring status from the selected case are used to impute the data.
Kaplan–Meier imputation. For a given censored value \(t\) construct the risk set \(R(t)\) and estimate the Kaplan–Meier curve from this set. A randomly drawn failure time from the Kaplan–Meier curve is used for imputation.

Both methods are asymptotically equivalent to the Kaplan–Meier estimator after multiple imputation with large \(m\). The adequacy of imputation procedures will depend on the availability of possible donor observations, which diminishes in the tails of the survival distribution. The Kaplan–Meier method has the advantage that nearly all censored observations are replaced by imputed failure times. In principle, both Bayesian sampling and bootstrap methods can be used to incorporate model uncertainty, but in practice only the bootstrap has been used.

Hsu et al. (2006) extended both methods to include covariates. The authors fitted a proportional hazards model and calculated a risk score as a linear combination of the covariates. The key adaptation is to restrict the risk set to those cases that have a risk score that is similar to the risk score of censored case, an idea similar to predictive mean matching. A donor group size with \(d=10\) was found to perform well, and Kaplan–Meier imputation was superior to risk set imputation across a wide range of situations.

Algorithm 3.6 (Imputation of right-censored data using predictive mean matching, Kaplan–Meier estimation and the bootstrap.)

Estimate \(\hat\beta\) by a proportional hazards model of \(y\) given \(X\), where \(y = (t,\phi)\) consists of time \(t\) and censoring indicator \(\phi\) (\(\phi_i=0\) if \(t_i\) is censored).
Draw a bootstrap sample \((\dot y,\dot X)\) of size \(n\) from \((y,X)\).
Estimate \(\dot\beta\) by a proportional hazards model of \(\dot y\) given \(\dot X\).
Calculate \(\dot\eta(i,j)=|X_{[i]}\hat\beta-X_{[j]}\dot\beta|\) with \(i=1,\dots,n\) and \(j=1,\dots,n_0\), where \([j]\) indexes the cases with censored times.
Construct \(n_0\) sets \(Z_j\), each containing \(d\) candidate donors such that \(t_i > t_j\) and \(\sum_d\dot\eta(i,j)\) is minimum for each \(j=1,\dots,n_0\). Break ties randomly.
For each \(Z_j\), estimate the Kaplan–Meier curve \(\hat S_j(t)\).
Draw \(n_0\) uniform random variates \(u_j\), and take \(\dot t_j\) from the estimated cumulative distribution function \(1-\hat S_j(t)\) at \(u_j\) for \(j=1,\dots,n_0\).
Set \(\phi_j=0\) if \(\dot t_j = t_n\) and \(\phi_{t_n}=0\), else set \(\phi_j=1\).

Algorithm 3.6 is based on the KIMB method proposed by Hsu et al. (2006). The method assumes that censoring status is known, and aims to impute plausible event times for censored observations. Hsu et al. (2006) actually suggested fitting two proportional hazards models, one with survival time as outcome and one with censoring status as outcome, but in order to keep in line with the rest of this chapter, here we only fit the model for survival time. The way in which predictive mean matching is done differs slightly from Hsu et al. (2006).

The literature on imputation methods for censored and rounded data is rapidly evolving. Alternative methods for right-censored data have also been proposed (Wei and Tanner 1991; Geskus 2001; Lam, Tang, and Fong 2005; Liu, Murray, and Tsodikov 2011). Lyles, Fan, and Chuachoowong (2001), Lynn (2001), Hopke, Liu, and Rubin (2001) and Lee et al. (2018) concentrated on left-censored data. Imputation of interval-censored data (rounded data) has been discussed quite extensively (Heitjan and Rubin 1990; Dorey, Little, and Schenker 1993; James and Tanner 1995; Pan 2000; Bebchuk and Betensky 2000; Glynn and Rosner 2004; Hsu 2007; Royston 2007; Chen and Sun 2010; Hsu, Taylor, and Hu 2015). Imputation of double-censored data, where both the initial and the final times are interval censored, is treated by Pan (2001) and Zhang et al. (2009). Delord and Génin (2016) extended Pan’s approach to interval-censored competing risks data, thus allowing estimation of the survival function, cumulative incidence function, Cox and Fine & Gray regression coefficients. These methods are available in the MIICD package. Jackson et al. (2014) used multiple imputation to study departures from the independent censoring assumption in the Cox model.

By comparison, very few methods have been developed to deal with truncation. Methods for imputing a missing censoring indicator have been proposed by Subramanian (2009), Subramanian (2011) and Wang and Dinse (2010).