12.3 Other applications

Chapters 9-11 illustrated several applications of multiple imputation. This section briefly reviews some other applications. These underscore the general nature and broad applicability of multiple imputation.

12.3.1 Synthetic datasets for data protection

Many governmental agencies make microdata available to the public. One of the major practical issues is that the identity of anonymous respondents can be disclosed through the data they provide. Rubin (1993) suggested publishing fully synthetic microdata instead of the real data, with the obvious advantage of zero disclosure risk. The released synthetic data should reproduce the essential features of confidential microdata.

Raghunathan, Reiter, and Rubin (2003) and Reiter (2005a) demonstrated the practical application of the idea. Real and synthetic records can be mixed, resulting in partially synthetic data. Recent advances can be found in Reiter (2009), Drechsler and Reiter (2010), Reiter, Wang, and Zhang (2014) and Loong and Rubin (2017). Yu et al. (2017) present an application to protect confidentiality in the Californian Cancer Registry.

12.3.2 Analysis of coarsened data

Many datasets contain data that are partially missing. Some values are known accurately, but others are only known to lie within a certain range. Heitjan and Rubin (1991) proposed a general theory for data coarsening processes that includes rounding, heaping, censoring and missing data as special cases. See also Gill, Van der Laan, and Robins (1997) for a slightly more extended model. Heitjan and Rubin (1990) provided an application where age is misreported, and the amount of misreporting increases with age itself. Such problems with the data can be handled by multiple imputation of true age, given reported age and other personal factors. Heitjan (1993) discussed various other biomedical examples and an application to data from the Stanford Heart Transplant Program. Related work on measurement error is available from several sources (Brownstone and Valletta 1996; Ghosh-Dastidar and Schafer 2003; Yucel and Zaslavsky 2005; Cole, Chu, and Greenland 2006; Glickman et al. 2008). Goldstein and Carpenter (2015) formulated joint models for three types of coarsened data.

12.3.3 File matching of multiple datasets

Statistical file matching, or data fusion, attempts to integrate two or more datasets with different units observed on common variables. Rubin and Schenker (1986a) considered file matching as a missing data problem, and suggested multiple imputation as a solution. Moriarity and Scheuren (2003) developed modifications that were found to improve the procedure. Further relevant work can be found in the books by Rässler (2002), D’Orazio, Di Zio, and Scanu (2006) and Herzog, Scheuren, and Winkler (2007).

The imputation techniques proposed to date were developed from the multivariate normal model. Application of the MICE algorithm under conditional independence is straightforward. Rässler (2002) compared MICE to several alternatives, and found MICE to work well under normality and conditional independence. If the assumption of conditional independence does not hold, we may bring prior information into MICE by appending a third data file that contains records with data that embody the prior information. Sections 6.5.2 and 9.4.5 put this idea into practice in a different context. This techique can perform file matching for mixed continuous-discrete data under any data coded prior.

12.3.4 Planned missing data for efficient designs

Lengthy questionnaires increase the missing data rate and can make a study expensive. An alternative is to cut up a long questionnaire into separate forms, each of which is considerably shorter than the full version. The split questionnaire design (Raghunathan and Grizzle 1995) poses certain restrictions on the selection of the forms, thus enabling analysis by multiple imputation. Gelman, King, and Liu (1998) provide additional techniques for the related problem of analysis of multiple surveys. The loss of efficiency depends on the strengths of the relations between form and can be compensated for by a larger initial sample size. Graham et al. (2006) and Graham (2012) are excellent resources for methodology based on planned missing data. Little and Rhemtulla (2013) and Rhemtulla and Hancock (2016) discuss applications in child development and educational research.

12.3.5 Adjusting for verification bias

Partial verification bias in diagnostic accuracy studies may occur if not all patients are assessed by the reference test (golden standard). Bias occurs if the group of patients is selective, e.g., when only those that score on a previous test are measured. Multiple imputation has been suggested as a way to correct for this bias (Harel and Zhou 2006; De Groot et al. 2008). The classic Begg-Greenes method may be used only if the missing data mechanism is known and simple. For more complex situations De Groot et al. (2011) and Naaktgeboren et al. (2016) recommend multiple imputation.