Flexible Imputation of Missing Data
Want the hardcopy?
Foreword
Preface to second edition
Preface to first edition
About the author
Symbol Description
I Part I: Basics
1
Introduction
1.1
The problem of missing data
1.1.1
Current practice
1.1.2
Changing perspective on missing data
1.2
Concepts of MCAR, MAR and MNAR
1.3
Ad-hoc solutions
1.3.1
Listwise deletion
1.3.2
Pairwise deletion
1.3.3
Mean imputation
1.3.4
Regression imputation
1.3.5
Stochastic regression imputation
1.3.6
LOCF and BOCF
1.3.7
Indicator method
1.3.8
Summary
1.4
Multiple imputation in a nutshell
1.4.1
Procedure
1.4.2
Reasons to use multiple imputation
1.4.3
Example of multiple imputation
1.5
Goal of the book
1.6
What the book does not cover
1.6.1
Prevention
1.6.2
Weighting procedures
1.6.3
Likelihood-based approaches
1.7
Structure of the book
1.8
Exercises
2
Multiple imputation
2.1
Historic overview
2.1.1
Imputation
2.1.2
Multiple imputation
2.1.3
The expanding literature on multiple imputation
2.2
Concepts in incomplete data
2.2.1
Incomplete-data perspective
2.2.2
Causes of missing data
2.2.3
Notation
2.2.4
MCAR, MAR and MNAR again
2.2.5
Ignorable and nonignorable
\(^\spadesuit\)
2.2.6
Implications of ignorability
2.3
Why and when multiple imputation works
2.3.1
Goal of multiple imputation
2.3.2
Three sources of variation
\(^\spadesuit\)
2.3.3
Proper imputation
2.3.4
Scope of the imputation model
2.3.5
Variance ratios
\(^\spadesuit\)
2.3.6
Degrees of freedom
\(^\spadesuit\)
2.3.7
Numerical example
2.4
Statistical intervals and tests
2.4.1
Scalar or multi-parameter inference?
2.4.2
Scalar inference
2.4.3
Numerical example
2.5
How to evaluate imputation methods
2.5.1
Simulation designs and performance measures
2.5.2
Evaluation criteria
2.5.3
Example
2.6
Imputation is not prediction
2.7
When not to use multiple imputation
2.8
How many imputations?
2.9
Exercises
3
Univariate missing data
3.1
How to generate multiple imputations
3.1.1
Predict method
3.1.2
Predict + noise method
3.1.3
Predict + noise + parameter uncertainty
3.1.4
A second predictor
3.1.5
Drawing from the observed data
3.1.6
Conclusion
3.2
Imputation under the normal linear normal
3.2.1
Overview
3.2.2
Algorithms
\(^\spadesuit\)
3.2.3
Performance
3.2.4
Generating MAR missing data
3.2.5
MAR missing data generation in multivariate data
3.2.6
Conclusion
3.3
Imputation under non-normal distributions
3.3.1
Overview
3.3.2
Imputation from the
\(t\)
-distribution
3.4
Predictive mean matching
3.4.1
Overview
3.4.2
Computational details
\(^\spadesuit\)
3.4.3
Number of donors
3.4.4
Pitfalls
3.4.5
Conclusion
3.5
Classification and regression trees
3.5.1
Overview
3.6
Categorical data
3.6.1
Generalized linear model
3.6.2
Perfect prediction
\(^\spadesuit\)
3.6.3
Evaluation
3.7
Other data types
3.7.1
Count data
3.7.2
Semi-continuous data
3.7.3
Censored, truncated and rounded data
3.8
Nonignorable missing data
3.8.1
Overview
3.8.2
Selection model
3.8.3
Pattern-mixture model
3.8.4
Converting selection and pattern-mixture models
3.8.5
Sensitivity analysis
3.8.6
Role of sensitivity analysis
3.8.7
Recent developments
3.9
Exercises
4
Multivariate missing data
4.1
Missing data pattern
4.1.1
Overview
4.1.2
Summary statistics
4.1.3
Influx and outflux
4.2
Issues in multivariate imputation
4.3
Monotone data imputation
4.3.1
Overview
4.3.2
Algorithm
4.4
Joint modeling
4.4.1
Overview
4.4.2
Continuous data
4.4.3
Categorical data
4.5
Fully conditional specification
4.5.1
Overview
4.5.2
The MICE algorithm
4.5.3
Compatibility
\(^\spadesuit\)
4.5.4
Congeniality or compatibility?
4.5.5
Model-based and data-based imputation
4.5.6
Number of iterations
4.5.7
Example of slow convergence
4.5.8
Performance
4.6
FCS and JM
4.6.1
Relations between FCS and JM
4.6.2
Comparisons
4.6.3
Illustration
4.7
MICE extensions
4.7.1
Skipping imputations and overimputation
4.7.2
Blocks of variables, hybrid imputation
4.7.3
Blocks of units, monotone blocks
4.7.4
Tile imputation
4.8
Conclusion
4.9
Exercises
5
Analysis of imputed data
5.1
Workflow
5.1.1
Recommended workflows
5.1.2
Not recommended workflow: Averaging the data
5.1.3
Not recommended workflow: Stack imputed data
5.1.4
Repeated analyses
5.2
Parameter pooling
5.2.1
Scalar inference of normal quantities
5.2.2
Scalar inference of non-normal quantities
5.3
Multi-parameter inference
5.3.1
\(D_1\)
Multivariate Wald test
5.3.2
\(D_2\)
Combining test statistics
\(^\spadesuit\)
5.3.3
\(D_3\)
Likelihood ratio test
\(^\spadesuit\)
5.3.4
\(D_1\)
,
\(D_2\)
or
\(D_3\)
?
5.4
Stepwise model selection
5.4.1
Variable selection techniques
5.4.2
Computation
5.4.3
Model optimism
5.5
Parallel computation
5.6
Conclusion
5.7
Exercises
II Part II: Advanced techniques
6
Imputation in practice
6.1
Overview of modeling choices
6.2
Ignorable or nonignorable?
6.3
Model form and predictors
6.3.1
Model form
6.3.2
Predictors
6.4
Derived variables
6.4.1
Ratio of two variables
6.4.2
Interaction terms
6.4.3
Quadratic relations
\(^\spadesuit\)
6.4.4
Compositional data
\(^\spadesuit\)
6.4.5
Sum scores
6.4.6
Conditional imputation
6.5
Algorithmic options
6.5.1
Visit sequence
6.5.2
Convergence
6.6
Diagnostics
6.6.1
Model fit versus distributional discrepancy
6.6.2
Diagnostic graphs
6.7
Conclusion
6.8
Exercises
7
Multilevel multiple imputation
7.1
Introduction
7.2
Notation for multilevel models
7.3
Missing values in multilevel data
7.3.1
Practical issues in multilevel imputation
7.3.2
Ad-hoc solutions for multilevel data
7.3.3
Likelihood solutions
7.4
Multilevel imputation by joint modeling
7.5
Multilevel imputation by fully conditional specification
7.5.1
Add cluster means of predictors
7.5.2
Model cluster heterogeneity
7.6
Continuous outcome
7.6.1
General principle
7.6.2
Methods
7.6.3
Example
7.7
Discrete outcome
7.7.1
Methods
7.7.2
Example
7.8
Imputation of level-2 variable
7.9
Comparative work
7.10
Guidelines and advice
7.10.1
Intercept-only model, missing outcomes
7.10.2
Random intercepts, missing level-1 predictor
7.10.3
Random intercepts, contextual model
7.10.4
Random intercepts, missing level-2 predictor
7.10.5
Random intercepts, interactions
7.10.6
Random slopes, missing outcomes and predictors
7.10.7
Random slopes, interactions
7.10.8
Recipes
7.11
Future research
8
Individual causal effects
8.1
Need for individual causal effects
8.2
Problem of causal inference
8.3
Framework
8.4
Generating imputations by FCS
8.4.1
Naive FCS
8.4.2
FCS with a prior for
\(\rho\)
8.4.3
Extensions
8.5
Bibliographic notes
III Part III: Case studies
9
Measurement issues
9.1
Too many columns
9.1.1
Scientific question
9.1.2
Leiden 85+ Cohort
9.1.3
Data exploration
9.1.4
Outflux
9.1.5
Finding problems:
loggedEvents
9.1.6
Quick predictor selection:
quickpred
9.1.7
Generating the imputations
9.1.8
A further improvement: Survival as predictor variable
9.1.9
Some guidance
9.2
Sensitivity analysis
9.2.1
Causes and consequences of missing data
9.2.2
Scenarios
9.2.3
Generating imputations under the
\(\delta\)
-adjustment
9.2.4
Complete-data model
9.2.5
Conclusion
9.3
Correct prevalence estimates from self-reported data
9.3.1
Description of the problem
9.3.2
Don’t count on predictions
9.3.3
The main idea
9.3.4
Data
9.3.5
Application
9.3.6
Conclusion
9.4
Enhancing comparability
9.4.1
Description of the problem
9.4.2
Full dependence: Simple equating
9.4.3
Independence: Imputation without a bridge study
9.4.4
Fully dependent or independent?
9.4.5
Imputation using a bridge study
9.4.6
Interpretation
9.4.7
Conclusion
9.5
Exercises
10
Selection issues
10.1
Correcting for selective drop-out
10.1.1
POPS study: 19 years follow-up
10.1.2
Characterization of the drop-out
10.1.3
Imputation model
10.1.4
A solution “that does not look good”
10.1.5
Results
10.1.6
Conclusion
10.2
Correcting for nonresponse
10.2.1
Fifth Dutch Growth Study
10.2.2
Nonresponse
10.2.3
Comparison to known population totals
10.2.4
Augmenting the sample
10.2.5
Imputation model
10.2.6
Influence of nonresponse on final height
10.2.7
Discussion
10.3
Exercises
11
Longitudinal data
11.1
Long and wide format
11.2
SE Fireworks Disaster Study
11.2.1
Intention to treat
11.2.2
Imputation model
11.2.3
Inspecting imputations
11.2.4
Complete-data model
11.2.5
Results from the complete-data model
11.3
Time raster imputation
11.3.1
Change score
11.3.2
Scientific question: Critical periods
11.3.3
Broken stick model
\(^\spadesuit\)
11.3.4
Terneuzen Birth Cohort
11.3.5
Shrinkage and the change score
\(^\spadesuit\)
11.3.6
Imputation
11.3.7
Complete-data model
11.4
Conclusion
11.5
Exercises
IV Part IV: Extensions
12
Conclusion
12.1
Some dangers, some do’s and some don’ts
12.1.1
Some dangers
12.1.2
Some do’s
12.1.3
Some don’ts
12.2
Reporting
12.2.1
Reporting guidelines
12.2.2
Template
12.3
Other applications
12.3.1
Synthetic datasets for data protection
12.3.2
Analysis of coarsened data
12.3.3
File matching of multiple datasets
12.3.4
Planned missing data for efficient designs
12.3.5
Adjusting for verification bias
12.4
Future developments
12.4.1
Derived variables
12.4.2
Algorithms for blocks and batches
12.4.3
Nested imputation
12.4.4
Better trials with dynamic treatment regimes
12.4.5
Distribution-free pooling rules
12.4.6
Improved diagnostic techniques
12.4.7
Building block in modular statistics
12.5
Exercises
Appendix
A
Technical information
References
Published with bookdown