5.4 Stepwise model selection

The standard multiple imputation scheme consists of three phases:

Imputation of the missing data \(m\) times;
Analysis of the \(m\) imputed datasets;
Pooling of the parameters across \(m\) analyses.

This scheme is difficult to apply if stepwise model selection is part of the statistical analysis in phase 2. Application of stepwise variable selection methods may result in sets of variables that differ across the \(m\) datasets. It is not obvious how phase 3 should be done.

5.4.1 Variable selection techniques

Brand (1999, chap. 7) was the first to recognize and treat the variable selection problem. He proposed a solution in two steps. The first step involves performing stepwise model selection separately on each imputed dataset, followed by the construction of a new supermodel that contains all variables that were present in at least half of the initial models. The idea is that this criterion excludes variables that were selected accidentally. Moreover, it is a rough correction for multiple testing. Second, a special procedure for backward elimination is applied to all variables present in the supermodel. Each variable is removed in turn, and the pooled likelihood ratio \(p\)-value (Equation (5.14)) is calculated. If the largest \(p\)-value is larger than 0.05, the corresponding variable is removed, and the procedure is repeated on the smaller model. The procedure stops if all \(p \leq 0.05\). The procedure was found to be a considerable improvement over complete-case analysis.

Yang, Belin, and Boscardin (2005) proposed variable selection techniques using Bayesian model averaging. The authors studied two methods. The first method, called “impute then select,” applies Bayesian variable selection methods on the imputed data. The second method, called “simultaneously impute and select” combines selection and missing data imputation into one Gibbs sampler. Though the latter slightly outperformed the first method, the first method is more broadly applicable. Application of the second method seems to require equivalent imputation and analysis models, thus defeating one of the main advantages of multiple imputation.

Wood, White, and Royston (2008) and Vergouwe et al. (2010) studied several scenarios for variable selection. We distinguish three general approaches:

Majority. A method that selects variables in the final that appear in at least half of the models.
Stack. Stack the imputed datasets into a single dataset, assign a fixed weight to each record and apply the usual variable selection methods.
Wald. Stepwise model selection is based on the Wald statistic calculated from the multiply imputed data.

The majority method is identical to step 1 of Brand (1999), whereas the Wald test method is similar to Brand’s step 2, with the likelihood ratio test replaced by the Wald test. The Wald test method is recommended since it is a well-established approach that follows Rubin’s rules, whereas the majority and stack methods fail to take into account the uncertainty caused by the missing data. Indeed, Wood, White, and Royston (2008) found that the Wald test method is the only procedure that preserved the type I error.

Zhao and Long (2017) review recent work on variable selection on imputed data. These authors favor approaches based on the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996). The MI-LASSO method by Chen and Wang (2013) tests the coefficients across all the stacked datasets, thus ensuring model consistency across different imputations. Marino, Buxton, and Li (2017) proposed an extension to select covariates in multilevel models.

In practice, it may be useful to combine methods. The Wald test method is computationally intensive, but is now easily available in mice as the D1() function. A strong point of the majority method is that it gives insight into the variability between the imputed datasets. An advantage of the stack method is that only one dataset needs to be analyzed. The discussion of Wood, White, and Royston (2008) contains additional simulations of a two-step method, in which a preselection made by the majority and stack methods is followed by the Wald test. This yielded a faster method with better theoretical properties. In practice, a judicious combination of approaches might turn out best.

5.4.2 Computation

The following steps illustrate the main steps involved by implementing a simple majority method to select variables in mice.

data <- boys[boys$age >= 8, -4]
imp <- mice(data, seed = 28382, m = 10, print = FALSE)
scope <- list(upper = ~ age + hgt + wgt + hc + gen + phb + reg,
              lower = ~1)
expr <- expression(f1 <- lm(tv ~ 1),
                   f2 <- step(f1, scope = scope))
fit <- with(imp, expr)

This code imputes the boys data \(m = 10\) times, fits a stepwise linear model to predict tv (testicular volume) separately to each of the imputed dataset. The following code blocks counts how many times each variable was selected.

formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)

votes
age gen  hc hgt phb reg wgt 
 10   9   1   6   9  10   1

The lapply() function is used three times. The first statement extracts the model formulas fitted to the \(m\) imputed datasets. The second lapply() call decomposes the model formulas into pieces, and the third call extracts the names of the variables included in all \(m\) models. The table() function counts the number of times that each variable in the 10 replications. Variables age, gen and reg are always included, whereas hc was selected in only one of the models. Since hgt appears in more than 50% of the models, we can use the Wald test to determine whether it should be in the final model.

fit.without <- with(imp, lm(tv ~ age + gen + reg + phb))
fit.with <- with(imp, lm(tv ~ age + gen + reg + phb + hgt))
D1(fit.with, fit.without)

   test statistic df1  df2 df.com p.value   riv
 1 ~~ 2      2.15   1 19.3    409   0.159 0.978

The \(p\)-value is equal to 0.173, so hgt is not needed in the model. If we go one step further, and remove phb, we obtain

fit.without <- with(imp, lm(tv ~ age + gen + reg))
fit.with <- with(imp, lm(tv ~ age + gen + reg + phb))
D1(fit.with, fit.without)

   test statistic df1  df2 df.com p.value  riv
 1 ~~ 2      2.49   5 97.9    410  0.0362 1.29

The significant difference (\(p=0.029\)) between the models implies that phb should be retained. We obtain similar results for the other three variables, so the final model contains age, gen, reg and phb.

5.4.3 Model optimism

The main danger of data-driven model building strategies is that the model found may depend highly on the sample at hand. For example, Viallefont, Raftery, and Richardson (2001) showed that of the variables declared to be “significant” with \(p\)-values between 0.01 and 0.05 by stepwise variable selection, only 49% actually were true risk factors. Various solutions have been proposed to counter such model optimism. A popular procedure is bootstrapping the model as developed in Sauerbrei and Schumacher (1992) and Harrell (2001). Although Austin (2008) found it ineffective to identify true predictors, this method has often been found to work well for developing predictive models. The method randomly draws multiple samples with replacement from the observed sample, thus mimicking the sampling variation in the population from which the sample was drawn. Stepwise regression analyses are replayed in each bootstrap sample. The proportion of times that each prognostic variable is retained in the stepwise regression model is known as the inclusion frequency (Sauerbrei and Schumacher 1992). This proportion provides information about the strength of the evidence that an indicator is an independent predictor. In addition, each bootstrap model can be fitted to the original sample. The difference between the apparent performance and the bootstrap performance provides the basis for performance measures that correct for model optimism. Steyerberg (2009, 95) provides an easy-to-follow procedure to calculate such optimism-corrected performance measures.

Clearly, the presence of missing data adds uncertainty to the model building process, so optimism can be expected to be more severe with missing data. It is not yet clear what the best way is to estimate optimism from incomplete data. Heymans et al. (2007) explored the combination of multiple imputation and the bootstrap. There appear to be at least four general procedures:

Imputation. Multiple imputation generates 100 imputed datasets. Automatic backward selection is applied to each dataset. Any differences found between the 100 fitted models are due to the missing data.
Bootstrap. 200 bootstrap samples are drawn from one singly imputed completed data. Automatic backward selection is applied to each dataset. Any differences found between the 200 fitted models are due to sampling variation.
Nested bootstrap. The bootstrap method is applied on each of the multiply imputed datasets. Automatic backward selection is applied to each of the \(100 \times 200\) datasets. Differences between the fitted model portray both sampling and missing data uncertainty.
Nested imputation. The imputation method is applied on each of the bootstrapped datasets.

Heymans et al. (2007) observed that the imputation method produced a wider range of inclusion frequencies than the bootstrap method. This is attractive since a better separation of strong and weak predictors may ease model building. The area under the curve is an overall index of predictive strength. Though the type of method had a substantial effect on the apparent \(c\)-index estimate, the optimism-corrected \(c\)-index estimate was quite similar. The optimism-corrected calibration slope estimates tended to be lower in the methods involving imputation, thus necessitating more shrinkage.

A drawback of the method is the use of classic stepwise variable selection techniques, which do not generalize well to high-dimensional data. Musoro et al. (2014) improved the methods of Heymans et al. (2007) through their use of the LASSO.

Long and Johnson (2015) developed a procedure, called bootstrap imputation and stability selection (BI-SS) , that generates bootstrap samples from the original data, imputes each bootstrap sample by single imputation, obtains the randomized LASSO estimate from each sample, and then selects the active set according to majority. The multiple imputation random LASSO (MIRL) method by Liu et al. (2016) first performs multiple imputation, obtains bootstrap samples from each imputed dataset, estimates regression weights under LASSO, and then selects the active set by majority. It is not yet known how BS-SS and MIRL compare to each other.