I am unaware of any formal literature about the validity of using the bootstrap to simplify a multiple regression equation, but examples appear in the scientific literature occasionally. One of the earliest comes from Gail Gong's doctoral thesis. It is recounted in her article with Brad Efron in the American Statistician (Vol 37, Feb 1983, 36-48). The model building technique was not exactly a pure forward selection procedure, but was quite close. From a set of 19 predictors, she

- ran 19 separate single-predictor logistic regressions, noting which variables achieved significance at the 0.05 level.
- ran a forward selection multiple logistic regression program with an 0.10 level of significance as the stopping criterion, using the statistically significant predictors from step 1.
- ran a forward selection stepwise (that is, allowing for removals) logistic regression with an 0.05 level of significance as the entry/removal criterion, using the variables that entered the model developed in step 2.

"Figure 6 illustrates another use of the bootstrap replications. The predictions chosen by the three-step selection procedure, applied to the bootstrap training setXare shown for the last 25 of 500 replications. Among all 500 replications, predictor 13 was selected 37 percent of the time, predictor 15 selected 48 percent, predictor 7 selected 35 percent, and predictor 20 selected 59 percent. No other predictor was selected more than 50 percent of the time. No theory exists for interpreting Figure 6, but the results certainly discourage confidence in the casual nature of the predictors 13, 15, 7, 20." (Efron and Gong, p. 48)^{*}

Phillip Good in his 2003 text *Common Errors in Statistics: (And How
To Avoid Them)* (2003, John Wiley & Sons, pp 147) makes this
approach central to his model building strategy.

We strongly urge you to adopt Dr. Gong's bootstrap approach to validating multi-variable models. Retain only those variables which appear consistently in the bootstrap regression models.

Pointing to such examples as Gong's, I've done something similar a few times when investigators were determined to use stepwise regression. I implemented the bootstrap the hard way--generating individual datasets, analyzing them one-at-a-time (in a batch program) and using a text processor to extract relevant portions of the output.

Recently I decided to automate the procedure by writing a SAS macro
that not only generated and analyzed the bootstrap samples, but also used
the SAS output delivery system to collect the results. That way, the
entire process could be carried out in one step. I analyzed a half-dozen
research datasets. I found * no* cases where the bootstrap
suggested instability in the model produced by stepwise regression
applied to the original dataset. This was not necessarily a problem. It
could have been that the signals were so strong that they weren't
distorted by stepwise regression.

To test this theory, I generated some random datasets so that I could control their structure. Each consisted of 100 cases containing a response and 10 predictors. The variables were jointly normally distributed with the same underlying correlation between any pair of variables. Therefore, all ten predictors predicted the response equally well. Setting the correlation to something other than 0 insured that some predictors would enter the stepwise regression equation, but the ones that entered would be just a matter of chance.

When I did this, I found the same thing as in the real data. The bootstrap samples pointed to the same model as the stepwise regression on the full dataset. For example, one dataset with a common underlying correlation of 0.50 (Here's the code if you'd like to try it yourself.) led to a forward selection regression model that included X1, X3, and X5. In the 100 bootstrap samples drawn from this dataset, the 10 predictors entered with the following frequencies.

Variable Entered |
Number of |

X1 | 57 |

X2 | 4 |

X3 | 83 |

X4 | 8 |

X5 | 76 |

X6 | 28 |

X7 | 4 |

X8 | 6 |

X9 | 14 |

X10 | 9 |

And the winners are...X3, X5, and X1! We can quibble over X1, but the frequency with which X3 and X5 appear are impressive. There is nothing to suggest these are random data.

It appears the problem is what's worried me all along about data splitting. Whatever peculiarities in the dataset that led X1, X3, and X5 to be the chosen ones in the stepwise regressions also make them the favorites in the bootstrap samples. In retrospect, it seems obvious that this would happen. Yet, even Gong & Efron considered this approach as a possibility. While Good (page 157, step 4) advises limiting attention of one or two of the most significant predictor variables, the examples here show that such advice is not enough to avoid choosing an improper model. Good warns about the importance of checking the validity of the model in another set of data, but it is not easy to do and seems to happen too seldom in practice.

My hope is that statisticians will discover how to modify the bootstrap to study model stability properly. Until then, I'll no longer be using it to evaluate models generated by stepwise regression, but it would have been nice if it worked.