Google SAS Search

Add to Google

Thursday, September 13, 2007

Bootstrap Resampling

First of all, I should mention here and now at the beginning of this post that I am not a statistician. But I am married to one (Happy Bithday Orla!), and I dounderstand normal distributions and confidence intervals and standard deviations and such. Suffice it to say, I generally get the concepts but my eyes invariably glaze over once the equations are presented.

Now that I've gotten that out of the way, I will attempt to make this post about... statistics! Hopefully everything I write will make sense, but if anything is outrageously stupid, feel free to forgive me and correct me in the comments.

On one of my travels through the internet I came across something I had never heard of before: bootstrap resampling. I will attempt to describe my understandingof it, but please do check out the links at the bottom because I am sure to over-simplify or exaggerate some parts.

In traditional parametric statistics the data is generally assumed to follow a particular pattern or distribution with the "normal" bell curve distribution being the ideal. Statisticians use various tests to determine if the sample data is normally distributed (a very surprising amount of data is) and then proceed to make statistically sound inferences about the population the data was drawn from (confidence intervals, standard deviation, etc). If it is true for the randomly drawn sample then it is true for any randomly drawn sample from the population. Assuming the sample fits the normal distribution.

Now If I understand bootstrap resampling correctly there is no need to assume the data follows a normal distribution; or any particular statistical distribution. You take a sample from your data and record the mean, then you put your sample back and get another sample of data and record the mean. You repeat that many, many, many times and then use the resulting means to pick your intervals. Here is the original description I read from a wonderful site called the World Question Center. It is an excerpt from the response of Bart Kosko. If you scroll about halfway down the page you will find it. He is way smarter than I am so his explaination will surely make more sense than mine:

"The hero of data-based reasoning is the bootstrap resample. The bootstrap has produced a revolution of sorts in statistics since statistician Bradley Efron introduced it in 1979 when personal computers were becoming more available. The bootstrap in effect puts the data set in a bingo hopper and lets the user sample from the data set over and over again just so long as the user puts the data back in the hopper after drawing and recording it. Computers easily let one turn an initial set of 100 data points into tens of thousands of resampled sets of 100 points each. Efron and many others showed that these virtual samples contain further information about the original data set. This gives a statistical free lunch except for the extensive computation involved—but that grows a little less expensive each day. A glance at most multi-edition textbook on statistics will show the growing influence of the bootstrap and related resampling techniques in the later editions.
Consider the model-based baggage that goes into the standard 95% confidence interval for a population mean. Such confidence intervals appear expressly in most medical studies and reports and appear implicitly in media poll results as well as appearing throughout science and engineering. The big assumption is that the data come reasonably close to a bell curve even if it has thick tails. A similar assumption occurs when instructors grade on a "curve" even the student grades often deviate substantially from a bell curve (such as clusters of good and poor grades). Sometimes one or more statistical tests will justify the bell-curve assumption to varying degrees — and some of the tests themselves make assumptions about the data. The simplest bootstrap confidence interval makes no such assumption. The user computes a sample mean for each of the thousands of virtual data sets. Then the user rank-orders these thousands of computed sample means from smallest to largest and picks the appropriate percentile estimates. Suppose there were a 1000 virtual sample sets and thus 1000 computed sample means. The bootstrap interval picks the 25th — largest sample mean for the lower bound of the 95% confidence interval and picks the 975th — largest sample mean for the upper bound. Done.
Bootstrap intervals tend to give similar results as model-based intervals for test cases where the user generates the original data from a normal bell curve or the like. The same holds for bootstrap hypothesis tests. But in the real world we do not know the "true" distribution that generated the observed data. So why not avoid the clear potential for modeler bias and just use the bootstrap estimate in the first place?"


So my questions to the statisticians: do you use bootstrap resampling? Is this something you do in SAS? Do you feel it helps to simplify statistics and open it up to us non-statisticians?

Really good explaination of bootstrap resampling:
http://www.uvm.edu/~dhowell/StatPages/Resampling/Bootstrapping.html

Bootstraping in SAS:
http://support.sas.com/ctx/samples/index.jsp?sid=479

4 comments:

  1. I've just read your post, and there are some things in it that are not totally correct to me.
    The trick about bootstrap is not on the data distribution itself, but on the statistic distribution.
    Even if you are computing a price mean, as prices do not follow a normal distribution, their mean is believed to do so. And hopefully it really does, provided you have enough observations to compute means on.
    So you can compute confidence intervals for the mean on any "big enough" dataset being correct : just compute mean, standard deviation, and mix with the usual normal distribution quantiles.
    But what for a median, for example ? Does a median follow a normal distribution ? That is, if you compute a median on every observation except one in your data, which is called a jackknife, will it turn out to look like the usual bell ? It does not.
    So you can draw samples, compute medians on each, and then give values for empirical confidence intervals for your median. The bootstrap magic is that it will work on any given "stable" statistic (except for high-order quantiles as minimum or maximum).
    But if you explain bootstrap using the data mean as an example, it is not very relevant, since the theory behind confidence intervals for the mean is quite robust. The bootstrap does not improve anything there. But for correlations, chi-squares, medians, yes it does.
    And to answer your questions, I do use SAS as a bootstrapping tool, drawing my samples with the SURVEYSELECT procedure (from SAS/STAT) rather than with a data step as your Institute-example suggests.

    Thank you very much for your blog, and for talking about bootstrap, an often underestimated method for digging up important information.

    Olivier

    ReplyDelete
  2. Bootstrap methods does assume normality of the population, but it relies on a "representative sample" to re-sample from. This is the assumption you make whenever you are bootstrapping. For a very small sample size, this assumption is also questionable.

    ReplyDelete
  3. Bootstrapping is a very powerful tool when you use unusual estimators, such as median, quantiles, or even predictions from a given statistical model. The thing is, that you can't do statistical hypotheses testing or get confidence intervals if you don't know the distribution of those estimators, and there's not always a close formula to them. Bootstrap allows you to "estimate" the distribution of the estimator so you can obtain empirical variances and then "plug" them into the usual confidence interval formulas, etc.

    That thing you mentioned about not needing to know the real distribution... is solved (partially) by SMOOTHING techniques (kernel density estimators, splines, generalized additive models...) but that's another story...

    ReplyDelete
  4. if your bootstrap distribution is normal it doesn't mean the representative sample follows a normal distribution?

    ReplyDelete