Data Steps: September 2007

Wednesday, September 19, 2007

Hooray For Vmware!

I am excited for computers again! Every once in a while something comes along that really changes the way you interact with computers. You know the feeling, it stops you in your tracks and makes you say, wow.

I remember when I was a kid and I first played a game called "Beach Head" on my Commodore 64. There was a level where you controlled a machine gun and the little computer guys would run at you from behind walls and throw grenades at you.
Every once in a while if you would shoot one of the little men he would yell "Medic!" or "I'm hit!". It was such a strain for that little computer to create the digitized speech that the whole game would slow down for a second or two. But my brother and I were seriously impressed. Wow!

I recently installed Vmware's Player on my little Dell laptop. If you are not familiar with Vmware and their virtualition technology then stop reading this and go to their web site. It is easily the most impressive software I have used in quite a while.

You see, I am going on vacation for two weeks (woohoo!) and will have some time to work on some coding projects during flights. I have been working on a perl/web/mySql project for my website for a while now and am getting close to finishing it. To work on it, I usually log into my remote server using ssh and work away. Works great until you aren't connected to the internet. So I thought, why not create a local server to work on while I am away from the internet?

Usually that would entail downloading a linux distro, partitioning part of my hard drive, making sure the distro has all the drivers it needs for my laptop, setting up and configuring all the tools I need, etc etc. Essentially a lot of wasted, unproductive time.

Last night I downloaded Vmware Player for free. Then I downloaded an appliance called Grandma's LAMP for free. An appliance is a full-blown pre-configured virtual server that is hosted on your machine through the player. Within minutes it was up and running.

All I had to do was go to my web server, tarball all the files for my application and download them to my laptop. Then I just copied them to my virtual
Ubuntu server using the pre-configured samba share and Voila! A completely useable local copy of my entire development environment in two hours! I am seriously impressed. And all without doing any reconfiguring on my little windows xp laptop.

And to top it off, I can take the whole virtual server and the player and copy them to a 2 gig thumb drive. Any computer I stick my USB drive into can host my development server. Wow, indeed.

First of all, I should mention here and now at the beginning of this post that I am not a statistician. But I am married to one (Happy Bithday Orla!), and I dounderstand normal distributions and confidence intervals and standard deviations and such. Suffice it to say, I generally get the concepts but my eyes invariably glaze over once the equations are presented.

Now that I've gotten that out of the way, I will attempt to make this post about... statistics! Hopefully everything I write will make sense, but if anything is outrageously stupid, feel free to forgive me and correct me in the comments.

On one of my travels through the internet I came across something I had never heard of before: bootstrap resampling. I will attempt to describe my understandingof it, but please do check out the links at the bottom because I am sure to over-simplify or exaggerate some parts.

In traditional parametric statistics the data is generally assumed to follow a particular pattern or distribution with the "normal" bell curve distribution being the ideal. Statisticians use various tests to determine if the sample data is normally distributed (a very surprising amount of data is) and then proceed to make statistically sound inferences about the population the data was drawn from (confidence intervals, standard deviation, etc). If it is true for the randomly drawn sample then it is true for any randomly drawn sample from the population. Assuming the sample fits the normal distribution.

Now If I understand bootstrap resampling correctly there is no need to assume the data follows a normal distribution; or any particular statistical distribution. You take a sample from your data and record the mean, then you put your sample back and get another sample of data and record the mean. You repeat that many, many, many times and then use the resulting means to pick your intervals. Here is the original description I read from a wonderful site called the World Question Center. It is an excerpt from the response of Bart Kosko. If you scroll about halfway down the page you will find it. He is way smarter than I am so his explaination will surely make more sense than mine:

"The hero of data-based reasoning is the bootstrap resample. The bootstrap has produced a revolution of sorts in statistics since statistician Bradley Efron introduced it in 1979 when personal computers were becoming more available. The bootstrap in effect puts the data set in a bingo hopper and lets the user sample from the data set over and over again just so long as the user puts the data back in the hopper after drawing and recording it. Computers easily let one turn an initial set of 100 data points into tens of thousands of resampled sets of 100 points each. Efron and many others showed that these virtual samples contain further information about the original data set. This gives a statistical free lunch except for the extensive computation involved—but that grows a little less expensive each day. A glance at most multi-edition textbook on statistics will show the growing influence of the bootstrap and related resampling techniques in the later editions.
Consider the model-based baggage that goes into the standard 95% confidence interval for a population mean. Such confidence intervals appear expressly in most medical studies and reports and appear implicitly in media poll results as well as appearing throughout science and engineering. The big assumption is that the data come reasonably close to a bell curve even if it has thick tails. A similar assumption occurs when instructors grade on a "curve" even the student grades often deviate substantially from a bell curve (such as clusters of good and poor grades). Sometimes one or more statistical tests will justify the bell-curve assumption to varying degrees — and some of the tests themselves make assumptions about the data. The simplest bootstrap confidence interval makes no such assumption. The user computes a sample mean for each of the thousands of virtual data sets. Then the user rank-orders these thousands of computed sample means from smallest to largest and picks the appropriate percentile estimates. Suppose there were a 1000 virtual sample sets and thus 1000 computed sample means. The bootstrap interval picks the 25th — largest sample mean for the lower bound of the 95% confidence interval and picks the 975th — largest sample mean for the upper bound. Done.
Bootstrap intervals tend to give similar results as model-based intervals for test cases where the user generates the original data from a normal bell curve or the like. The same holds for bootstrap hypothesis tests. But in the real world we do not know the "true" distribution that generated the observed data. So why not avoid the clear potential for modeler bias and just use the bootstrap estimate in the first place?"

So my questions to the statisticians: do you use bootstrap resampling? Is this something you do in SAS? Do you feel it helps to simplify statistics and open it up to us non-statisticians?

Really good explaination of bootstrap resampling:
http://www.uvm.edu/~dhowell/StatPages/Resampling/Bootstrapping.html

Bootstraping in SAS:
http://support.sas.com/ctx/samples/index.jsp?sid=479

Data Steps

Google SAS Search

Wednesday, September 19, 2007

Hooray For Vmware!

Thursday, September 13, 2007

Bootstrap Resampling

Popular Posts

Total Pageviews

Subscribe Now: Feed Icon

Translate

About Me

Links

Topics

Followers

Blog Archive