Data Steps: 2007

Wednesday, November 14, 2007

Using Logical Expressions In SQL

Most of us SAS programmers approach SQL as simply a data extraction and table joining tool. Since most of us have used the data step longer than SQL, we tend to leave the logic programming to the data step with its if/then statements. However, SQL does have a way of assigning values conditionally. With the CASE expression you can test and assign values logically.
The basic syntax is:


CASE value  
   WHEN condition THEN result  
   WHEN condition THEN result  
   ELSE result
END

In the code below I am just assigning a 1 or a 0 to a column/variable named bool_tf.
Using the CASE expression is pretty straightforward and is another great way to use SQL to get more coding done in fewer steps.


data myData;
  input answer $;
datalines;
true
false
true
true
false
false
true
;
proc sql;
  create table a as
  select answer, 
         case substr(answer,1,1)
              when 't' then 1
       when 'f' then 0
  end as bool_tf
  from myData;
quit;

Tuesday, October 16, 2007

Saving Steps With SQL

Often we need to create some simple statistics for a set of data and then associate those stats with each observation of the original set. As a simple example
consider a table with only three rows:
N
3
6
4

We want to get the mean of the variable N and stripe it down all the observations:
N Mean
3 4.333
6 4.333
4 4.333

The first way I learned to do this was with a proc summary and a merge. A better way to do it is with proc sql.

Here is a little test data:
data myData;
input x level $;
cards;
11 a
31 a
51 a
2 b
61 a
8 b
21 a
71 a
91 a
4 b
61 a
21 a
5 b
7 b
5 b
31 a
1 b
61 a
8 a
9 b
3 a
2 b
5 a
7 b
7 a
3 b
;
* in that data set we have two variables X and LEVEL. We can get the stats on X for each level by summarizing and merging...;
proc sort data = myData;
by level;
run;

proc summary data = myData;
by level;
var x;
output out= tempStats(drop=_type_ _freq_) mean=mean max=max min=min;
run;

data sumStats;
merge myData tempStats;
by level;
run;

* or better yet, we can collapse the whole thing into one nifty proc sql step!;
proc sql;
create table stats as select *,
min(x) as min,
max(x) as max,
mean(x) as mean
from myData group by level;
quit;

Wednesday, September 19, 2007

Hooray For Vmware!

I am excited for computers again! Every once in a while something comes along that really changes the way you interact with computers. You know the feeling, it stops you in your tracks and makes you say, wow.

I remember when I was a kid and I first played a game called "Beach Head" on my Commodore 64. There was a level where you controlled a machine gun and the little computer guys would run at you from behind walls and throw grenades at you.
Every once in a while if you would shoot one of the little men he would yell "Medic!" or "I'm hit!". It was such a strain for that little computer to create the digitized speech that the whole game would slow down for a second or two. But my brother and I were seriously impressed. Wow!

I recently installed Vmware's Player on my little Dell laptop. If you are not familiar with Vmware and their virtualition technology then stop reading this and go to their web site. It is easily the most impressive software I have used in quite a while.

You see, I am going on vacation for two weeks (woohoo!) and will have some time to work on some coding projects during flights. I have been working on a perl/web/mySql project for my website for a while now and am getting close to finishing it. To work on it, I usually log into my remote server using ssh and work away. Works great until you aren't connected to the internet. So I thought, why not create a local server to work on while I am away from the internet?

Usually that would entail downloading a linux distro, partitioning part of my hard drive, making sure the distro has all the drivers it needs for my laptop, setting up and configuring all the tools I need, etc etc. Essentially a lot of wasted, unproductive time.

Last night I downloaded Vmware Player for free. Then I downloaded an appliance called Grandma's LAMP for free. An appliance is a full-blown pre-configured virtual server that is hosted on your machine through the player. Within minutes it was up and running.

All I had to do was go to my web server, tarball all the files for my application and download them to my laptop. Then I just copied them to my virtual
Ubuntu server using the pre-configured samba share and Voila! A completely useable local copy of my entire development environment in two hours! I am seriously impressed. And all without doing any reconfiguring on my little windows xp laptop.

And to top it off, I can take the whole virtual server and the player and copy them to a 2 gig thumb drive. Any computer I stick my USB drive into can host my development server. Wow, indeed.

Thursday, September 13, 2007

Bootstrap Resampling

First of all, I should mention here and now at the beginning of this post that I am not a statistician. But I am married to one (Happy Bithday Orla!), and I dounderstand normal distributions and confidence intervals and standard deviations and such. Suffice it to say, I generally get the concepts but my eyes invariably glaze over once the equations are presented.

Now that I've gotten that out of the way, I will attempt to make this post about... statistics! Hopefully everything I write will make sense, but if anything is outrageously stupid, feel free to forgive me and correct me in the comments.

On one of my travels through the internet I came across something I had never heard of before: bootstrap resampling. I will attempt to describe my understandingof it, but please do check out the links at the bottom because I am sure to over-simplify or exaggerate some parts.

In traditional parametric statistics the data is generally assumed to follow a particular pattern or distribution with the "normal" bell curve distribution being the ideal. Statisticians use various tests to determine if the sample data is normally distributed (a very surprising amount of data is) and then proceed to make statistically sound inferences about the population the data was drawn from (confidence intervals, standard deviation, etc). If it is true for the randomly drawn sample then it is true for any randomly drawn sample from the population. Assuming the sample fits the normal distribution.

Now If I understand bootstrap resampling correctly there is no need to assume the data follows a normal distribution; or any particular statistical distribution. You take a sample from your data and record the mean, then you put your sample back and get another sample of data and record the mean. You repeat that many, many, many times and then use the resulting means to pick your intervals. Here is the original description I read from a wonderful site called the World Question Center. It is an excerpt from the response of Bart Kosko. If you scroll about halfway down the page you will find it. He is way smarter than I am so his explaination will surely make more sense than mine:

"The hero of data-based reasoning is the bootstrap resample. The bootstrap has produced a revolution of sorts in statistics since statistician Bradley Efron introduced it in 1979 when personal computers were becoming more available. The bootstrap in effect puts the data set in a bingo hopper and lets the user sample from the data set over and over again just so long as the user puts the data back in the hopper after drawing and recording it. Computers easily let one turn an initial set of 100 data points into tens of thousands of resampled sets of 100 points each. Efron and many others showed that these virtual samples contain further information about the original data set. This gives a statistical free lunch except for the extensive computation involved—but that grows a little less expensive each day. A glance at most multi-edition textbook on statistics will show the growing influence of the bootstrap and related resampling techniques in the later editions.
Consider the model-based baggage that goes into the standard 95% confidence interval for a population mean. Such confidence intervals appear expressly in most medical studies and reports and appear implicitly in media poll results as well as appearing throughout science and engineering. The big assumption is that the data come reasonably close to a bell curve even if it has thick tails. A similar assumption occurs when instructors grade on a "curve" even the student grades often deviate substantially from a bell curve (such as clusters of good and poor grades). Sometimes one or more statistical tests will justify the bell-curve assumption to varying degrees — and some of the tests themselves make assumptions about the data. The simplest bootstrap confidence interval makes no such assumption. The user computes a sample mean for each of the thousands of virtual data sets. Then the user rank-orders these thousands of computed sample means from smallest to largest and picks the appropriate percentile estimates. Suppose there were a 1000 virtual sample sets and thus 1000 computed sample means. The bootstrap interval picks the 25th — largest sample mean for the lower bound of the 95% confidence interval and picks the 975th — largest sample mean for the upper bound. Done.
Bootstrap intervals tend to give similar results as model-based intervals for test cases where the user generates the original data from a normal bell curve or the like. The same holds for bootstrap hypothesis tests. But in the real world we do not know the "true" distribution that generated the observed data. So why not avoid the clear potential for modeler bias and just use the bootstrap estimate in the first place?"

So my questions to the statisticians: do you use bootstrap resampling? Is this something you do in SAS? Do you feel it helps to simplify statistics and open it up to us non-statisticians?

Really good explaination of bootstrap resampling:
http://www.uvm.edu/~dhowell/StatPages/Resampling/Bootstrapping.html

Bootstraping in SAS:
http://support.sas.com/ctx/samples/index.jsp?sid=479

Tuesday, August 21, 2007

Clean Up Clean Up

Clean up.
Clean up.

Everybody everwhere.

Clean up.
Clean up.

Everybody do your share.

This is the song my wife taught our two-year old daughter in the hopes that it would make clean-up fun and encourage more of it. Sometimes it works really well and sometimes not so well. Every now and then it backfires completely and my toddler makes a big mess just so she can run around in circles singing the Clean Up song. Leaving Mommy or Daddy to do the actual cleaning.

As SAS programmers, we are given a lot of freedom to easily create as many data sets as the system will allow in the workspace. I have met many SAS programmers that do not even carry a thought about the conseqences of keeping all those work data sets hanging around. Some of the trickiest bugs to track down can be caused by stale work data sets (especially when running interactive SAS).

I have found it very useful to delete all work data sets if I am working a piece of code repeatedly. That way I make sure previous runs don't taint current runs. A simple proc datasets does the trick:

proc datasets library=work mt=data nodetails nolist KILL;
quit;

So now that you've got the song and the code, you have no excuses for leaving a mess in the work library :)

Clean up! Clean Up! Everybody Everywhere!

Tuesday, August 07, 2007

Multiple By Variables

Here is one little piece of SAS programming that I always have to work out: When using multiple "by variables" in a SAS data step, when does the grouping flip? An example:


data stuff; 
  set otherStuff; 
  by var1 var2 var3; 
  if first.var1 then ...; 
  if first.var3 then ...;
run;

For some reason, I always have to sit and think through how multiple by variables effect each other. So here, once and for all, is the rule for me to remember:

If the group (value) changes in the variable to the left, it changes the group of all the variables on the right regardlessof their values.

It makes sense if you think it through, but sometimes it's just easier to write the rule down and refer to it (here!).

Wednesday, August 01, 2007

Summer Reading

Currently I am reading a book that is so good, I thought I would give it a quick recommendation. Against The Gods: The Remarkable Story of Risk is one of those books that I know-- before even finishing it, I will read again, and again. And I will gain deeper insights into history, humanity, stock markets, statistics and even the decisions that I make in my everyday life.

So if you get the chance, pick up a copy. And if you have any other good reads that you think I or others might be interested in, please share them here.

Wednesday, June 13, 2007

Los Angeles Basin SAS User Group

If you live in the Los Angeles area and have not had the chance to attend a LABSUG you are missing out. Kimberly Lebouton has worked very hard to bring a user group to Los Angeles and her efforts have been very productive. The speakers have been very good and Kimberly has worked diligently to listen and respond to attendee's feedback.

I was hoping to attend this year, but my wife is going out of town leaving me with babysitting duty. Of course, I could bring my toddler-- she would make a very engaging presenter!

I wonder if Kimberly could carve out 2 hours for nap time this year. . . :)

LABSUG
Friday June 22nd
Sheraton Los Angeles Downtown Hotel
http://www.labsug.org

Tuesday, May 29, 2007

Where Did the Observation Come From?

Here is a little snippet of code I created to address the problem of assigning a value to a variable based on what data set an observation came from in a data step. Here is an example:

Suppose I have a whole bunch of data sets each representing a different country. I want to set a lot of them in one data step and create one resulting data set with a variable called language. In order to create the language variable correctly, we need to know which data set the observation is coming from. Typically we would use the IN= option on the data set to create a flag and then check that flag using if/then logic.


data selectedCountries;
  set
    chile(in=chile)
    china(in=china)
    costa_rica(in=costa)
    egypt(in=egypt)
    fiji(in=fiji)
    turkey(in=turkey)
    usa(in=usa)
    saudi_arabia(in=saudi)
  ;

  if chile then language = 'SPANISH';
  else if china then language = 'CHINESE';
  else if costa then language = 'SPANISH';
  etc etc etc...
run;

One of the major problems with this approach is it does not scale well. The more countries you set, the more problematic your if/then logic becomes.

Here is a slightly more elegant solution that uses arrays and variable information functions. You still use the IN= option on the data set, however you want to name the in= variable the same as the value we want to assign. Then you create an array of all those in=variables. Finally, you loop through the array of in= variables and check for their boolean value. If it is true then you assign your new variable the value derived from the vname() function.


data selectedCountries;
  set
    chile(in= SPANISH)
    china(in= CHINESE)
    costa_rica(in= SPANISH)
    egypt(in= ARABIC)
    fiji(in= ENGLISH)
    turkey(in= TURKISH)
    usa(in= ENGLISH)
    saudi_arabia(in= ARABIC)
  ;
  array names[*] SPANISH CHINESE ARABIC ENGLISH TURKISH;
  do i = 1 to dim(names);
    if names[i] eq 1
     then language = vname( names[i] );
  end;
run;

Wednesday, May 23, 2007

Saving Time

When I was a kid my brother, sister and I spent a lot of time in my Father's dental lab. This gave us a unique opportunity to learn how to get things done in a time-sensitive production environment. The more business he got and the more successful his practice became, the more demanding his labwork. He spent a lot of time working in the lab perfecting techniques and efficiency. We kids would hang out in his dental lab looking for things to do and he would hand out miscellaneous tasks to us (sadly he locked away the NO2 from us). As we got older and more profecient working the lathe, drill, sand blaster, oven, etc we would get more critical tasks. Spending time with Dad meant spending time learning how to get things done in a fast-paced hands-on environment.

One thing Dad would always repeat to us is how important it is to get things done "quickly and correctly."

Just getting things done quickly won't cut it. And believe it or not, just getting things done correctly doesn't cut it either. Not if you have other steps in the process or customers waiting on you to complete your task. In order to have time in this life for things other than work, it helps to learn how to get things done both quickly and correctly.

Generally, most people think of working quickly as producing sloppy work. But actually, you can get things done quickly with FEWER mistakes. The trick is to seperate tasks into two categories: things that should be done very quickly, and things that should be done very correctly. When you get good at cutting down the time it takes for you to do the miscellaneous tasks you can spend more time getting the critical tasks done correctly. This type of thinking translates very well to programming. It has probably helped my career more than any other single piece of advice I have received.

So as you spend your day programming, think to yourself, "what are the non-critical tasks that I am having to do and how can I minimize them?" Believe it or not, with just a few small changes you can find yourself getting a lot more done.

Here is an example of a change that I have recently incorporated. If you are like me, you probably have a few folders on your hard drive that you are constantly having to access. Throughout my day I am constantly typing something like "c:\my data\reports\ad hoc\" into Save As and Open dialog boxes, Windows Explorer, etc. In Windows you can create a PATH variable to substitute. So in my example I might create a Windows path variable name R (stands for reports) that has the value "c:\my data\reports\ad hoc\". Now I can just type %R% to navigate to that folder. Saves time and frees my mind to focus on the more critical tasks than navigating Windows Explorer.

I believe I got that tip from http://www.lifehack.org/. It's a great site full of useful tips for minimizing the clutter so you can focus on getting things done quickly and correctly.

Thursday, May 17, 2007

LRECL

Here is a SAS trick that is especially useful for Windows users. By default, Windows creates files with a logical record length of 256. This means if you are creating a flat file with records (lines) longer than 256, the lines are going to wrap. You can tell Windows exactly how long to make the record length on the filename statement in SAS. The option is lrecl= (logical record length) and it looks like this:

filename myFile "c:\some directory\some file.txt" LRECL= 400;

Then you can write lines to that file that are up to 400 characters long without fear of the line wrapping.

Wednesday, March 28, 2007

SAS Programming Google Search

I have added a custom SAS programming search button to the top of this blog. It is done through Google Co-op and should offer better SAS programming search results than just searching the web.

You can use it directly from this blog, or you can add it to your Google homepage. To add it to your Google homepage click on the button "Add To Google".

I have not added/filtered many sites in it yet, but already I can see the results are more specific for SAS programming than just searching the web.

If you see sites that don't belong in the search result or if you know of a site that should have appeared in a search result but for some reason didn't, please comment here. The more I am able to refine the results, the better the SAS programming search will be. Soon, I would also like to start taging the sites in the search results. If you have suggestions for tags, that would be useful too.
Happy Searching!

Thursday, March 22, 2007

Not Equal

A long long time ago (or what seems like a long time ago!), before I could could call myself a professional SAS programmer I made lots of little mistakes in my programs. Now that I have been programming for a long time and have lots of good habits, I generally tend to avoid the little mistakes. Now when I make a mistake it is generally one of the bigger varieties. :)

One of the little mistakes I remember making was using the wrong "not equal" operator. It was terribly embarrassing for me at the time, and for some reason it stuck in my memory more than the other myriad mistakes I made.

When I first got hired as a SAS programmer, I did not have a whole lot of SAS experience. I had coded quite a bit growing up, but had only used SAS in a limited function at Texas A&M Univ on Windows. In my interview I explained my SAS skills honestly, and lo-and-behold they hired me! I was hired as an "intern" and had a few months to prove myself. I was told they needed people with PC experience because most of the programmers came from a mainframe MVS TSO background (which meant nothing to me at the time) and there were going to be more PC SAS contracts coming. Well, the PC SAS contracts never appeared and I suddenly found myself knee-deep in MVS TSO and mainframe SAS. JCL, ISPF, pf8 forward, pf7 back, pf3 end-- all new to me. It was all terribly daunting and every day I came to work, I thought someone was going to ask me to leave. So I did my best to keep my head above water and learn everything as quickly as I could. I thought I was doing a pretty good job masquerading as a true-blue mainframe developer until I wrote one of my first full programs and had another programmer look at it (the first week or two was spent making small changes to other people's programs and going through logs, etc). The reviewer wanted to know what this line meant:

if x <> y then delete;

I answered "if x not equal y then delete the row." The mainframer shrugged and gave me the benefit of the doubt that I knew what I was talking about. That was until someone else (one of the programmers employed by our client-- GULP!) looked at it and pointed out in a friendly email to everyone that <> is "not equal" in BASIC, but means something entirely different in SAS. I could feel everyone looking at me differently and hear their whispers.

"Basic? Basic? Is this kid a joke?" I had been exposed as a commodore 64 hack!

Well, luckily I wasn't fired and ended up learning a tremendous amount from those mainframe SAS programmers at my first real consulting job. I truly owe them my career.

So, what does <> mean in SAS? It is the MAX operator and returns the maximum of the two values on either side of it. Conversely >< is the MIN operator and returns the smaller of the two values.

Oh wait! We're talking about SAS here, right? Then I should say <> is _usually_ the max operator, but in one situation, it can stand in as the "not equals" operator I was intending it to be.

Proc SQL of course!

Friday, March 09, 2007

The =: Operator

Most people who are familiar with programming SAS are familiar with the equal colon operator ( =: ). There are a couple different colon operators in SAS, but in this post I am only talking about the comparison operator. The equal colon operator works much like the substr() function. It is used to compare substrings for equality.

Here is a quick little example:


data _null_;
  x = 'abcdefg';
  if x =: 'abc'
   then put 'The substrings match.';
run;

As you can see if you run the little data step above the substring 'abc' matches in 'abcdefg'. A better way to think of it is that it "starts with" the substring. The same could also be accomplished by this statement:


  if substr(x,1,3) = 'abc'
   then put 'The substrings match.';

There is one big difference between the =: operator and using substr(). With the substr() function you tell it exactly how many characters to look for the substring. In the example above it was three. For the =: operator it has to figure out how many characters to search. It does this by (somewhat counter-intuitively) looking at _both_ sides of the operator to find the shortest length. Here is an example:


data _null_;
  x = 'abcdefg';
  if x =: 'abc'
   then put 'The substrings match.';
run;

If you run the above data step you will see that they match. It looks a little funny because most of us assume that SAS is looking for 'abcdefg' within 'abc', but that's not really what's happening. SAS uses the shortest string to decide what to look for, no matter which side of the equals sign it is on.

Oh yeah, the =: operator also works in list context such as:


  if x in: ('abc', 'xyz', 'def');

That's it for today's post. Happy coding!

Tuesday, January 23, 2007

SAS Unix Process ID

Today a friend called and wanted to know if there was an easy way to use the Unix Process ID as part of the name of the log file when invoking SAS in batch mode. She wanted to make (semi-)unique log files. She is concerned about uniqueness enough to not want to immediatly overwrite another log file, but not so much that she's worried about possible collisions when the system recycles a process ID.

Well, it just so happens the script variable $$ contains the process ID of that script. So you can use that when constructing your log file name. Such as:

nohup /home/sas mySAScode.sas -log "/tmp/mySAScode_$$.log" &

In the above, nohup tells unix to keep the process alive even after we've closed down our terminal and logged off.
/home/sas is the sas executable (or the script that executes SAS).

An example of the log file created by this command would be /tmp/mySAScode_1298656.log

If you wanted to get the process ID that SAS was started with you can use the automatic macro variable &SYSJOBID within SAS.

If you wanted to learn some more about running SAS on Unix you could also bounce over to SASonUnix.blogspot.com. It has some very good tips.

Google SAS Search

Wednesday, November 14, 2007

Tuesday, October 16, 2007

Wednesday, September 19, 2007

Thursday, September 13, 2007

Tuesday, August 21, 2007

Tuesday, August 07, 2007

Wednesday, August 01, 2007

Wednesday, June 13, 2007

Tuesday, May 29, 2007

Wednesday, May 23, 2007

Thursday, May 17, 2007

Wednesday, March 28, 2007

Thursday, March 22, 2007

Friday, March 09, 2007

Tuesday, January 23, 2007

Popular Posts

Total Pageviews

Subscribe Now: Feed Icon

Translate

About Me

Links

Topics

Followers

Blog Archive