Google SAS Search

Add to Google

Wednesday, May 19, 2010

Sorting Pitfalls

SAS Proc sort is probably one of the easiest procs to use. And sorting in other programs/software is generally pretty straightforward. However, there are a few things I always remind myself when sorting.

When you use proc sort you don't know who is doing the sorting. Does this surprise you? Don't worry-- I don't know who's doing the sorting either!

It could be a low level routine called by SAS (a .dll in Windows). On the mainframe it could be an entirely different proc (syncsort). It could be a specialized routine written to take advantage of specific hardware efficiencies. It could be a different data base (SQL pass thru). You get the idea.

This tells me not to assume anything about the sort.

Like what kind of assumptions? Generally anything that's not specified in the sort is open to interpretation by the procedure/dll/db/whatever.

A specific example:
Let's say you have a data set with 3 variables (ID, NAME, Q_DATE) and 5
observations:

1 stephen 01apr2010
2 brian 01apr2010
1 stephen 05apr2010
1 stephen 20apr2010
2 brian 25apr2010


First of all notice this data set is already sorted by Q_DATE.

I am going to sort by ID.
Now my data set looks like this:
1 stephen 01apr2010
1 stephen 05apr2010
1 stephen 20apr2010
2 brian 01apr2010
2 brian 25apr2010


Later on in my code I want to remove the duplicates and get the first occurrence of Q_DATE.
data myData;
set myData;
by id;
if first.id;
run;


What if the sort doesn't maintain the order of observations within the by group? In my example, the Q_DATE is still in sorted order, but that is not guaranteed to always be the case. As you can see, this code has the potential for getting me into some very difficult debugging that may not even show up until years later.

This also applies to the NODUPKEY option. It is not guaranteed that nodupkey will keep the first observation from a group of duplicates.

Unfortunately, sorting of any kind is one of the most expensive low-level routines in computer science. So when you review code looking for efficiency gains, there is a strong temptation to remove as many sorts as possible. In the scenario above, the code is pretty effecient because we are making use of the fact that the data set is already sorted by Q_DATE. However, we need to tell the sort routine to maintain the integrity of the observations within the by group.
To do this we can use the EQUALS option:
proc sort data = myData equals;
by id;
run;


Personally, I would only recommend such chicanery for absolutely, positively, critically important efficient code. I guarantee you'll sleep better at night if you make the code more bulletproof (who's to say your data set will always come in sorted by Q_DATE?) and readable by including it in your by group. You could even leave a handy comment for the next programmer that comes along and feels the need to over-optimize the code:
proc sort data = myData;
by id q_date; ** yes yes i know the ds is supposed to already be sorted
by q_date, but there is no guarantee the sort will maintain the order;
run;


Happy coding!

Tuesday, May 18, 2010

Cloud App

Oh man it has been a long time since I have updated this blog. So what's new?

I have a new son!
The band I am playing in recently played a gig at the Whisky A Go Go in Hollywood. We're playing at The Cat Club in Hollywood this Thur come one come all, promotion, promotion, promotion and all that :)

So between playing bass and changing countless diapers I have found my free time has been severely limited. Right now I am into anything that lets me get things done simply and quickly.

I recently found a new application for storing and sharing files online called CloudApp. I'll let their site explain themselves, but basically it is a simple web front end that lets you quickly upload files to Amazon's S3 cloud storage. And it's free!

Now I can record my band's live shows, practices, etc and share them with all the band members online in seconds. It even automatically creates a shortened URL. And if you use Mac, there is a desktop app that makes uploading and sharing files crazy simple.

I quickly skimmed their TOS and I didn't see anything in there about encryption, privacy, etc so I wouldn't use it to pass data sets that contain social security numbers or any other sensitive information. But it is still dead handy for all the other large data sets you've been working on...