Google SAS Search

Add to Google

Wednesday, May 19, 2010

Sorting Pitfalls

SAS Proc sort is probably one of the easiest procs to use. And sorting in other programs/software is generally pretty straightforward. However, there are a few things I always remind myself when sorting.

When you use proc sort you don't know who is doing the sorting. Does this surprise you? Don't worry-- I don't know who's doing the sorting either!

It could be a low level routine called by SAS (a .dll in Windows). On the mainframe it could be an entirely different proc (syncsort). It could be a specialized routine written to take advantage of specific hardware efficiencies. It could be a different data base (SQL pass thru). You get the idea.

This tells me not to assume anything about the sort.

Like what kind of assumptions? Generally anything that's not specified in the sort is open to interpretation by the procedure/dll/db/whatever.

A specific example:
Let's say you have a data set with 3 variables (ID, NAME, Q_DATE) and 5
observations:

1 stephen 01apr2010
2 brian 01apr2010
1 stephen 05apr2010
1 stephen 20apr2010
2 brian 25apr2010


First of all notice this data set is already sorted by Q_DATE.

I am going to sort by ID.
Now my data set looks like this:
1 stephen 01apr2010
1 stephen 05apr2010
1 stephen 20apr2010
2 brian 01apr2010
2 brian 25apr2010


Later on in my code I want to remove the duplicates and get the first occurrence of Q_DATE.
data myData;
set myData;
by id;
if first.id;
run;


What if the sort doesn't maintain the order of observations within the by group? In my example, the Q_DATE is still in sorted order, but that is not guaranteed to always be the case. As you can see, this code has the potential for getting me into some very difficult debugging that may not even show up until years later.

This also applies to the NODUPKEY option. It is not guaranteed that nodupkey will keep the first observation from a group of duplicates.

Unfortunately, sorting of any kind is one of the most expensive low-level routines in computer science. So when you review code looking for efficiency gains, there is a strong temptation to remove as many sorts as possible. In the scenario above, the code is pretty effecient because we are making use of the fact that the data set is already sorted by Q_DATE. However, we need to tell the sort routine to maintain the integrity of the observations within the by group.
To do this we can use the EQUALS option:
proc sort data = myData equals;
by id;
run;


Personally, I would only recommend such chicanery for absolutely, positively, critically important efficient code. I guarantee you'll sleep better at night if you make the code more bulletproof (who's to say your data set will always come in sorted by Q_DATE?) and readable by including it in your by group. You could even leave a handy comment for the next programmer that comes along and feels the need to over-optimize the code:
proc sort data = myData;
by id q_date; ** yes yes i know the ds is supposed to already be sorted
by q_date, but there is no guarantee the sort will maintain the order;
run;


Happy coding!

No comments:

Post a Comment