Google SAS Search

Add to Google

Wednesday, October 12, 2011

Efficiently Drop/Keep SAS Data Set Columns

What is the most efficient way to drop/keep columns (variables) in sas tables (data sets)?

For the most part, we would correctly say "using a keep= option on the data set as it is being read into the current step." A quick example to illustrate:

  data someData;
set myData(keep= x y z);
run;

proc sort data = myData(keep= x y z) out= someData;
by x y;
run;
In fact, I even wrote a whole paper on this for SUGI a few years back.
Programming with the KEEP, RENAME and DROP Data Set Options

However, what if you didn't have access to modify the code? You need to create a seperate step just to keep/drop variables. What's the most efficient way to do it? You could create another data step and use the keep= option on the set statement:
  data myData;
* just dropping some variables....;
set myData(keep=x y z);
run;
That approach will work, but it's not very efficient. In fact, it's pretty horribly inefficient. The data set is read one observation at a time, which is usually IO intense. And IO is a big efficiency suck. Plus the single data step actually creates a copy of myData as it's being processed. After the step is finished, the temporary copy replaces the original data set.

As a general rule, the most efficient way to move SAS data sets around is to copy them. Copying is usually more efficient than reading them one observation at a time because the copy can use a better copy buffer resulting in less IO. This hints that we should avoid the data step and use a procedure that can copy the data.

Here is what I came up with:
* a little test data set;
data myData;
do i = 1 to 5;
a = 'blah blah blah';
b = 'foo';
c = 'bar';
x = 'some data ';
y = 'lovely data';
z = 4;
output;
end;
run;

proc datasets;
change myData= tempData;
run;
append base=work.myData
data = tempData(keep = x y z)
force;
run;
delete tempData;
quit;
As you can see, we are using proc datasets instead of a data step. First we are renaming the data set to something temporary (the CHANGE statement). This is a super cheap operation since only the data set header (metadata) is being modified. Then we append (copy) the temporary data set onto the non-existent original data set. Along the way we tell SAS exactly which variables to keep/drop. We have to use the FORCE option because the base data set no longer exists after the rename. And then finally we delete the temporary data set. Whew!

I haven't run any benchmarks to see how much time this method would save because I, uh, don't have time. But I'm pretty sure it would be much faster than an extra data step.

If you know of another way to accomplish this that is more efficient than a data step, please share!

3 comments:

  1. Why not just remove the unwanted columns?

    proc sql noprint;
    ALTER TABLE myData DROP COLUMN a ,b ,c;
    quit;

    Jan Hansen

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. Thanks man. I'v always wondered why every single paper/article/faq on drop/keep didn't even questioned reading/copying the data when the task is only to drop variables.

      Delete