What is the most efficient way to drop/keep columns (variables) in sas tables (data sets)?
For the most part, we would correctly say "using a keep= option on the data set as it is being read into the current step." A quick example to illustrate:
data someData;In fact, I even wrote a whole paper on this for SUGI a few years back.
set myData(keep= x y z);
proc sort data = myData(keep= x y z) out= someData;
by x y;
Programming with the KEEP, RENAME and DROP Data Set Options
However, what if you didn't have access to modify the code? You need to create a seperate step just to keep/drop variables. What's the most efficient way to do it? You could create another data step and use the keep= option on the set statement:
data myData;That approach will work, but it's not very efficient. In fact, it's pretty horribly inefficient. The data set is read one observation at a time, which is usually IO intense. And IO is a big efficiency suck. Plus the single data step actually creates a copy of myData as it's being processed. After the step is finished, the temporary copy replaces the original data set.
* just dropping some variables....;
set myData(keep=x y z);
As a general rule, the most efficient way to move SAS data sets around is to copy them. Copying is usually more efficient than reading them one observation at a time because the copy can use a better copy buffer resulting in less IO. This hints that we should avoid the data step and use a procedure that can copy the data.
Here is what I came up with:
* a little test data set;As you can see, we are using proc datasets instead of a data step. First we are renaming the data set to something temporary (the CHANGE statement). This is a super cheap operation since only the data set header (metadata) is being modified. Then we append (copy) the temporary data set onto the non-existent original data set. Along the way we tell SAS exactly which variables to keep/drop. We have to use the FORCE option because the base data set no longer exists after the rename. And then finally we delete the temporary data set. Whew!
do i = 1 to 5;
a = 'blah blah blah';
b = 'foo';
c = 'bar';
x = 'some data ';
y = 'lovely data';
z = 4;
change myData= tempData;
data = tempData(keep = x y z)
I haven't run any benchmarks to see how much time this method would save because I, uh, don't have time. But I'm pretty sure it would be much faster than an extra data step.
If you know of another way to accomplish this that is more efficient than a data step, please share!