Google SAS Search

Add to Google

Tuesday, March 29, 2011

Data Step Hooks

Here is something to keep in mind when using the END= option on the set statement: There is no guarantee you will hit the end of file.

Simple example to illustrate:


data test;
do i = 1 to 10;
output;
end;
run;

data _null_;
set test(where=(i > 10)) end= eof;
if eof
then put "It set EOF for end of file";
run;

In the SAS documentation this is stated cryptically:
Restriction: END= cannot be used with POINT=. When random access is used, the END= variable is never set to 1.

If only it were true that END= can not be used. It can be used. It just might not work as you assumed it would. Consider the above data step where i > 5. In that case it does set eof to 1 as expected. Try where i < 5. Strangely, it does set eof to 1 even though we have never reached the end of file. SAS just "knows" that is has reached the logical end of the file. Which makes you think that reading 0 obs from the file should set eof to 1. But as we saw, it doesn't.

What actually happens is the data step finishes executing as soon as the set statement fails to read another record. So even if it did set eof to 1, it would never reach the if statement to execute it.

Unfortuately, there is no good way (that I know of) to run a bit of code at the end of file, even if 0 obs are read. You could toss the where clause and use a subsetting if statement. But then you are doing a lot of useless data step IO.

What would be be sweet is if SAS provided hooks into the data step. Two useful ones would be post-compile/pre-execute and post-execute. Maybe use special named labels?

Something like:

data _null_;
set test(where=(i > 10)) end= eof;

PRE_EXEC:
* In a super awesome world, the code in this label
would ALWAYS execute no matter if the set
statement reads anything or not;
* This would eliminate a lot of the IF _N_ = 1 silliness;
return;

POST_EXEC:
* In a doubly super awesome world, the code in this label
would execute at the end of the data step's life;
* No matter how many observations were read.;
return;

run;

2 comments:

  1. Stephen, I believe there is a way to get closer to your "super awesome world."

    Several options on the SET statement set variable values. Some of these options initialize the variable to a value after the DATA step program is compiled and before it executes.

    The most popular option that does this may be NOBS=. Before the step executes, the NOBS= variable is set to the number of observations in the data set. The END= option is also an option that initializes its variable to a value.

    After the DATA step program is compiled and before it executes, the DATA step checks to see if there are any observations to be read by the SET statement. If there aren't any observations to be read, the END= variable is set to 1 before the DATA step executes.

    This behavior enables a programmer to execute code even when no observations are read. Here is a program that gets us closer to a "super awesome world":

    data test;
    do i = 1 to 10;
    output;
    end;
    run;

    data a;
    /* Just a little "if _N_ = 1 silliness" */
    if _N_ = 1 then do;
    link pre_exec;
    if eof then link post_exec;
    end;

    /* This SET statement reads no observations */
    set test(where=(i > 10)) end=eof;

    if eof then link post_exec;

    return;

    pre_exec:
    put "At start of step";
    return;

    post_exec:
    put "It set EOF for end of file";
    return;
    run;

    ReplyDelete
  2. @secoskyj, Yes! Your code is as close to super awesome as it can get. Thanks for sharing it. I totally overlooked checking the eof and linking to the post_exec label before the set stmt fails. Nice!

    However, I still think explicit hooks would be super awesome.

    ReplyDelete