Google SAS Search

Add to Google

Tuesday, August 09, 2011

First One In Gets the Win

Yikes, it's been a while since the last update! So I will try to keep this one short and useful. Most everybody knows there are essentially two ways for tables to be merged in SAS: using the merge statement in the data step and using a join in SQL. Programmers tend to prefer one way over the other, and generally they are interchangeable. However, there are some minor differences that you should keep in mind. One such difference is in how overlapping variables are handled.

Here is a very basic one-to-one merge and its SQL equivalent:


data left;
do i = 1 to 10;
output;
end;
run;

data right;
do i = 1 to 10;
output;
end;
run;

data merged;
merge left(in=a) right(in=b);
by i;
run;

proc sql noprint;
create table joined as
select a.*,b.*
from left a inner join right b
on a.i = b.i;
quit;

Now lets add another variable that is the same on both data sets:

data left;
length overlap $8;
do i = 1 to 10;
overlap = 'left';
output;
end;
run;

data right;
length overlap $8;
do i = 1 to 10;
overlap = 'right';
output;
end;
run;

Now when the two data sets are merged, what value will be in the rows for the overlap variable?

It depends on the order you specify the data sets on the merge statement.

The value comes from the last data set to contribute a record to the merge.

data merged;
merge left(in=a) right(in=b);
by i;
run;

The resulting value for overlap will be 'right' because it is the last one named on the merge statement and each row in left has a match in right.

Would you expect it to work the same way in proc SQL? Of course not! You are a SAS programmer. These types of inconsistencies keep you employed.

proc sql noprint;
create table joined as
select a.*,b.*
from left a inner join right b
on a.i = b.i;
quit;

The value for overlap is 'left' in the joined data set. Opposite of how the data step merge works. I like to remember the SQL rule as: First one in gets the win.