Google SAS Search

Add to Google

Thursday, April 17, 2008

SAS SQL Join

One of the most frequent uses of SQL is to join tables. And since SAS data sets are tables, there is good reason to learn SQL. But a lot of SAS programmers will shy away from learning SQL because they are already familiar with merging in the data step. Here are three additional reasons to use a SQL join instead of a data step merge. But first two little sample data sets to use for the examples:


data sales;
length var $5;
do id_num = 1 to 10;
var = 'left';
output;
end;
run;

data contracts;
length var $5;
input id_no var $;
cards;
3 right
2 right
5 right
6 right
1 right
;


Okay, now here's the three reasons!

1) No need to sort the data beforehand.
This one is pretty self explanatory. If the tables are not sorted by the variables you are joining on, SQL will take care of it.

2) You can join on different variables names.
In a SAS data step merge, you have to merge by a variable or variables that are identical in each data set. With SQL, you can join on variables with different names as long as the values match up. So instead of something like this:

proc sort data = contracts;
by id_no;
run;
data together;
merge sales( in=a )
contracts( in=b rename=(id_no = id_num) );
by id_num;
if a;
run;


You can use sql:

proc sql;
create table together
as select * from sales as a
left join contracts as b
on a.id_num = b.id_no;
quit;


3) You are warned if there are overlapping variables.
In a traditional data step merge, you have to be very mindful of overlapping variables-- or variables that are shared by the data sets but are not part of the by statetement. If there are overlapping variables, the last data set named on the merge statement to contribute to the observation gets to deliver the resulting value. But there is no warning in the log letting you know values have been overwritten. SQL will let you know that the variable already exists, but it uses the first value, not the last.

If you are already pretty familiar with merging in the data step then you may find some of the SQL syntax a little strange. Most merges are of the "if a;" and "if a and b;" variety. Those are the best starting points for getting used to the equivalent SQL syntax.
The left join we used in the above example is equivalent to "if a;" data step syntax. This is an "outer" join since we are asking for values that don't belong to both sets. For the more restrictive "if a and b;" merge, use an inner join in SQL:

proc sql;
create table together
as select * from sales as a
inner join contracts as b
on a.id_num = b.id_no;
quit;


It is called an inner join because you only want values that belong to both sets. For the data step syntax "if not (a and b);" you use a full outer join in SQL. It is called a full outer join because you want everything that is not contained in both sets.

proc sql;
create table together
as select * from sales as a
full join contracts as b
on a.id_num = b.id_no;
quit;



Hopefully that is enough to get you started if you are interested in SQL. If I get some time I will put together and post a little cheat sheet of Venn diagrams to illustrate the inner/outer join concepts.

Update: I just got some Venn diagrams up in a new post. Hope you find it useful!
www.sascoders.com/2010/06/how-to-get-what-you-want-out-of-data.html

6 comments:

  1. Very clear post! Although there are lots of reasons to use (and not use) SQL, your post was simple and with good few points. Congrats for the explanation too.

    ReplyDelete
  2. For me the main reason to use SQL for matching is the ability to use more complex comparissons to match. For example records should match if a date is between the end and startdate in an other file.

    ReplyDelete
  3. Hi!
    I was hoping you could help me with a following problem:

    There are 5 tables:
    1. Year05d: 5 variables + 2 keys: zip naics
    2. Year06d: 5 variables + 2 keys: zip naics
    3. Year05t: 3 variables + key: zip
    4. Year06t 3 variables + key: zip
    5. look-up table: 2 variables: naics sic

    I need one table that is unique at zip+naics level and contains all observations and all the variables from all 5 tables.
    With data step the logic was as following:
    1. merge Year05d Year06d by zip naics
    2. merge Year05t Year06t by zip
    3. merge step1 and step2
    4. merge step3 and look-up table by naics

    how do i do that with sql in one step?

    what is the best reference for learning proc sql?

    Thanks a lot!

    ReplyDelete
  4. Simple, but extremely helpful.

    ReplyDelete