Often we need to create some simple statistics for a set of data and then associate those stats with each observation of the original set. As a simple example

consider a table with only three rows:

N

3

6

4

We want to get the mean of the variable N and stripe it down all the observations:

N Mean

3 4.333

6 4.333

4 4.333

The first way I learned to do this was with a proc summary and a merge. A better way to do it is with proc sql.

Here is a little test data:

data myData;

input x level $;

cards;

11 a

31 a

51 a

2 b

61 a

8 b

21 a

71 a

91 a

4 b

61 a

21 a

5 b

7 b

5 b

31 a

1 b

61 a

8 a

9 b

3 a

2 b

5 a

7 b

7 a

3 b

;

* in that data set we have two variables X and LEVEL. We can get the stats on X for each level by summarizing and merging...;

proc sort data = myData;

by level;

run;

proc summary data = myData;

by level;

var x;

output out= tempStats(drop=_type_ _freq_) mean=mean max=max min=min;

run;

data sumStats;

merge myData tempStats;

by level;

run;

* or better yet, we can collapse the whole thing into one nifty proc sql step!;

proc sql;

create table stats as select *,

min(x) as min,

max(x) as max,

mean(x) as mean

from myData group by level;

quit;

## Tuesday, October 16, 2007

### Saving Steps With SQL

Subscribe to:
Post Comments (Atom)

SQL is by far my favourite way of calculating summary stats like that. The major downside (at the moment, at least) is that SQL doesn't have a "median()" function. Or maybe it does in version 9?

ReplyDeleteHi Chris,

ReplyDeleteIndeed, there is no "real" median() function in SQL-- even in version 9. There is a median function but it works across columns on the same row (ie, median(col1,col2,col3,col4) ) NOT median(row1...rowN). Confusing to be sure!