Data Steps: 2010

Monday, November 15, 2010

!= does not == ne

In general, the more programming languages you work with, the better you are going to become as a programmer. I try to work with a new language about every year, so I can stretch my little brain in lots of different directions.

However, the downside is sometimes you lose track of the syntax that used to be second nature to you. Or maybe it's just old age? Wait, what was I talking about?

Oh yeah, I was working away on some SAS macro code and it just wasn't working. It was simple code. Easy code. Code you can read and write in your sleep. And I still couldn't get it to work. After circling around the problem for waaaay longer than I should have, it finally stuck out at me like a sore thumb. Oh silliness, there is no != in SAS (unlike say, EVERY other language).

Unfortunately the offending != syntax fails silently in SAS macro:


%macro whatIsGoingOn(name);
  %put ** reality check name is &name **;
  %if &name != Stephen
   %then %put You are not Stephen;
  %else %put Hello Stephen;
%mend;

%whatIsGoingOn(Chuck);
** reality check name is Chuck **
Hello Stephen

Thursday, September 16, 2010

Leading Zeroes

Here's a situation that comes up pretty often. You receive a file that contains zip codes. It's an excel file and you need to create a SAS data set out of it, so you can do some nifty market analysis. No problem, you clickety clickety through the SAS import wizard and voila! a data set is created. However, the zip_code variable is numeric and doesn't have leading zeroes.

Even though zip codes are made up of numbers, we generally want to treat them as character data. Luckily for you it is easy to turn those numeric variables character and restore the leading zeroes.

In general you:
1) Rename your original numeric zip code variable.
2) Create a new character zip code variable.
3) Apply the z. format to the numeric value to make it character and
restore the missing leading zeroes.

Here is what it looks like in practice:


 data myData(drop= bad_zip);
   length zip_code $5;
   set someData( rename= ( zip_code = bad_zip ));
   zip_code = put( bad_zip, $z5. );
 run;

Tuesday, August 31, 2010

It's a Math, Math World

Here is a blog that I found. It belongs to Michael O'Brien. I have not had a chance to really read through any of his posts, but it is on my to-do list as soon as I have a little tiny bit of time. Based on my quick cursory view of it, he seems to be writing quite a bit about statistics. And not the crazy insane look how whippety smart I am statistics writing that makes my eyes hurt and my brain feel small. I'm not a stat wizard! I'm just a programmer!

I just glanced through some of his posts, and while there was math, it didn't seem impenetrable. In fact, a quick scan of this post actually made my brain feel a little bigger.

Kudos to Michael of It's a Math, Math World

Monday, July 26, 2010

Thanks LabSug!

Thank you Los Angeles Basin SAS User Group for listening to my talk on SAS MACRO: Beyond The Basics.

The room provided by RAND was excellent. Everything was well organized and the day went without a hitch. The audience was very engaged and had great questions (definitely not the sleepy group I was expecting right after lunch!).

Overall I had a great time, and look forward to getting the chance to speak again!

Monday, July 19, 2010

I have a data set of sales data by day. Unfortunately the names of the columns represent the dates. In order to work with the data, I need to transform the data set so each day represents an observation.

The data set looks something like this:


store _010_05_01 _010_05_02 _010_05_03 ....
1     8          5          6
2     6          9          3
3     7          9          8
....

As you can see, I have a store variable and several variables that are named by the date and contain the number of sales for that day. I need it to look like this:


store date   sales
1     5/1/10 8
1     5/2/10 5
1     5/3/10 6
2     5/1/10 6
2     5/2/10 9
2     5/3/10 3
....

Unfortunately this is a pretty common problem when you receive data from vendors who aren't sure how you are going to work with it. Luckily the transformation is pretty easy:



1) data byDay;
2)   set myData;
3)   array t[*] $ _010: ;
4)   do i = 1 to dim(t);
5)     sales = t[i];
6)     date  = vName( t[i] );
7)     output;
8)   end;
9) run;

This little data step creates an array to hold our date variables. (3)We use a little syntax sugar to keep from having to type out all the variable names _010:. The colon tells the SAS compiler to list out all the variables that have the _010 prefix. (4)Then we loop through all the elements in our t[] array. We use the vName() function to get the name of the variable that t[] is referring to (6). And finally we output once for each iteration of our loop (7).

Obviously if your date columns aren't named uniformly (_010:) then things won't work as nicely. And you would likely want to add a KEEP statement to just keep the variables you are creating along with any others you might be interested in.

If you made it this far, you may have noticed I am not quite finished with my transformation. Line (6) assigns the value of date to something like '_010_05_01'. Not quite what we wanted. We need to turn that into a SAS date. I need to substr() the first two characters off to remove them (_0) and then input() the result using the appropriate informat. So line (6) should really be:



6) date = input( substr( vName( t[i] ), 3), yymmdd8. );

A little more complicated, but hopefully still understandable.

Hopefully you find this useful! Comment with any comments/observations.

Thursday, June 03, 2010

How To Get What You Want Out of a Data Step Merge

This Fall my daughter will be going to kindergarden. So like all other hyper-attentive parents we have started introducing her to the concept of homework. The other night I got out her crayons and introduced her to set theory. After about an hour I finally got her to draw her Venn diagrams with the the corresponding SAS data step code for a merge. I was so proud I decided to post it here.

In her drawing data set A is red and B is blue. The shaded area is what's kept.
Disregard the part where she drew herself building a sand castle on the beach. It has nothing to do with Venn diagrams, or SAS data step merging code.

What? You don't believe my five year old daughter drew it? :)

Just a quick refresher: The A and B business refers to the automatic variables that are created when you use the IN= data set option. Essentially the variable A will be "true" whenever that data set contributes an observation to the merge (or join). Also, I don't know why everyone uses A and B-- you can set these variables to anything (left|right, paid|due, etc); but I've always seen A and B so I will stick with convention.


data merged;
  merge someData(in=A) otherData(in=B);
  by someKey;
  if a and (not b); * just keep the observations in A that do not match anything to B;
run;

Wednesday, May 19, 2010

Sorting Pitfalls

SAS Proc sort is probably one of the easiest procs to use. And sorting in other programs/software is generally pretty straightforward. However, there are a few things I always remind myself when sorting.

When you use proc sort you don't know who is doing the sorting. Does this surprise you? Don't worry-- I don't know who's doing the sorting either!

It could be a low level routine called by SAS (a .dll in Windows). On the mainframe it could be an entirely different proc (syncsort). It could be a specialized routine written to take advantage of specific hardware efficiencies. It could be a different data base (SQL pass thru). You get the idea.

This tells me not to assume anything about the sort.

Like what kind of assumptions? Generally anything that's not specified in the sort is open to interpretation by the procedure/dll/db/whatever.

A specific example:
Let's say you have a data set with 3 variables (ID, NAME, Q_DATE) and 5
observations:

1 stephen 01apr2010
2 brian   01apr2010
1 stephen 05apr2010
1 stephen 20apr2010
2 brian   25apr2010

First of all notice this data set is already sorted by Q_DATE.

I am going to sort by ID.
Now my data set looks like this:

1 stephen 01apr2010
1 stephen 05apr2010
1 stephen 20apr2010
2 brian   01apr2010
2 brian   25apr2010

Later on in my code I want to remove the duplicates and get the first occurrence of Q_DATE.

data myData;
 set myData;
 by id;
 if first.id;
run;

What if the sort doesn't maintain the order of observations within the by group? In my example, the Q_DATE is still in sorted order, but that is not guaranteed to always be the case. As you can see, this code has the potential for getting me into some very difficult debugging that may not even show up until years later.

This also applies to the NODUPKEY option. It is not guaranteed that nodupkey will keep the first observation from a group of duplicates.

Unfortunately, sorting of any kind is one of the most expensive low-level routines in computer science. So when you review code looking for efficiency gains, there is a strong temptation to remove as many sorts as possible. In the scenario above, the code is pretty effecient because we are making use of the fact that the data set is already sorted by Q_DATE. However, we need to tell the sort routine to maintain the integrity of the observations within the by group.
To do this we can use the EQUALS option:

proc sort data = myData equals;
 by id;
run;

Personally, I would only recommend such chicanery for absolutely, positively, critically important efficient code. I guarantee you'll sleep better at night if you make the code more bulletproof (who's to say your data set will always come in sorted by Q_DATE?) and readable by including it in your by group. You could even leave a handy comment for the next programmer that comes along and feels the need to over-optimize the code:

proc sort data = myData;
 by id q_date;  ** yes yes i know the ds is supposed to already be sorted
by q_date, but there is no guarantee the sort will maintain the order;
run;

Happy coding!

Tuesday, May 18, 2010

Cloud App

Oh man it has been a long time since I have updated this blog. So what's new?

I have a new son!
The band I am playing in recently played a gig at the Whisky A Go Go in Hollywood. We're playing at The Cat Club in Hollywood this Thur come one come all, promotion, promotion, promotion and all that :)

So between playing bass and changing countless diapers I have found my free time has been severely limited. Right now I am into anything that lets me get things done simply and quickly.

I recently found a new application for storing and sharing files online called CloudApp. I'll let their site explain themselves, but basically it is a simple web front end that lets you quickly upload files to Amazon's S3 cloud storage. And it's free!

Now I can record my band's live shows, practices, etc and share them with all the band members online in seconds. It even automatically creates a shortened URL. And if you use Mac, there is a desktop app that makes uploading and sharing files crazy simple.

I quickly skimmed their TOS and I didn't see anything in there about encryption, privacy, etc so I wouldn't use it to pass data sets that contain social security numbers or any other sensitive information. But it is still dead handy for all the other large data sets you've been working on...

Monday, January 18, 2010

Finding the Max Value In An Array

The max() function makes it easy to find the maximum value in a SAS array.

Given an array like:


array x[*] x1-x10;

maxValue = max(of x[*]);

Pretty slick, eh? Remember, it doesn't return the position of the max element, just the max value.

This can be pretty useful if you want to find out if at least one element of an array has a value. Like if you have an array of answers and you want to find out if there is at least one answer in the array.


hasAnAnswer = max(of answers[*]);

or in logic:


if max(of answers[*])
 then do;
        * yes, they have answered at least one question;
      end;

*NOTE: I have decided to keep this SAS programming blog SAS focused and move my Ruby on Rails writing to a Rails oriented blog. I am a guest contributor at www.HABTMProjects.com. If you were interested in following along with my RoR projects then please hop on over there and sign-up.*

Data Steps

Google SAS Search

Monday, November 15, 2010

!= does not == ne

Thursday, September 16, 2010

Leading Zeroes

Tuesday, August 31, 2010

It's a Math, Math World

Monday, July 26, 2010

Thanks LabSug!

Monday, July 19, 2010

Thursday, June 03, 2010

How To Get What You Want Out of a Data Step Merge

Wednesday, May 19, 2010

Sorting Pitfalls

Tuesday, May 18, 2010

Cloud App

Monday, January 18, 2010

Finding the Max Value In An Array

Popular Posts

Total Pageviews

Subscribe Now: Feed Icon

Translate

About Me

Links

Topics

Followers

Blog Archive