Data Steps: Some Fun with SAS and Perl Regular Expressions

This post assumes you have a little understanding of how regular expressions work and specifically how SAS implements regular expressions. I recently did something like this and thought it would be good to share. Suppose you have a program that searches through a big text field for a specific word. That's pretty easy to code and you can even get away with just using a simple indexW() function. The problem is when you look at the text field on your report, your eyes glaze over as you scan for the word to make sure you are capturing the correct output. If only there was some easy way to make the word stand out from its neighbors.

I used the prxchange() function to search for a pattern and then replace it with another pattern. In this case, I am outputting HTML so I can wrap my search word in tags. First I will give a little example code, then I will break down what the code is doing and finally show some easy improvements. For the sake of clarity and brevity, I am only showing the code that highlights the search word. I am not showing the code that subsets the data based on the search term.

Example 1:


data _null_;
  input text $80.;
  put "The text before matching " text=  ;
  text = prxchange('s/(battery)/<b>$1<\/b>/', -1, text);
  put "The text after matching " text= //;
datalines;
This battery is dead.
Batteries are in the box.
;

Output in the log:
The text before matching text=This battery is dead.
The text after matching text=This <b>battery</b> is dead.


The text before matching text=Batteries are in the box.
The text after matching text=Batteries are in the box.

Looking at the code above, you can see that the only interesting thing happening is the prxchange() function. The prxchange function takes a regular expression as its first argument. The regular expression uses a substitution syntax with a generic look of

s/(something to look for)/numbered capture buffers/.

So in my example above, the word (or pattern really) I am looking for is battery. I put () around it to specify that it's the first capture buffer: $1. Then I wrap $1 with bold tags. You can see I had to escape the / in the closing tag because it is a special regular expression character. So my regular expression is:

s/(battery)/$1<\/b>/

and reads as: look for the pattern 'battery', store it in $1 and substitute it with $1.

The second parameter to the prxchange() function is -1 and just tells the function to keep searching the source, finding and replacing every occurrence till you get to the end of source. The third parameter 'text' just tells the function what text source to search.

Make sense?

Now there are a couple things that can easily be added to the regular expression to make the code a little more robust and efficient. First of all, the regular expression is recompiled on every loop of the data step. In our case, we don't need that so we can add the /o option to the end of the regular expression to tell it to just compile it once:

s/(battery)/$1<\/b>/o

Also, our regular expression is caSe SensiTive. We can tell it to ignore case by adding the ignore case option (/i) to the end of the regular expression:

s/(battery)/$1<\/b>/oi

Now it will match battery, Battery, BATTERY, etc.

But wait! We also want to match Batteries. What to do? We could shorten our regular expression to:

s/(batter)/$1<\/b>/oi

But that would match batter and batter is a liquid mixture, usually based on one or more flours combined with liquids such as water, milk or beer. That's definetly not what we are looking for. We want to search for batter followed by at least one or more [a-z] characters:

s/(batter[a-z]+)/$1<\/b>/oi

Now our example code looks like:


data _null_;
  input text $80.;
  put "The text before matching " text=  ;
  text = prxchange('s/(batter[a-z]+)/<b>$1<\/b>/oi', -1, text);
  put "The text after matching " text= //;
datalines;
This battery is dead.
Batteries are in the box.
Do not eat the cookie batter before it is cooked.
;

Output in the log:
The text before matching text=This battery is dead.
The text after matching text=This <b>battery</b> is dead.


The text before matching text=Batteries are in the box.
The text after matching text=<b>Batteries</b> are in the box.


The text before matching text=Do not eat the cookie batter before it is cooked.
The text after matching text=Do not eat the cookie batter before it is cooked.

And finally, you sharp SAS coders probably don't want to hardcode the search term. More likely it would be stored in a variable and then you could construct the regular expression like you would any other text variable:


mySearch = 'batter';
rx = "s/(" ||
      mySearch ||
      "[a-z]+)/<b>$1<\/b>/oi";

Or something like that. Also, you can search for more than one thing. Just enclose each pattern in () and refer to them as $1, $2, etc. Play around with it. Have fun. Thanks for reading!

2 comments:

apple cider vinegar acid refluxFebruary 20, 2011 at 5:42:00 PM PST
I¡¯ve read a number of the content material articles on your site now, and I absolutely like your model of website. I included it to my favorites web site listing and ought to be coming back quickly. Keep in mind to take a look at my website too and inform me what you think.
AnonymousJune 21, 2011 at 10:38:00 PM PDT
You probably don't want "battering", so

s/(batter[y|ies])/$1<\/b>/oi

would be my choice.
I concur with apple vinegar, nice web site.

ChG

Data Steps

Google SAS Search

Wednesday, December 09, 2009

Some Fun with SAS and Perl Regular Expressions

2 comments:

Popular Posts

Total Pageviews

Subscribe Now: Feed Icon

Translate

About Me

Links

Topics

Followers

Blog Archive