parsing a string in SAS 9.2 that contains the '|' character - regex

I have a variable that contains a number of firms separated by the | symbol. I would like to be able to count how many firms there. i.e., the number of | + 1, and ideally identify the location of the | symbol in the string. Note there will not be more than five firms in a single variable. I was trying to use the following approach but run into the fact that SAS treats the | symbol as a special operator.
pattern1 = prxparse('/|/'); /* I can't seem to get SAS to treat this as a text to compare */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
call prxnext(pattern1, start, stop, reassignment2, position, length);
ARRAY Y[5];
do J=1 to 5 while (position > 0);
Y[J]=position;
call prxnext(pattern1, start, stop, reassignment2, position, length);
end;
nfirms=j+1;
run;

I would do it somewhat differently. What you really want is not the number of | characters, but the actual firms, right? So search for those. Your code had a number of minor issues; primarily, you must first prxmatch before using call prxnext, your j+1 is wrong because the loop iterator actually increments one beyond the last qualifying loop value (I use j-1 because I will find one more element than you), and | is a regular expression metacharacter and must be escaped if you actually want to use it, unless it is inside [] like I am using it.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
pattern1 = prxparse('/[^|]+/io'); /* Look for non-| characters */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
rc=prxmatch(pattern1,reassignment2);
if rc>0 then do;
ARRAY Y[5];
do J=1 by 1 until (position = 0);
call prxnext(pattern1, start, stop, reassignment2, position, length);
Y[J]=position;
end;
nfirms=j-1;
end;
else nfirms=0;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
For completeness' sake, you could also do this easily without regular expressions, using call scan.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
array y[5];
do nfirms=1 by 1 until (position le 0);
call scan(reassignment2,nfirms,position,length,'|');
y[nfirms]=position;
end;
nfirms=nfirms-1; *loop ends one iteration too late;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;

I agree with #Joe that this could be done more simply without regular expressions, though I would simplify his code a little further to exclude the use of an array.
data test;
infile datalines truncover length = reclen;
input firmlist $varying256. reclen;
i = 0;
do until(scan(firmlist,i,"|") = "");
i + 1;
end;
nfirms = i - 1;
drop i;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;
run;
You said you'd also like to capture the position of the "|" character in the string, but if there are multiple firms per record there will be multiple "|" characters in the string. If you want the position of each one, an array might be a better route, though if you only want one, the index function will get you what you want. You'd use delimpos = index(firmlist,"|");.
I hope that helps!

Related

Using PRXNEXT to capture all instances of a keyword

I'm searching through medical notes to capture all instances of a phrase, in particular 'carbapenemase producing'. At times this phrasing can occur > 1 time in a string. From some research I think PRXNEXT would make the most sense but I'm having difficulty getting it to do what I want to. As an example for this string:
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion presumptive carbapenemase producing cre see
spmi for carba r pcr results not confirmed carbapenemase producing cre
From this comment above, I'd like to extract the phrases
presumptive carbapenemase producing
and
not confirmed carbapenemase producing
I realize I can't extract, I don't think, those exact phrases but some variation of it with a substring. The code i've been using I found here. Here's what I have thus far but it's only capturing the 1st phrase:
carba_cnt = count(as_comments,'carba','i');
if _n_ = 1 then do;
retain reg1 neg1;
reg1 = prxparse("/ca[bepr]\w+ prod/");
end;
start = 1;
stop = length(as_comments);
position = 0;
length = 0;
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(reg1, start, stop, as_comments, position, length);
lastpos = 0;
do while (position > 0);
if lastpos then do;
length found $200;
found = substr(as_comments,lastpos,position-lastpos);
put found=;
output;
end;
lastpos = position;
call prxnext(reg1, start, stop, as_comments, position, length);
end;
if lastpos then do;
found = substr(as_comments,lastpos);
put found=;
output;
end;
You are correct to use PRXNEXT for locating each occurrence of a regex match in a source. The regex pattern can be modified to use a group capture to search for an optional leading "not confirmed". The scenario for the least likely 'coder fail' is to focus loop and extract around a single call to PRXNEXT.
This example uses pattern /((not confirmed\s*)?(ca[bepr]\w+ prod)) and outputs one row per match.
data have;
id + 1;
length comment $2000;
infile datalines eof=done;
do until (_infile_ = '----');
input;
if _infile_ ne '----' then
comment = catx(' ',comment,_infile_);
end;
done:
if not missing(comment);
datalines4;
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion presumptive carbapenemase producing cre
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion conjectured carbapenems producing cre
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
;;;;
run;
data want;
set have;
prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod))/');
_start_inout = 1;
do hitnum = 1 by 1 until (pos=0);
call prxnext (prx, _start_inout, length(comment), comment, pos, len);
if len then do;
content = substr(comment,pos,len);
output;
end;
end;
keep id hitnum content;
run;
Bonus info: The prxparse does not need to be inside an if _n_=1 block. See PRXPARSE docs
If perl-regular-expression is a constant or if it uses the /o option, the Perl regular expression is compiled only once. Successive calls to PRXPARSE do not cause a recompile, but returns the regular-expression-id for the regular expression that was already compiled. This behavior simplifies the code because you do not need to use an initialization block (IF _N_ = 1) to initialize Perl regular expressions.

Get string between two specific char positions

i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");

How to scan the string and convert dynamically in SAS

Supposed I have two strings to convert from SAS program name to table number.
My goal is to convert the first "f-2-2-7-5-vcb" to "2.2.7.5".
And this should be done dynamically. Like for "f-2-2-12-1-2-hbd87q",
it needed to be "2.2.12.1.2" .
How to accomplish this?
data input;
input str $ 1-20;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
data want;
set input;
Sub=compress(substr(str,3,length(str)),,'kd') ;
run;
Bit of a longer way, but this works fine for me.
Use FIND() to find the first '-'
Use REVERSE() and FIND() to find the
last '-'
Use SUBSTR() and metrics + math from above to remove the first and
last components
Use TRANSLATE() to convert the - to periods.
z=find(str, '-');
end=find(strip(reverse(str)), '-');
string = translate(substr(str, z+1, length(str) - z - end), ".", "-");
A regular expression can match the dash delimited digits only sequence. The match, when extracted, can be transformed using translate.
data input;
input str $ 1-20;
rx = prxparse ("/^.*?((\d+)(-\d+)*)/");
if prxmatch(rx,str) then do;
call prxposn (rx,1,s,e);
name = substr(str,s,e);
name = translate(name,'.','-');
end;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
funky2-2-1funky
f-2-hb17
a2bfunky
;
run;
A funky situation occurs if the digits only token sequence is preceded by a token ending with digits, or succeeded by a token starting with digits.
data input;
input str $ 1-20;
string=translate(prxchange('s/\w+?\-(.*)\-\w+/$1/',-1,strip(str)),'.','-');
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
You can do this in one line. Use subtr to keep the text between the second word and last word:
translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-')
find(str,scan(str,2,'-') : finds the starting position of the second
word.
find(str,scan(str,-1,'-') : finds the starting position of the last
word.
step2 - find(str,scan(str,2,'-'))-1 : find ending position of second
last word (length of text to copy).
Translate function: replaces '-' with '.'
substr(str,step1,step3) : copy text between second word and second to last.
Code:
data want;
set input;
Sub=translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-');
put _all_;
run;
Output:
str=f-2-3-1-5-vcb Sub=2.3.1.5
str=f-2-4-1-6-rtg Sub=2.4.1.6
str=f-2-3-11-1-3-hb17 Sub=2.3.11.1.3

Check specific sequence of alphanumeric string in sas

I have data with an ID field that is structured like this:
XX00000X
7 characters total, with the first 2 and last letters only, and numbers in between.
How can I check that the ID is structured specifically and exactly like this?
I'm not sure of how to approach checking this - one possibility was the CATs function but not sure how to apply that.
You can use a combination of functions to check this, including:
CHAR()
ANYDIGIT()
ANYALPHA()
data have;
input x $10.;
cards;
AB0000X
AO000BF
1234556
ABCDEFG
AB0123Y
AB
ABCDEFGHI
;
run;
data check;
set have;
flag=0;
if lengthn(x) ne 7 then flag=1;
length letter $1;
if flag=0 then do i=1 to 7;
letter = char(x, i);
if ( i in (1,2, 7) and anyalpha(letter) ne 1 )
or i in (3:6) and anydigit(letter) ne 1 then do;
flag=1;
leave;
end;
end;
run;
Regular expressions are obviously more succinct and likely a better approach.
Here is an approach by regular expression. [A-Z]{2} mathc first two letters, [0-9]{4} match four digits in the middle, [A-Z] match last letter, i ignore case.
data want;
set have;
flag=prxmatch("m/[A-Z]{2}[0-9]{4}[A-Z]/i",x);
run;

SAS : How do I find nth instance of a character/group of characters within a string?

I'm trying to find a function that will index the nth instance of a character(s).
For example, if I have the string ABABABBABSSSDDEE and I want to find the 3rd instance of A, how do I do that? What if I want to find the 4th instance of AB
ABABABBABSSSDDEE
data HAVE;
input STRING $;
datalines;
ABABABBASSSDDEE
;
RUN;
Here is a much simplified implementation of finding N-th instance of a group of characters in a SAS character string using SAS find() function:
data a;
s='AB bhdf +BA s Ab fs ABC Nfm AB ';
x='AB';
n=3;
/* from left to right */
p = 0;
do i=1 to n until(p=0);
p = find(s, x, p+1);
end;
put p=;
/* from right to left */
p = length(s) + 1;
do i=1 to n until(p=0);
p = find(s, x, -p+1);
end;
put p=;
run;
As you can see it allows for both, left-to-right and right-to-left searches.
You can combine these two into a SAS user-defined function (negative n will indicate search from right to left as it is in find function):
proc fcmp outlib=sasuser.functions.findnth;
function findnth(str $, sub $, n);
p = ifn(n>=0,0,length(str)+1);
do i=1 to abs(n) until(p=0);
p = find(str,sub,sign(n)*p+1);
end;
return (p);
endsub;
run;
Note that the above solutions with FIND() and FINDNTH() functions assume that the searched substring can overlap with its prior instance. For example, if we search for a substring ‘AAA’ within a string ‘ABAAAA’, then the first instance of the ‘AAA’ will be found in position 3, and the second instance – in position 4. That is, the first and second instances are overlapping. For that reason, when we find an instance we increment position p by 1 (p+1) to start the next iteration (instance) of the search.
However, if such overlapping is not a valid case in your searches, and you want to continue search after the end of the previous substring instance, then we should increment p not by 1, but by length of the substring x. That will speed up our search (the more the longer our substring x is) as we will be skipping more characters as we go through the string s. In this case, in our search code we should replace p+1 to p+w, where w=length(x).
A detail discussion of this problem is described in my recent SAS blog post Finding n-th instance of a substring within a string. I also found that using find() function works considerably faster than using regular expression functions in SAS.
I realize I'm late to the party here, but in the interest of adding to the collection of answers, here's what I've come up with.
DATA test;
input = "ABABABBABSSSDDEE";
A_3 = find(prxchange("s/A/#/", 2, input), "A");
AB_4 = find(prxchange("s/AB/##/", 3, input), "AB");
RUN;
Breaking it down, prxchange() just does a pattern matching replacement, but the great thing about it is that you can tell it how many times to replace that pattern. So, prxchange("s/A/#/", 2, input) replaces the first two A's in input with #. Once you've replaced the first two A's, you can wrap it in a find() function to find the "first A", which is actually the third A of the original string.
One thing to note about this approach is that, ideally, the replacement string should be the same length as the string you're replacing. For instance, notice the difference between
prxchange("s/AB/##/", 3, input) /* gives 8 (correct) */
and
prxchange("s/AB/#/", 3, input) /* gives 5 (incorrect) */
That's because we've replaced a string of length 2 with a string of length 1 three times. In other words:
(length("#") - length("AB")) * 3 = -3
so 8 + (-3) = 5.
Hopefully that helps someone out there!
data _null_;
findThis = 'A'; *** substring to find;
findIn = 'ADABAACABAAE'; **** the string to search;
instanceOf=1; *** and the instance of the substring we want to find;
pos = 0;
len = 0;
startHere = 1;
endAt = length(findIn);
n = 0; *** count occurrences of the pattern;
pattern = '/' || findThis || '/';
rx = prxparse(pattern);
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
if pos le 0 then do;
put 'Could not find ' findThis ' in ' findIn;
end;
else do while (pos gt 0);
n+1;
if n eq instanceOf then leave;
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
end;
if n eq instanceOf then do;
put 'found ' instanceOf 'th instance of ' findThis ' at position ' pos ' in ' findIn;
end;
else do;
put 'No ' instanceOf 'th instance of ' findThis ' found';
end;
run;
Here is a solution using the find() function and a do loop within a datastep. I then take that code, and place it into a proc fcmp procedure to create my own function called find_n(). This should greatly simplify whatever task is using this and allows for code re-use.
Define the data:
data have;
length string $50;
input string $;
datalines;
ABABABBABSSSDDEE
;
run;
Do-loop solution:
data want;
set have;
search_term = 'AB';
nth_time = 4;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
if nth_time eq counter then do;
put "The nth occurrence was found at position " last_find;
end;
else do;
put "Could not find the nth occurrence";
end;
run;
Define the proc fcmp function:
Note: If the nth-occurrence cannot be found return 0.
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function find_n(string $, search_term $, nth_time) ;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
result = ifn(nth_time eq counter, last_find, 0);
return (result);
endsub;
run;
Example proc fcmp usage:
Note that this calls the function twice. The first example is showing the original request solution. The second example shows what happens when a match cannot be found.
data want;
set have;
nth_position = find_n(string, "AB", 4);
put nth_position =;
nth_position = find_n(string, "AB", 5);
put nth_position =;
run;