I am following a published method to identify matched cases. I am getting the following error
ERROR: No matching %MACRO statement for this %MEND statement.
WARNING: Apparent invocation of macro MATCH not resolved.
137 %MEND MATCH;
138
139 %MATCH (g.ps_match,Match4,scase4,scontrol4, abuser, 0.0001);
_
180
ERROR 180-322: Statement is not valid or it is used out of proper order.
How do I correctly call the macro?
I am using SAS University Edition.
The method is from
http://www2.sas.com/proceedings/sugi25/25/po/25p225.pdf
Part 2: Perform the Match
The next part of the macro program performs the match and
outputs the matched pairs. First, the cases data set is
selected. Curob is used to keep track of the current case.
Matchto is used to identify matched pairs of cases and
controls. Start and oldi are initialized to control processing of
the controls data set DO loop.
data &lib..&matched.
(drop=Cmatch randnum aprob cprob start
oldi curctrl matched);
set &lib..&SCase. ;
curob + 1;
matchto = curob;
if curob = 1 then do;
start = 1;
oldi = 1;
end;
Next, the controls data set is selected. Processing starts at
the first unmatched observation. The data set is searched
until a match is found, or it is determined no match can be
made. Error checking is performed to avoid an infinite loop.
Curctrl is used to keep track of current control.
DO i = start to n;
set &lib..&Scontrol. point = i nobs = n;
if i gt n then goto startovr;
if _Error_ = 1 then abort;
curctrl = i;
If the propensity score of the current case (aprob) matches the
propensity score of the current control (cprob), then a match
was found. Update Cmatch to 1=Yes. Output the control.
Update matched to keep track of last matched control. Exit
the DO loop. If the propensity score of the current control is
greater than the propensity score of the current case, then no
match will be found for the current case. Stop the DO loop
processing.
if aprob = cprob then
do;
Cmatch = 1;
output &lib..&matched.;
matched = curctrl;
goto found;
end;
else if cprob gt aprob then
goto nextcase;
startovr: if i gt n then
goto nextcase;
END;
/* end of DO LOOP */
nextcase:
if Cmatch=0 then start = oldi;
found:
if Cmatch = 1 then do;
oldi = matched + 1;
start = matched + 1;
set &lib..&SCase. point = curob;
output &lib..&matched.;
end;
retain oldi start;
if _Error_=1 then _Error_=0;
run;
%MEND MATCH;
MACRO MATCH CALL STATEMENT
The following are call statements to the macro
program MATCH. The first performs a 4-digit match;
the second performs a 3-digit match.
%MATCH(STUDY,Propen,Match4,SCase4,
SContrl4,Interven,.0001);
%MATCH(STUDY,Propen,Match3,SCase3,
SContrl3,Interven,.001);
Presumably, you didn't include the beginning of the macro (i.e., the %MACRO MATCH(... portion, earlier in the paper). This is a macro, it's not intended to be run in pieces the way it's written - you need to include all of the code from %MACRO MATCH to %MEND and then the calls.
Related
I'm searching through medical notes to capture all instances of a phrase, in particular 'carbapenemase producing'. At times this phrasing can occur > 1 time in a string. From some research I think PRXNEXT would make the most sense but I'm having difficulty getting it to do what I want to. As an example for this string:
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion presumptive carbapenemase producing cre see
spmi for carba r pcr results not confirmed carbapenemase producing cre
From this comment above, I'd like to extract the phrases
presumptive carbapenemase producing
and
not confirmed carbapenemase producing
I realize I can't extract, I don't think, those exact phrases but some variation of it with a substring. The code i've been using I found here. Here's what I have thus far but it's only capturing the 1st phrase:
carba_cnt = count(as_comments,'carba','i');
if _n_ = 1 then do;
retain reg1 neg1;
reg1 = prxparse("/ca[bepr]\w+ prod/");
end;
start = 1;
stop = length(as_comments);
position = 0;
length = 0;
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(reg1, start, stop, as_comments, position, length);
lastpos = 0;
do while (position > 0);
if lastpos then do;
length found $200;
found = substr(as_comments,lastpos,position-lastpos);
put found=;
output;
end;
lastpos = position;
call prxnext(reg1, start, stop, as_comments, position, length);
end;
if lastpos then do;
found = substr(as_comments,lastpos);
put found=;
output;
end;
You are correct to use PRXNEXT for locating each occurrence of a regex match in a source. The regex pattern can be modified to use a group capture to search for an optional leading "not confirmed". The scenario for the least likely 'coder fail' is to focus loop and extract around a single call to PRXNEXT.
This example uses pattern /((not confirmed\s*)?(ca[bepr]\w+ prod)) and outputs one row per match.
data have;
id + 1;
length comment $2000;
infile datalines eof=done;
do until (_infile_ = '----');
input;
if _infile_ ne '----' then
comment = catx(' ',comment,_infile_);
end;
done:
if not missing(comment);
datalines4;
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion presumptive carbapenemase producing cre
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion conjectured carbapenems producing cre
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
;;;;
run;
data want;
set have;
prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod))/');
_start_inout = 1;
do hitnum = 1 by 1 until (pos=0);
call prxnext (prx, _start_inout, length(comment), comment, pos, len);
if len then do;
content = substr(comment,pos,len);
output;
end;
end;
keep id hitnum content;
run;
Bonus info: The prxparse does not need to be inside an if _n_=1 block. See PRXPARSE docs
If perl-regular-expression is a constant or if it uses the /o option, the Perl regular expression is compiled only once. Successive calls to PRXPARSE do not cause a recompile, but returns the regular-expression-id for the regular expression that was already compiled. This behavior simplifies the code because you do not need to use an initialization block (IF _N_ = 1) to initialize Perl regular expressions.
I'm trying to find a function that will index the nth instance of a character(s).
For example, if I have the string ABABABBABSSSDDEE and I want to find the 3rd instance of A, how do I do that? What if I want to find the 4th instance of AB
ABABABBABSSSDDEE
data HAVE;
input STRING $;
datalines;
ABABABBASSSDDEE
;
RUN;
Here is a much simplified implementation of finding N-th instance of a group of characters in a SAS character string using SAS find() function:
data a;
s='AB bhdf +BA s Ab fs ABC Nfm AB ';
x='AB';
n=3;
/* from left to right */
p = 0;
do i=1 to n until(p=0);
p = find(s, x, p+1);
end;
put p=;
/* from right to left */
p = length(s) + 1;
do i=1 to n until(p=0);
p = find(s, x, -p+1);
end;
put p=;
run;
As you can see it allows for both, left-to-right and right-to-left searches.
You can combine these two into a SAS user-defined function (negative n will indicate search from right to left as it is in find function):
proc fcmp outlib=sasuser.functions.findnth;
function findnth(str $, sub $, n);
p = ifn(n>=0,0,length(str)+1);
do i=1 to abs(n) until(p=0);
p = find(str,sub,sign(n)*p+1);
end;
return (p);
endsub;
run;
Note that the above solutions with FIND() and FINDNTH() functions assume that the searched substring can overlap with its prior instance. For example, if we search for a substring ‘AAA’ within a string ‘ABAAAA’, then the first instance of the ‘AAA’ will be found in position 3, and the second instance – in position 4. That is, the first and second instances are overlapping. For that reason, when we find an instance we increment position p by 1 (p+1) to start the next iteration (instance) of the search.
However, if such overlapping is not a valid case in your searches, and you want to continue search after the end of the previous substring instance, then we should increment p not by 1, but by length of the substring x. That will speed up our search (the more the longer our substring x is) as we will be skipping more characters as we go through the string s. In this case, in our search code we should replace p+1 to p+w, where w=length(x).
A detail discussion of this problem is described in my recent SAS blog post Finding n-th instance of a substring within a string. I also found that using find() function works considerably faster than using regular expression functions in SAS.
I realize I'm late to the party here, but in the interest of adding to the collection of answers, here's what I've come up with.
DATA test;
input = "ABABABBABSSSDDEE";
A_3 = find(prxchange("s/A/#/", 2, input), "A");
AB_4 = find(prxchange("s/AB/##/", 3, input), "AB");
RUN;
Breaking it down, prxchange() just does a pattern matching replacement, but the great thing about it is that you can tell it how many times to replace that pattern. So, prxchange("s/A/#/", 2, input) replaces the first two A's in input with #. Once you've replaced the first two A's, you can wrap it in a find() function to find the "first A", which is actually the third A of the original string.
One thing to note about this approach is that, ideally, the replacement string should be the same length as the string you're replacing. For instance, notice the difference between
prxchange("s/AB/##/", 3, input) /* gives 8 (correct) */
and
prxchange("s/AB/#/", 3, input) /* gives 5 (incorrect) */
That's because we've replaced a string of length 2 with a string of length 1 three times. In other words:
(length("#") - length("AB")) * 3 = -3
so 8 + (-3) = 5.
Hopefully that helps someone out there!
data _null_;
findThis = 'A'; *** substring to find;
findIn = 'ADABAACABAAE'; **** the string to search;
instanceOf=1; *** and the instance of the substring we want to find;
pos = 0;
len = 0;
startHere = 1;
endAt = length(findIn);
n = 0; *** count occurrences of the pattern;
pattern = '/' || findThis || '/';
rx = prxparse(pattern);
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
if pos le 0 then do;
put 'Could not find ' findThis ' in ' findIn;
end;
else do while (pos gt 0);
n+1;
if n eq instanceOf then leave;
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
end;
if n eq instanceOf then do;
put 'found ' instanceOf 'th instance of ' findThis ' at position ' pos ' in ' findIn;
end;
else do;
put 'No ' instanceOf 'th instance of ' findThis ' found';
end;
run;
Here is a solution using the find() function and a do loop within a datastep. I then take that code, and place it into a proc fcmp procedure to create my own function called find_n(). This should greatly simplify whatever task is using this and allows for code re-use.
Define the data:
data have;
length string $50;
input string $;
datalines;
ABABABBABSSSDDEE
;
run;
Do-loop solution:
data want;
set have;
search_term = 'AB';
nth_time = 4;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
if nth_time eq counter then do;
put "The nth occurrence was found at position " last_find;
end;
else do;
put "Could not find the nth occurrence";
end;
run;
Define the proc fcmp function:
Note: If the nth-occurrence cannot be found return 0.
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function find_n(string $, search_term $, nth_time) ;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
result = ifn(nth_time eq counter, last_find, 0);
return (result);
endsub;
run;
Example proc fcmp usage:
Note that this calls the function twice. The first example is showing the original request solution. The second example shows what happens when a match cannot be found.
data want;
set have;
nth_position = find_n(string, "AB", 4);
put nth_position =;
nth_position = find_n(string, "AB", 5);
put nth_position =;
run;
When I run prxmatch I keep getting an error saying argument 1 is missing. I've checked the pattern and it processes correctly, but when I try to use it through SAS I get the errors below.
Here is an example WORD_0012_MUK613 which returns N
data test2;
set test;
if prxmatch(prxparse('^WORD_\d{4}_\w{3}\d{3}$'), external_id) then match = 'Y'; else match = 'N';
run;
NOTE: Argument 1 to the function PRXMATCH is missing.
ERROR: Argument 1 to the function PRXMATCH must be a positive integer returned by PRXPARSE for a valid pattern.
ERROR: Closing delimiter "^" not found after regular expression "^WORD_\d{4}_\w{3}\d{3}$".
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
When I add the delimeter it gets rid of the error but still doesn't match
data test2;
set test;
if prxmatch(prxparse('/^COAF_\d{4}_\w{3}\d{3}$/'), external_id) then match = 'Y'; else match = 'N';
run;
First off, prxparse exists to allow you to separate the compilation of the regex from its use. That's useful for code structure. However, it's not really useful in the way you used it there - nesting it.
data test2;
set test;
rx_word = prxparse('^WORD_\d{4}_\w{3}\d{3}$');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Second, you need delimiters in SAS to wrap around the regex (This will be useful in step 3). Any character is fine - the first character you pass it will become the delimiter, so use something that you won't use anywhere else except as the delimiter. / is common, but I like to use ~ sometimes as / can be needed in the regex and would have to be escaped.
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Third, you need some options. o at minimum - that way the regex isn't compiled once per row of your dataset, that's horribly slow. i for case insensitive. s means ignore linebreaks in the string, if that's relevant. They go after the ending delimiter - hence, the need for them (though they're not optional even if you're using no options).
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Fourth, SAS strings are full length (non-varchar) strings. If you have spaces after your string, you'll get no match. So make sure to trim your strings when you're matching them, if you include the $ end of string marker and you're not 100% sure your strings aren't exact length (or use substr or something else to get that exact length).
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
if prxmatch(rx_word, trim(external_id)) then match = 'Y'; else match = 'N';
run;
Finally, you can improve the last statement by using ifc, since prxmatch returns 0 for no match.
The full example:
data test;
length external_id $20;
input external_id $;
datalines;
WORD_0012_MUK613
WORD_5344_915ABC
;;;;
run;
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
match = ifc(prxmatch(rx_word, trim(external_id)),'Y','N');
put match=;
run;
No need for PRXPARSE function plus you need delimiters for the expression example /exp/
40 data _null_;
41 x = prxmatch('/^WORD_\d{4}_\w{3}\d{3}$/','WORD_0012_MUK613 | N');
42 put _all_;
43 run;
x=0 _ERROR_=0 _N_=1
here is what I am trying to do for more understanding i just wanna find away to get the substring and put it into variable
DECLARE
v_file_type thufitab.file_type%TYPE;
v_filename thufitab.filename%TYPE;
v_status thufitab.status%TYPE;
V_seq_FILENAME NUMBER (4);
CURSOR List_FILENAME_cur
IS
SELECT FILENAME
FROM thufitab
WHERE status = 2 AND ROWNUM <= 100;
BEGIN
FOR List_FILENAME_rec IN List_FILENAME_cur
LOOP
SELECT REGEXP_SUBSTR (FILENAME, '([1-9][0-9]{0,3})')
INTO V_seq_FILENAME
FROM thufitab;
DBMS_OUTPUT.PUT_LINE (V_seq_FILENAME);
END LOOP;
END;
Not sure I understand well, but, is this ok for you?
'CDR-([1-9][0-9]{0,3})_[0-9]{2}_[0-9]{2}_[0-9]{2}_[0-9]{4}_UK1\.FCDR'
^_______________^
group 1
I have a variable that contains a number of firms separated by the | symbol. I would like to be able to count how many firms there. i.e., the number of | + 1, and ideally identify the location of the | symbol in the string. Note there will not be more than five firms in a single variable. I was trying to use the following approach but run into the fact that SAS treats the | symbol as a special operator.
pattern1 = prxparse('/|/'); /* I can't seem to get SAS to treat this as a text to compare */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
call prxnext(pattern1, start, stop, reassignment2, position, length);
ARRAY Y[5];
do J=1 to 5 while (position > 0);
Y[J]=position;
call prxnext(pattern1, start, stop, reassignment2, position, length);
end;
nfirms=j+1;
run;
I would do it somewhat differently. What you really want is not the number of | characters, but the actual firms, right? So search for those. Your code had a number of minor issues; primarily, you must first prxmatch before using call prxnext, your j+1 is wrong because the loop iterator actually increments one beyond the last qualifying loop value (I use j-1 because I will find one more element than you), and | is a regular expression metacharacter and must be escaped if you actually want to use it, unless it is inside [] like I am using it.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
pattern1 = prxparse('/[^|]+/io'); /* Look for non-| characters */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
rc=prxmatch(pattern1,reassignment2);
if rc>0 then do;
ARRAY Y[5];
do J=1 by 1 until (position = 0);
call prxnext(pattern1, start, stop, reassignment2, position, length);
Y[J]=position;
end;
nfirms=j-1;
end;
else nfirms=0;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
For completeness' sake, you could also do this easily without regular expressions, using call scan.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
array y[5];
do nfirms=1 by 1 until (position le 0);
call scan(reassignment2,nfirms,position,length,'|');
y[nfirms]=position;
end;
nfirms=nfirms-1; *loop ends one iteration too late;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
I agree with #Joe that this could be done more simply without regular expressions, though I would simplify his code a little further to exclude the use of an array.
data test;
infile datalines truncover length = reclen;
input firmlist $varying256. reclen;
i = 0;
do until(scan(firmlist,i,"|") = "");
i + 1;
end;
nfirms = i - 1;
drop i;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;
run;
You said you'd also like to capture the position of the "|" character in the string, but if there are multiple firms per record there will be multiple "|" characters in the string. If you want the position of each one, an array might be a better route, though if you only want one, the index function will get you what you want. You'd use delimpos = index(firmlist,"|");.
I hope that helps!