Solved. See footnote.
/*check regex*/
go = 1;
i = 1;
do while (go = 1);
set braw.regex point = i;
if (upcase(fname) = upcase("&var.")) then do;
put format1 " one"; /*format1 is a field of braw.regex, properties says character length 30*/
if format1 = '/\d{8}/' then put 'hello world one'; else put 'good bye world one';
%check1(&data, format1, &var)
end;
else i = i+1;
end;
/*check1 passes regex, string, true false to check_format*/
%macro check_format(regex, string, truefalse);
pattern = prxparse(®ex.);
truefalse = prxmatch(pattern, &string);
put ®ex " " &string " " &truefalse "post";
%mend;
So sorry about the lack of indentation - stackover flow seems to be being buggy or something.
This outputs
/\d{8}/ one
good bye world one
apparently format isn't a string. So it then fails the prxparse, as it's looking for a string input.
Any idea of what I do?
I was thinking I could use a macro variable to put quotes around it, perhaps using:
call symput('mymacrovar', format1);
%let mymacrovar = "&mymacrovar";
but that symput does nothing.
Solved:
It was being read as a string. On the CSV file that the regex dataset was being read from, there were additional spaces between the commas, making the string ' /\d{8}/' which prxparse doesn't like.
It was being read as a string. On the CSV file that the regex dataset was being read from, there were additional spaces between the commas, making the string '_/\d{8}/' (underscore denoting a space) which prxparse doesn't like.
Related
In SAS DI when I connect a user written transformation to an output table, the variable _OUTPUT_connect is assigned. In my case it looks something like this:
%let _OUTPUT_connect = DEFER=YES READBUFF=25000 DBCLIENT_MAX_BYTES=1 DB_LENGTH_SEMANTICS_BYTE=NO PATH=MY_PATH AUTHDOMAIN="MY_AUTH_DOMAIN"
Now I'm trying to extract the PATH and AUTHDOMAIN variables from _OUTPUT_connect. My solution for now is the following:
%let _authdomain = %sysfunc(scan(&_OUTPUT_connect,7," "));
%let _path = %sysfunc(scan(%sysfunc(scan(&_OUTPUT_connect,5," ")),2,"="));
This works but it breaks if the order of the _OUTPUT_connect variables changes.
I thought I'd use regex to match the paramater values: PATH=[match_this] and AUTHDOMAIN="[match_this]", but I have problems parsing the variable _OUTPUT_connect because it contains double quotes. When I manually assign _OUTPUT_connect without the double quotes I can do the following
data _null_;
re = prxparse('/PATH=(\w)*/');
string = "&_OUTPUT_connect";
position = prxmatch(re, string);
put position=;
matched_pattern=prxposn(re, 0, string);
put matched_pattern=;
run;
Output:
position=75
matched_pattern=PATH=A1091211_SAS_SRV
The problem however is that _OUTPUT_connect contains double quotes, and the regex function fails when the input string contains double quotes. Since _OUTPUT_connect is assigned automatically, I cannot change the format.
I've tried to remove the double quotes from _OUTPUT_connect using this %let unquoted =%sysfunc(translate(%quote(&test),' ','"'));. This does work, but it puts a whitespace in place of the double quotes.
Is there an easy way to retrieve the values of PATH and AUTHDOMAIN from _OUTPUT_connect?
You can extract the name value pairs of the connection string by using SCAN with modifiers.
Example:
data nvps(label='name value pairs' keep=name value);
s = 'name1=value1 name2="value2" name3="value 3"';
do index = 1 to countw(s,' ','q');
nvp = scan(s,index,' ','q');
name = scan(nvp,1,'=','q');
value = scan(nvp,2,'=','q');
output;
end;
run;
I'm searching through medical notes to capture all instances of a phrase, in particular 'carbapenemase producing'. At times this phrasing can occur > 1 time in a string. From some research I think PRXNEXT would make the most sense but I'm having difficulty getting it to do what I want to. As an example for this string:
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion presumptive carbapenemase producing cre see
spmi for carba r pcr results not confirmed carbapenemase producing cre
From this comment above, I'd like to extract the phrases
presumptive carbapenemase producing
and
not confirmed carbapenemase producing
I realize I can't extract, I don't think, those exact phrases but some variation of it with a substring. The code i've been using I found here. Here's what I have thus far but it's only capturing the 1st phrase:
carba_cnt = count(as_comments,'carba','i');
if _n_ = 1 then do;
retain reg1 neg1;
reg1 = prxparse("/ca[bepr]\w+ prod/");
end;
start = 1;
stop = length(as_comments);
position = 0;
length = 0;
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(reg1, start, stop, as_comments, position, length);
lastpos = 0;
do while (position > 0);
if lastpos then do;
length found $200;
found = substr(as_comments,lastpos,position-lastpos);
put found=;
output;
end;
lastpos = position;
call prxnext(reg1, start, stop, as_comments, position, length);
end;
if lastpos then do;
found = substr(as_comments,lastpos);
put found=;
output;
end;
You are correct to use PRXNEXT for locating each occurrence of a regex match in a source. The regex pattern can be modified to use a group capture to search for an optional leading "not confirmed". The scenario for the least likely 'coder fail' is to focus loop and extract around a single call to PRXNEXT.
This example uses pattern /((not confirmed\s*)?(ca[bepr]\w+ prod)) and outputs one row per match.
data have;
id + 1;
length comment $2000;
infile datalines eof=done;
do until (_infile_ = '----');
input;
if _infile_ ne '----' then
comment = catx(' ',comment,_infile_);
end;
done:
if not missing(comment);
datalines4;
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion presumptive carbapenemase producing cre
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
if amikacin results are needed please notify microbiology lab at ext
for further testing the organism will be held until meropenem result
obtained by disc diffusion conjectured carbapenems producing cre
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
;;;;
run;
data want;
set have;
prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod))/');
_start_inout = 1;
do hitnum = 1 by 1 until (pos=0);
call prxnext (prx, _start_inout, length(comment), comment, pos, len);
if len then do;
content = substr(comment,pos,len);
output;
end;
end;
keep id hitnum content;
run;
Bonus info: The prxparse does not need to be inside an if _n_=1 block. See PRXPARSE docs
If perl-regular-expression is a constant or if it uses the /o option, the Perl regular expression is compiled only once. Successive calls to PRXPARSE do not cause a recompile, but returns the regular-expression-id for the regular expression that was already compiled. This behavior simplifies the code because you do not need to use an initialization block (IF _N_ = 1) to initialize Perl regular expressions.
i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");
Supposed I have two strings to convert from SAS program name to table number.
My goal is to convert the first "f-2-2-7-5-vcb" to "2.2.7.5".
And this should be done dynamically. Like for "f-2-2-12-1-2-hbd87q",
it needed to be "2.2.12.1.2" .
How to accomplish this?
data input;
input str $ 1-20;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
data want;
set input;
Sub=compress(substr(str,3,length(str)),,'kd') ;
run;
Bit of a longer way, but this works fine for me.
Use FIND() to find the first '-'
Use REVERSE() and FIND() to find the
last '-'
Use SUBSTR() and metrics + math from above to remove the first and
last components
Use TRANSLATE() to convert the - to periods.
z=find(str, '-');
end=find(strip(reverse(str)), '-');
string = translate(substr(str, z+1, length(str) - z - end), ".", "-");
A regular expression can match the dash delimited digits only sequence. The match, when extracted, can be transformed using translate.
data input;
input str $ 1-20;
rx = prxparse ("/^.*?((\d+)(-\d+)*)/");
if prxmatch(rx,str) then do;
call prxposn (rx,1,s,e);
name = substr(str,s,e);
name = translate(name,'.','-');
end;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
funky2-2-1funky
f-2-hb17
a2bfunky
;
run;
A funky situation occurs if the digits only token sequence is preceded by a token ending with digits, or succeeded by a token starting with digits.
data input;
input str $ 1-20;
string=translate(prxchange('s/\w+?\-(.*)\-\w+/$1/',-1,strip(str)),'.','-');
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
You can do this in one line. Use subtr to keep the text between the second word and last word:
translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-')
find(str,scan(str,2,'-') : finds the starting position of the second
word.
find(str,scan(str,-1,'-') : finds the starting position of the last
word.
step2 - find(str,scan(str,2,'-'))-1 : find ending position of second
last word (length of text to copy).
Translate function: replaces '-' with '.'
substr(str,step1,step3) : copy text between second word and second to last.
Code:
data want;
set input;
Sub=translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-');
put _all_;
run;
Output:
str=f-2-3-1-5-vcb Sub=2.3.1.5
str=f-2-4-1-6-rtg Sub=2.4.1.6
str=f-2-3-11-1-3-hb17 Sub=2.3.11.1.3
I have the following string:
{'output',{'variable','VGRG_Pos_Var1/Parameters/D_foo'},'date',734704.60904050921}
I would like to verify the format of the string that the word 'variable' is the second word and i would like to retrive the string after the last '/' in the 3rd string (In this example 'D_foo').
how could i verify this and retrive the sting i search?
I tried the following:
regexp(str,'{''\w+'',{''variable'',''([(a-z)|(A-Z)|/|_])+')
without success
REMARK
The string to analysis is not splited after the komma, it is only due to length of the string.
EDIT
my string is:
'{''output'',{''variable'',''VGRG_Pos_Var1/Parameters/D_foo''},''date'',734704.60904050921}';
and not a cell, which could be understood. I added the sybol ' at the start and end of the string to symbolizied that it is a string.
I realise that you mention using regexp in the question, but I'm not sure if this is a requirement? If other solutions are acceptable you could try this:
str='{''output'',{''variable'',''VGRG_Pos_Var1/Parameters/D_foo''},''date'',734704.60904050921}';
parts1=textscan( str, '%s','delimiter',{',','{','}'},'MultipleDelimsAsOne',1);
parts2=textscan( parts1{1}{3}, '%s','delimiter',{'/',''''},'MultipleDelimsAsOne',1);
string=parts2{1}{end}
match=strcmp(parts1{1}{2},'variable')
To answer the first part of your question, you can write this:
str = {'output',{'variable','VGRG_Pos_Var1/Parameters/D_foo'},'date',734704.60904050921};
temp = str(2); %this holds the cell containing the two strings
if cmpstr(temp{1}(1), 'variable')
%do stuff
end
For the second part you can do this:
str = {'output',{'variable','VGRG_Pos_Var1/Parameters/D_foo'},'date',734704.60904050921};
temp = str(2); %like before, this contains the cell
temp = temp{1}(2); %this picks out the second string in the cell
temp = char(temp); %turns the item from a cell to a string
res = strsplit(temp, '/'); %splits the string where '/' are found, res is an array of strings
string = res(3); %assuming there will always be just 2 '/'s.