Reading from _OUTPUT_connect - sas

In SAS DI when I connect a user written transformation to an output table, the variable _OUTPUT_connect is assigned. In my case it looks something like this:
%let _OUTPUT_connect = DEFER=YES READBUFF=25000 DBCLIENT_MAX_BYTES=1 DB_LENGTH_SEMANTICS_BYTE=NO PATH=MY_PATH AUTHDOMAIN="MY_AUTH_DOMAIN"
Now I'm trying to extract the PATH and AUTHDOMAIN variables from _OUTPUT_connect. My solution for now is the following:
%let _authdomain = %sysfunc(scan(&_OUTPUT_connect,7," "));
%let _path = %sysfunc(scan(%sysfunc(scan(&_OUTPUT_connect,5," ")),2,"="));
This works but it breaks if the order of the _OUTPUT_connect variables changes.
I thought I'd use regex to match the paramater values: PATH=[match_this] and AUTHDOMAIN="[match_this]", but I have problems parsing the variable _OUTPUT_connect because it contains double quotes. When I manually assign _OUTPUT_connect without the double quotes I can do the following
data _null_;
re = prxparse('/PATH=(\w)*/');
string = "&_OUTPUT_connect";
position = prxmatch(re, string);
put position=;
matched_pattern=prxposn(re, 0, string);
put matched_pattern=;
run;
Output:
position=75
matched_pattern=PATH=A1091211_SAS_SRV
The problem however is that _OUTPUT_connect contains double quotes, and the regex function fails when the input string contains double quotes. Since _OUTPUT_connect is assigned automatically, I cannot change the format.
I've tried to remove the double quotes from _OUTPUT_connect using this %let unquoted =%sysfunc(translate(%quote(&test),' ','"'));. This does work, but it puts a whitespace in place of the double quotes.
Is there an easy way to retrieve the values of PATH and AUTHDOMAIN from _OUTPUT_connect?

You can extract the name value pairs of the connection string by using SCAN with modifiers.
Example:
data nvps(label='name value pairs' keep=name value);
s = 'name1=value1 name2="value2" name3="value 3"';
do index = 1 to countw(s,' ','q');
nvp = scan(s,index,' ','q');
name = scan(nvp,1,'=','q');
value = scan(nvp,2,'=','q');
output;
end;
run;

Related

Get string between two specific char positions

i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");

SAS Argument 1 to the function PRXMATCH is missing

When I run prxmatch I keep getting an error saying argument 1 is missing. I've checked the pattern and it processes correctly, but when I try to use it through SAS I get the errors below.
Here is an example WORD_0012_MUK613 which returns N
data test2;
set test;
if prxmatch(prxparse('^WORD_\d{4}_\w{3}\d{3}$'), external_id) then match = 'Y'; else match = 'N';
run;
NOTE: Argument 1 to the function PRXMATCH is missing.
ERROR: Argument 1 to the function PRXMATCH must be a positive integer returned by PRXPARSE for a valid pattern.
ERROR: Closing delimiter "^" not found after regular expression "^WORD_\d{4}_\w{3}\d{3}$".
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
When I add the delimeter it gets rid of the error but still doesn't match
data test2;
set test;
if prxmatch(prxparse('/^COAF_\d{4}_\w{3}\d{3}$/'), external_id) then match = 'Y'; else match = 'N';
run;
First off, prxparse exists to allow you to separate the compilation of the regex from its use. That's useful for code structure. However, it's not really useful in the way you used it there - nesting it.
data test2;
set test;
rx_word = prxparse('^WORD_\d{4}_\w{3}\d{3}$');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Second, you need delimiters in SAS to wrap around the regex (This will be useful in step 3). Any character is fine - the first character you pass it will become the delimiter, so use something that you won't use anywhere else except as the delimiter. / is common, but I like to use ~ sometimes as / can be needed in the regex and would have to be escaped.
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Third, you need some options. o at minimum - that way the regex isn't compiled once per row of your dataset, that's horribly slow. i for case insensitive. s means ignore linebreaks in the string, if that's relevant. They go after the ending delimiter - hence, the need for them (though they're not optional even if you're using no options).
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Fourth, SAS strings are full length (non-varchar) strings. If you have spaces after your string, you'll get no match. So make sure to trim your strings when you're matching them, if you include the $ end of string marker and you're not 100% sure your strings aren't exact length (or use substr or something else to get that exact length).
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
if prxmatch(rx_word, trim(external_id)) then match = 'Y'; else match = 'N';
run;
Finally, you can improve the last statement by using ifc, since prxmatch returns 0 for no match.
The full example:
data test;
length external_id $20;
input external_id $;
datalines;
WORD_0012_MUK613
WORD_5344_915ABC
;;;;
run;
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
match = ifc(prxmatch(rx_word, trim(external_id)),'Y','N');
put match=;
run;
No need for PRXPARSE function plus you need delimiters for the expression example /exp/
40 data _null_;
41 x = prxmatch('/^WORD_\d{4}_\w{3}\d{3}$/','WORD_0012_MUK613 | N');
42 put _all_;
43 run;
x=0 _ERROR_=0 _N_=1

Regular Expression dot in SAS

I'm new in this field, and try to use prxmatch and rxmatch to match some strings.
The pattern is a., which matches a string with more than 2 characters and a isn't the last one.
I run prxmatch('/a./', 'a') and rxmatch('/a./', 'a'), the result should be 0. But the system returns me 1.
So how can I get 0 in this case?
If you write a MCVE for this, you do get no match.
data test;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
However, if x is not length 1, it will match!
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
Why?
Because in SAS, strings are not varchar, they are char. They have spaces padding the rest of the string out to its full length. So you would need to do either
data test;
length x $5;
x='a';
rc=prxmatch('~a[^ ]~',x);
put x= rc=;
run;
or, better,
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',trim(x));
put x= rc=;
run;
(Note, I use ~ for my regex delimiter - you're free to use slash, or any other character, for that, it makes no difference.)

Field not being read as a character string

Solved. See footnote.
/*check regex*/
go = 1;
i = 1;
do while (go = 1);
set braw.regex point = i;
if (upcase(fname) = upcase("&var.")) then do;
put format1 " one"; /*format1 is a field of braw.regex, properties says character length 30*/
if format1 = '/\d{8}/' then put 'hello world one'; else put 'good bye world one';
%check1(&data, format1, &var)
end;
else i = i+1;
end;
/*check1 passes regex, string, true false to check_format*/
%macro check_format(regex, string, truefalse);
pattern = prxparse(&regex.);
truefalse = prxmatch(pattern, &string);
put &regex " " &string " " &truefalse "post";
%mend;
So sorry about the lack of indentation - stackover flow seems to be being buggy or something.
This outputs
/\d{8}/ one
good bye world one
apparently format isn't a string. So it then fails the prxparse, as it's looking for a string input.
Any idea of what I do?
I was thinking I could use a macro variable to put quotes around it, perhaps using:
call symput('mymacrovar', format1);
%let mymacrovar = "&mymacrovar";
but that symput does nothing.
Solved:
It was being read as a string. On the CSV file that the regex dataset was being read from, there were additional spaces between the commas, making the string ' /\d{8}/' which prxparse doesn't like.
It was being read as a string. On the CSV file that the regex dataset was being read from, there were additional spaces between the commas, making the string '_/\d{8}/' (underscore denoting a space) which prxparse doesn't like.

How to pad out character fields in SAS?

I am creating a SAS dataset from a database that includes a VARCHAR(5) key field.
This field includes some entries that use all 5 characters and some that use fewer.
When I import this data, I would prefer to pad all the shorter entries out to use all five characters. For this example, I want to pad on the left with 0, the character zero. So, 114 would become 00114, ABCD would become 0ABCD, and EA222 would stay as it is.
I've attempted this with a simple data statement, but of course the following does not work:
data test;
set databaseinput;
format key $5.;
run;
I've tried to do this with a user-defined informat, but I don't think it's possible to specify the ranges correctly on character fields, per this SAS KB answer. Plus, I'm fairly sure proc format won't let me define the result dynamically in terms of the incoming variable.
I'm sure there's an obvious solution here, but I'm just missing it.
Here is an alternative:
data padded_data_dsn; length key $5;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = translate(right(raw_data),'0',' ');
run;
Data raw_data_dsn;
format key $5.;
key = '4'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A114'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A1140'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
run;
I'm sure someone will have a more elegant solution, but the following code works. Essentially it is padding the variable with five leading zeros, then reversing the order of this text string so that the zeros are to the right, then reversing this text string again and limiting the size to five characters, in the original order but left-padded with zeros.
data raw_data_dsn;
format key $varying5.;
key = '114'; output;
key = 'ABCD'; output;
key = 'EA222'; output;
run;
data padded_data_dsn;
format key $5.;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = put(put('00000' || raw_data ,$revers10.),$revers5.);
run;
Here's what worked for me.
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
It works by counting the length of the existing string, and then creating a cat string of length 4 - that, and then appending the cat value and the original string together.
Notice that it screws up if the original string is length 5.
Also - it won't work if the input string has a $5. format on it.
data a; /*input dataset*/
input str $;
datalines;
a
aa
aaa
aaaa
aaaaa
;
run;
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
input:
a
aa
aaa
aaaa
aaaaa
output:
0000a
000aa
00aaa
0aaaa
0aaaa
I use this, but only works with numeric values :S. Try with another formats in the INPUT
data work.prueba;
format xx $5.;
xx='1234';
vv=PUT(INPUT(xx,best5.),z5.);
run;