Regular Expression dot in SAS - regex

I'm new in this field, and try to use prxmatch and rxmatch to match some strings.
The pattern is a., which matches a string with more than 2 characters and a isn't the last one.
I run prxmatch('/a./', 'a') and rxmatch('/a./', 'a'), the result should be 0. But the system returns me 1.
So how can I get 0 in this case?

If you write a MCVE for this, you do get no match.
data test;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
However, if x is not length 1, it will match!
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
Why?
Because in SAS, strings are not varchar, they are char. They have spaces padding the rest of the string out to its full length. So you would need to do either
data test;
length x $5;
x='a';
rc=prxmatch('~a[^ ]~',x);
put x= rc=;
run;
or, better,
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',trim(x));
put x= rc=;
run;
(Note, I use ~ for my regex delimiter - you're free to use slash, or any other character, for that, it makes no difference.)

Related

sas using % wildcard with substr

I'm trying to subset some data with the following code:
data want;
set have;
array fx(12) fx1-fx12;
do i=1 to 12;
if substr(dx(i),1,4) in ('1115')
or substr(fx(i),1,5) in ('1146%')
then output;
end;
run;
I cross reference the data output using proc freq to the original dataset. The frequency counts for '1115' matches as they should. They don't for '1146%'. I thought '%' is a wildcard that I can use?
I also tried '/^1146\d*/'
The % wildcard is recognized by the WHERE LIKE operator. For the IF statement you will want to use the string prefix equality (i.e. starts with) operator =: or the prefix in set operator IN:
Also, since you are just substr 5 characters, you could substr 4 characters and check = '1146'. Furthermore, since you are substr from position 1 (1st character) you won't need to do substr at all (see 3rd example) when using IN:.
In order to use Perl regular expression pattern matching use the PRXMATCH function. Your pattern '/^1146\d*/' does not need \d* (0 or more digits). '/^1146/' will match anything that '/^1146\d*/' does.
Example(s):
if substr(dx(i),1,4) in ('1115') or fx(i) =: '1146' then output;
if substr(dx(i),1,4) in ('1115') or substr(fx(i),1,4) = '1146' then output;
/* expanded example for case of checking two prefix possibilities */
if dx(i) in: ('1115') or fx(i) in: ('1146', '124') then output;
if dx(i) =: '1115' or prxmatch('/^1146/', fx(i)) then output;

SAS mainframe replace 20..99 to 200099

In my the sas mainframe code how to replace . with 0?
data newlic;
INPUT #1 LICNO $10.;
DATALINES;
203....412
...3300421
9955..032.
;
RUN;
PROC PRINT DATA = NEWLIC;
RUN;
DATA MYDATA;
SET NEWLIC;
ARRAY A(*) _NUMERIC_;
DO I=1 TO DIM(A);
IF A(I) = . THEN A(I) = 0;
END;
DROP I;
RUN;
PROC PRINT DATA = MYDATA;
RUN;
my required output
2030000412
0003300421
9955000320
the requirement is to replace '.' to 0
Use a regular expression to replace all non-alphanumeric characters with a 0:
s/[^0-9a-zA-Z]/0/
You can implement regex replacements in SAS with prxchange().
data mydata;
set newlic;
licno = prxchange('s/[^0-9a-zA-Z]/0/', -1, licno);
run;
You can use the TRANSLATE() function to replace unwanted characters with '0'. You can use the COMPRESS() function with d modifier to find any non-digit characters that exist in the value.
fixed=translate(licno,repeat('0',255),compress(licno,,'d'));
Results:
Obs LICNO fixed
1 1234567890 1234567890
2 ABC 9 0000000009
3 203....412 2030000412
4 ...3300421 0003300421
5 9955..032. 9955000320
6 123 1230000000
You can use the regular expression pattern metacharacter \D to locate non-digit characters and replace them with 0 in a use of PRXCHANGE().
From the complete list in the documentation
\d matches a digit character that is equivalent to [0−9].
\D matches any character that is not a digit.
Example:
data have; input
licno $char10.; datalines;
1234567890
ABC 9
203....412
...3300421
9955..032.
123
;
data want;
set have;
fixed = prxchange('s/\D/0/', -1, licno);
run;

Get string between two specific char positions

i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");

Check specific sequence of alphanumeric string in sas

I have data with an ID field that is structured like this:
XX00000X
7 characters total, with the first 2 and last letters only, and numbers in between.
How can I check that the ID is structured specifically and exactly like this?
I'm not sure of how to approach checking this - one possibility was the CATs function but not sure how to apply that.
You can use a combination of functions to check this, including:
CHAR()
ANYDIGIT()
ANYALPHA()
data have;
input x $10.;
cards;
AB0000X
AO000BF
1234556
ABCDEFG
AB0123Y
AB
ABCDEFGHI
;
run;
data check;
set have;
flag=0;
if lengthn(x) ne 7 then flag=1;
length letter $1;
if flag=0 then do i=1 to 7;
letter = char(x, i);
if ( i in (1,2, 7) and anyalpha(letter) ne 1 )
or i in (3:6) and anydigit(letter) ne 1 then do;
flag=1;
leave;
end;
end;
run;
Regular expressions are obviously more succinct and likely a better approach.
Here is an approach by regular expression. [A-Z]{2} mathc first two letters, [0-9]{4} match four digits in the middle, [A-Z] match last letter, i ignore case.
data want;
set have;
flag=prxmatch("m/[A-Z]{2}[0-9]{4}[A-Z]/i",x);
run;

SAS Argument 1 to the function PRXMATCH is missing

When I run prxmatch I keep getting an error saying argument 1 is missing. I've checked the pattern and it processes correctly, but when I try to use it through SAS I get the errors below.
Here is an example WORD_0012_MUK613 which returns N
data test2;
set test;
if prxmatch(prxparse('^WORD_\d{4}_\w{3}\d{3}$'), external_id) then match = 'Y'; else match = 'N';
run;
NOTE: Argument 1 to the function PRXMATCH is missing.
ERROR: Argument 1 to the function PRXMATCH must be a positive integer returned by PRXPARSE for a valid pattern.
ERROR: Closing delimiter "^" not found after regular expression "^WORD_\d{4}_\w{3}\d{3}$".
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
When I add the delimeter it gets rid of the error but still doesn't match
data test2;
set test;
if prxmatch(prxparse('/^COAF_\d{4}_\w{3}\d{3}$/'), external_id) then match = 'Y'; else match = 'N';
run;
First off, prxparse exists to allow you to separate the compilation of the regex from its use. That's useful for code structure. However, it's not really useful in the way you used it there - nesting it.
data test2;
set test;
rx_word = prxparse('^WORD_\d{4}_\w{3}\d{3}$');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Second, you need delimiters in SAS to wrap around the regex (This will be useful in step 3). Any character is fine - the first character you pass it will become the delimiter, so use something that you won't use anywhere else except as the delimiter. / is common, but I like to use ~ sometimes as / can be needed in the regex and would have to be escaped.
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Third, you need some options. o at minimum - that way the regex isn't compiled once per row of your dataset, that's horribly slow. i for case insensitive. s means ignore linebreaks in the string, if that's relevant. They go after the ending delimiter - hence, the need for them (though they're not optional even if you're using no options).
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
if prxmatch(rx_word, external_id) then match = 'Y'; else match = 'N';
run;
Fourth, SAS strings are full length (non-varchar) strings. If you have spaces after your string, you'll get no match. So make sure to trim your strings when you're matching them, if you include the $ end of string marker and you're not 100% sure your strings aren't exact length (or use substr or something else to get that exact length).
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
if prxmatch(rx_word, trim(external_id)) then match = 'Y'; else match = 'N';
run;
Finally, you can improve the last statement by using ifc, since prxmatch returns 0 for no match.
The full example:
data test;
length external_id $20;
input external_id $;
datalines;
WORD_0012_MUK613
WORD_5344_915ABC
;;;;
run;
data test2;
set test;
rx_word = prxparse('~^WORD_\d{4}_\w{3}\d{3}$~o');
match = ifc(prxmatch(rx_word, trim(external_id)),'Y','N');
put match=;
run;
No need for PRXPARSE function plus you need delimiters for the expression example /exp/
40 data _null_;
41 x = prxmatch('/^WORD_\d{4}_\w{3}\d{3}$/','WORD_0012_MUK613 | N');
42 put _all_;
43 run;
x=0 _ERROR_=0 _N_=1