Mask # character in SAS RegEX - sas

The # character (in bold) in the replacement string of my RegEx s/<<[\w|+|#|#]+>>/\s*(<<[\w|\+|#|#]+>>)\s*/ is causing an error. When I replace the RegEx with s/<<[\w|+|#|#]+>>/\s*(<<[\w|\+|#]+>>)\s*/, the error goes away.
How do I mask the # character. %NRSTR does not seem to work.
CODE:
Data _NULL_;
a=prxchange(%NRSTR("s/<<[\w|\+|#|#]+>>/\s*(<<[\\w|\\+|#|#]+>>)\s*/"), -1, "<<A>> <<A+>> <<A#>> <<A#+>> <<A#>> <<A#+>>");
putlog a;
run;
LOG:
ERROR: An array reference was found in replacement text
"s/<<[\w|\+|#|#]+>>/\s*(<<[\\w|\\+|#|#]+>>)\s*/". Array references within replacement text
are not supported.
ERROR: The regular expression passed to the function PRXCHANGE contains a syntax error.
NOTE: Argument 1 to function PRXCHANGE('s/<<[\w|\+|#'[12 of 46 characters shown],-1,'<<WORD>>
<<W'[12 of 60 characters shown]) at line 1656 column 3 is invalid.
a= _ERROR_=1 _N_=1

Is this what you want? I added escape character before the ampersands.
data _null_;
length regex in out $200 ;
regex='s/<<[\w|\+|\#|#]+>>/\s*(<<[\\w|\\+|#|\#]+>>)\s*/';
in = '<<A>> <<A+>> <<A#>> <<A#+>> <<A#>> <<A#+>>';
out=prxchange(regex,-1,in);
putlog (_all_) (//= :$quote.);
run;
Results:
regex="s/<<[\w|\+|\#|#]+>>/\s*(<<[\\w|\\+|#|\#]+>>)\s*/"
in="<<A>> <<A+>> <<A#>> <<A#+>> <<A#>> <<A#+>>"
out="s*(<<[\w|\+|#|#]+>>)s* s*(<<[\w|\+|#|#]+>>)s* s*(<<[\w|\+|#|#]+>>)s* s*(<<[\w|\+|#|#]+>>)s* s*(<<[\w|\
+|#|#]+>>)s* s*(<<[\w|\+|#|#]+>>)s*"

Related

How to replace text in quotes with equal length asterisks?

How to replace text in quotes with the same equal length asterisks in SAS?
I mean, convert:
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
to:
*******
******
**********************
there are ************* not in quotes
There are 7,6,22,13 asterisks in line 1,2,3,4 separately. Yes, quotes themself are included, too.
I tried program like this:
pat=prxparse('/[''"].*?["'']/');
do until(pos=0);
call prxsubstr(pat,text,pos,len);
if pos then substr(text,pos,len)=repeat('*',len-1);
end;
It works.
My question is: Is there a more efficient way to do this?
First off, your example fails on the third expression, because it doesn't remember what the opening quote was - so it leaves "double" unmatched.
You can solve that with a backreference, which is supported by SAS:
data have;
length text $1024;
infile datalines pad;
input #1 text $80.;
datalines;
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
;;;;
run;
data want;
set have;
pat=prxparse('/([''"]).*?\1/');
do until(pos=0);
call prxsubstr(pat,text,pos,len);
if pos then substr(text,pos,len)=repeat('*',len-1);
end;
run;
Efficiency wise, this takes about 1.5 seconds on my (fairly fast but not exceptionally so) SAS server to handle 400k records (these 4 x 100,000). This seems reasonable, unless your text is much bigger or your row count much larger. Also, note this will fail on highly complicated nesting if that's permissible (single-double-single etc., or double-single inside single won't be recognized, though it probably will still work fine for your intentions).
However, if you want most efficient, regular expressions are not the answer - it is more efficient to use basic text functions. It's harder to get it exactly right though, and takes a lot more code, so I don't suggest doing this if the regex is acceptable performance. But here's one example - you may need to tweak it some, and you'll need to loop it to repeat until it doesn't find any to replace, and not execute it if there are no quotes at all. This just gives the basic idea of how to use the text functions.
data want;
set have;
length text_sub $80;
_start = findc(text,'"''');
_qchar = char(text,_start); *Save aside which char we matched on;
_end = findc(text,_qchar,_start+1); *now look for that one again anywhere after the first match;
to_convert = substr(text,_start,_end-_start+1);
if _start eq 1 and _end eq length(text) then text_sub = repeat('*',_end-1);
else if _start eq 1 then text_sub = substr(text,_end+1);
else if _end eq length(text) then text_sub = substr(text,1,_start-1)||repeat('*',_end-_start);
else text_sub = cat(substr(text,1,_start-1),repeat('*',_end-_start),substr(text,_end+1));
run;
I would skip regex and just use CALL SCAN() instead.
So loop through finding the location of the next "word". If the word begins and ends with a quote then replace the word with *'s.
data have;
input string $char80. ;
cards;
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
What's going on?
;
data want;
set have;
position=1;
do count=1 by 1 while(position>0);
call scan(string,count,position,length,' ','q');
if char(string,position) in ('"',"'")
and char(string,position)=char(string,position+length-1)
then substr(string,position,length) = repeat('*',length-1)
;
end;
drop position count length;
run;
Result
Obs string
1 *******
2 ******
3 **********************
4 there are ************* not in quotes
5
6 What's going on?

Find Dot Separated Words in a String

I need to parse a log file to pick out strings that match the following case-insensitive pattern:
libname.data <--- Okay
libname.* <--- Not okay
For those with SAS experience, I'm trying to get SAS dataset names out of a large log.
All strings are space-separated. Some examples of lines:
NOTE: The data set LIBNAME.DATA has 428 observations and 15 variables.
MPRINT(MYMACRO): data libname.data;
MPRINT(MYMACRO): create table libname.data(rename=(var1 = var2)) as select distinct var1, var2 as
MPRINT(MYMACRO): format=date. from libname.data where ^missing(var1) and ^missing(var2) and
What I've tried
This PERL regular expression:
/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
https://regex101.com/r/jYkXn5/1
In SAS code:
data test;
line = 'words and stuff libname.data';
test = prxmatch('/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi', line);
run;
Problem
This will work when the line only contains this exact string, but it will not work if the line contains other strings.
Solution
Thanks, Blindy!
The regex that worked for me to parse SAS datasets from a log is:
/(?!.*[.*]{3})[a-z_]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi
data test;
line = 'NOTE: COMPRESSING DATA SET LIBNAME.DATA DECREASED SIZE BY 46.44 PERCENT';
prxID = prxparse('/(?!.*[.*]{3})[a-z]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi');
call prxsubstr(prxID, line, position, length);
dataset = substr(line, position, length);
run;
This will still pick up some SQL select statements but that is easily solvable through post-processing.
You anchored your expression at the beginning, simply remove the first ^ and you're set.
/(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
You can get by just locating the following landmark text in a log file line.
... data set <LIBNAME>.<MEMNAME> ...
If the data set name is in the log you can presume it was correctly formed.
data want;
length line $1000;
infile LOG_FILE lrecl=1000 length=L;
input line $VARYING. L;
* literally "data set <name>" followed by space or period;
rx = prxparse('/data set (.*?)\.(.*?)[. ]/');
if prxmatch(rx,line) then do;
length libname $8 memname $32;
libname = prxposn(rx,1,line);
memname = prxposn(rx,2,line);
line_number = _n_;
output;
end;
keep libname memname line_number;
run;
Some adjustment would be needed if the data set names are name literals of the form '<anything>'N
There are also a plethora of existing SAS Log file parsers and analyzers out on the web that you can utilize.
The lookahead at the start prevents matching .. but the pattern by itself will not match that, as the character classes are repeated 1 or more times and do not contain a dot.
If you don't want to match ** as well, and the string should not start with *, you can add that to a character class [*.] together with the dot, and take it out of the first character class.
In that case, you could omit the positive lookahead and the anchor:
/[a-z0-9_:-]+(?:[.*][a-z0-9_:-]+)+/i
Regex demo
As the pattern does not contain any anchors, you could omit the m flag.

Tranwrd just one letter in SAS

How can i quote just one letter in sas?
%sysfunc(tranwrd(%quote(&string),%quote(T),%quote('Test')));
The Problem is, when the string has a 'T' and 'TR' that both get tranwrd to 'Test'
SAS Macro variables are always character. The arguments to macro functions are always character and generally won't require an extra layer of macro macro quoting, and definitely won't if the arguments are to be as literals.
Did you try this first ?
%let string = STACKOVERFLOW;
%let string_tweaked = %sysfunc(tranwrd(&string),T,Test);
%put NOTE: string_tweaked = &string_tweaked;
Do the macro values contain embedded single quotes ?
%let string = %str(S%'T%'ACKOVERFLOW);
%let string_tweaked = %qsysfunc(tranwrd(&string,'T','Test'));
%put NOTE: string_tweaked = &string_tweaked;
The second code sample is analogous to the following DATA step code (whose scope is different than that of the MACRO environment). DATA step string values are explicitly quoted, with either double quote (") or single quote (')
data _null_;
string = "S'T'ACKOVERFLOW";
string_tweaked = tranwrd(string,"'T'","'Test'");
put "NOTE: " string_tweaked=;
run;

I wants to remove list of character string from the original string in SAS

I want to remove "LIMITED", "LTD", "CORPORATION", "GMBH", "AG", "SDN", "BHD", "INC" string from my Customer_Name variable.
I tried with compress function in SAS like
Customer_Name1=compress(Customer_Name, 'LIMITED', 'LTD', 'GMBH');
But i am getting error -
The COMPRESS function call has too many arguments.
Please suggest way to solve it.
I would use a regular expression to perform this. Store the words to be removed in a macro variable, then use call prxchange to search within name and remove them. The words are separated by |, which signifies or in regular expression language.
%let vals = LIMITED|LTD|CORPORATION|GMBH|AG|SDN|BHD|INC;
data have;
input name $20.;
datalines;
a ltd
b limited
c corporation
d corp
e gmbh
f test
g ag
i sdn
j bhd
aggregate ag
income inc
;
run;
data want;
set have;
regex = prxparse("s/\b(&vals.)\b//i"); /* /b signifies a word boundary, so it will remove the whole words only */
call prxchange(regex,-1,name);
drop regex;
run;

How to check whether the first character of a string is a small letter using sas

I have a variable NAME. I want to check whether the first character of this variable is a small letter or not. Name looks like the following:
aBMS
BMS
xMS
zVewS
fPP
NBMS
I extract the first character of my variable using first_letter = first(NAME); Can anyone teach me how to check whether the variable first_letter is a small letter or not. Now I did it as follows, but I am wondering if I can achieve this without typing the whole alphabet letters. if first_letter = 'a' | first_letter = 'b' |first_letter = 'c' ... then dummy = 1.
Using the compress function with kl as the 3rd argument tells SAS to keep only lowercase characters, so the following works correctly for all cases, including non-alphanumeric first characters:
data have;
input NAME $;
cards;
aBMS
BMS
xMS
zVewS
fPP
NBMS
;
run;
data want;
set have;
FLAG = compress(first(NAME),,'lk') ne '';
run;
N.B. The third argument for compress is a feature that was only added to SAS in version 9.1, so this won't work in earlier versions of SAS.
Also, this will work both in a where clause and in a data step if statement - by contrast, the between syntax used in Gordon's answer is only valid in a where clause. A variant of this approach that would work in both cases is:
data want;
set have;
/*Yes, SAS supports character inequalities!*/
FLAG = 'a' <= first(NAME) <= 'z';
run;
Perl Regular Expression can also provide an alternative:
data have;
input NAME $;
cards;
aBMS
BMS
xMS
zVewS
fPP
NBMS
;
run;
data want;
set have;
if prxmatch('/^[[:lower:]]/', name)>0;
run;
This is very straightforward, literally checking if the first letter is the lower case. ^ to define the beginning of the string, [[:lower:]] is to match the lower case characters.
first(string) eq lowcase(first(string))
This will also true be if the first character in the string is not alphabet character. You don't mention if that scenario is to be considered.
SAS proc sql is case sensitive, so the following should work:
proc sql;
select t.*
from t
where substring(t.name from 1 for 1) between 'a' and 'z';