How to replace text in quotes with equal length asterisks? - regex

How to replace text in quotes with the same equal length asterisks in SAS?
I mean, convert:
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
to:
*******
******
**********************
there are ************* not in quotes
There are 7,6,22,13 asterisks in line 1,2,3,4 separately. Yes, quotes themself are included, too.
I tried program like this:
pat=prxparse('/[''"].*?["'']/');
do until(pos=0);
call prxsubstr(pat,text,pos,len);
if pos then substr(text,pos,len)=repeat('*',len-1);
end;
It works.
My question is: Is there a more efficient way to do this?

First off, your example fails on the third expression, because it doesn't remember what the opening quote was - so it leaves "double" unmatched.
You can solve that with a backreference, which is supported by SAS:
data have;
length text $1024;
infile datalines pad;
input #1 text $80.;
datalines;
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
;;;;
run;
data want;
set have;
pat=prxparse('/([''"]).*?\1/');
do until(pos=0);
call prxsubstr(pat,text,pos,len);
if pos then substr(text,pos,len)=repeat('*',len-1);
end;
run;
Efficiency wise, this takes about 1.5 seconds on my (fairly fast but not exceptionally so) SAS server to handle 400k records (these 4 x 100,000). This seems reasonable, unless your text is much bigger or your row count much larger. Also, note this will fail on highly complicated nesting if that's permissible (single-double-single etc., or double-single inside single won't be recognized, though it probably will still work fine for your intentions).
However, if you want most efficient, regular expressions are not the answer - it is more efficient to use basic text functions. It's harder to get it exactly right though, and takes a lot more code, so I don't suggest doing this if the regex is acceptable performance. But here's one example - you may need to tweak it some, and you'll need to loop it to repeat until it doesn't find any to replace, and not execute it if there are no quotes at all. This just gives the basic idea of how to use the text functions.
data want;
set have;
length text_sub $80;
_start = findc(text,'"''');
_qchar = char(text,_start); *Save aside which char we matched on;
_end = findc(text,_qchar,_start+1); *now look for that one again anywhere after the first match;
to_convert = substr(text,_start,_end-_start+1);
if _start eq 1 and _end eq length(text) then text_sub = repeat('*',_end-1);
else if _start eq 1 then text_sub = substr(text,_end+1);
else if _end eq length(text) then text_sub = substr(text,1,_start-1)||repeat('*',_end-_start);
else text_sub = cat(substr(text,1,_start-1),repeat('*',_end-_start),substr(text,_end+1));
run;

I would skip regex and just use CALL SCAN() instead.
So loop through finding the location of the next "word". If the word begins and ends with a quote then replace the word with *'s.
data have;
input string $char80. ;
cards;
"12345"
"hi42"
'with "double" quotes'
there are 'other words' not in quotes
What's going on?
;
data want;
set have;
position=1;
do count=1 by 1 while(position>0);
call scan(string,count,position,length,' ','q');
if char(string,position) in ('"',"'")
and char(string,position)=char(string,position+length-1)
then substr(string,position,length) = repeat('*',length-1)
;
end;
drop position count length;
run;
Result
Obs string
1 *******
2 ******
3 **********************
4 there are ************* not in quotes
5
6 What's going on?

Related

Why is the last character getting removed after applying tranwrd function

I want to replace certain values in my json file (in this example null values with empty quotation marks.) My solution is working correctly but, for some mysterious reason, the last character of the json file is deleted. Regardless of the last character, the code always deletes it - I have also tried with a different json file that ends in curly braces.
What is causing this and more importantly how can I prevent this?
data testdata_;
input var1 var2 var3;
format _all_ commax10.1;
datalines;
3.1582 0.3 1.8
21 . .
1.2 4.5 6.4
;
proc json out = 'G:\test.json' pretty fmtnumeric nosastags keys;
export testdata_;
run;
data _null_;
infile 'G:\test.json';
file 'G:\test.json';
input;
_infile_ = tranwrd(_infile_,'null','""');
put _infile_ ;
run;
To see how the contents change, first run the code until "data null" statement and check the file content, then run the last statement.
Data _null_ has it correct; don't write to the same file. SAS offers this option, but in the modern day it's almost always the wrong answer, due to how SAS supports this and the fact that storage is sufficiently cheap and fast.
In this case, it looks like it's a relatively easy fix, but you probably should do as suggested and write to a new file anyway - there will be other issues.
data testdata_;
input var1 var2 var3;
format _all_ commax10.1;
datalines;
3.1582 0.3 1.8
21 . .
1.2 4.5 6.4
;
proc json out = 'H:\temp\test.json' pretty fmtnumeric nosastags keys;
export testdata_;
run;
data _null_;
infile 'H:\temp\test.json' end=eof;
file 'H:\temp\test.json';
input #;
putlog _infile_;
_infile_ = tranwrd(_infile_,'null','"" ');
len = length(_infile_);
put _infile_ ;
if eof then put _infile_;
run;
There's two changes. One, I use '"" ' instead of '""' in the tranwrd; that's because otherwise you end up with slightly odd results with new lines being added. If your JSON parser doesn't like "" ,, then you may want to instead have two tranwrd, one for null, and one for null, or something similar (or use a regular expression). But what's important is the number of characters needs to match in the input and the output. If you can't handle that (like the extra spaces are problematic) then you're left with "write a new file".
Two, I look for the end of the file, then intentionally write out a second line there. That avoids the issue you're having with the bracket, as it avoids having the EOF being written out before the bracket. I'm not 100% sure I know why you need that - but you do.
Another option, which might make more sense, is to only write the lines that have the bracket.
data _null_;
infile 'H:\temp\test.json' sharebuffers;
file 'H:\temp\test.json';
input #;
putlog _infile_;
if find(_infile_,'null') then do;
_infile_ = tranwrd(_infile_,'null','"" ');
put _infile_;
end;
run;
I added sharebuffers because that should make it run a bit faster. Note that I also remove one space - something weird about how SAS does this seems to otherwise remove a space from the following line otherwise. No idea why, probably something weird with EOL characters.
But again - don't do any of this unless there's no other option. Write a new file.
One strange thing is that the PROC JSON always writes a text file that uses LF as the end of line characters.
So you might be able to get your overwriting of the file to work if add these caveats:
Use TERMSTR=LF on the INFILE statement.
Use SHAREDBUFFERS on the INFILE statement.
Replace the string with the same number of bytes with the TRANWRD() function and not put a space as the last character on the line.
I would also search for ': null' instead of just 'null' to reduce risk of replacing those characters in some other string in the file.
data _null_;
infile json SHAREBUFFERS termstr=lf ;
file json ;
input ;
_infile_ = tranwrd(_infile_,': null',': ""');
put _infile_;
run;

Find Dot Separated Words in a String

I need to parse a log file to pick out strings that match the following case-insensitive pattern:
libname.data <--- Okay
libname.* <--- Not okay
For those with SAS experience, I'm trying to get SAS dataset names out of a large log.
All strings are space-separated. Some examples of lines:
NOTE: The data set LIBNAME.DATA has 428 observations and 15 variables.
MPRINT(MYMACRO): data libname.data;
MPRINT(MYMACRO): create table libname.data(rename=(var1 = var2)) as select distinct var1, var2 as
MPRINT(MYMACRO): format=date. from libname.data where ^missing(var1) and ^missing(var2) and
What I've tried
This PERL regular expression:
/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
https://regex101.com/r/jYkXn5/1
In SAS code:
data test;
line = 'words and stuff libname.data';
test = prxmatch('/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi', line);
run;
Problem
This will work when the line only contains this exact string, but it will not work if the line contains other strings.
Solution
Thanks, Blindy!
The regex that worked for me to parse SAS datasets from a log is:
/(?!.*[.*]{3})[a-z_]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi
data test;
line = 'NOTE: COMPRESSING DATA SET LIBNAME.DATA DECREASED SIZE BY 46.44 PERCENT';
prxID = prxparse('/(?!.*[.*]{3})[a-z]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi');
call prxsubstr(prxID, line, position, length);
dataset = substr(line, position, length);
run;
This will still pick up some SQL select statements but that is easily solvable through post-processing.
You anchored your expression at the beginning, simply remove the first ^ and you're set.
/(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
You can get by just locating the following landmark text in a log file line.
... data set <LIBNAME>.<MEMNAME> ...
If the data set name is in the log you can presume it was correctly formed.
data want;
length line $1000;
infile LOG_FILE lrecl=1000 length=L;
input line $VARYING. L;
* literally "data set <name>" followed by space or period;
rx = prxparse('/data set (.*?)\.(.*?)[. ]/');
if prxmatch(rx,line) then do;
length libname $8 memname $32;
libname = prxposn(rx,1,line);
memname = prxposn(rx,2,line);
line_number = _n_;
output;
end;
keep libname memname line_number;
run;
Some adjustment would be needed if the data set names are name literals of the form '<anything>'N
There are also a plethora of existing SAS Log file parsers and analyzers out on the web that you can utilize.
The lookahead at the start prevents matching .. but the pattern by itself will not match that, as the character classes are repeated 1 or more times and do not contain a dot.
If you don't want to match ** as well, and the string should not start with *, you can add that to a character class [*.] together with the dot, and take it out of the first character class.
In that case, you could omit the positive lookahead and the anchor:
/[a-z0-9_:-]+(?:[.*][a-z0-9_:-]+)+/i
Regex demo
As the pattern does not contain any anchors, you could omit the m flag.

I wants to remove list of character string from the original string in SAS

I want to remove "LIMITED", "LTD", "CORPORATION", "GMBH", "AG", "SDN", "BHD", "INC" string from my Customer_Name variable.
I tried with compress function in SAS like
Customer_Name1=compress(Customer_Name, 'LIMITED', 'LTD', 'GMBH');
But i am getting error -
The COMPRESS function call has too many arguments.
Please suggest way to solve it.
I would use a regular expression to perform this. Store the words to be removed in a macro variable, then use call prxchange to search within name and remove them. The words are separated by |, which signifies or in regular expression language.
%let vals = LIMITED|LTD|CORPORATION|GMBH|AG|SDN|BHD|INC;
data have;
input name $20.;
datalines;
a ltd
b limited
c corporation
d corp
e gmbh
f test
g ag
i sdn
j bhd
aggregate ag
income inc
;
run;
data want;
set have;
regex = prxparse("s/\b(&vals.)\b//i"); /* /b signifies a word boundary, so it will remove the whole words only */
call prxchange(regex,-1,name);
drop regex;
run;

How to check whether the first character of a string is a small letter using sas

I have a variable NAME. I want to check whether the first character of this variable is a small letter or not. Name looks like the following:
aBMS
BMS
xMS
zVewS
fPP
NBMS
I extract the first character of my variable using first_letter = first(NAME); Can anyone teach me how to check whether the variable first_letter is a small letter or not. Now I did it as follows, but I am wondering if I can achieve this without typing the whole alphabet letters. if first_letter = 'a' | first_letter = 'b' |first_letter = 'c' ... then dummy = 1.
Using the compress function with kl as the 3rd argument tells SAS to keep only lowercase characters, so the following works correctly for all cases, including non-alphanumeric first characters:
data have;
input NAME $;
cards;
aBMS
BMS
xMS
zVewS
fPP
NBMS
;
run;
data want;
set have;
FLAG = compress(first(NAME),,'lk') ne '';
run;
N.B. The third argument for compress is a feature that was only added to SAS in version 9.1, so this won't work in earlier versions of SAS.
Also, this will work both in a where clause and in a data step if statement - by contrast, the between syntax used in Gordon's answer is only valid in a where clause. A variant of this approach that would work in both cases is:
data want;
set have;
/*Yes, SAS supports character inequalities!*/
FLAG = 'a' <= first(NAME) <= 'z';
run;
Perl Regular Expression can also provide an alternative:
data have;
input NAME $;
cards;
aBMS
BMS
xMS
zVewS
fPP
NBMS
;
run;
data want;
set have;
if prxmatch('/^[[:lower:]]/', name)>0;
run;
This is very straightforward, literally checking if the first letter is the lower case. ^ to define the beginning of the string, [[:lower:]] is to match the lower case characters.
first(string) eq lowcase(first(string))
This will also true be if the first character in the string is not alphabet character. You don't mention if that scenario is to be considered.
SAS proc sql is case sensitive, so the following should work:
proc sql;
select t.*
from t
where substring(t.name from 1 for 1) between 'a' and 'z';

SAS: How to delete word between two specific position?

data:
Hell_TRIAL21_o World
Good Mor_Trial9_ning
How do I remove the _TRIAL21_ and _TRIAL9_?
What I did was I find the position of the first _ and the second _. Then I want to compress from the first _ and second _. But the compress function is not available to do so. How?
x = index(string, '_');
if (x>0) then do;
y = x+1;
z = find(string, '_', y);
end;
Text= " Hell_TRIAL21_o World Good Mor_Trial9_ning"
var= catx("",scan(text,1,"_"),"__",scan(text,3,"_"),"_", scan(text,5,"_"))
Note that the length of variable var may not be desirable to your case.Remember to adjust accordingly.
PERL regular expressions are a good way of identifying these sort of strings. call prxchange is the function that will remove the relevant characters. It requires prxparse beforehand to create the search and replace parameters.
I've used modify here to amend the existing dataset, obviously you may want to use set to write out to a new dataset and test the results first.
data have;
input string $ 30.;
datalines;
Hell_TRIAL21_o World
Good Mor_Trial9_ning
;
run;
data have;
modify have;
regex = prxparse('s/_.*_//'); /* identify and remove anything between 2 underscores */
call prxchange(regex,-1,string);
run;
Or to create a new variable and dataset, just use prxchange (which doesn't require prxparse).
data want;
set have;
new_string = prxchange('s/_.*_//',-1,string);
run;