SAS Scan function separator not working as it should - sas

I ran into a problem with the scan function in sas.
The dataset I have contains one variable that needs to be split into multiple variables.
The variable is structured like this:
4__J04__1__SCH175__BE__compositeur / arrangeur__compositeur /
bewerker__(blank)__1__17__108.03__93.7
I use this code to split this into multiple variables:
data /*ULB.*/work.smart_BCSS_withNISS_&JJ.&K.;
set work.smart_BCSS_withNISS_&JJ.&K.;
/* Maand splitsen in variablen */
mois=scan(smart,1,"__");
jours=scan(smart,2,"__");
nbjours=scan(smart,3,"__");
refClient=scan(smart,4,"__");
paysPrestation=scan(smart,5,"__");
wordingFR=scan(smart,6,"__");
wordingNL=scan(smart,7,"__");
fonction=scan(smart,8,"__");
ARTISTIQUE2=scan(smart,9,"__");
Art_At_LEAST=scan(smart,10,"__");
totalBrut=scan(smart,11,"__");
totalImposable=scan(smart,12,"__");
run;
Most of the time this works perfectly. However sometimes the 4th variable 'refClient' contains one single underscore like this:
4__J04__1__LE_46__BE__compositeur / arrangeur__compositeur /
bewerker__(blank)__1__17__108.03__93.7
Somehow the scan function also detects this single underscore as a separator even though the separator is a double underscore.
Any idea on how to avoid this behavior?

Aurieli's code works, but their answer doesn't explain why. Your understanding of how scan works is incorrect.
If there is more than 1 character in the delimiter specified for scan, each character is treated as a delimiter. You've specified _ twice. If you had specified ab then a and b would both have been treated as delimiters, rather than ab being the delimiter.
scan by default treats multiple consecutive delimiters as a single delimiter, which was why your code treated both __ and _ as delimiters. So if you specified ab as the delimiter string then ba, abba etc. would also be counted as a single delimiter by default.

You can use regexp to change single '_' (for example, change to '-') and then scan what you want:
data /*ULB.*/work.test;
smart="4__J04__1__LE_18__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7";
smartcr=prxchange("s/(?<=[^_])(_{1})(?=[^_])/-/",-1,smart);
/* Maand splitsen in variablen */
mois=scan(smartcr,1,"__");
jours=scan(smartcr,2,"__");
nbjours=scan(smartcr,3,"__");
refClient=tranwrd(scan(smartcr,4,"__"),'-','_');
paysPrestation=scan(smartcr,5,"__");
wordingFR=scan(smartcr,6,"__");
wordingNL=scan(smartcr,7,"__");
fonction=scan(smartcr,8,"__");
ARTISTIQUE2=scan(smartcr,9,"__");
Art_At_LEAST=scan(smartcr,10,"__");
totalBrut=scan(smartcr,11,"__");
totalImposable=scan(smartcr,12,"__");
run;

Mildly interesting, the INFILE statement supports a delimiter string.
data test;
infile cards dlmstr='__';
input (mois
jours
nbjours
refClient
paysPrestation
wordingFR
wordingNL
fonction
ARTISTIQUE2
Art_At_LEAST
totalBrut
totalImposable) (:$32.);
cards;
4__J04__1__SCH175__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7
4__J04__1__LE_46__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7
;;;;
run;
proc print;
run;

Related

Find Dot Separated Words in a String

I need to parse a log file to pick out strings that match the following case-insensitive pattern:
libname.data <--- Okay
libname.* <--- Not okay
For those with SAS experience, I'm trying to get SAS dataset names out of a large log.
All strings are space-separated. Some examples of lines:
NOTE: The data set LIBNAME.DATA has 428 observations and 15 variables.
MPRINT(MYMACRO): data libname.data;
MPRINT(MYMACRO): create table libname.data(rename=(var1 = var2)) as select distinct var1, var2 as
MPRINT(MYMACRO): format=date. from libname.data where ^missing(var1) and ^missing(var2) and
What I've tried
This PERL regular expression:
/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
https://regex101.com/r/jYkXn5/1
In SAS code:
data test;
line = 'words and stuff libname.data';
test = prxmatch('/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi', line);
run;
Problem
This will work when the line only contains this exact string, but it will not work if the line contains other strings.
Solution
Thanks, Blindy!
The regex that worked for me to parse SAS datasets from a log is:
/(?!.*[.*]{3})[a-z_]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi
data test;
line = 'NOTE: COMPRESSING DATA SET LIBNAME.DATA DECREASED SIZE BY 46.44 PERCENT';
prxID = prxparse('/(?!.*[.*]{3})[a-z]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi');
call prxsubstr(prxID, line, position, length);
dataset = substr(line, position, length);
run;
This will still pick up some SQL select statements but that is easily solvable through post-processing.
You anchored your expression at the beginning, simply remove the first ^ and you're set.
/(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
You can get by just locating the following landmark text in a log file line.
... data set <LIBNAME>.<MEMNAME> ...
If the data set name is in the log you can presume it was correctly formed.
data want;
length line $1000;
infile LOG_FILE lrecl=1000 length=L;
input line $VARYING. L;
* literally "data set <name>" followed by space or period;
rx = prxparse('/data set (.*?)\.(.*?)[. ]/');
if prxmatch(rx,line) then do;
length libname $8 memname $32;
libname = prxposn(rx,1,line);
memname = prxposn(rx,2,line);
line_number = _n_;
output;
end;
keep libname memname line_number;
run;
Some adjustment would be needed if the data set names are name literals of the form '<anything>'N
There are also a plethora of existing SAS Log file parsers and analyzers out on the web that you can utilize.
The lookahead at the start prevents matching .. but the pattern by itself will not match that, as the character classes are repeated 1 or more times and do not contain a dot.
If you don't want to match ** as well, and the string should not start with *, you can add that to a character class [*.] together with the dot, and take it out of the first character class.
In that case, you could omit the positive lookahead and the anchor:
/[a-z0-9_:-]+(?:[.*][a-z0-9_:-]+)+/i
Regex demo
As the pattern does not contain any anchors, you could omit the m flag.

How to remove special ASCII characters?

I am trying to remove special character from the string.
"Mumbai rains live updates: IMD predicts heavy rainfall for next 24 hours �"
data demo1 (keep=headline2 headline3 headline4 headline5);
set kk.newspaper_append_freq_daily1;
headline2=trim(headline);
headline3=tranwrd(headline2,"�"," ");
headline5=compress(headline2,"�");
headline4=index(headline2,"�");
run;
You can use kpropdata function.
From doc:
Removes or converts unprintable characters.
Code example:
%let in=kk.newspaper_append_freq_daily1;
%let out=demo1;
data &out;
set &in;
array cc (*) _character_;
do i=1 to dim(cc);
cc(_N_)=kpropdata(cc(i),"TRUNC", 'utf-8');
end;
run;
In code I've used array statement to iterate over all character columns in table.
compress should also handle this if you keep a whitelist of characters rather than trying to exclude a blacklist - e.g.
clean_text = compress(dirty_text,'','kw');
The k modifier keeps characters instead of removing them, and w adds all printable characters to the list.

sas, remove the comma and period, regex

Do you guys know how to replace remove the comma and period in something like this:
'18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,. '
I had to concatenate to get list of claim numbers (with leading zeros). Now, I have that string but I want to delete all the stuff at the end. I tried this but it didn't work
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=PRXCHANGE('s/,.//',1,CONCAT);
run;
By the way, I am using SAS and regex, something like prxchange.
This also worked for me
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=TRANWRD(CONCAT, ',.', '');
run;
The second argument to the PRXCHANGE function specifies the number of times the search and replace should be done. Replacing your 1 by -1 will run the replacement until the end of the string, rather than only once.
Also, the pair ',.' will replace a comma followed by any character ('.' is a wildcard). You want to catch either a comma (',') or a period ('.'), the last of which is a metacharacter you need to escape from, using '\':
CONCAT1=PRXCHANGE('s/[,\.]//',-1,CONCAT);
If you only want to remove the comma-period pairs, then remove the square brackets:
CONCAT1=PRXCHANGE('s/,\.//',-1,CONCAT);
No need for regex unless you have something more complicated than actually shown.
Just use the scan() function and tell it to use . and , as delimiters:
data claims;
length claim $50;
list = '18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.';
cnt=1;
claim=scan(list,cnt,'.,');
do while (claim ne '');
output;
cnt=cnt+1;
claim=scan(list,cnt,'.,');
end;
keep claim;
run;

SAS search and replace the last word using regex [duplicate]

Match strings ending in certain character
I am trying to get create a new variable which indicates if a string ends with a certain character.
Below is what I have tried, but when this code is run, the variable ending_in_e is all zeros. I would expect that names like "Alice" and "Jane" would be matched by the code below, but they are not:
proc sql;
select *,
case
when prxmatch("/e$/",name) then 1
else 0
end as ending_in_e
from sashelp.class
;quit;
You should account for the fact that, in SAS, strings are of char type and spaces are added up to the string end if the actual value is shorter than the buffer.
Either trim the string:
prxmatch("/e$/",trim(name))
Or add a whitespace pattern:
prxmatch("/e\s*$/",name)
^^^
to match 0 or more whitespaces.
SAS character variables are fixed length. So you either need to trim the trailing spaces or include them in your regular expression.
Regular expressions are powerful, but they might be confusing to some. For such a simple pattern it might be clearer to use simpler functions.
proc print data=sashelp.class ;
where char(name,length(name))='e';
run;

SAS - replacing a character with a space?

Had a quick question - I need to remove punctuation and replace characters with a space (i.e.: if I have a field that contains a * I need to replace it with a white space).
I can't seem to get it right - I was originally doing this to just remove it, but I've found that in some cases my string is being squished together.
Thoughts?
STRING2 = compress(STRING, ":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...");
The COMPRESS() function will remove the characters. If you want to replace them with spaces then use the TRANSLATE() function. If you want to reduce multiple blanks to a single blank use the COMPBL() function.
STRING2 = compbl(translate(STRING,' ',":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…..."));
Rather than listing the characters that need to be converted to spaces you could use COMPRESS() to turn the problem around to listing the characters that should be kept.
So this example will use the modifiers ad on the COMPRESS() function call to pass the characters in STRING that are not alphanumeric characters to the TRANSLATE() function call so they will be replaced by spaces.
STRING2 = compbl(translate(STRING,' ',compress(STRING,' ','ad')));
Try using the translate function and see if it fits your needs:
data want;
STRING = "!';AAAAÄAA$";
STRING2 = translate(STRING,' ',':;,*~''’°-!()®#""#$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...');
run;
Output:
STRING STRING2
!';AAAAÄAA$ AAAA AA
Try the TRANSLATE() function.
TRANSLATE(SOURCE,TO,FROM);
data test;
string = "1:,*2~’°-ÍÓ3Äö‘—È…...4";
string2 = translate(string,
" ",
":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...");
put string2=;
run;
I get
string2=1 2 3 4
While translate function could get you there, you could also use REGEX in SAS. It is more elegant, but you need to escape the characters in the actual regex pattern.
data want;
input string $60.;
length new_string $60.;
new_string = prxchange('s/([\:\,\*\~\’\°\-\!\'||"\'"||';\(\)\®\"\"\#\#\$\%\^\&\©\+\=\\\/\|\[\}\{\]\{\\\?\>\<\É\Ñ\Ë\Á\’\Í\Ó\Ä\ö\‘\—\È\…\.\.\.\]])/ /',-1,string);
datalines;
Cats, dogs, and anyone else!
;
Try it with the help of regular expressions.
data have;
old = "AM;'IGH}|GH";
new = prxchange("s/[^A-Z]/ /",-1,old);
run;
proc print data=have nobs;
run;
OUTPUT-
old new
AM;'IGH}|GH AM IGH GH