How to remove special ASCII characters?

How to remove special ASCII characters? - sas

I am trying to remove special character from the string.
"Mumbai rains live updates: IMD predicts heavy rainfall for next 24 hours �"
data demo1 (keep=headline2 headline3 headline4 headline5);
set kk.newspaper_append_freq_daily1;
headline2=trim(headline);
headline3=tranwrd(headline2,"�"," ");
headline5=compress(headline2,"�");
headline4=index(headline2,"�");
run;

You can use kpropdata function.
From doc:
Removes or converts unprintable characters.
Code example:
%let in=kk.newspaper_append_freq_daily1;
%let out=demo1;
data &out;
set &in;
array cc (*) _character_;
do i=1 to dim(cc);
cc(_N_)=kpropdata(cc(i),"TRUNC", 'utf-8');
end;
run;
In code I've used array statement to iterate over all character columns in table.

compress should also handle this if you keep a whitelist of characters rather than trying to exclude a blacklist - e.g.
clean_text = compress(dirty_text,'','kw');
The k modifier keeps characters instead of removing them, and w adds all printable characters to the list.

Related

Find Dot Separated Words in a String

I need to parse a log file to pick out strings that match the following case-insensitive pattern:
libname.data <--- Okay
libname.* <--- Not okay
For those with SAS experience, I'm trying to get SAS dataset names out of a large log.
All strings are space-separated. Some examples of lines:
NOTE: The data set LIBNAME.DATA has 428 observations and 15 variables.
MPRINT(MYMACRO): data libname.data;
MPRINT(MYMACRO): create table libname.data(rename=(var1 = var2)) as select distinct var1, var2 as
MPRINT(MYMACRO): format=date. from libname.data where ^missing(var1) and ^missing(var2) and
What I've tried
This PERL regular expression:
/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
https://regex101.com/r/jYkXn5/1
In SAS code:
data test;
line = 'words and stuff libname.data';
test = prxmatch('/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi', line);
run;
Problem
This will work when the line only contains this exact string, but it will not work if the line contains other strings.
Solution
Thanks, Blindy!
The regex that worked for me to parse SAS datasets from a log is:
/(?!.*[.*]{3})[a-z_]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi
data test;
line = 'NOTE: COMPRESSING DATA SET LIBNAME.DATA DECREASED SIZE BY 46.44 PERCENT';
prxID = prxparse('/(?!.*[.*]{3})[a-z]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi');
call prxsubstr(prxID, line, position, length);
dataset = substr(line, position, length);
run;
This will still pick up some SQL select statements but that is easily solvable through post-processing.

You anchored your expression at the beginning, simply remove the first ^ and you're set.
/(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi

You can get by just locating the following landmark text in a log file line.
... data set <LIBNAME>.<MEMNAME> ...
If the data set name is in the log you can presume it was correctly formed.
data want;
length line $1000;
infile LOG_FILE lrecl=1000 length=L;
input line $VARYING. L;
* literally "data set <name>" followed by space or period;
rx = prxparse('/data set (.*?)\.(.*?)[. ]/');
if prxmatch(rx,line) then do;
length libname $8 memname $32;
libname = prxposn(rx,1,line);
memname = prxposn(rx,2,line);
line_number = _n_;
output;
end;
keep libname memname line_number;
run;
Some adjustment would be needed if the data set names are name literals of the form '<anything>'N
There are also a plethora of existing SAS Log file parsers and analyzers out on the web that you can utilize.

The lookahead at the start prevents matching .. but the pattern by itself will not match that, as the character classes are repeated 1 or more times and do not contain a dot.
If you don't want to match ** as well, and the string should not start with *, you can add that to a character class [*.] together with the dot, and take it out of the first character class.
In that case, you could omit the positive lookahead and the anchor:
/[a-z0-9_:-]+(?:[.*][a-z0-9_:-]+)+/i
Regex demo
As the pattern does not contain any anchors, you could omit the m flag.

SAS treatment of blanks interferes with regex rules

I have a dataset which I need to clean using regex rules. These rules come from a file regex_rules.csv with columns string_pattern and string_replace and are applied using a combination of prxparse and prxchange as follows:
array a_rules{1:&NOBS} $200. _temporary_;
array a_rules_parsed{1:&num_rules} _temporary_;
if _n_ = 1 then
do i = 1 to &num_rules;
a_rules{i} = cat("'s/",string_pattern,"/",string_replace,"/'");
a_rules_parsed{i} = prxparse(cats('s/',string_pattern,'/',string_replace,'/','i'));
end
set work.dirty_strings;
clean_string = dirty_string;
do i = 1 to &num_rules;
debug_string = cats("Executing prxchange(",a_rules{i},",",-1,",","'",clean_string,"'",")");
put debug_string;
clean_string = PRXCHANGE(a_rules_parsed{i},-1,clean_string);
end
Some rules specify replacing certain patterns with a single blank space, so the corresponding string_replace value in the file is a single blank space.
The issue I'm facing is that SAS never respects the single space, and instead replaces the matched string_pattern for these records with an empty string (the other rules are applied as expected).
To troubleshoot I executed the following:
proc sql;
create table work.single_blanks as
select
string_pattern,
string_replace,
from work.regex_rules
where string_replace = " ";
quit;
which yielded the expected records. I was confused to find that changing the where clause to
where string_replace = "" or
where string_replace = " " gave identical results! (I've been using sas for a while but I guess this behavior has gone unnoticed until now). Consequently, I could not determine whether SAS is neglecting to properly read in the file and retain the single blank, or whether one of the prx functions is failing to properly handle the single blank.
I can think of "hacky" work-arounds, but I'd rather understand what I'm doing wrong here and what the correct solution should be.
EDIT 1:
Here is a rule from the file and how I'd expect it to act on an example input value:
string_pattern, string_replace
"(#|,|/|')", " "
running the code above on the input string dirty_string = "10,120 DIRTY DRIVE"; does not produce the expected output of "10 120 DIRTY DRIVE" but rather "10120 DIRTY DRIVE".
EDIT 2
In addition to not respecting single spaces, leading and trailing spaces do not seem to be respected. For example, for a file with the rules
string_pattern, string_replace
"\\bDR(\\.|\\b)", "DRIVE "
"\\bS(\\.|\\b)?W(\\.|\\b)", " SOUTH WEST"
running the code above on the input string dirty_string = "10120 DIRTY DR.SW."; does not produce the expected output of "10120 DIRTY DRIVE SOUTH WEST" but rather "10120 DIRTY DRIVESW.". This is because the space at the end of the first string_replace value gets lost, meaning there is no word boundary at the beginning of the second string_pattern to be matched.

SAS stores character variables as fixed length strings that are padded with spaces. As a consequence string comparisons ignore trailing spaces. So x=' ' and x=' ' are the same test.
The CATS() will remove all of the leading and trailing spaces, so empty strings will generate nothing at all. It sounds like you want to treat an empty string as a single space. The TRIM() function will return a single space for an empty string. So perhaps you just want to change this:
cats('s/',string_pattern,'/',string_replace,'/','i')
into
cat('s/',trim(string_pattern),'/',trim(string_replace),'/','i')
Here is a working code (with a fixed string_pattern) of your example data:
data test;
length string_pattern string_replace dirty_string expect
clean_string regex $200
;
infile cards dsd truncover;
input string_pattern string_replace dirty_string expect;
regex= cat('s/',trim(string_pattern),'/',trim(string_replace),'/i') ;
regex_id = prxparse(trim(regex));
clean_string = prxchange(regex_id,-1,trim(dirty_string));
if clean_string=expect then put 'GOOD'; else put 'BAD';
*put (_character_) (=$quote./);
cards4;
"(#|,|\/|')", " ","10,120 DIRTY DRIVE","10 120 DIRTY DRIVE"
;;;;
If any of your values have significant trailing spaces then you will need to store the data differently. You could for example quote the values:
string_replace = "'DRIVE '";
...
cat('s/',dequote(string_pattern),'/',dequote(string_replace),'/','i')
If you only add quotes around values that need them then you will need to include the TRIM() function calls.
cat('s/',dequote(trim(string_pattern)),'/',dequote(trim(string_replace)),'/','i')
Or store the string lengths into separate numeric fields.
cat('s/',substrn(string_pattern,1,len1),'/',substrn(string_replace,1,len2),'/','i')
And note that if any of your original character strings had either significant leading or trailing spaces they would have been eliminated by reading the data from a CSV file.

SAS Scan function separator not working as it should

I ran into a problem with the scan function in sas.
The dataset I have contains one variable that needs to be split into multiple variables.
The variable is structured like this:
4__J04__1__SCH175__BE__compositeur / arrangeur__compositeur /
bewerker__(blank)__1__17__108.03__93.7
I use this code to split this into multiple variables:
data /*ULB.*/work.smart_BCSS_withNISS_&JJ.&K.;
set work.smart_BCSS_withNISS_&JJ.&K.;
/* Maand splitsen in variablen */
mois=scan(smart,1,"__");
jours=scan(smart,2,"__");
nbjours=scan(smart,3,"__");
refClient=scan(smart,4,"__");
paysPrestation=scan(smart,5,"__");
wordingFR=scan(smart,6,"__");
wordingNL=scan(smart,7,"__");
fonction=scan(smart,8,"__");
ARTISTIQUE2=scan(smart,9,"__");
Art_At_LEAST=scan(smart,10,"__");
totalBrut=scan(smart,11,"__");
totalImposable=scan(smart,12,"__");
run;
Most of the time this works perfectly. However sometimes the 4th variable 'refClient' contains one single underscore like this:
4__J04__1__LE_46__BE__compositeur / arrangeur__compositeur /
bewerker__(blank)__1__17__108.03__93.7
Somehow the scan function also detects this single underscore as a separator even though the separator is a double underscore.
Any idea on how to avoid this behavior?

Aurieli's code works, but their answer doesn't explain why. Your understanding of how scan works is incorrect.
If there is more than 1 character in the delimiter specified for scan, each character is treated as a delimiter. You've specified _ twice. If you had specified ab then a and b would both have been treated as delimiters, rather than ab being the delimiter.
scan by default treats multiple consecutive delimiters as a single delimiter, which was why your code treated both __ and _ as delimiters. So if you specified ab as the delimiter string then ba, abba etc. would also be counted as a single delimiter by default.

You can use regexp to change single '_' (for example, change to '-') and then scan what you want:
data /*ULB.*/work.test;
smart="4__J04__1__LE_18__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7";
smartcr=prxchange("s/(?<=[^_])(_{1})(?=[^_])/-/",-1,smart);
/* Maand splitsen in variablen */
mois=scan(smartcr,1,"__");
jours=scan(smartcr,2,"__");
nbjours=scan(smartcr,3,"__");
refClient=tranwrd(scan(smartcr,4,"__"),'-','_');
paysPrestation=scan(smartcr,5,"__");
wordingFR=scan(smartcr,6,"__");
wordingNL=scan(smartcr,7,"__");
fonction=scan(smartcr,8,"__");
ARTISTIQUE2=scan(smartcr,9,"__");
Art_At_LEAST=scan(smartcr,10,"__");
totalBrut=scan(smartcr,11,"__");
totalImposable=scan(smartcr,12,"__");
run;

Mildly interesting, the INFILE statement supports a delimiter string.
data test;
infile cards dlmstr='__';
input (mois
jours
nbjours
refClient
paysPrestation
wordingFR
wordingNL
fonction
ARTISTIQUE2
Art_At_LEAST
totalBrut
totalImposable) (:$32.);
cards;
4__J04__1__SCH175__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7
4__J04__1__LE_46__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7
;;;;
run;
proc print;
run;

sas, remove the comma and period, regex

Do you guys know how to replace remove the comma and period in something like this:
'18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,. '
I had to concatenate to get list of claim numbers (with leading zeros). Now, I have that string but I want to delete all the stuff at the end. I tried this but it didn't work
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=PRXCHANGE('s/,.//',1,CONCAT);
run;
By the way, I am using SAS and regex, something like prxchange.

This also worked for me
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=TRANWRD(CONCAT, ',.', '');
run;

The second argument to the PRXCHANGE function specifies the number of times the search and replace should be done. Replacing your 1 by -1 will run the replacement until the end of the string, rather than only once.
Also, the pair ',.' will replace a comma followed by any character ('.' is a wildcard). You want to catch either a comma (',') or a period ('.'), the last of which is a metacharacter you need to escape from, using '\':
CONCAT1=PRXCHANGE('s/[,\.]//',-1,CONCAT);
If you only want to remove the comma-period pairs, then remove the square brackets:
CONCAT1=PRXCHANGE('s/,\.//',-1,CONCAT);

No need for regex unless you have something more complicated than actually shown.
Just use the scan() function and tell it to use . and , as delimiters:
data claims;
length claim $50;
list = '18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.';
cnt=1;
claim=scan(list,cnt,'.,');
do while (claim ne '');
output;
cnt=cnt+1;
claim=scan(list,cnt,'.,');
end;
keep claim;
run;

SAS - replacing a character with a space?

Had a quick question - I need to remove punctuation and replace characters with a space (i.e.: if I have a field that contains a * I need to replace it with a white space).
I can't seem to get it right - I was originally doing this to just remove it, but I've found that in some cases my string is being squished together.
Thoughts?
STRING2 = compress(STRING, ":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...");

The COMPRESS() function will remove the characters. If you want to replace them with spaces then use the TRANSLATE() function. If you want to reduce multiple blanks to a single blank use the COMPBL() function.
STRING2 = compbl(translate(STRING,' ',":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…..."));
Rather than listing the characters that need to be converted to spaces you could use COMPRESS() to turn the problem around to listing the characters that should be kept.
So this example will use the modifiers ad on the COMPRESS() function call to pass the characters in STRING that are not alphanumeric characters to the TRANSLATE() function call so they will be replaced by spaces.
STRING2 = compbl(translate(STRING,' ',compress(STRING,' ','ad')));

Try using the translate function and see if it fits your needs:
data want;
STRING = "!';AAAAÄAA$";
STRING2 = translate(STRING,' ',':;,*~''’°-!()®#""#$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...');
run;
Output:
STRING STRING2
!';AAAAÄAA$ AAAA AA

Try the TRANSLATE() function.
TRANSLATE(SOURCE,TO,FROM);
data test;
string = "1:,*2~’°-ÍÓ3Äö‘—È…...4";
string2 = translate(string,
" ",
":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...");
put string2=;
run;
I get
string2=1 2 3 4

While translate function could get you there, you could also use REGEX in SAS. It is more elegant, but you need to escape the characters in the actual regex pattern.
data want;
input string $60.;
length new_string $60.;
new_string = prxchange('s/([\:\,\*\~\’\°\-\!\'||"\'"||';\(\)\®\"\"\#\#\$\%\^\&\©\+\=\\\/\|\[\}\{\]\{\\\?\>\<\É\Ñ\Ë\Á\’\Í\Ó\Ä\ö\‘\—\È\…\.\.\.\]])/ /',-1,string);
datalines;
Cats, dogs, and anyone else!
;

Try it with the help of regular expressions.
data have;
old = "AM;'IGH}|GH";
new = prxchange("s/[^A-Z]/ /",-1,old);
run;
proc print data=have nobs;
run;
OUTPUT-
old new
AM;'IGH}|GH AM IGH GH

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove special ASCII characters? - sas

compress should also handle this if you keep a whitelist of characters rather than trying to exclude a blacklist - e.g. clean_text = compress(dirty_text,'','kw'); The k modifier keeps characters instead of removing them, and w adds all printable characters to the list.

Related

Find Dot Separated Words in a String

SAS treatment of blanks interferes with regex rules

SAS Scan function separator not working as it should

sas, remove the comma and period, regex

SAS - replacing a character with a space?

Categories

Resources