Double replacement occurring for accented characters when using PXRCHANGE

Double replacement occurring for accented characters when using PXRCHANGE - sas

I am facing a problem when using PXRCHANGE when replacing accented characters such as ö with an underscore. More precisely, when I perform the replacement, rather than ö being replaced with a single underscore, _, it is being replaced with two underscores __. This is not just an isolated instance for ö, but is occuring for several other accented characters.
Here is some dummy code to replicate my problem:
option validvarname = any;
data dummy_data;
input "ö"n "aü"n;
datalines;
1 1
2 2
;
run;
data badvarnames (keep = name validname);
set sashelp.vcolumn;
where libname = "WORK" and memname = "DUMMY_DATA";
validname = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
name = nliteral(name);
run;
proc sql;
select cats("rename", name, "=", validname, ";") into : renamelist
separated by " " from badvarnames;
quit;
data output_tab;
set dummy_data;
&renamelist.;
run;

The regex function is treating the multi-byte characters as individual bytes to be replaced. So if the character uses two bytes in UTF-8 then you get two underscores.
Here are two choices.
Use KTRANSLATE() to handle the multi-byte characters. You can use KCOMPRESS() to find the set of invalid characters in any given name.
valid = ' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
test2 = ktranslate(name,repeat('_',255),kcompress(name,valid));
Or replace adjacent invalid characters with a single underscore by adding + to your regex.
test3 = prxchange("s/[^a-zA-Z0-9]+/_/", -1, trim(name));
Note this will also eliminate the multiple adjacent underscores generated by replacing multiple single byte characters. So it has the added advantage of making those generated names easier to deal with also.

Related

Use of PRXCHANGE to rename variables causes excessive replacement to happen at the end of the variable name

I am new to SAS and I am currently trying to create a macro which will automatically replace any special characters in a variables name with an underscore. I am currently using PRXCHANGE to perform the replacement, yet I notice that when the variable gets renamed, there is extra underscores being placed at the end of the new variable name.
Suppose we were to have two variables "dummy?" and "te!st". When I perform the replacement, the new variables are "dummy___________________________" and " te_st___________________________". When the replacement should just be "dummy_" and "te_st", respectively.
In the sample code below, I know that if I were to add "TRIM(name)" in the PRXCHANGE function then there would not be any extra replacements occurring. The issue with doing this is that if I were to have a variable named "example! ", with a space as the final character, then I would want the variable to be renamed to "example__", with two underscores at the end. Yet by using TRIM(name), I would get "example_", with a single underscore.
N.B. I know if I change the SAS variable name policy to V7, then this would not be a problem. I am solely doing this to improve upon my SAS skills.
/* Generate dummy data */
option validvarname = any;
data dummy_data;
input "dummy?"n "te!st"n;
datalines;
1 1
2 2
3 3
;
run;
/* Generate variables with the old and new variable names as entries */
data test (keep = name new_name);
set sashelp.vcolumn;
where libname = "WORK" and memname = "DUMMY_DATA";
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, name);
run;

Your issue is how SAS pads string variable length. While most languages have variable length strings, SAS is more akin to SQL char type, without the accompanying varchar type. This gives SAS very good performance in some ways, due to predictable row sizes, but has some consequences. Note that you can actually get effectively variable length strings on datasets using options compress, but during a data step the dataset is uncompressed.
In SAS, a string of length 10 that is assigned "A" will actually have value "A ". A, plus 9 spaces. Not null characters, actual space characters. That usually doesn't matter, as SAS is written in many ways to ignore those trailing spaces (so "A" = "A " = "A "), but in this particular case it does matter (since you're transforming the space character).
You can use the trim function to remove the spaces during execution, though it will still be stored with the spaces afterwards of course.
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
Note that trim cannot return a null value, it will always return a single space, so if that's a possibility, you should wrap this in a check for missing (a string variable with only spaces = missing).
if not missing(name) then do;
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
end;
else new_name = ' ';
There is a trimn function that can return a length 0 string, but there's no reason to do the prxchange if it's missing - this will save time.

Your concern about trailing spaces on variable names it not valid. Trailing spaces on variable names are not significant. This data step creates only one variable.
376 options validvarname=any;
377 data test;
378 'xxx'n = 1;
379 'xxx 'n= 2;
380 run;
NOTE: The data set WORK.TEST has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds

SAS treatment of blanks interferes with regex rules

I have a dataset which I need to clean using regex rules. These rules come from a file regex_rules.csv with columns string_pattern and string_replace and are applied using a combination of prxparse and prxchange as follows:
array a_rules{1:&NOBS} $200. _temporary_;
array a_rules_parsed{1:&num_rules} _temporary_;
if _n_ = 1 then
do i = 1 to &num_rules;
a_rules{i} = cat("'s/",string_pattern,"/",string_replace,"/'");
a_rules_parsed{i} = prxparse(cats('s/',string_pattern,'/',string_replace,'/','i'));
end
set work.dirty_strings;
clean_string = dirty_string;
do i = 1 to &num_rules;
debug_string = cats("Executing prxchange(",a_rules{i},",",-1,",","'",clean_string,"'",")");
put debug_string;
clean_string = PRXCHANGE(a_rules_parsed{i},-1,clean_string);
end
Some rules specify replacing certain patterns with a single blank space, so the corresponding string_replace value in the file is a single blank space.
The issue I'm facing is that SAS never respects the single space, and instead replaces the matched string_pattern for these records with an empty string (the other rules are applied as expected).
To troubleshoot I executed the following:
proc sql;
create table work.single_blanks as
select
string_pattern,
string_replace,
from work.regex_rules
where string_replace = " ";
quit;
which yielded the expected records. I was confused to find that changing the where clause to
where string_replace = "" or
where string_replace = " " gave identical results! (I've been using sas for a while but I guess this behavior has gone unnoticed until now). Consequently, I could not determine whether SAS is neglecting to properly read in the file and retain the single blank, or whether one of the prx functions is failing to properly handle the single blank.
I can think of "hacky" work-arounds, but I'd rather understand what I'm doing wrong here and what the correct solution should be.
EDIT 1:
Here is a rule from the file and how I'd expect it to act on an example input value:
string_pattern, string_replace
"(#|,|/|')", " "
running the code above on the input string dirty_string = "10,120 DIRTY DRIVE"; does not produce the expected output of "10 120 DIRTY DRIVE" but rather "10120 DIRTY DRIVE".
EDIT 2
In addition to not respecting single spaces, leading and trailing spaces do not seem to be respected. For example, for a file with the rules
string_pattern, string_replace
"\\bDR(\\.|\\b)", "DRIVE "
"\\bS(\\.|\\b)?W(\\.|\\b)", " SOUTH WEST"
running the code above on the input string dirty_string = "10120 DIRTY DR.SW."; does not produce the expected output of "10120 DIRTY DRIVE SOUTH WEST" but rather "10120 DIRTY DRIVESW.". This is because the space at the end of the first string_replace value gets lost, meaning there is no word boundary at the beginning of the second string_pattern to be matched.

SAS stores character variables as fixed length strings that are padded with spaces. As a consequence string comparisons ignore trailing spaces. So x=' ' and x=' ' are the same test.
The CATS() will remove all of the leading and trailing spaces, so empty strings will generate nothing at all. It sounds like you want to treat an empty string as a single space. The TRIM() function will return a single space for an empty string. So perhaps you just want to change this:
cats('s/',string_pattern,'/',string_replace,'/','i')
into
cat('s/',trim(string_pattern),'/',trim(string_replace),'/','i')
Here is a working code (with a fixed string_pattern) of your example data:
data test;
length string_pattern string_replace dirty_string expect
clean_string regex $200
;
infile cards dsd truncover;
input string_pattern string_replace dirty_string expect;
regex= cat('s/',trim(string_pattern),'/',trim(string_replace),'/i') ;
regex_id = prxparse(trim(regex));
clean_string = prxchange(regex_id,-1,trim(dirty_string));
if clean_string=expect then put 'GOOD'; else put 'BAD';
*put (_character_) (=$quote./);
cards4;
"(#|,|\/|')", " ","10,120 DIRTY DRIVE","10 120 DIRTY DRIVE"
;;;;
If any of your values have significant trailing spaces then you will need to store the data differently. You could for example quote the values:
string_replace = "'DRIVE '";
...
cat('s/',dequote(string_pattern),'/',dequote(string_replace),'/','i')
If you only add quotes around values that need them then you will need to include the TRIM() function calls.
cat('s/',dequote(trim(string_pattern)),'/',dequote(trim(string_replace)),'/','i')
Or store the string lengths into separate numeric fields.
cat('s/',substrn(string_pattern,1,len1),'/',substrn(string_replace,1,len2),'/','i')
And note that if any of your original character strings had either significant leading or trailing spaces they would have been eliminated by reading the data from a CSV file.

How to remove special ASCII characters?

I am trying to remove special character from the string.
"Mumbai rains live updates: IMD predicts heavy rainfall for next 24 hours �"
data demo1 (keep=headline2 headline3 headline4 headline5);
set kk.newspaper_append_freq_daily1;
headline2=trim(headline);
headline3=tranwrd(headline2,"�"," ");
headline5=compress(headline2,"�");
headline4=index(headline2,"�");
run;

You can use kpropdata function.
From doc:
Removes or converts unprintable characters.
Code example:
%let in=kk.newspaper_append_freq_daily1;
%let out=demo1;
data &out;
set &in;
array cc (*) _character_;
do i=1 to dim(cc);
cc(_N_)=kpropdata(cc(i),"TRUNC", 'utf-8');
end;
run;
In code I've used array statement to iterate over all character columns in table.

compress should also handle this if you keep a whitelist of characters rather than trying to exclude a blacklist - e.g.
clean_text = compress(dirty_text,'','kw');
The k modifier keeps characters instead of removing them, and w adds all printable characters to the list.

SAS Scan function separator not working as it should

I ran into a problem with the scan function in sas.
The dataset I have contains one variable that needs to be split into multiple variables.
The variable is structured like this:
4__J04__1__SCH175__BE__compositeur / arrangeur__compositeur /
bewerker__(blank)__1__17__108.03__93.7
I use this code to split this into multiple variables:
data /*ULB.*/work.smart_BCSS_withNISS_&JJ.&K.;
set work.smart_BCSS_withNISS_&JJ.&K.;
/* Maand splitsen in variablen */
mois=scan(smart,1,"__");
jours=scan(smart,2,"__");
nbjours=scan(smart,3,"__");
refClient=scan(smart,4,"__");
paysPrestation=scan(smart,5,"__");
wordingFR=scan(smart,6,"__");
wordingNL=scan(smart,7,"__");
fonction=scan(smart,8,"__");
ARTISTIQUE2=scan(smart,9,"__");
Art_At_LEAST=scan(smart,10,"__");
totalBrut=scan(smart,11,"__");
totalImposable=scan(smart,12,"__");
run;
Most of the time this works perfectly. However sometimes the 4th variable 'refClient' contains one single underscore like this:
4__J04__1__LE_46__BE__compositeur / arrangeur__compositeur /
bewerker__(blank)__1__17__108.03__93.7
Somehow the scan function also detects this single underscore as a separator even though the separator is a double underscore.
Any idea on how to avoid this behavior?

Aurieli's code works, but their answer doesn't explain why. Your understanding of how scan works is incorrect.
If there is more than 1 character in the delimiter specified for scan, each character is treated as a delimiter. You've specified _ twice. If you had specified ab then a and b would both have been treated as delimiters, rather than ab being the delimiter.
scan by default treats multiple consecutive delimiters as a single delimiter, which was why your code treated both __ and _ as delimiters. So if you specified ab as the delimiter string then ba, abba etc. would also be counted as a single delimiter by default.

You can use regexp to change single '_' (for example, change to '-') and then scan what you want:
data /*ULB.*/work.test;
smart="4__J04__1__LE_18__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7";
smartcr=prxchange("s/(?<=[^_])(_{1})(?=[^_])/-/",-1,smart);
/* Maand splitsen in variablen */
mois=scan(smartcr,1,"__");
jours=scan(smartcr,2,"__");
nbjours=scan(smartcr,3,"__");
refClient=tranwrd(scan(smartcr,4,"__"),'-','_');
paysPrestation=scan(smartcr,5,"__");
wordingFR=scan(smartcr,6,"__");
wordingNL=scan(smartcr,7,"__");
fonction=scan(smartcr,8,"__");
ARTISTIQUE2=scan(smartcr,9,"__");
Art_At_LEAST=scan(smartcr,10,"__");
totalBrut=scan(smartcr,11,"__");
totalImposable=scan(smartcr,12,"__");
run;

Mildly interesting, the INFILE statement supports a delimiter string.
data test;
infile cards dlmstr='__';
input (mois
jours
nbjours
refClient
paysPrestation
wordingFR
wordingNL
fonction
ARTISTIQUE2
Art_At_LEAST
totalBrut
totalImposable) (:$32.);
cards;
4__J04__1__SCH175__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7
4__J04__1__LE_46__BE__compositeur / arrangeur__compositeur / bewerker__(blank)__1__17__108.03__93.7
;;;;
run;
proc print;
run;

SAS - replacing a character with a space?

Had a quick question - I need to remove punctuation and replace characters with a space (i.e.: if I have a field that contains a * I need to replace it with a white space).
I can't seem to get it right - I was originally doing this to just remove it, but I've found that in some cases my string is being squished together.
Thoughts?
STRING2 = compress(STRING, ":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...");

The COMPRESS() function will remove the characters. If you want to replace them with spaces then use the TRANSLATE() function. If you want to reduce multiple blanks to a single blank use the COMPBL() function.
STRING2 = compbl(translate(STRING,' ',":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…..."));
Rather than listing the characters that need to be converted to spaces you could use COMPRESS() to turn the problem around to listing the characters that should be kept.
So this example will use the modifiers ad on the COMPRESS() function call to pass the characters in STRING that are not alphanumeric characters to the TRANSLATE() function call so they will be replaced by spaces.
STRING2 = compbl(translate(STRING,' ',compress(STRING,' ','ad')));

Try using the translate function and see if it fits your needs:
data want;
STRING = "!';AAAAÄAA$";
STRING2 = translate(STRING,' ',':;,*~''’°-!()®#""#$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...');
run;
Output:
STRING STRING2
!';AAAAÄAA$ AAAA AA

Try the TRANSLATE() function.
TRANSLATE(SOURCE,TO,FROM);
data test;
string = "1:,*2~’°-ÍÓ3Äö‘—È…...4";
string2 = translate(string,
" ",
":,*~’°-!';()®""##$%^&©+=\/|[]}{]{?><ÉÑËÁ’ÍÓÄö‘—È…...");
put string2=;
run;
I get
string2=1 2 3 4

While translate function could get you there, you could also use REGEX in SAS. It is more elegant, but you need to escape the characters in the actual regex pattern.
data want;
input string $60.;
length new_string $60.;
new_string = prxchange('s/([\:\,\*\~\’\°\-\!\'||"\'"||';\(\)\®\"\"\#\#\$\%\^\&\©\+\=\\\/\|\[\}\{\]\{\\\?\>\<\É\Ñ\Ë\Á\’\Í\Ó\Ä\ö\‘\—\È\…\.\.\.\]])/ /',-1,string);
datalines;
Cats, dogs, and anyone else!
;

Try it with the help of regular expressions.
data have;
old = "AM;'IGH}|GH";
new = prxchange("s/[^A-Z]/ /",-1,old);
run;
proc print data=have nobs;
run;
OUTPUT-
old new
AM;'IGH}|GH AM IGH GH

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Double replacement occurring for accented characters when using PXRCHANGE - sas

Related

Use of PRXCHANGE to rename variables causes excessive replacement to happen at the end of the variable name

SAS treatment of blanks interferes with regex rules

How to remove special ASCII characters?

SAS Scan function separator not working as it should

SAS - replacing a character with a space?

Categories

Resources