I have a dataset which I need to clean using regex rules. These rules come from a file regex_rules.csv with columns string_pattern and string_replace and are applied using a combination of prxparse and prxchange as follows:
array a_rules{1:&NOBS} $200. _temporary_;
array a_rules_parsed{1:&num_rules} _temporary_;
if _n_ = 1 then
do i = 1 to &num_rules;
a_rules{i} = cat("'s/",string_pattern,"/",string_replace,"/'");
a_rules_parsed{i} = prxparse(cats('s/',string_pattern,'/',string_replace,'/','i'));
end
set work.dirty_strings;
clean_string = dirty_string;
do i = 1 to &num_rules;
debug_string = cats("Executing prxchange(",a_rules{i},",",-1,",","'",clean_string,"'",")");
put debug_string;
clean_string = PRXCHANGE(a_rules_parsed{i},-1,clean_string);
end
Some rules specify replacing certain patterns with a single blank space, so the corresponding string_replace value in the file is a single blank space.
The issue I'm facing is that SAS never respects the single space, and instead replaces the matched string_pattern for these records with an empty string (the other rules are applied as expected).
To troubleshoot I executed the following:
proc sql;
create table work.single_blanks as
select
string_pattern,
string_replace,
from work.regex_rules
where string_replace = " ";
quit;
which yielded the expected records. I was confused to find that changing the where clause to
where string_replace = "" or
where string_replace = " " gave identical results! (I've been using sas for a while but I guess this behavior has gone unnoticed until now). Consequently, I could not determine whether SAS is neglecting to properly read in the file and retain the single blank, or whether one of the prx functions is failing to properly handle the single blank.
I can think of "hacky" work-arounds, but I'd rather understand what I'm doing wrong here and what the correct solution should be.
EDIT 1:
Here is a rule from the file and how I'd expect it to act on an example input value:
string_pattern, string_replace
"(#|,|/|')", " "
running the code above on the input string dirty_string = "10,120 DIRTY DRIVE"; does not produce the expected output of "10 120 DIRTY DRIVE" but rather "10120 DIRTY DRIVE".
EDIT 2
In addition to not respecting single spaces, leading and trailing spaces do not seem to be respected. For example, for a file with the rules
string_pattern, string_replace
"\\bDR(\\.|\\b)", "DRIVE "
"\\bS(\\.|\\b)?W(\\.|\\b)", " SOUTH WEST"
running the code above on the input string dirty_string = "10120 DIRTY DR.SW."; does not produce the expected output of "10120 DIRTY DRIVE SOUTH WEST" but rather "10120 DIRTY DRIVESW.". This is because the space at the end of the first string_replace value gets lost, meaning there is no word boundary at the beginning of the second string_pattern to be matched.
SAS stores character variables as fixed length strings that are padded with spaces. As a consequence string comparisons ignore trailing spaces. So x=' ' and x=' ' are the same test.
The CATS() will remove all of the leading and trailing spaces, so empty strings will generate nothing at all. It sounds like you want to treat an empty string as a single space. The TRIM() function will return a single space for an empty string. So perhaps you just want to change this:
cats('s/',string_pattern,'/',string_replace,'/','i')
into
cat('s/',trim(string_pattern),'/',trim(string_replace),'/','i')
Here is a working code (with a fixed string_pattern) of your example data:
data test;
length string_pattern string_replace dirty_string expect
clean_string regex $200
;
infile cards dsd truncover;
input string_pattern string_replace dirty_string expect;
regex= cat('s/',trim(string_pattern),'/',trim(string_replace),'/i') ;
regex_id = prxparse(trim(regex));
clean_string = prxchange(regex_id,-1,trim(dirty_string));
if clean_string=expect then put 'GOOD'; else put 'BAD';
*put (_character_) (=$quote./);
cards4;
"(#|,|\/|')", " ","10,120 DIRTY DRIVE","10 120 DIRTY DRIVE"
;;;;
If any of your values have significant trailing spaces then you will need to store the data differently. You could for example quote the values:
string_replace = "'DRIVE '";
...
cat('s/',dequote(string_pattern),'/',dequote(string_replace),'/','i')
If you only add quotes around values that need them then you will need to include the TRIM() function calls.
cat('s/',dequote(trim(string_pattern)),'/',dequote(trim(string_replace)),'/','i')
Or store the string lengths into separate numeric fields.
cat('s/',substrn(string_pattern,1,len1),'/',substrn(string_replace,1,len2),'/','i')
And note that if any of your original character strings had either significant leading or trailing spaces they would have been eliminated by reading the data from a CSV file.
Related
I am new to SAS and I am currently trying to create a macro which will automatically replace any special characters in a variables name with an underscore. I am currently using PRXCHANGE to perform the replacement, yet I notice that when the variable gets renamed, there is extra underscores being placed at the end of the new variable name.
Suppose we were to have two variables "dummy?" and "te!st". When I perform the replacement, the new variables are "dummy___________________________" and " te_st___________________________". When the replacement should just be "dummy_" and "te_st", respectively.
In the sample code below, I know that if I were to add "TRIM(name)" in the PRXCHANGE function then there would not be any extra replacements occurring. The issue with doing this is that if I were to have a variable named "example! ", with a space as the final character, then I would want the variable to be renamed to "example__", with two underscores at the end. Yet by using TRIM(name), I would get "example_", with a single underscore.
N.B. I know if I change the SAS variable name policy to V7, then this would not be a problem. I am solely doing this to improve upon my SAS skills.
/* Generate dummy data */
option validvarname = any;
data dummy_data;
input "dummy?"n "te!st"n;
datalines;
1 1
2 2
3 3
;
run;
/* Generate variables with the old and new variable names as entries */
data test (keep = name new_name);
set sashelp.vcolumn;
where libname = "WORK" and memname = "DUMMY_DATA";
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, name);
run;
Your issue is how SAS pads string variable length. While most languages have variable length strings, SAS is more akin to SQL char type, without the accompanying varchar type. This gives SAS very good performance in some ways, due to predictable row sizes, but has some consequences. Note that you can actually get effectively variable length strings on datasets using options compress, but during a data step the dataset is uncompressed.
In SAS, a string of length 10 that is assigned "A" will actually have value "A ". A, plus 9 spaces. Not null characters, actual space characters. That usually doesn't matter, as SAS is written in many ways to ignore those trailing spaces (so "A" = "A " = "A "), but in this particular case it does matter (since you're transforming the space character).
You can use the trim function to remove the spaces during execution, though it will still be stored with the spaces afterwards of course.
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
Note that trim cannot return a null value, it will always return a single space, so if that's a possibility, you should wrap this in a check for missing (a string variable with only spaces = missing).
if not missing(name) then do;
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
end;
else new_name = ' ';
There is a trimn function that can return a length 0 string, but there's no reason to do the prxchange if it's missing - this will save time.
Your concern about trailing spaces on variable names it not valid. Trailing spaces on variable names are not significant. This data step creates only one variable.
376 options validvarname=any;
377 data test;
378 'xxx'n = 1;
379 'xxx 'n= 2;
380 run;
NOTE: The data set WORK.TEST has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
Why does this code not need two trim statements, one for first and one for last name? Does the length statement remove blanks?
data work.maillist; set cert.maillist;
length FullName $ 40;
fullname=trim(firstname)||' '||lastname;
run;
length is a declarative statement and introduces a variable to the Program Data Vector (PDV) with the specific length you specify. When an undeclared variable is used in a formula SAS will assign it a default length depending on the formula or usage context.
Character variables in SAS have a fixed length and are padded with spaces on the right. That is why the trim(firstname) is needed when || lastname concatenation occurs. If it wasn't, the right padding of firstname would be part of the value in the concatenation operations, and might likely exceed the length of the variable receiving the result.
There are concatenation functions that can simplify string operations
CAT same as using <var>|| operator
CATT same as using trim(<var>)||
CATS same as using trim(left(<var>))||
CATX same as using CATS with a delimiter.
STRIP same as trim(left(<var>))
Your expression could be re-coded as:
fullname = catx(' ', firstname, lastname);
Is there a reason you think it should? Can you see trailing spaces in the surname, have you tried a length() function?
I could be wrong here but sometimes when you apply a function (put especially) or import data you can inadvertently store leading or trailing spaces. Trailing spaces are a mystery because you don't realise they are there until you try to do something else with the data.
A length statement should allow you to store exactly the data you give it providing you use a number/character variable correctly with truncation only occurring if the length value is too short.
I've found the
compress() function to be the most convenient for dealing with white space and punctuation particularly if you are concatenating variables.
https://www.geeksforgeeks.org/sas-compress-function-with-examples/
All the best,
Phil
Because SAS will truncate the value when it is too long to fit into FULLNAME. And when it is too short it will fill in the rest of FULLNAME with spaces anyway so there is no need to remove them.
It would only be an issue if the length of FULLNAME is smaller than the sum of the lengths of FIRSTNAME and LASTNAME plus one. Otherwise the result cannot be too long to fit into FULLNAME, even if there are no trailing spaces in either FIRSTNAME or LASTNAME.
Try it yourself with non-blank values so it is easier to see what is happening.
1865 data test;
1866 length one $1 two $2 three $3 ;
1867 one = 'ABCD';
1868 two = 'ABCD';
1869 three='ABCD';
1870 put (_all_) (=);
1871 run;
one=A two=AB three=ABC
NOTE: The data set WORK.TEST has 1 observations and 3 variables.
I have a decode phrase (AE_SER_D) 'Is a significant medical event in the Investigator's judgment” that I need to change to ‘Is a significant medical event in the Investigators judgment’ as the apostrophe between r and s is causing the program to error out. I can't change the decode (AE_SER_C) but wanted to program a line of code using a scan function to search if ae_ser_d is ne '' and contains this phrase but only want to search for a partial segment of the phrase as If I search for the whole phrase it will cause the program to still error out because of the apostrophe. Is SCAN the best option here?.
Working with Reeza's idea: Remove all punctuation marks with the compress() function and the 'p' option. Assuming you want a single quote around the whole phrase, enclose the result with single quotations using cats().
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D,,'p'), "'");
run;
If you only need to remove quotations and need to keep other punctuation marks, specify them directly in compress():
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D, "'"""), "'");
run;
Source: KevinQin
I would like to remove dashes from 3 to 9 digit numbers. A certain percentage of those numbers have leading zeros in them. I tried using the Compress function, but this stripped the zeros as well. What would be the best function to use?
I understand your "numbers" are actually codes with digits and dashes and you want to keep only the digits, so what you need is string processing.
The compress function in SAS has a second (optional) parameter. If you don't specify it, the function will remove all white space characters. If you do, it will remove the characters specified. So try
no_dash = compress(with_dash, '-');
Alternatively you could remove all non digit characters, using a third (also optional) parameter
no_dash = compress(with_dash, '0123456789', 'k');
The k specifies to keep instead of remove the characters specified. You can shorten this by adding the d to the third parameter, telling SAS to add all digits to the second:
no_dash = compress(with_dash, '', 'dk');
If you have stored the compressed result (with implicit conversion) in a numeric variable, that variable may need a format to get the result you want.
data _null_;
my_dashed_text = '000-90-123';
my_compressed_text = compress(my_dashed_text, '-');
attrib my_num_var
length = 8
format = z9.
;
my_num_var = compress(my_dashed_text, '-');
put (_all_) (=/);
run;
------ LOG -----
NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
36:16
my_dashed_text=000-90-123
my_compressed_text=00090123
my_num_var=000090123
The Z numeric format tells SAS to add leading zeros that fill out to the specified width when displaying the number. The format is a fixed width, so a my_num_var from both "123-456" and "0-1-2-3-45-6" will display a Z9 formatted value of 000123456. SAS formatting can't make a number value look like 123456 or 0123456 when rendered through a single format specification (such a Z9)
Match strings ending in certain character
I am trying to get create a new variable which indicates if a string ends with a certain character.
Below is what I have tried, but when this code is run, the variable ending_in_e is all zeros. I would expect that names like "Alice" and "Jane" would be matched by the code below, but they are not:
proc sql;
select *,
case
when prxmatch("/e$/",name) then 1
else 0
end as ending_in_e
from sashelp.class
;quit;
You should account for the fact that, in SAS, strings are of char type and spaces are added up to the string end if the actual value is shorter than the buffer.
Either trim the string:
prxmatch("/e$/",trim(name))
Or add a whitespace pattern:
prxmatch("/e\s*$/",name)
^^^
to match 0 or more whitespaces.
SAS character variables are fixed length. So you either need to trim the trailing spaces or include them in your regular expression.
Regular expressions are powerful, but they might be confusing to some. For such a simple pattern it might be clearer to use simpler functions.
proc print data=sashelp.class ;
where char(name,length(name))='e';
run;