I don't remember how SAS deal with these special characters. Any built-in functions?
E.g
a = New Year's Day, should I use something like index(a, 'New Year's Day') > 0?
The key to this question is the masking of the apostrophe in quotes. If you wish to look for an occurrence of a single apostrophe, you can mask it with double apostrophes:
Looking for single apostrophes
data _NULL_;
a="New Year's Day";
b=index(a,"'");
put b=;
run;
The single apostrophe is passed as a second argument to the index function, using double quotes.
Looking for double quotes
data _NULL_;
a='They said, "Happy New Year!"';
b=index(a,'"');
put b=;
run;
This time around, the double quote is set inside single quotes when passed to the index function
mjsqu and NeoMental covered the basic case well, but in the special case where you do not have the option of using " (for example, you need to prevent macro variable resolution), you can double the apostrophe:
data _null_;
a='MerryXmas&HappyNewYear''s'; *here need single quotes or a macro quoting function;
b=find(a,"'"); *here do not need to mask ampersand resolution;
run;
Of course you could also use %nrstr to avoid resolution, but there are real life cases where this is occasionally needed. This works with "" similarly (two "" become one character ").
Use "find" command like below to find out what are you looking for is there in the string or not. If the returned value is greater than > 0 then apostrophe or whatever you are looking for is there, otherwise not.
Teststring - where you want to look
Next to Teststring is "'" - In quotes what are you looking for, in
your case apostrophe
data _null_;
TestString="New year's day";
IsItThere=find(TestString,"'");
put IsItThere=;
run;
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002267763.htm
Related
I have a dataset which I need to clean using regex rules. These rules come from a file regex_rules.csv with columns string_pattern and string_replace and are applied using a combination of prxparse and prxchange as follows:
array a_rules{1:&NOBS} $200. _temporary_;
array a_rules_parsed{1:&num_rules} _temporary_;
if _n_ = 1 then
do i = 1 to &num_rules;
a_rules{i} = cat("'s/",string_pattern,"/",string_replace,"/'");
a_rules_parsed{i} = prxparse(cats('s/',string_pattern,'/',string_replace,'/','i'));
end
set work.dirty_strings;
clean_string = dirty_string;
do i = 1 to &num_rules;
debug_string = cats("Executing prxchange(",a_rules{i},",",-1,",","'",clean_string,"'",")");
put debug_string;
clean_string = PRXCHANGE(a_rules_parsed{i},-1,clean_string);
end
Some rules specify replacing certain patterns with a single blank space, so the corresponding string_replace value in the file is a single blank space.
The issue I'm facing is that SAS never respects the single space, and instead replaces the matched string_pattern for these records with an empty string (the other rules are applied as expected).
To troubleshoot I executed the following:
proc sql;
create table work.single_blanks as
select
string_pattern,
string_replace,
from work.regex_rules
where string_replace = " ";
quit;
which yielded the expected records. I was confused to find that changing the where clause to
where string_replace = "" or
where string_replace = " " gave identical results! (I've been using sas for a while but I guess this behavior has gone unnoticed until now). Consequently, I could not determine whether SAS is neglecting to properly read in the file and retain the single blank, or whether one of the prx functions is failing to properly handle the single blank.
I can think of "hacky" work-arounds, but I'd rather understand what I'm doing wrong here and what the correct solution should be.
EDIT 1:
Here is a rule from the file and how I'd expect it to act on an example input value:
string_pattern, string_replace
"(#|,|/|')", " "
running the code above on the input string dirty_string = "10,120 DIRTY DRIVE"; does not produce the expected output of "10 120 DIRTY DRIVE" but rather "10120 DIRTY DRIVE".
EDIT 2
In addition to not respecting single spaces, leading and trailing spaces do not seem to be respected. For example, for a file with the rules
string_pattern, string_replace
"\\bDR(\\.|\\b)", "DRIVE "
"\\bS(\\.|\\b)?W(\\.|\\b)", " SOUTH WEST"
running the code above on the input string dirty_string = "10120 DIRTY DR.SW."; does not produce the expected output of "10120 DIRTY DRIVE SOUTH WEST" but rather "10120 DIRTY DRIVESW.". This is because the space at the end of the first string_replace value gets lost, meaning there is no word boundary at the beginning of the second string_pattern to be matched.
SAS stores character variables as fixed length strings that are padded with spaces. As a consequence string comparisons ignore trailing spaces. So x=' ' and x=' ' are the same test.
The CATS() will remove all of the leading and trailing spaces, so empty strings will generate nothing at all. It sounds like you want to treat an empty string as a single space. The TRIM() function will return a single space for an empty string. So perhaps you just want to change this:
cats('s/',string_pattern,'/',string_replace,'/','i')
into
cat('s/',trim(string_pattern),'/',trim(string_replace),'/','i')
Here is a working code (with a fixed string_pattern) of your example data:
data test;
length string_pattern string_replace dirty_string expect
clean_string regex $200
;
infile cards dsd truncover;
input string_pattern string_replace dirty_string expect;
regex= cat('s/',trim(string_pattern),'/',trim(string_replace),'/i') ;
regex_id = prxparse(trim(regex));
clean_string = prxchange(regex_id,-1,trim(dirty_string));
if clean_string=expect then put 'GOOD'; else put 'BAD';
*put (_character_) (=$quote./);
cards4;
"(#|,|\/|')", " ","10,120 DIRTY DRIVE","10 120 DIRTY DRIVE"
;;;;
If any of your values have significant trailing spaces then you will need to store the data differently. You could for example quote the values:
string_replace = "'DRIVE '";
...
cat('s/',dequote(string_pattern),'/',dequote(string_replace),'/','i')
If you only add quotes around values that need them then you will need to include the TRIM() function calls.
cat('s/',dequote(trim(string_pattern)),'/',dequote(trim(string_replace)),'/','i')
Or store the string lengths into separate numeric fields.
cat('s/',substrn(string_pattern,1,len1),'/',substrn(string_replace,1,len2),'/','i')
And note that if any of your original character strings had either significant leading or trailing spaces they would have been eliminated by reading the data from a CSV file.
I have a decode phrase (AE_SER_D) 'Is a significant medical event in the Investigator's judgment” that I need to change to ‘Is a significant medical event in the Investigators judgment’ as the apostrophe between r and s is causing the program to error out. I can't change the decode (AE_SER_C) but wanted to program a line of code using a scan function to search if ae_ser_d is ne '' and contains this phrase but only want to search for a partial segment of the phrase as If I search for the whole phrase it will cause the program to still error out because of the apostrophe. Is SCAN the best option here?.
Working with Reeza's idea: Remove all punctuation marks with the compress() function and the 'p' option. Assuming you want a single quote around the whole phrase, enclose the result with single quotations using cats().
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D,,'p'), "'");
run;
If you only need to remove quotations and need to keep other punctuation marks, specify them directly in compress():
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D, "'"""), "'");
run;
Source: KevinQin
Do you guys know how to replace remove the comma and period in something like this:
'18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,. '
I had to concatenate to get list of claim numbers (with leading zeros). Now, I have that string but I want to delete all the stuff at the end. I tried this but it didn't work
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=PRXCHANGE('s/,.//',1,CONCAT);
run;
By the way, I am using SAS and regex, something like prxchange.
This also worked for me
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=TRANWRD(CONCAT, ',.', '');
run;
The second argument to the PRXCHANGE function specifies the number of times the search and replace should be done. Replacing your 1 by -1 will run the replacement until the end of the string, rather than only once.
Also, the pair ',.' will replace a comma followed by any character ('.' is a wildcard). You want to catch either a comma (',') or a period ('.'), the last of which is a metacharacter you need to escape from, using '\':
CONCAT1=PRXCHANGE('s/[,\.]//',-1,CONCAT);
If you only want to remove the comma-period pairs, then remove the square brackets:
CONCAT1=PRXCHANGE('s/,\.//',-1,CONCAT);
No need for regex unless you have something more complicated than actually shown.
Just use the scan() function and tell it to use . and , as delimiters:
data claims;
length claim $50;
list = '18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.';
cnt=1;
claim=scan(list,cnt,'.,');
do while (claim ne '');
output;
cnt=cnt+1;
claim=scan(list,cnt,'.,');
end;
keep claim;
run;
Match strings ending in certain character
I am trying to get create a new variable which indicates if a string ends with a certain character.
Below is what I have tried, but when this code is run, the variable ending_in_e is all zeros. I would expect that names like "Alice" and "Jane" would be matched by the code below, but they are not:
proc sql;
select *,
case
when prxmatch("/e$/",name) then 1
else 0
end as ending_in_e
from sashelp.class
;quit;
You should account for the fact that, in SAS, strings are of char type and spaces are added up to the string end if the actual value is shorter than the buffer.
Either trim the string:
prxmatch("/e$/",trim(name))
Or add a whitespace pattern:
prxmatch("/e\s*$/",name)
^^^
to match 0 or more whitespaces.
SAS character variables are fixed length. So you either need to trim the trailing spaces or include them in your regular expression.
Regular expressions are powerful, but they might be confusing to some. For such a simple pattern it might be clearer to use simpler functions.
proc print data=sashelp.class ;
where char(name,length(name))='e';
run;
I want to replace one combination of text with another. For example
data test;
a='raja\ram{work}italic';
if index(a,'\') then b=tranwrd(a,'\','\\');
if index(a,'{') then b=tranwrd(a,'{','\{');
if index(a,'}') then b=tranwrd(a,'}','\}');
if index(upcase(a),'ITALIC') then b=tranwrd(a,substr(a,index(upcase(a),'ITALIC'),length('ITALIC')),'\i');
run;
Required Result: b=raja\\ram\{work\}\i;
These kind of combination I wanted to replace. I'm not interested to use a macro or FCMP or if else condition.
Is there any function to do all at once? I tried to use a Perl expression that also working for one at a time b= prxchange('s/\\/\\\\/', -1, a)
Your regular expression is on the right track. You have a set of characters, right, you want to always prepend a \ to? So search for (one of that set of characters), which you do with [...], and then add a \ to it, using a capturing group. That's the escape character, so you have to add two any time you want to use one (\\ escapes itself to \).
data test;
a='Hello\Goodbye{stuff}';
b= prxchange('s/([\\{}])/\\$1/',-1,a);
put b=;
run;
You should do the italic bit in a second expression (or just use tranwrd). That's a totally different replacement and while theoretically possible to put in one, would make it too messy.
This question is almost identical to the other question: Multiple search and replace within a string through regular expression in SAS
Is that a coincidence?
Here is the code that worked for the other question.
%let text = abc\pqr{work};
data _null_;
var=prxchange("s/\\/\\\\/",-1,"&text");
var=prxchange("s/\{/\\\{/",-1,var);
var=prxchange("s/\}/\\\}/",-1,var);
put var;
run;
Result: abc\\pqr\{work\};
%let text = BOLD\ITALIC\ITALICBOLD\BOLDITALIC\B\I\IB\BI;
data _null_;
var=prxchange("s/BOLD/b/",-1,"&text");
var=prxchange("s/ITALIC/i/",-1,var);
var=lowcase(var);
put var;
run;
RESULT: b\i\ib\bi\b\i\ib\bi