SCAN function to change a phrase - sas

I have a decode phrase (AE_SER_D) 'Is a significant medical event in the Investigator's judgment” that I need to change to ‘Is a significant medical event in the Investigators judgment’ as the apostrophe between r and s is causing the program to error out. I can't change the decode (AE_SER_C) but wanted to program a line of code using a scan function to search if ae_ser_d is ne '' and contains this phrase but only want to search for a partial segment of the phrase as If I search for the whole phrase it will cause the program to still error out because of the apostrophe. Is SCAN the best option here?.

Working with Reeza's idea: Remove all punctuation marks with the compress() function and the 'p' option. Assuming you want a single quote around the whole phrase, enclose the result with single quotations using cats().
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D,,'p'), "'");
run;
If you only need to remove quotations and need to keep other punctuation marks, specify them directly in compress():
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D, "'"""), "'");
run;
Source: KevinQin

Related

BigQuery remove <0x00> hidden characters from a column

I have a table with unwanted hidden characters such as my_table:
id
fruits
1
STuff1 stuff_2 ����������������������
2
Blahblah-blahblah �������������
3
nothing
How do I remove ���������������������� when selecting this column?
Current query:
SELECT fruits, TRIM(REGEXP_REPLACE(fruits, r'[^a-zA-Z,0-9,-]', ' ')) AS new_fruits
FROM `project-id.MYDATASET.my_table`
This query is too flaw because I'm worried if I accidentally exclude/replace important data. I only want to be specific on this weird characters.
Upon opening the data as csv, the weird characters shows as <0x00>. How do I solve this?
First you have to identify which is this character, because as it is a non printable this sign is just a random representation. For replace it without remove any other important information, do the following:
identify the hexadecimal of the character. Copy from csv and past on this site:
Use the replace function in bigquery replacing the char of this hex, as following:
SELECT trim(replace(string_field_1,chr(0xfffd)," ")) FROM `<project>.<dataset>.<table>`;
if your character result is different than fffd, put you value on the chr() function

SAS treatment of blanks interferes with regex rules

I have a dataset which I need to clean using regex rules. These rules come from a file regex_rules.csv with columns string_pattern and string_replace and are applied using a combination of prxparse and prxchange as follows:
array a_rules{1:&NOBS} $200. _temporary_;
array a_rules_parsed{1:&num_rules} _temporary_;
if _n_ = 1 then
do i = 1 to &num_rules;
a_rules{i} = cat("'s/",string_pattern,"/",string_replace,"/'");
a_rules_parsed{i} = prxparse(cats('s/',string_pattern,'/',string_replace,'/','i'));
end
set work.dirty_strings;
clean_string = dirty_string;
do i = 1 to &num_rules;
debug_string = cats("Executing prxchange(",a_rules{i},",",-1,",","'",clean_string,"'",")");
put debug_string;
clean_string = PRXCHANGE(a_rules_parsed{i},-1,clean_string);
end
Some rules specify replacing certain patterns with a single blank space, so the corresponding string_replace value in the file is a single blank space.
The issue I'm facing is that SAS never respects the single space, and instead replaces the matched string_pattern for these records with an empty string (the other rules are applied as expected).
To troubleshoot I executed the following:
proc sql;
create table work.single_blanks as
select
string_pattern,
string_replace,
from work.regex_rules
where string_replace = " ";
quit;
which yielded the expected records. I was confused to find that changing the where clause to
where string_replace = "" or
where string_replace = " " gave identical results! (I've been using sas for a while but I guess this behavior has gone unnoticed until now). Consequently, I could not determine whether SAS is neglecting to properly read in the file and retain the single blank, or whether one of the prx functions is failing to properly handle the single blank.
I can think of "hacky" work-arounds, but I'd rather understand what I'm doing wrong here and what the correct solution should be.
EDIT 1:
Here is a rule from the file and how I'd expect it to act on an example input value:
string_pattern, string_replace
"(#|,|/|')", " "
running the code above on the input string dirty_string = "10,120 DIRTY DRIVE"; does not produce the expected output of "10 120 DIRTY DRIVE" but rather "10120 DIRTY DRIVE".
EDIT 2
In addition to not respecting single spaces, leading and trailing spaces do not seem to be respected. For example, for a file with the rules
string_pattern, string_replace
"\\bDR(\\.|\\b)", "DRIVE "
"\\bS(\\.|\\b)?W(\\.|\\b)", " SOUTH WEST"
running the code above on the input string dirty_string = "10120 DIRTY DR.SW."; does not produce the expected output of "10120 DIRTY DRIVE SOUTH WEST" but rather "10120 DIRTY DRIVESW.". This is because the space at the end of the first string_replace value gets lost, meaning there is no word boundary at the beginning of the second string_pattern to be matched.
SAS stores character variables as fixed length strings that are padded with spaces. As a consequence string comparisons ignore trailing spaces. So x=' ' and x=' ' are the same test.
The CATS() will remove all of the leading and trailing spaces, so empty strings will generate nothing at all. It sounds like you want to treat an empty string as a single space. The TRIM() function will return a single space for an empty string. So perhaps you just want to change this:
cats('s/',string_pattern,'/',string_replace,'/','i')
into
cat('s/',trim(string_pattern),'/',trim(string_replace),'/','i')
Here is a working code (with a fixed string_pattern) of your example data:
data test;
length string_pattern string_replace dirty_string expect
clean_string regex $200
;
infile cards dsd truncover;
input string_pattern string_replace dirty_string expect;
regex= cat('s/',trim(string_pattern),'/',trim(string_replace),'/i') ;
regex_id = prxparse(trim(regex));
clean_string = prxchange(regex_id,-1,trim(dirty_string));
if clean_string=expect then put 'GOOD'; else put 'BAD';
*put (_character_) (=$quote./);
cards4;
"(#|,|\/|')", " ","10,120 DIRTY DRIVE","10 120 DIRTY DRIVE"
;;;;
If any of your values have significant trailing spaces then you will need to store the data differently. You could for example quote the values:
string_replace = "'DRIVE '";
...
cat('s/',dequote(string_pattern),'/',dequote(string_replace),'/','i')
If you only add quotes around values that need them then you will need to include the TRIM() function calls.
cat('s/',dequote(trim(string_pattern)),'/',dequote(trim(string_replace)),'/','i')
Or store the string lengths into separate numeric fields.
cat('s/',substrn(string_pattern,1,len1),'/',substrn(string_replace,1,len2),'/','i')
And note that if any of your original character strings had either significant leading or trailing spaces they would have been eliminated by reading the data from a CSV file.

sas, remove the comma and period, regex

Do you guys know how to replace remove the comma and period in something like this:
'18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,. '
I had to concatenate to get list of claim numbers (with leading zeros). Now, I have that string but I want to delete all the stuff at the end. I tried this but it didn't work
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=PRXCHANGE('s/,.//',1,CONCAT);
run;
By the way, I am using SAS and regex, something like prxchange.
This also worked for me
data OUT.REQ_1_4_25 ;
set OUT.REQ_1_4_24;
CONCAT1=TRANWRD(CONCAT, ',.', '');
run;
The second argument to the PRXCHANGE function specifies the number of times the search and replace should be done. Replacing your 1 by -1 will run the replacement until the end of the string, rather than only once.
Also, the pair ',.' will replace a comma followed by any character ('.' is a wildcard). You want to catch either a comma (',') or a period ('.'), the last of which is a metacharacter you need to escape from, using '\':
CONCAT1=PRXCHANGE('s/[,\.]//',-1,CONCAT);
If you only want to remove the comma-period pairs, then remove the square brackets:
CONCAT1=PRXCHANGE('s/,\.//',-1,CONCAT);
No need for regex unless you have something more complicated than actually shown.
Just use the scan() function and tell it to use . and , as delimiters:
data claims;
length claim $50;
list = '18430109646000104331929350001,064380958490001,974317618110001,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.';
cnt=1;
claim=scan(list,cnt,'.,');
do while (claim ne '');
output;
cnt=cnt+1;
claim=scan(list,cnt,'.,');
end;
keep claim;
run;

How to check if a string contains apostrophe

I don't remember how SAS deal with these special characters. Any built-in functions?
E.g
a = New Year's Day, should I use something like index(a, 'New Year's Day') > 0?
The key to this question is the masking of the apostrophe in quotes. If you wish to look for an occurrence of a single apostrophe, you can mask it with double apostrophes:
Looking for single apostrophes
data _NULL_;
a="New Year's Day";
b=index(a,"'");
put b=;
run;
The single apostrophe is passed as a second argument to the index function, using double quotes.
Looking for double quotes
data _NULL_;
a='They said, "Happy New Year!"';
b=index(a,'"');
put b=;
run;
This time around, the double quote is set inside single quotes when passed to the index function
mjsqu and NeoMental covered the basic case well, but in the special case where you do not have the option of using " (for example, you need to prevent macro variable resolution), you can double the apostrophe:
data _null_;
a='MerryXmas&HappyNewYear''s'; *here need single quotes or a macro quoting function;
b=find(a,"'"); *here do not need to mask ampersand resolution;
run;
Of course you could also use %nrstr to avoid resolution, but there are real life cases where this is occasionally needed. This works with "" similarly (two "" become one character ").
Use "find" command like below to find out what are you looking for is there in the string or not. If the returned value is greater than > 0 then apostrophe or whatever you are looking for is there, otherwise not.
Teststring - where you want to look
Next to Teststring is "'" - In quotes what are you looking for, in
your case apostrophe
data _null_;
TestString="New year's day";
IsItThere=find(TestString,"'");
put IsItThere=;
run;
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002267763.htm

Is there an efficient way to scrape substrings from column values in Postgres?

I have a column called user_response, on which I want to do variety of operations like take out words contained in quotes, and take out the part of the string after colon (:)
One such operation is this:
Let's say for a record
user_response = "My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts"
Now, I want to scrape the part of the string after last colon(:). Hence, my output should be RealMadridTShirts
I could achieve this somehow with the following hack:
SELECT reverse(split_part(reverse(user_response), ' :', 1))
However, this is grossly inefficient, specially when I am having to do this over 500,000 rows. It's not an operation that I will doing throughout the day. This operation is for a once-a-day load but even then the load is becoming very expensive.
Coming from Oracle, I know I could have used INSTR and SUBSTR functions to achieve it in a more elegant fashion (without having to reverse the string and all.
Also, what if I had to scrape the text after the second last colon?
Find the string after the last colon, right?
My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts
It's trivial with a regular expression:
regress=> SELECT (regexp_matches(
'My company: ''XYZ Co.'' has allowed to use:: the following \n \n kind of product: RealMadridTShirts',
'.*:(.*?)$')
)[1];
regexp_matches
--------------------
RealMadridTShirts
(1 row)
The apparent lack of a function to request the position of a string counting from a particular starting point makes it harder to do without using a regexp, but as a regexp is sure to be the fastest way to solve this I doubt that's an issue.
Your bigger problem is likely to be that you're scanning so much data. That's never going to be fast.