Selective case sensitivity/insensitivity with PRXPARSE - regex

I have the following regex which I am using to scan fields within a dataset for a variety of markers that may indicate that the record belongs to a US resident:
prx_1 = (prxparse("/(?i)^USA$(?-i)|
(?i)^United[\s+]States[\s+]of[\s+]America$(?-i)|
(?i)^US$(?-i)|
(?i)^U[\s+]S[\s+]A$(?-i)|
(?i)^United[\s+]States$(?-i)|
(?i)^America$(?-i)|
(?i)^U[\.+]S[\.+]A$(?-i)|
(?i)^U[\.+]S[\.+]A[\.+]$(?-i)|
(?-i)^AL$|(?-i)^AK$|(?-i)^AZ$|(?-i)^AR$|
(?-i)^CA$|(?-i)^CO$|(?-i)^CT$|(?-i)^DE$|
(?-i)^DC$|(?-i)^FL$|(?-i)^GA$|(?-i)^HI$|
(?-i)^ID$|(?-i)^IL$|(?-i)^IN$|(?-i)^IA$|
(?-i)^KS$|(?-i)^KY$|(?-i)^LA$|(?-i)^ME$|
(?-i)^MD$|(?-i)^MA$|(?-i)^MI$|(?-i)^MN$|
(?-i)^MS$|(?-i)^MO$|(?-i)^MT$|(?-i)^NE$|
(?-i)^NV$|(?-i)^NH$|(?-i)^NJ$|(?-i)^NM$|
(?-i)^NY$|(?-i)^NC$|(?-i)^ND$|(?-i)^OH$|
(?-i)^OK$|(?-i)^OR$|(?-i)^PA$|(?-i)^RI$|
(?-i)^SC$|(?-i)^SD$|(?-i)^TN$|(?-i)^TX$|
(?-i)^UT$|(?-i)^VT$|(?-i)^VA$|(?-i)^WA$|
(?-i)^WV$|(?-i)^WI$|(?-i)^WY$|(?-i)^AS$|
(?-i)^GU$|(?-i)^MP$|(?-i)^PR$|(?-i)^VI$|
(?-i)^UM$|(?-i)^FM$|(?-i)^MH$|(?-i)^PW$|
(?-i)^AA$|(?-i)^AE$|(?-i)^AP$|(?-i)^CM$|
(?-i)^CZ$|(?-i)^NB$|(?-i)^PI$|(?-i)^TT$|
(?i)^Alabama$(?-i)|(?i)^Alaska$(?-i)|(?i)^Arizona$(?-i)|(?i)^Arkansas$(?-i)|
(?i)^California$(?-i)|(?i)^Colorado$(?-i)|(?i)^Connecticut$(?-i)|(?i)^Delaware$(?-i)|
(?i)^District[\s+]of[\s+]Columbia$(?-i)|(?i)^Florida$(?-i)|(?i)^Georgia$(?-i)|(?i)^Hawaii$(?-i)|
(?i)^Idaho$(?-i)|(?i)^Illinois$(?-i)|(?i)^Indiana$(?-i)|(?i)^Iowa$(?-i)|(?i)^Kansas$(?-i)|
(?i)^Kentucky$(?-i)|(?i)^Louisiana$(?-i)|(?i)^Maine$(?-i)|(?i)^Maryland$(?-i)|
(?i)^Massachusetts$(?-i)|(?i)^Michigan$(?-i)|(?i)^Minnesota$(?-i)|(?i)^Mississippi$(?-i)|
(?i)^Missouri$(?-i)|(?i)^Montana$(?-i)|(?i)^Nebraska$(?-i)|(?i)^Nevada$(?-i)|
(?i)^New[\s+]Hampshire$(?-i)|(?i)^New[\s+]Jersey$(?-i)|(?i)^New[\s+]Mexico$(?-i)|
(?i)^New[\s+]York$(?-i)|(?i)^North[\s+]Carolina$(?-i)|(?i)^North[\s+]Dakota$(?-i)|
(?i)^Ohio$(?-i)|(?i)^Oklahoma$(?-i)|(?i)^Oregon$(?-i)|(?i)^Pennslyvania$(?-i)|
(?i)^Rhode[\s+]Island$(?-i)|(?i)^South[\s+]Carolina$(?-i)|(?i)^South[\s+]Dakota$(?-i)|
(?i)^Tennessee$(?-i)|(?i)^Texas$(?-i)|(?i)^Utah$(?-i)|(?i)^Vermont$(?-i)|(?i)^Virginia$(?-i)|
(?i)^Washington$(?-i)|(?i)^West[\s+]Virginia$(?-i)|(?i)^Wisconsin$(?-i)|(?i)^Wyoming$(?-i)|
(?i)^American[\s+]Samoa$(?-i)|(?i)^Guam$(?-i)|(?i)^Northern[\s+]Mariana[\s+]Islands$(?-i)|
(?i)^Puerto[\s+]Rico$(?-i)|(?i)^Virgin[\s+]Islands$(?-i)|
(?i)^U[\.*]S[\.*][\s+]Minor[\s+]Outlying[\s+]Islands$(?-i)|
(?i)^Federated[\s+]States[\s+]of[\s+]Micronesia$(?-i)|(?i)^Marshall[\s+]Islands$(?-i)|
(?i)^Palau$(?-i)/"
));
This is a series of small regexes concatenated with the | marker. My understanding of regexes was that if I wanted to switch case sensitivity on and off I should use (?i) to turn it on and (?-i) to turn it off. However this code is not returning matches where the state name for example is written in upper case.
Have I misinterpreted something here?
Thanks

If the regex flavour support (?i), it should also support (?i:pattern). You should rewrite your regex and place the patterns which should be case-insensitive inside the non-capturing group (?i:pattern).
An example for the part of the pattern which you need to make case-insensitive:
^(?i:USA|United\s+States\s+of\s+America|United\s+States)$
An example for the part of the pattern which you need to make case-sensitive:
^(?:AL|AK|AZ|AR)$

This works here. See this page lower down, title "Comments and Inline Modifiers", for detail.
data have;
input state $;
datalines;
AZ
az
Az
ARIZONA
Arizona
ArIzOnA
;;;;
run;
data want;
set have;
_rx = prxparse('~(?i)AZ|(?-i)Arizona~o');
_rc = prxmatch(_rx,state);
put _rc=;
run;
Your regex is too complex right now to really help you troubleshoot. If you want troubleshooting, I would limit it to just one state (or something like that) and figure it out from there.

Related

SCAN function to change a phrase

I have a decode phrase (AE_SER_D) 'Is a significant medical event in the Investigator's judgment” that I need to change to ‘Is a significant medical event in the Investigators judgment’ as the apostrophe between r and s is causing the program to error out. I can't change the decode (AE_SER_C) but wanted to program a line of code using a scan function to search if ae_ser_d is ne '' and contains this phrase but only want to search for a partial segment of the phrase as If I search for the whole phrase it will cause the program to still error out because of the apostrophe. Is SCAN the best option here?.
Working with Reeza's idea: Remove all punctuation marks with the compress() function and the 'p' option. Assuming you want a single quote around the whole phrase, enclose the result with single quotations using cats().
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D,,'p'), "'");
run;
If you only need to remove quotations and need to keep other punctuation marks, specify them directly in compress():
data want;
AE_SER_D = cat("'Is a significant medical event in the Investigator's ", 'judgment"');
AE_SER_D_Fixed = cats("'", compress(AE_SER_D, "'"""), "'");
run;
Source: KevinQin

Can I make my Alteryx RegEx parse conditional?

I receive messages with the fields below. I want to group and extract the user inputs. Majority of submissions contain all fields and the regex works great. Problem comes in when someone removes additional lines if let's say they only need to fill in down to Amount 1
Name:
Number:
Amount:
Old Code:
Code 1:
Amount 1:
Code 2:
Amount 2:
Code 3:
Amount 3:
Code 4:
Amount 4:
I'm using Alteryx to parse the message contents and have success with my current regex but want to be ready for unavoidable user submission inconsistency
Name:(.+)\sNumber:(.+)\sAmount:(.+)\sOld Code:(.+)\sCode 1:(.+)\sAmount 1:(.+)\sCode 2:(.*?)\sAmount 2:(.*?)\sCode 3:(.*?)\sAmount 3:(.*?)\sCode 4:(.*?)\sAmount 4:(.*?[^-]*)
Is it possible to have Alteryx return parsed results from a message even if a listed field is deleted?
Alteryx issue with new cascading regex
Anyway, you can always do a cascading nested optional grouping around the
lines to just match what's valid up to a point.
This expects the form lines to be in order. If it's not, a different type
of regex is needed - an out-of-order regex ( see the bottom regex ) .
Both these regex are for Perl 5.10
(?-ms)Name:(.*)(?:\s+Number:(.*)(?:\s+Amount:(.*)(?:\s+Old[ ]+Code:(.*)(?:\s+Code[ ]+1:(.*)(?:\s+Amount[ ]+1:(.*)(?:\s+Code[ ]+2:(.*)(?:\s+Amount[ ]+2:(.*)(?:\s+Code[ ]+3:(.*)(?:\s+Amount[ ]+3:(.*)(?:\s+Code[ ]+4:(.*)(?:\s+Amount[ ]+4:(.*?[^-]*))?)?)?)?)?)?)?)?)?)?)?
https://regex101.com/r/9oKXEE/1
For out-of-order matching, use this
(?m-s)\A(?:[\S\s]*?(?:(?(1)(?!))^\h*Name\h*:\h*(.*)|(?(2)(?!))^\h*Number\h*:\h*(.*)|(?(3)(?!))^\h*Amount\h*:\h*(.*)|(?(4)(?!))^\h*Old\h*Code\h*:\h*(.*)|(?(5)(?!))^\h*Code\h*1\h*:\h*(.*)|(?(6)(?!))^\h*Amount\h*1\h*:\h*(.*)|(?(7)(?!))^\h*Code\h*2\h*:\h*(.*)|(?(8)(?!))^\h*Amount\h*2\h*:\h*(.*)|(?(9)(?!))^\h*Code\h*3\h*:\h*(.*)|(?(10)(?!))^\h*Amount\h*3\h*:\h*(.*)|(?(11)(?!))^\h*Code\h*4\h*:\h*(.*)|(?(12)(?!))^\h*Amount\h*4\h*:\h*(.*?))){1,12}
https://regex101.com/r/f2rG1v/1
In this situation, you don't need to use Regex straight off the bat and given the inconsistent data it could take a while to perfect one regex term...
You can do it this way instead:
- RecordID first,
- Then you can use a Text 2 Columns with a new-line (\n) delimiter. Configure this to "Split to Rows".
- You can then use a Text to Columns to split on the delimter ":".
That will handle additional rows entered etc. At that stage, you can figure out how to clean up the results (filter to remove null lines, multi-row to tag records, cross-tab to create a table etc...). If you want to flag any unknown rows, you can have a Text Input with the required rows and use Find/Replace or Join to separate the data.

Regular Expression in SAS. Match word if at start or with a space before

I have some Text in a table and want to match a specific word, for example "good".
if(prxmatch("/GOOD/",UPCASE(mytablefield)) > 0 then ....
But this should only match when no other letter is before the G.
So i could add a space before the g
"/ GOOD/"
But it could also be at the start of the text. According to some SAS-documentation and on some answer for other languages on this site, for the start of the text ^ is used, but if i try
"/(( )|(^)) GOOD/" or "/(^GOOD)|( GOOD)/" or even "/^GOOD/"
i get no result for text starting with good. I guess this is a simple Problem, maybe related to regexsyntax in SAS, but i could not find it by googling. How can I solve this issue?
"Good morning"-> should match
"This is a good idea"->should match
"I have an ungood Feeling"-> should not match
"He is back 4good" -> should not match
You can actually use a word boundary for that:
/\bGOOD/i
See demo
If you need to match a whole word, add the trailing \b:
/\bGOOD\b/i
You can also use the FINDW function that uses the position value and the modifier 'i' doesn't look at case. If this is greater than 1 then you have a match:
DATA TEST;
INPUT STRING $40.;
DATALINES;
Good morning
This is a good idea
I have an ungood Feeling
He is back 4good
goodness me
Good boy
;
RUN;
DATA TEST1; SET TEST;
POS=FINDW(STRING,'good',' ','i',1);
IF POS>0 THEN MATCH=1;
ELSE MATCH=0;
RUN;

One function to replace different text with other in SAS

I want to replace one combination of text with another. For example
data test;
a='raja\ram{work}italic';
if index(a,'\') then b=tranwrd(a,'\','\\');
if index(a,'{') then b=tranwrd(a,'{','\{');
if index(a,'}') then b=tranwrd(a,'}','\}');
if index(upcase(a),'ITALIC') then b=tranwrd(a,substr(a,index(upcase(a),'ITALIC'),length('ITALIC')),'\i');
run;
Required Result: b=raja\\ram\{work\}\i;
These kind of combination I wanted to replace. I'm not interested to use a macro or FCMP or if else condition.
Is there any function to do all at once? I tried to use a Perl expression that also working for one at a time b= prxchange('s/\\/\\\\/', -1, a)
Your regular expression is on the right track. You have a set of characters, right, you want to always prepend a \ to? So search for (one of that set of characters), which you do with [...], and then add a \ to it, using a capturing group. That's the escape character, so you have to add two any time you want to use one (\\ escapes itself to \).
data test;
a='Hello\Goodbye{stuff}';
b= prxchange('s/([\\{}])/\\$1/',-1,a);
put b=;
run;
You should do the italic bit in a second expression (or just use tranwrd). That's a totally different replacement and while theoretically possible to put in one, would make it too messy.
This question is almost identical to the other question: Multiple search and replace within a string through regular expression in SAS
Is that a coincidence?
Here is the code that worked for the other question.
%let text = abc\pqr{work};
data _null_;
var=prxchange("s/\\/\\\\/",-1,"&text");
var=prxchange("s/\{/\\\{/",-1,var);
var=prxchange("s/\}/\\\}/",-1,var);
put var;
run;
Result: abc\\pqr\{work\};
%let text = BOLD\ITALIC\ITALICBOLD\BOLDITALIC\B\I\IB\BI;
data _null_;
var=prxchange("s/BOLD/b/",-1,"&text");
var=prxchange("s/ITALIC/i/",-1,var);
var=lowcase(var);
put var;
run;
RESULT: b\i\ib\bi\b\i\ib\bi

How can I use regexextract function in Google Docs spreadsheets to get "all" occurrences of a string?

My text string is in cell D2:
Decision, ERC Case No. 2009-094 MC, In the Matter of the Application for Authority to Secure Loan from the National Electrification Administration (NEA), with Prayer for Issuance of Provisional Authority, Dinagat Island Electric Cooperative, Inc. (DIELCO) applicant(12/29/2011)
This function:
=regexextract(D2,"\([A-Z]*\)")
will grab (NEA) but not (DIELCO)
I would like it to extract both (NEA) and (DIELCO)
You can use capture groups, which will cause regexextract() to return an array. You can use this as the cell result, in which case you will get a range of results, or you can feed the array to another function to reformat it to your purpose. For example:
regexextract( "abracadabra" ; "(bra).*(bra)" )
will return the array:
{bra,bra}
Another approach would be to use regexreplace(). This has the advantage that the replace is global (like s/pattern/replacement/g), so you do not need to know the number of results in advance. For example:
regexreplace( "aBRAcadaBRA" ; "[a-z]+" ; "..." )
will return the string:
...BRA...BRA
Here are two solutions, one using the specific terms in the author's example, the other one expanding on the author's sample regex pattern which appears to match all ALLCAPS terms. I'm not sure which is wanted, so I gave both.
(Put the block of text in A1)
Generic solution for all words in ALLCAPS
=regexreplace(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b","|"),"\W+","|"),"^\||\|$","")
Result:
ERC|MC|NEA|DIELCO
NB: The brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
If you want space separation, the formula is a little simpler:
=trim(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b"," "),"\W+"," "))
Result:
ERC MC NEA DIELCO
(One way I like playing with regex in google spreadsheets is to read the regex pattern from another cell so I can change it without having to edit or re-paste into all the cells using that pattern. This looks so:
Cell A1:
Block of text
Cell B1 (no quote marks):
\b\w[^A-Z]*\b
Formula, in any cell:
=trim(regexreplace(REGEXREPLACE(A1,B$1," "),"\W+"," "))
By anchoring it to B$1, I can fill all my rows at once and the reference won't increment.)
Previous answer:
Specific solution for selected terms (ERC, DIELCO)
=regexreplace(join("|",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")),"(^\||\|$)","")
Result:
ERC|DIELCO
As before, the brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
This formula will find any ERC or DIELCO, or both in the block of text. The initial order doesn't matter, but the output will always be ERC followed by DIELCO (the order of appearance is lost). This fixes the shortcoming with the previous answer using "(bra).*(bra)" in that isolated ERC or DIELCO can still be matched.
This also has a simpler form with space separation:
=trim(join(" ",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")))
Result:
ERC DIELCO
Please try:
=SPLIT(regexreplace(A1 ; "(?s)(.)?\(([A-Z]+)\)|(.)" ; "🧸$2");"🧸")
or
=REGEXEXTRACT(A1;"\Q"&REGEXREPLACE(A1;"\([A-Z]+\)";"\\E(.*)\\Q")&"\E")