Regex lookaround does not work with quantifiers in SAS - sas

I have a table similar to this:
Data have;
text = 'insurance premium'; output;
text = 'insur. premium'; output;
text = 'premium. insur aa'; output;
text = 'premium card'; output;
text = 'sales premium'; output;
Run;
My task is to select all transactions that contain the word premium, but do not contain the word insurance or a form thereof (e.g. insur, ins. etc.). I read up on how to use lookaround expressions in regex and wrote the following expression:
/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i
The expression seems to work on testing websites such as https://regexr.com/, but when I run the code below I get an error in SAS:
Data want;
Set have;
re = prxparse('/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i');
flg = prxmatch(re, text) > 0;
Run;
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!insur[a-z.]*\s)premium(?!.*insur[a-z.]*\s) <<
HERE /.
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s) << HERE /.
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
NOTE: Argument 1 to function PRXPARSE('/(?<!ins[a-z'[12 of 45 characters shown]) at line 30 column 6 is invalid.
NOTE: Argument 1 to the function PRXMATCH is missing.
As far as I understood there is an issue with the * symbols inside the lookaround functions, because the error does not occur if I remove them. Does SAS implement such expressions differently or does it simply not support such expressions?

You are using flg = prxmatch(re, text) > 0; to see if there is a match by checking if the position is > 0
You can put the negative lookahead at the start of the string to check for the variations of insurance, and then match the word premium.
^(?!.*\bins[a-z.]*\s).*\bpremium\b
Explanation
^ Start of string
(?! Negative lookahead, assert that on the right is not
.*\bins Match a word starting with ins
[a-z.]*\s Optionally repeat matching chars a-z or . followed by a whitespace char
) Close the lookahead
.*\bpremium\b match the word premium in the line
Regex demo

You cannot use a lookbehind with a variable width pattern in a PCRE regex. However, you can match and skip substrings you do not need using (*SKIP)(*FAIL) verbs, so you can revamp the regex you have in the following way:
prxparse('/ins[a-z.]*\spremium(?!.*ins[a-z.]*\s)(*SKIP)(*F)|premium(?!.*ins[a-z.]*\s)/i')
Mind that patterns are parsed and searched for from left to right. ins[a-z.]*\spremium(?!.*ins[a-z.]*\s)(*SKIP)(*F)| is triggered first, and if ins[a-z.]*\spremium(?!.*ins[a-z.]*\s) is found, it is skipped. Else, the second premium(?!.*ins[a-z.]*\s) alternative comes into play and matches premium not followed with ins and zero or more letters / dots and a whitespace in other contexts.

Related

Converting a normal regular RegEx to one in SAS

Although reading SAS documentation and various example pages, I am struggeling to convert a slightly more complicated RegEx to SAS syntax. I using the command prxchange. This is what I came up so far convert a filename-string like pre_31DEC2019_299792458.xls to an integer number (of length 8) 299792458 inside a SAS data step:
tmp=prxchange('s/pre_([a-zA-Z0-9]{8,9})_([0-9]{1,16})\.xls/\2/g',-1,have);
want=input(tmp,8.);
The error message I have points to somewhere else in the code, but I am rather certain that it is those two lines which cause a problem since leaving out the two quoted lines makes the SAS error message vanish.
References
Inofficial SAS howto on RegEx suggests that I could use standard RegExes.
Why use regex at all?
want = input(scan(have,-2,'._'),32.);
You can use
tmp = prxchange('s/^pre_[A-Za-z0-9]+_([0-9]+)\.xls$/$1/', -1, have);
See the regex demo
Details
s/ - substitution action (we are replacing the match)
^ - start of string
pre_ - a literal prefix
[A-Za-z0-9]+ - one or more alphanumeric ASCII chars (note you may simply use .* here instead if there can be anything)
_ - an underscore
([0-9]+) - Group 1: one or more digits
\.xls$ - .xls at the end of string
$1 - the whole match, the whole string matched, will be replaced with the contents of Group 1.
As far as the prxchange function is concerned, note that it replaces all occurrences of the pattern once you pass -1 as the times argument, thus, no g flag is necessary.
Many ways you could try:
data _null_;
a="pre_31DEC2019_299792458.xls";
b=input(prxchange('s/.*\_(.*)\..*/$1/',-1,a),12.);
c=input(prxchange('s/.*(\d{9}).*/$1/',-1,a),12.);
d=input(prxchange('s/.*(?<=\_)(\d+).*/$1/',-1,a),12.);
put _all_;
run;
.* means any one character many times; for b, the numbers you need are between _and .; for c, it is 9 digitals; for d, it look behind "_" to find digitals.

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

Postgres Regex Negative Lookahead

Scenario: Match any string that starts with "J01" except the string "J01FA09".
I'm baffled why the following code returns nothing:
SELECT 1
WHERE
'^J01(?!FA09).*' ~ 'J01FA10'
when I can see on regexr.com that it's working (I realize there are different flavors of regex and that could be the reason for the site working).
I have confirmed in the postgres documentation that negative look aheads are supported though.
Table 9-15. Regular Expression Constraints
(?!re) negative lookahead matches at any point where no substring
matching re begins (AREs only). Lookahead constraints cannot contain
back references (see Section 9.7.3.3), and all parentheses within them
are considered non-capturing.
Match any string that starts with "J01" except the string "J01FA09".
You can do without a regex using
WHERE s LIKE 'J01%' AND s != 'J01FA09'
Here, LIKE 'J01%' requires a string to start with J01 and then may have any chars after, and s != 'J01FA09' will filter out the matches.
If you want to ahieve the same with a regex, use
WHERE s ~ '^J01(?!FA09$)'
The ^ matches the start of a string, J01 matches the literal J01 substring and (?!FA09$) asserts that right after J01 there is no FA09 followed with the end of string position. IF the FA09 appears and there is end of string after it, no match will be returned.
See the online demo:
CREATE TABLE table1
(s character varying)
;
INSERT INTO table1
(s)
VALUES
('J01NNN'),
('J01FFF'),
('J01FA09'),
('J02FA09')
;
SELECT * FROM table1 WHERE s ~ '^J01(?!FA09$)';
SELECT * FROM table1 WHERE s LIKE 'J01%' AND s != 'J01FA09';
RE is a right side operand:
SELECT 1
WHERE 'J01FA10' ~ '^J01(?!FA09)';
?column?
----------
1
(1 row)

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!
To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

sas pattern matching with square brackets evaluation

I have the following SAS code that checks for patterns and flags any error.
I'm sure that it checks for a pattern in field1, but I'm not sure how two square brackets [] are evaluated.
I need to check for invalid values in field1.
sas code:
if prxmatch('/^[a-zA-Z][a-zA-Z0-9_]*$/', strip(&vfiel1)) = 0 then do;
put "Error is field1"
This regular expression will check for valid-looking SAS name. Specifically, it must start (^) with a letter ([a-zA-Z]) followed by 0 or more (*) letters, numbers, and/or underscores ([a-zA-Z0-9_]) before the end ($).
A better SAS name check would be something along the lines of this:
Libnames: ^[a-zA-Z_][a-zA-Z0-9_]{0,7}$
Dataset & variable names: ^[a-zA-Z_][a-zA-Z0-9_]{0,31}$
Note these allow names to start with an underscore and have max lengths of 8 and 32 characters.
Here is a page on Names in the SAS Language.