Postgres Regex Negative Lookahead - regex

Scenario: Match any string that starts with "J01" except the string "J01FA09".
I'm baffled why the following code returns nothing:
SELECT 1
WHERE
'^J01(?!FA09).*' ~ 'J01FA10'
when I can see on regexr.com that it's working (I realize there are different flavors of regex and that could be the reason for the site working).
I have confirmed in the postgres documentation that negative look aheads are supported though.
Table 9-15. Regular Expression Constraints
(?!re) negative lookahead matches at any point where no substring
matching re begins (AREs only). Lookahead constraints cannot contain
back references (see Section 9.7.3.3), and all parentheses within them
are considered non-capturing.

Match any string that starts with "J01" except the string "J01FA09".
You can do without a regex using
WHERE s LIKE 'J01%' AND s != 'J01FA09'
Here, LIKE 'J01%' requires a string to start with J01 and then may have any chars after, and s != 'J01FA09' will filter out the matches.
If you want to ahieve the same with a regex, use
WHERE s ~ '^J01(?!FA09$)'
The ^ matches the start of a string, J01 matches the literal J01 substring and (?!FA09$) asserts that right after J01 there is no FA09 followed with the end of string position. IF the FA09 appears and there is end of string after it, no match will be returned.
See the online demo:
CREATE TABLE table1
(s character varying)
;
INSERT INTO table1
(s)
VALUES
('J01NNN'),
('J01FFF'),
('J01FA09'),
('J02FA09')
;
SELECT * FROM table1 WHERE s ~ '^J01(?!FA09$)';
SELECT * FROM table1 WHERE s LIKE 'J01%' AND s != 'J01FA09';

RE is a right side operand:
SELECT 1
WHERE 'J01FA10' ~ '^J01(?!FA09)';
?column?
----------
1
(1 row)

Related

Regex lookaround does not work with quantifiers in SAS

I have a table similar to this:
Data have;
text = 'insurance premium'; output;
text = 'insur. premium'; output;
text = 'premium. insur aa'; output;
text = 'premium card'; output;
text = 'sales premium'; output;
Run;
My task is to select all transactions that contain the word premium, but do not contain the word insurance or a form thereof (e.g. insur, ins. etc.). I read up on how to use lookaround expressions in regex and wrote the following expression:
/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i
The expression seems to work on testing websites such as https://regexr.com/, but when I run the code below I get an error in SAS:
Data want;
Set have;
re = prxparse('/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i');
flg = prxmatch(re, text) > 0;
Run;
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!insur[a-z.]*\s)premium(?!.*insur[a-z.]*\s) <<
HERE /.
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s) << HERE /.
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
NOTE: Argument 1 to function PRXPARSE('/(?<!ins[a-z'[12 of 45 characters shown]) at line 30 column 6 is invalid.
NOTE: Argument 1 to the function PRXMATCH is missing.
As far as I understood there is an issue with the * symbols inside the lookaround functions, because the error does not occur if I remove them. Does SAS implement such expressions differently or does it simply not support such expressions?
You are using flg = prxmatch(re, text) > 0; to see if there is a match by checking if the position is > 0
You can put the negative lookahead at the start of the string to check for the variations of insurance, and then match the word premium.
^(?!.*\bins[a-z.]*\s).*\bpremium\b
Explanation
^ Start of string
(?! Negative lookahead, assert that on the right is not
.*\bins Match a word starting with ins
[a-z.]*\s Optionally repeat matching chars a-z or . followed by a whitespace char
) Close the lookahead
.*\bpremium\b match the word premium in the line
Regex demo
You cannot use a lookbehind with a variable width pattern in a PCRE regex. However, you can match and skip substrings you do not need using (*SKIP)(*FAIL) verbs, so you can revamp the regex you have in the following way:
prxparse('/ins[a-z.]*\spremium(?!.*ins[a-z.]*\s)(*SKIP)(*F)|premium(?!.*ins[a-z.]*\s)/i')
Mind that patterns are parsed and searched for from left to right. ins[a-z.]*\spremium(?!.*ins[a-z.]*\s)(*SKIP)(*F)| is triggered first, and if ins[a-z.]*\spremium(?!.*ins[a-z.]*\s) is found, it is skipped. Else, the second premium(?!.*ins[a-z.]*\s) alternative comes into play and matches premium not followed with ins and zero or more letters / dots and a whitespace in other contexts.

Regex for last n characters of String in PostgreSQL query

Regex checks wouldn't be a strong point of mine. This is trivial but after playing around with it for 15 minutes already I think it would be quicker posting here. Ultimately I want to filter out any results of a table where a certain text column value ends with S(01 -99), i.e. the letter S followed by 2 digits. Consider the following test query
select x.* from (
select
unnest(array['kjkjkj','jhjs01','kjkj11','kjhkjh','uusus','iiosis99']::text[])
as tests ) x
where RIGHT(x.tests,3) !~ 'S[0-9]{1,2}$'
This returns everything in the unnested array, whereas I'm hoping to return everything excluding the second and last values. Any pointers in the right direction would be much appreciated. I'm using PostgreSQL v11.9
You may actually use SIMILAR TO here since your pattern is not that complex:
SELECT * FROM table
WHERE column_name NOT SIMILAR TO '%S[0-9]{2}'
SIMILAR TO patterns require a full string match, so here, % matches any text from the start of the string, then S matches S and [0-9]{2} matches two digits that must be at the end of the string.
If you were to use a regex, you could use
WHERE column_name !~ 'S[0-9]{2}$'
Or, 'S[0-9]{1,2}$' if there can be one or two digits. Since the regex search in PostgreSQL does not require a full string match, it just matches S, two (or one or two with {1,2}) digits at the end of string ($).

Concatenate special characters to the column values based on pattern matching in Postgres

I have a table with following column in postgres
col1
C29[40
D1305_D1306delinsKK
E602C[20
I would like to append a string 'p.' & closing square brackets in row 1 and 3 elements and 'p.' to the row2 element.
The expected output is:
col2
p.C29[40]
p.D1305_D1306delinsKK
p.E602C[20]
I am running following query, which runs without an error but the expected output is missing.
SELECT *,
CASE
WHEN t.c LIKE 'p.?=[%'
THEN 'p.'|| t.c || ']'
ELSE 'p.'|| t.c
END AS col2
FROM table;
You may use two chained REGEXP_REPLACE calls:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('C29[40', '^(.*\[\d+)$', 'p.\1]'), '^(?:p\.)?', 'p.')
See the regex demo #1 and regex demo #2 and the PostgreSQL demo.
Pattern details
^ - start of string
(.*\[\d+) - Group 1 (\1): any 0+ chars as many as possible (.*), then[ and 1+ digits
$ - end of string.
The ^(?:p\.)? pattern matches an optional p. substring at the beginning of the string, and thus either adds p. or replaces p. with p. (thus, keeping it).

Why do I get empty response for regexp_matches function while using positive lookahead (?=...)

Why the following code returns just empty brackets - {''}. How to make it return matching strings?
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=..CAA)','g');
Expected output is:
regexp_matches
----------------
{GCCAA}
{AACAA}
{AACAA}
{GTCAA}
(4 rows)
but instead it returns the following:
regexp_matches
----------------
{""}
{""}
{""}
{""}
(4 rows)
I actually have a bit more complicated query, which requires positive lookahead in order to cover all occurrences of patterns in the string even if they overlap.
Well, it's not pretty, but you can do it without regular expressions or custom functions.
WITH data(d) as (
SELECT * FROM (VALUES ('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT')) v
)
SELECT substr(d, x, 5) AS match
FROM data
JOIN LATERAL (SELECT generate_series(1, length(d))) g(x) ON TRUE
WHERE substr(d, x, 5) LIKE '__CAA'
;
match
-------
GCCAA
AACAA
AACAA
GTCAA
(4 rows)
Basically, get each five letter slice of the string and see if it matches __CAA.
You could change generate_series(1, length(d)) to generate_series(1, length(d)-4) because the last ones will never match, but you would have to remember to update this if the length of your matching string changes.
Using a lookahead has the problem that the lookahead itself is not part of the match but it allows overlapping searches
Without using a lookahead, you lose the ability for overlapping searches.
Using Powershell, you can loop over the indexes returned from the lookaheads and use that as an index into your searchstring to get the matches
$string = 'ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT'
$r = [regex]::new('(?=..CAA)')
$r.Matches($string) | % {$string.Substring($_.Index, 5)}
returns
GCCAA
AACAA
AACAA
GTCAA
I don't know how to translate this to PostgreSQL (or if that's even possible)
update:
Aparently it won't capture inside of an assertion, that's ok because
what you really need is the first 2 characters, which can safely be
consumed. It will only give you the first 2 characters per row, but
since you know the last 3, you can easily join the set elements
with the CAA constant.
Try this
..(?=CAA)
and you're done.
If I knew the bizarre sql language, I could show you how to do the join.
Output should now be
match
-------
GC
AA
AA
GT
(4 rows)
This is the regex you need for overlapped matches.
(?=(..CAA))
https://regex101.com/r/eJ36zb/1
I think you just need this sql statement which captures group 1:
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=(..CAA))','g');
Formatted regex
(?=
( . . CAA ) # (1)
)
The reason you got empty strings in your result is that
you didn't give the expression anything to consume and
nothing to capture.
I.e., it matched at the right places, but nothing was consumed or captured.
So, doing it this way allows the overlap and the capture so it
should show up on the output now.
Lookahead is a zero-width assertion. It doesn't match anything. If you change your regular expression to just a regular match/capture, you'll get a result. For matching any two characters that are followed by CAA in your case, lookahead probably isn't necessary.

Regex in PostgreSQL

I'm ultimately trying to use the following regex expression.
SELECT *
into table
FROM table2
Where
(Description ~ '\bD\s*(&|AND|&AMP;|N|AMP|\*|\+)\s*B.*')
However this returns the following errors:
[XX000] ERROR: Invalid preceding regular expression prior to repetition operator. The error occured while parsing the regular expression fragment: 'P;|N|AMP|>>>HERE>>>|+)sB.'. Detail: ----------------------------------------------- error: Invalid preceding regular expression prior to repetition operator. The error occured while parsing the regular expression fragment: 'P;|N|AMP|>>>HERE>>>|+)sB.'. code: 8 ...
Any idea on the fix?
You should replace \b with \y (or \m) to fix the pattern, and you may put single chars inside a capturing group into a character class where you do not have to escape them, (&|\*|\+) -> [*+&]. Note you do not need .* at the end, unless you are matching (if you just check for a regex match with ~ you do not need it);
Use
'\yD\s*(AND|&AMP;|N|AMP|[*+&])\s*B'
See the online demo:
CREATE TABLE tb1
(website character varying)
;
INSERT INTO tb1
(website)
VALUES
('D AND B...'),
('ROCK''N''ROLL'),
('www.google.com'),
('More text here'),
('D N Brother')
;
SELECT * FROM tb1 WHERE website ~ '\yD\s*(AND|&AMP;|N|AMP|[*+&])\s*B';
Output