postgres regex positive lookahead is not working as expected - regex

I want to capture tokens in a text in the following pattern:
The First 2 characters are alphabets and necessary, ends with [A-Z] or [A-Z][0-9] this is optional anything can come in between.
example:
AA123123A1
AA123123A
AA123123123
i want to match and capture
start with ([A-Z][A-Z]) in group 1, end with [A-Z] or [A-Z][0-9] in group 3 and everything else between then in group2
Example:
AA123123A1 => [AA,123123,A1]
AA123123A. => [AA,123123,A]
AA123123123 => [AA,123123123,'']
the following regex is working in python but not in postgres.
regex='^([A-Za-z]{2})((?:.+)(?=[A-Za-z][0-9]{0,1})|(?:.*))([A-Za-z][0-9]{0,1}){0,1}$'
In Postgressql
select regexp_matches('AA2311121A1',
'^([A-Za-z]{2})((?:.+)(?=[A-Za-z][0-9]{0,1})|(?:.*))(.*)$','x');
result:
{AA,2311121A1,""}
I am trying to explore why positive lookahead behavior is not the same as python, and how to take make positive lookahead in Postgres work in this case.

You can use
^([A-Za-z]{2})(.*?)([A-Za-z][0-9]?)?$
See the regex demo and a DB fiddle online:
Details:
^ - start of string
([A-Za-z]{2}) - two ASCII letters
(.*?) - Group 2: any zero or more chars as few as possible
([A-Za-z][0-9]?)? - Group 3: an optional sequence of an ASCII letter and then an optional digit
$ - end of string.

Related

Why my Regex is only giving me ONE group back?

Im currenty having issues with a regex that Im creating. The regex has to extract all the groups that says number #### between Hello and Regards. At this moment my regex only extracts one group and I need all the groups inside, at this case I have 2, but there may be more inside.
Regex Image
I'm using the web page https://regex101.com/
Flavor: PCRE (PHP)
Regex: Hello\s.*(number\s*[\d]*)\s.*Regards
Text:
This is my test text number 25120
Hello my name is testing
I'm 20 years old
Please help me with the regex number 1542
I have been trying to create the regex many times this is my number 5152
Regards
I'm still trying my attempt number 5150
Result:
My Result is only the group number 5152 but inside is another group number 1542.
You may use
(?si)(?:\G(?!\A)|\bHello\b)(?:(?!\bHello\b).)*?\K\bnumber\s*\d+(?=.*?\bRegards\b)
See the regex demo.
Details
(?si) - s - DOTALL modifier making . match any chars, and i makes the pattern case insensitive
(?:\G(?!\A)|\bHello\b) - either the end of the previous match (\G(?!\A)) or (|) a whole word Hello (\bHello\b)
(?:(?!\bHello\b).)*? - any char, 0 or more times but as few as possible, that does not start a whole word Hello char sequence
\K - match reset operator that discards all text matched so far
\bnumber - a whole word number
\s* - 0+ whitespaces
\d+ - 1+ digits
(?=.*?\bRegards\b) - there must be a whole word Regards somewhere after any 0+ chars (as few as possible).

Validating User Input While Typing using RegEx

I am struggling to write the RegEx for the following criteria:
The number can be positive / negative
Optional - at the start
Between 1 and 5 numbers before the decimal point
2 decimal places only (optional)
Stop user from typing more than 1 . or -
This is the regex I have tried to implement which does not work for me.
^((-?[0-9]{1,5}(\.?){1,1}[0-9]{0,2})
It should allow the user to type out the following numbers.
-1.12
12345
1
123
12.12
Any help would be appreciated!
You may use
^-?\d{0,5}(?:(?<=\d)\.\d{0,2})?$
See the regex demo.
Details
^ - start of string
-? - an optional -
\d{0,5} - zero to five digits
(?:(?<=\d)\.\d{0,2})? - an optional sequence of
(?<=\d) - there must be a digit immediately to the left of the current location
\. - a dot
\d{0,2} - zero, one or two digits
$ - end of string.
If you want to validate while typing, you could make use of optional groups to accept intermediate values and do a final check on the whole pattern when processing the value.
^-?(?:\d{1,5}(?:\.\d{0,2})?)?$
Explanation
^ Start of string
-? Optional hyphen
(?: Non capture group
\d{1,5} Match 1-45 digits
(?: Non capture group
\.\d{0,2} Match a dot and 0-2 digits
)? Close group and make it optional
)? Close group and make it optional
$ End of string
Regex demo
To validate the final pattern, you could match an optional -, 1-5 digits and an optional decimal part:
^-?\d{1,5}(?:\.\d{1,2})?$
Regex demo
The regex ^(-?(\d{1,5}(\.\d{0,2})?)?)$ should work if you want to match strings that end in . such as 123. demo of this regex
Otherwise, change the 0 to a 1 as follows: ^(-?(\d{1,5}(\.\d{1,2})?)?)$. Then it will only match strings that have a digit after the decimal point.
The regex that you posted allows strings with more than 2 digits after the decimal point because it stops matching after the 2 digits, even if the string continues. Adding a $ at the end of the regex stops it from matching strings that continue after the part we want.
This regex ^(-?\d{1,5}(\.\d{0,2})?)$ will validate the input once the user has finished typing, because I assume that you don't want -to be valid at that point.

Why's this postgres regexp_match giving me null instead of the regex groups?

This:
select regexp_matches('test text user:testuser,anotheruser hashtag:peach,phone,milk site:youtube.com,twitter.com flair:😂bobby😂', '^.*?(?=\s+[^:\s]+:)|([^:\s]+):([^:\s]+)','gi');
gives me only one group match and a row with NULL:
regexp_matches
-----------------
{NULL,NULL}
{flair,😂bobby😂}
It works fine when I test it here:
https://regex101.com/r/AxsatL/3
What am I doing wrong?
You may use
'^(?:(?!\s+[^:\s]+:).)*|[^:\s]+:[^:\s]+'
The point here is to keep all quantifiers greedy and remove all capturing parentheses.
The ^(?:(?!\s+[^:\s]+:).)* part will match - from the start of the string - any char, 0 or more occurrences, that does not start a sequence of the following patterns: 1+ whitespaces, 1+ chars other than : and whitespace and then a :.
Online test:
select regexp_matches(
'test text user:testuser,anotheruser hashtag:peach,phone,milk site:youtube.com,twitter.com flair:😂bobby😂',
'^(?:(?!\s+[^:\s]+:).)*|[^:\s]+:[^:\s]+',
'gi'
);
Result:

REGEXP_REPLACE for exact regex pattern, not working

I'm trying to match an exact pattern to do some data cleanup for ISSN's using the code below:
select case when REGEXP_REPLACE('1234-5678 ÿþT(zlsd?k+j''fh{l}x[a]j).,~!##$%^&*()_+{}|:<>?`"\;''/-', '([0-9]{4}[\-]?[Xx0-9]{4})(.*)', '$1') not similar to '[0-9]{4}[\-]?[Xx0-9]{4}' then 'NOT' else 'YES' end
The pattern I want match any 8 digit group with a possible dash in the middle and possible X at the end.
The code above works for most cases, but if capture group 1 is the following example: 123456789 then it also returns positive because it matches the first 8 digits, and I don't want it to.
I tried surrounding capture group 1 with ^...$ but that doesn't work either.
So I would like to match exactly these examples and similar ones:
1234-5678
1234-567X
12345678
1234567X
BUT NOT THESE (and similar):
1234567899
1234567899x
What am I missing?
You may use
^([0-9]{4}-?[Xx0-9]{4})([^0-9].*)?$
See the regex demo
Details
^ - start of string
([0-9]{4}-?[Xx0-9]{4}) - Capturing group 1 ($1): four digits, an optional -, and then four x / X or digits
([^0-9].*)? - an optional Capturing group 2: any char other than a digit and then any 0+ chars as many as possible
$ - end of string.

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"