REGEXP_REPLACE for exact regex pattern, not working - regex

I'm trying to match an exact pattern to do some data cleanup for ISSN's using the code below:
select case when REGEXP_REPLACE('1234-5678 ÿþT(zlsd?k+j''fh{l}x[a]j).,~!##$%^&*()_+{}|:<>?`"\;''/-', '([0-9]{4}[\-]?[Xx0-9]{4})(.*)', '$1') not similar to '[0-9]{4}[\-]?[Xx0-9]{4}' then 'NOT' else 'YES' end
The pattern I want match any 8 digit group with a possible dash in the middle and possible X at the end.
The code above works for most cases, but if capture group 1 is the following example: 123456789 then it also returns positive because it matches the first 8 digits, and I don't want it to.
I tried surrounding capture group 1 with ^...$ but that doesn't work either.
So I would like to match exactly these examples and similar ones:
1234-5678
1234-567X
12345678
1234567X
BUT NOT THESE (and similar):
1234567899
1234567899x
What am I missing?

You may use
^([0-9]{4}-?[Xx0-9]{4})([^0-9].*)?$
See the regex demo
Details
^ - start of string
([0-9]{4}-?[Xx0-9]{4}) - Capturing group 1 ($1): four digits, an optional -, and then four x / X or digits
([^0-9].*)? - an optional Capturing group 2: any char other than a digit and then any 0+ chars as many as possible
$ - end of string.

Related

Using regex replacement in Sublime 3

I am trying to use replace in Sublime using regular expressions but I'm stuck. I tried various combinations but don't seem to be getting there.
This is the input and my desired output:
Input: N_BBP_c_46137_n
Output : BBP
I tried combinations of:
[^BBP]+\b
\*BBP*+\g
But none of the above (and many others) don't seem to work.
To turn N_BBP_c_46137_n into BBP and according to the comment just want that entire long name such as N_BBP_ to be replaced by only BBP* you might also use a capture group to keep BBP.
\bN_(BBP)_\S*
\bN_ Match N preceded by a word boundary
(BBP) Capture group 1, match BBP (or use [A-Z]+ to match 1+ uppercase chars)
_\S* Match _ followed by 0+ times a non whitespace char
In the replacement use the first capturing group $1
Regex demo
You may use
(N_)[^_]*(_c_\d+_n)
Replace with ${1}some new value$2.
Details
(N_) - Group 1 ($1 or ${1} if the next char is a digit): N_
[^_]* - any 0 or more chars other than _
-(_c_\d+_n) - Group 2 ($2): _c_, 1 or more digits and then _n.
See the regex demo.

Validating User Input While Typing using RegEx

I am struggling to write the RegEx for the following criteria:
The number can be positive / negative
Optional - at the start
Between 1 and 5 numbers before the decimal point
2 decimal places only (optional)
Stop user from typing more than 1 . or -
This is the regex I have tried to implement which does not work for me.
^((-?[0-9]{1,5}(\.?){1,1}[0-9]{0,2})
It should allow the user to type out the following numbers.
-1.12
12345
1
123
12.12
Any help would be appreciated!
You may use
^-?\d{0,5}(?:(?<=\d)\.\d{0,2})?$
See the regex demo.
Details
^ - start of string
-? - an optional -
\d{0,5} - zero to five digits
(?:(?<=\d)\.\d{0,2})? - an optional sequence of
(?<=\d) - there must be a digit immediately to the left of the current location
\. - a dot
\d{0,2} - zero, one or two digits
$ - end of string.
If you want to validate while typing, you could make use of optional groups to accept intermediate values and do a final check on the whole pattern when processing the value.
^-?(?:\d{1,5}(?:\.\d{0,2})?)?$
Explanation
^ Start of string
-? Optional hyphen
(?: Non capture group
\d{1,5} Match 1-45 digits
(?: Non capture group
\.\d{0,2} Match a dot and 0-2 digits
)? Close group and make it optional
)? Close group and make it optional
$ End of string
Regex demo
To validate the final pattern, you could match an optional -, 1-5 digits and an optional decimal part:
^-?\d{1,5}(?:\.\d{1,2})?$
Regex demo
The regex ^(-?(\d{1,5}(\.\d{0,2})?)?)$ should work if you want to match strings that end in . such as 123. demo of this regex
Otherwise, change the 0 to a 1 as follows: ^(-?(\d{1,5}(\.\d{1,2})?)?)$. Then it will only match strings that have a digit after the decimal point.
The regex that you posted allows strings with more than 2 digits after the decimal point because it stops matching after the 2 digits, even if the string continues. Adding a $ at the end of the regex stops it from matching strings that continue after the part we want.
This regex ^(-?\d{1,5}(\.\d{0,2})?)$ will validate the input once the user has finished typing, because I assume that you don't want -to be valid at that point.

REGEX Capturing differing sets of repeating groups

this is a two-part question, but I feel the answers will be related.
I have this regex pattern:
(\d+)(aa|bb) which I use to capture this string: 1bb2aa3aa4bb5bb6aa7bb8cc9cc
See demo: example 1
The way it captures the random series of aa and bb (both preceded by a digit) is exactly what I want, and is good as far as it goes.
So we get this match on regex101:
Match 1
Full match 0-3 `1bb`
Group 1. 0-1 `1`
Group 2. 1-3 `bb`
Match 2
Full match 3-6 `2aa`
Group 1. 3-4 `2`
Group 2. 4-6 `aa`
Match 3
Full match 6-9 `3aa`
Group 1. 6-7 `3`
Group 2. 7-9 `aa`
Match 4
Full match 9-12 `4bb`
Group 1. 9-10 `4`
Group 2. 10-12 `bb`
Match 5
Full match 12-15 `5bb`
Group 1. 12-13 `5`
Group 2. 13-15 `bb`
Match 6
Full match 15-18 `6aa`
Group 1. 15-16 `6`
Group 2. 16-18 `aa`
Match 7
Full match 18-21 `7bb`
Group 1. 18-19 `7`
Group 2. 19-21 `bb`
As expected, the 8cc9ccbit at the end is ignored. I would like capture this as well, in the same way I have captured the first repeating groups, in the same expression. So in the final output, I'd get something like this added to the end of the output. This should work for any amounts of matches on either side. This text is just one example.
Full match 21-24 `8cc`
Group 1. 21-22 `8`
Group 2. 22-24 `cc`
Match 7
Full match 24-27 `9cc`
Group 1. 24-25 `9`
Group 2. 25-27 `cc`
Also, I'd like to do similar but flipping the 'or' group to the end i.e. this:
1cc2cc3cc4cc5cc6cc7ccb8aa9bb
My current regex pattern (\\d+)(cc) only matches the repeating 'cc' groups.
See demo: example 2
I would like a similar full capture, with any amount of permissible entries of each group.
Any thoughts?
You may use
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)
See the regex demo
The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) lookahead, and then will consecutively match and capture digits and aa, bb or cc, but digits + aa or bb will be matched unless digits + cc is not in front.
Details
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)) - either of the two alternatives:
\G(?!^) - end of the previous successful match
(?(?=\d+(?:aa|bb))(?<!\dcc)) - if-then-else construct: if there is 1+ digits and aa or bb immediately to the right of the current location ((?=\d+(?:aa|bb)), then only continue matching if there is no digit followed with cc immediately to the left of the current location ((?<!\dcc))
| - or
^ - start of string
(?=(?:\d+(?:aa|bb))+(?:\d+cc)+) - a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):
(?:\d+(?:aa|bb))+ - one or more occurrences of 1+ digits followed with aa or bb
(?:\d+cc)+ - one or more occurrences of 1+ digits followed with cc
(\d+) - Group 1: one or more digits
(aa|bb|cc) - aa, bb or cc.
For the second pattern, replace cc with (?:aa|bb):
(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)
I'm no expert with perl, so I'll give a bit of pseudo code here. Feel free to suggest an edit.
You can start by matching any number of xaa or xbb combos, followed by one or more xcc combos using this pattern: ^(?:\d+(?:aa|bb))+(?:\dcc)+$
Once you have that you can use this pattern to capture the appropriate groups: (\d+)(aa|bb|cc)
Demo 1
Demo 2
Something like:
if(ismatch("^(?:\d+(?:aa|bb))+(?:\dcc)+$", inputString))
{
match = match("(\d+)(aa|bb|cc)", inputString);
}
from here you can extract the information using the groups.

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"

Regex for 5 digit number with optional characters

I am trying to create a regex to validate a field where the user can enter a 5 digit number with the option of adding a / followed by 3 letters. I have tried quite a few variations of the following code:
^(\d{5})+?([/]+[A-Z]{1,3})?
But I just can't seem to get what I want.
For instance l would like the user to either enter a 5 digit number such as 12345 with the option of adding a forward slash followed by any 3 letters such as 12345/WFE.
You probably want:
^\d{5}(?:/[A-Z]{3})?$
You might have to escape that forward slash depending on your regex flavor.
Explanation:
^ - start of string anchor
\d{5} - 5 digits
(?:/[A-Z]{3}) - non-capturing group consisting of a literal / followed by 3 uppercase letters (depending on your needs you could consider making this a capturing group by removing the ?:).
? - 0 or 1 of what precedes (in this case that's the non-capturing group directly above).
$ - end of string anchor
All in all, the regex looks like this:
You can use this regex
/^\d{5}(?:\/[a-zA-Z]{3})?$/
^\d{5}(?:/[A-Z]{3})?$
Here it is in practice (this is a great site to test your regexes):
http://regexr.com?36h9m
^(\d{5})(\/[A-Z]{3})?
Tested in rubular