I have strings of 010xxx, 011xxx, 110xxx, 111xxx, Q10xxx, Q11xxx in a field along with other values that are not similar. They might be XyzABC.
I have two regex patterns that separately give results that are good: [1Q]_[0-9]% and 0_[1-9]%
In words return true if
first letter is 1 or Q and the 3rd letter is a 0-9
OR
the first letter is 0 and the third letter is 1-9
How do I create a search pattern that does the OR either using SIMILAR TO or regex?
One version that works by itself is:
SELECT field FROM db WHERE field SIMILAR TO '[1Q]_[0-9]%'
Not wedded to SIMILAR or regex. They were just what I could get working until I tried to or them. Open to other suggestions.
You can use a SIMILAR TO pattern like
WHERE field SIMILAR TO '([1Q]_[0-9]|0_[1-9])%'
The SIMILAR TO pattern requires a full string match, so the pattern means: start with 1 or Q, then any char, then any digit, or start with 0, any char and a non-zero digit, and then there can be any 0 or more chars afterwards.
You can also use a regex like
WHERE field ~ '^(?:[1Q].[0-9]|0.[1-9])'
See the regex demo
Details:
^ - start of string
(?: - start of a non-capturing group:
[1Q].[0-9] - 1 or Q, any char and a digit
| - or
0.[1-9] - 0, any char and a non-zero digit
) - end of a non-capturing group.
Related
I want to expect some characters only if a prior regex matched. If not, no characters (empty string) is expected.
For instance, if after the first four characters appears a string out of the group (A10, B32, C56, D65) (kind of enumeration) then a "_" followed by a 3-digit number like 123 is expected. If no element of the mentioned group appears, no other string is expected.
My first attempt was this but the ELSE branch does not work:
^XXX_(?<DT>A12|B43|D14)(?(DT)(_\d{1,3})|)\.ZZZ$
XXX_A12_123.ZZZ --> match
XXX_A11.ZZZ --> match
XXX_A12_abc.ZZZ --> no match
XXX_A23_123.ZZZ --> no match
These are examples of filenames. If the filename contains a string of the mentioned group like A12 or C56, then I expect that this element if followed by an underscore followed by 1 to 3 digits. If the filename does not contain a string of that group (no character or a character sequence different from the strings in the group) then I don't want to see the underscore followed by 1 to 3 digits.
For instance, I could extend the regex to
^XXX_(?<DT>A12|B43|D14)_\d{5}(?(DT)(_\d{1,3})|)_someMoreChars\.ZZZ$
...and then I want these filenames to be valid:
XXX_A12_12345_123_wellDone.ZZZ
XXX_Q21_00000_wellDone.ZZZ
XXX_Q21_00000_456_wellDone.ZZZ
...but this is invalid:
XXX_A12_12345_wellDone.ZZZ
How can I make the ELSE branch of the conditional statement work?
In the end I intend to have two groups like
Group A: (A11, B32, D76, R33)
Group B: (A23, C56, H78, T99)
If an element of group A occurs in the filename then I expect to find _\d{1,3} in the filename.
If an element of group B occurs ion the filename then the _\d{1,3} shall be optional (it may or may not occur in the filename).
I ended up in this regex:
^XXX_(?:(?A12|B43|D14))?(?(DT)(_\d{5}_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).*\.ZZZ$
^XXX_(?:(?<DT>A12|B43|D14))?_\d{5}(?(DT)(_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).+\.ZZZ$
Since I have to use this regex in the OpenApi #Pattern annotation I have the problem that I get the error:
Conditionals are not supported in this regex dialect.
As #The fourth bird suggested alternation seems to do the trick:
XXX_((((A12|B43|D14)_\d{5}_\d{1,3}))|((?:(A10|B10|C20)((?:_\d{5}_\d{3})|(?:_\d{3}))))).*\.ZZZ$
The else branch is the part after the |, but if you also want to match the 2nd example, the if clause would not work as you have already matched one of A12|B43|D14
The named capture group is not optional, so the if clause will always be true.
What you can do instead is use an alternation to match either the numeration part followed by an underscore and 3 digits, or match an uppercase char and 2 digits.
^XXX_(?:(?<DT>A12|B43|D14)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
If you want to make use of the if/else clause, you can make the named capture group optional, and then check if group 1 exists.
^XXX_(?<DT>A12|B43|D14)?(?(DT)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
For the updated question:
^XXX_(?<DT>A12|B43|D14)?(?(DT)(?:_\d{5})?_\d{3}(?!\d)|(?!A12|B43|D14|[A-Z]\d{2}_\d{3}(?!\d))).*\.ZZZ$
The pattern matches:
^ Start of string
XXX_ Match literally
(?<DT>A12|B43|D14)?
(?(DT) If we have group DT
(?:_\d{5})? Optionally match _ and 5 digits
_\d{3}(?!\d) Match _ and 3 digits
| Or
(?! Negative lookahead, assert not to the right
A12|B43|D14| Match one of the alternatives, or
[A-Z]\d{2}_\d{3}(?!\d) Match 1 char A-Z, 2 digits _ 3 digits not followed by a digit
) Close lookahead
) Close if clause
.* Match the rest of the line
\.ZZZ Match . and ZZZ
$ End of string
Regex demo
I have the following string:
"Thu Dec 31 22:00:00 UYST 2009"
I want to replace everything except for the hours and minutes so I get the following result:
"22:00"
I am using this regex :
(^([0-9][0-9]:[0-9][0-9]))
But its not matching anything.
This would be my line of actual code :
println("Thu Dec 31 22:00:00 UYST 2009".replace("(^([0-9][0-9]:[0-9][0-9]))".toRegex(),""))
Can someone help me to correct the regex?
The reason the one you have isn't working is because you are asserting that the line starts right before the minutes and seconds, which isn't the case. This can be fixed by removing the assertion (^).
If you need the assertion to remain, there is another way. In most languages, you wouldn't be able to use a variable-length positive lookbehind here, but lucky for you, it looks like you can in Kotlin.
A positive lookbehind is basically just telling the pattern "this comes before what I'm looking for". It's denoted by a group beginning with ?<=. In this case, you can use something like (?<=^[\w ]+). This will match all word characters or spaces between the beginning of the line and where the pattern that comes after it is able to match. Appending it to your expression would look something like (?<=^[\w ]+)([0-9][0-9]:[0-9][0-9]) (note you will have to escape the \w in order for it to be in a string and not be angry about it).
Side note, Yogesh_D is correct in saying that \d\d:\d\d is the same as your [0-9][0-9]:[0-9][0-9]. Using this, it would look more like (?<=^[\w ]+)\d\d:\d\d.
You may use various solutions, here are two:
val text = """Thu Dec 31 22:00:00 UYST 2009"""
val match = """\b(?:0?[1-9]|1\d|2[0-3]):[0-5]\d\b""".toRegex().find(text)
println(match?.value)
val match2 = """\b(\d{1,2}:\d{2}):\d{2}\b""".toRegex().find(text)
println(match2?.groupValues?.getOrNull(1))
Both return 22:00. See regex #1 demo and regex #2 demo.
The regex complexity should be selected based on how messy the input string is.
Details
\b - a word boundary
(?:0?[1-9]|1\d|2[0-3]) - an optional zero and then a non-zero digit, or 1 and any digit, or 2 and a digit from 0 to 3
: - a : char
[0-5]\d - 0, 1, 2, 3, 4 or 5 and then any one digit
\b - a word boundary.
If there is a match with this regex, you get it as a whole match, so you can access it via match?.value.
If you do not have to worry about any pre-valiation when matching, you may simply match 3 colon-separated digit pairs and capture the first two, see the second regex:
\b - a word boundary
(\d{1,2}:\d{2}) - Group 1: one or two digits, : and two digits
:\d{2} - a : and two digits (not captured)
\b - a word boundary.
If there is a match, we need Group 1 value, hence match2?.groupValues?.getOrNull(1) is used.
I am not sure what language you are using but why use negation when you can directly match the first digits in the hh:mm format.
Assuming that the date string format always is in the format with a hh:mm in there.
This regex snippet should have the first group match the hh:mm.
https://regex101.com/r/aHdehZ/1
The regex to use is (\d\d:\d\d)
I have a string looking like this (stored as an Event Action value from Google Analytics)
0+171235652++zu
or
122+115166747++en
I would like (with the use of calculate fields) create a new field that will show only the number before the 1st '+' character. So in those examples above
0 or 122
What I tried was (below), but it did not help, Any ideas?
REGEXP_REPLACE(Event Action, '(^\\+).*', '')
You may use
REGEXP_EXTRACT(Event Action, '^([^+]+)')
See the regex in action. The regex matches:
^ - start of string
([^+]+) - Capturing group 1: any one or more chars other than a + (you may use ([^+]*) if you want to also get empty match when a + is the first char).
If you want a replacement function, you may use
REGEXP_REPLACE(Event Action,"[+].*","")
The pattern you tried (^\\+).* did not work because this part ^\\+ matches the start of the string followed by 1 or more times a plus sign.
If what comes before the first plus sign should be digits and the plus sign itself should be present, you could capture the leading digits followed by matching the plus sign followed by the rest of the string.
Use group 1 using \\1 in the replacement.
^(\\d+)\\+.*
In parts
^ Start of string
(\\d+) Capture group 1, match 1 or more digits
\\+.* Match a + char and 0 or more times any char except a newline
Regex demo
Example code
REGEXP_REPLACE(Event Action, '^(\\d+)\\+.*', '\\1')
I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).
Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"