Find certain colons in string using Regex - regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible

You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.

Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

Related

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo

Regex to match a unlimited repeating pattern between two strings

I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).

Extract information through regexp

I have a question about groups in a rule i created to extract dates from text.
Let's consider the following string:
fherfrefercr17hfeuetvbyeituew
The string is composed by everything at the beginning, then there is a number composed by one or two digits and then everything again. I need to extract only the number "17" from the string listed above.
With the following rule i extract only 7 and not 17.
.*(\d{1,2}).*
Can anyone help me with that please?
Overview
Given your pattern:
.*(\d{1,2}).*
This works in the following way:
.* Match any character any number of times
The quantifier here is considered to be greedy because it will match as many characters as possible so long as the pattern matches the string.
\d{1,2} Since your pattern says to match 1 or 2 digits and the previous token is greedy, the regex is just going to match a single digit because this still satisfies the pattern (the previous token stole the first digit).
Code
There are multiple ways you can fix this issue
Method 1
This will simply extract all numbers (1+ digits) from the string. If you want to only match 1 or two digits use \d\d? or \d{1,2} instead.
\d+
\d\d?
\d{1,2}
Method 2
This method turns the greedy quantifier * (in .*) into a lazy quantifier .*?. This will match any character any number of times, but as few as possible. The drawback to this method is that it's expensive because the engine needs to backtrack.
.*?\d{1,2}.*
Method 3
This method matches any non-digit character any number of times, then it matches one or two digits. This is likely the solution you're looking for.
\D*(\d{1,2}).*

Global regex with multiple matches, where the separator should be shared in several matches

First of all, sorry for the unclear title, it's hard to describe (and to find an existing solution for the same reason).
I use this regex in Javascript, to collect numbers in a string :
/(?:^|[^\d])([\d]+)(?:$|[^\d])/g
Executing it on "5358..2145" returns 2 matches, where the submatches are "5358" and "2145"
But if I use it on "5358.2145", I receive only 1 match : "5358"
So, I understand it so :
The first match is found ("5358.") so the point goes in the first match
What I want as second match is not preceded with start of string or the point because this point already belongs to the first match
How can I change my pattern to find all numbers separated with 1 non-number character ?
Use a negative lookahead at the end:
/(?:^|\D)(\d+)(?!\d)/g
See the regex demo
The pattern matches:
(?:^|\D) - either start of string (^) or any non-digit char (\D)
(\d+) - Group 1: one or more digits
(?!\d) - the negative lookahead failing the match if there is a digit immediately to the right of the current location.