how to implement or after group in regex pattern - regex

I want to get the thread-id from my urls in one pattern. The pattern should hat just one group (on level 1). My test Strings are:
https://www.mypage.com/thread-3306-page-32.html
https://www.mypage.com/thread-3306.html
https://www.mypage.com/Thread-String-Thread-Id
So I want a Pattern, that gives me for line 1 and 2 the number 3306 and for the last line "String-Thread-Id"
My current state is .*[t|T]hread-(.*)[\-page.*|.html]. But it fails at the end after the id. How to do it well? I also solved it like .*Thread-(.*)|.*thread-(\\w+).*, but this is with two groups not applicable for my java code.

Not knowing if this fits for all situations, but I would try this:
^.*?thread-((?:(?!-page|\.html).)*)
In Java, that could look something like
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("^.*?thread-((?:(?!-page|\\.html).)*)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
Explanation:
^ # Match start of line
.*? # Match any number of characters, as few as possible
thread- # until "thread-" is matched.
( # Then start a capturing group (number 1) to match:
(?: # (start of non-capturing group)
(?!-page|\.html) # assert that neither "page-" nor ".html" follow
. # then match any character
)* # repeat as often as possible
) # end of capturingn group

Related

regex - How to create date format regex without repeating previous group

I want to create a regex for matching date formats entered by a user. The user will enter date formats as a string ("dd/MMM/yyyy") and not actual values.
For example:
dd/MMM/yyyy = ✅
MMM/dd/yyyy = ✅
dd/dd/yyyy = ❌ (previously captured groups cannot be repeated)
MMM/MMM/yyyy = ❌ (same reason as above)
I'm having issues with working negative lookahead. Any assistance is much appreciated.
I believe you could use the regular expression
\b(?:dd\/(?:mm|MMM)\/yyyy|(?:mm|MMM)\/dd\/yyyy|yyyy\/(?:mm|MMM)\/dd)\b
Demo
The regex engine performs the following operations.
\b # match a word break
(?: # begin a non-capture group
dd\/(?:mm|MMM)\/yyyy # match 'dd/' followed by 'mm' or 'MMM'
# followed by '/yyyy'
| # or
(?:mm|MMM)\/dd\/yyyy # match 'mm' or 'MMM' followed by '/dd'
# followed by '/yyyy'
| # or
yyyy\/(?:mm|MMM)\/dd # match 'yyyy/' followed by 'mm' or 'MMM'
# followed by 'dd'
)
\b

Find Strings that contains a sequence of a specific sub string, with a limited amount of interruptions in between with regex

I'm looking for the following regex:
to find the part of the string (if exist) that contains the longest sequence of repeating GGG, with a minimal interruption of 10 chars in between every GGG.
i tried the following pattern but it didn't work that well: ((GGG).{0,10}?)*
CAGTTAGGGTTTAGGGTTAGGTTTAGGGTTAGGGTTAGGGTGAGGTGAGGGTGAGGGTTAGGGTGAGGGGTGAGGGGTTGGGGTTAGGGTTAGGGTTAGGAGTTGCAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTACTTTAGGGTTAGGGTTGGGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTACCTGCTTACTTGCTGCAGGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTTAGGGTTGGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGAGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTGGGGTTAGGGGTTAGGGGTTGGGGGGGTTAGGGTTGGGGGTTGGGGGTTAGGGAGGGTTAGGGGTTGGGGGTTGCAGGGGTTAGGGTTAGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTACCTTGGGGGTTGGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTAGGAGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTGGGGTTAGGGTTAGAGGTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGGGGTTGGGGTTAGGGTTGCAGAAGGGGTTGAGCAGGGTGGGAGTTAGGGATTAGGGATTAGGAGTTAGGGTGAGGGTTAGGGTTAGGGTGGGGTGGGGATTGGGGATTGGGAGTTAGGGTGGGTGGGGATTGGGGAGTTAGGAGTTAGGAGTTAGGAGTTAGGGAGTTAGGTTAGGGAGTTAGGGTTAGGAGTTAGAGGTTAGGGTTAGGGTGGGAGTTAGGGAGTTAGGAGGTGGGGTTGGGGTTAGGGTTAGGAGTTAGGGTTAGGGTTAGGGTTAGGGATTGGGAGTTAGGGTAGGAGTTAGGGTTAGAGGTTAGGAGTTAGGGTTAGGAGTTAGGGATTAGAGGTTAGGGTGGGATTAGGAGTTACTTACTTAGGGAGTTAGGAGTTAGGAGTTAGGGTGGGGTGGGAGTTAGAGGTTAGGAGTTAGGAGTTAGGGTTAGGGTTAGGAGTTAAGGGTTAGGGATTAGGAGTTAGGGTTAGGGTTAGGAGTTAGGGAGTTAGGGTGGGGTGGGAGTTGCAGGGATTGGGTTAGGGTTAGGAGTTGGGAGTTGGGGAGTTGGGAGTTAGGGTTACAGGGTGGGAGTTAGGAGTTAGGGAGTTAGGAGTTAGAGGTTAGGGATTAGGGGT
This pattern will work based on your rules: ((?:GGG.{0,10}?)+GGG)
regex101 demo
Explanation:
( start capture group
(?: start non-capturing group
GGG literally
.{0,10}? any character 0-10 times, non-greedy
) end the non-capturing group
+ match the previous group 1 or more times
GGG literally
) end the capture group
Then you can simply use re.findall to find all of those matches, and get the longest one of those with max(key=len).
Python demo:
import re
string = "CAGTTAGGGTTTAGGGTTAGGTTTAGGGTTAGGGTTAGGGTGAGGTGAGGGTGAGGGTTAGGGTGAGGGGTGAGGGGTTGGGGTTAGGGTTAGGGTTAGGAGTTGCAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTACTTTAGGGTTAGGGTTGGGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTACCTGCTTACTTGCTGCAGGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTTAGGGTTGGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGAGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTGGGGTTAGGGGTTAGGGGTTGGGGGGGTTAGGGTTGGGGGTTGGGGGTTAGGGAGGGTTAGGGGTTGGGGGTTGCAGGGGTTAGGGTTAGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTACCTTGGGGGTTGGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTAGGAGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTGGGGTTAGGGTTAGAGGTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGGGGTTGGGGTTAGGGTTGCAGAAGGGGTTGAGCAGGGTGGGAGTTAGGGATTAGGGATTAGGAGTTAGGGTGAGGGTTAGGGTTAGGGTGGGGTGGGGATTGGGGATTGGGAGTTAGGGTGGGTGGGGATTGGGGAGTTAGGAGTTAGGAGTTAGGAGTTAGGGAGTTAGGTTAGGGAGTTAGGGTTAGGAGTTAGAGGTTAGGGTTAGGGTGGGAGTTAGGGAGTTAGGAGGTGGGGTTGGGGTTAGGGTTAGGAGTTAGGGTTAGGGTTAGGGTTAGGGATTGGGAGTTAGGGTAGGAGTTAGGGTTAGAGGTTAGGAGTTAGGGTTAGGAGTTAGGGATTAGAGGTTAGGGTGGGATTAGGAGTTACTTACTTAGGGAGTTAGGAGTTAGGAGTTAGGGTGGGGTGGGAGTTAGAGGTTAGGAGTTAGGAGTTAGGGTTAGGGTTAGGAGTTAAGGGTTAGGGATTAGGAGTTAGGGTTAGGGTTAGGAGTTAGGGAGTTAGGGTGGGGTGGGAGTTGCAGGGATTGGGTTAGGGTTAGGAGTTGGGAGTTGGGGAGTTGGGAGTTAGGGTTACAGGGTGGGAGTTAGGAGTTAGGGAGTTAGGAGTTAGAGGTTAGGGATTAGGGGT"
pattern = re.compile(r"((?:GGG.{0,10}?)+GGG)")
longest = max(re.findall(pattern, string), key=len)
print(len(longest), longest)
Output:
583 GGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTTAGGGTTGGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGAGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTGGGGTTAGGGGTTAGGGGTTGGGGGGGTTAGGGTTGGGGGTTGGGGGTTAGGGAGGGTTAGGGGTTGGGGGTTGCAGGGGTTAGGGTTAGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTACCTTGGGGGTTGGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTAGGAGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTGGGGTTAGGGTTAGAGGTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGGGGTTGGGGTTAGGGTTGCAGAAGGGGTTGAGCAGGGTGGGAGTTAGGGATTAGGG
Edit:
If you want to have at least 51 GGGs in the string, you can use the pattern: ((?:GGG.{0,10}?){50,}GGG) to accomplish that.

Regex: match an empty string instead of nothing

I have a Python script in which I'm trying to parse a string of the form:
one[two=three].four
Each word should be in its own capture group. The punctuation should not be captured.
Additionally, each part of the string is optional, and the part delimited by brackets can be repeated. So the above is the most complete example, but all of the following should also be valid matches:
one
.four
one[two=three][five=six]
[two=three]
[two].four
[two][five]
[]
In the case that one of the words is not present, instead of failing to capture, I'd like to capture a string of length 0.
The regex that I'm using is as follows:
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)* # End the two-three group, allowing repeats
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
What I've tried to do during that regex is, instead of allowing 0 or 1 of a group by appending ? to the entire group, I allowed any number of characters to be in the actual match itself by appending * to the character selection. Therefore, the match is forced to exist, but the string itself can have a length of 0.
The problem comes with the bracketed block. The package I'm using allows me to access all captures of a named group using match.captures(groupname). This way, I can access all matches for cap_2 using match.captures("cap_2"):
>>> pattern.match("one[two=three][five=six].four").captures("cap_2")
["two", "five"]
This works fine when the brackets are present. However, when they're not:
>>> pattern.match("one.four").captures("cap_2")
[]
Expected: [""]
I expect there to be at least an empty string present for cap_2 and cap_3. However, there's nothing.
This is because of the * I place after the two+three section of the regex, in order to allow multiple of those groups - this is allowing that part of the regex to be skipped altogether.
Changing that * to + breaks the regex, as now it won't match the above example at all because it's trying to match the brackets. Adding a ? after each bracket means that cap_1 and cap_2 are not delimited and includes what should be in cap_4 in cap_3.
What's the solution here? How can I allow a group containing two capturing groups to be executed multiple times, but match only empty strings when the brackets are not present?
You may solve the problem by replacing * after the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* repeated group with + and adding an alternative with a second occurrence of groups cap_2 and cap_3 (note that PyPi regex module supports multiple identically named groups in the same regex):
import regex as re
s = 'one.four'
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?:
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)+ # End the two-three group, allowing repeats
|
(?P<cap_2>)(?P<cap_3>)
)
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
print ( pattern.match("one.four").captures("cap_2") )
# => ['']
See the Python demo
The thing is, the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* part matches by all means since it can match an empty string, and if you just add the alternatives without changing the modifier, the expected results won't be achieved. So, if there is no [...]s, the second cap_2 and cap_3 groups with empty patterns willmatch by all means capturing an empty string.
if you want it to either match the empty string or something else, you need the OR operator: |
if you want your regexp to match an empty string, you need something that matches the empty string: e.g. () or (not empty|)
Combined and applied to your case, that would look like this (simplified):
((?:\[stuff inside the brackets\])+|)
The outermost group captures the whole bracket construct (e.g. [two][three]) if it's present or the empty string. Notice that the left part of the | operator now has to match at least once (+).

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?
I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with
Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.
^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex
Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

regex conditional statement (only parse numbers if string has a certain beginning)

first I used a string that returned my relaystates, so “1.0.0.0.1.1.0.0” would get parsed/grouped with \d+,
then my eight switches used ‘format response’, e.g. {1} to get the state for each switch.
now I need to get the numbers out of this string: “RELAYS.1.0.0.0.1.1.0.0”
\d+ will still get the numbers but I only want to get them IF the string starts with “RELAYS"
can anyone please explain how I could do that?
thnx a million in advance!
Edited icebear (today 00:24)
With a .NET engine, you could use the regex (?<=^RELAYS[\d.]*)\d+. But most regex engines don't support indefinite repetition in a negative lookbehind assertion.
See it live on regexhero.net.
Explanation:
(?<= # Assert that the following can be matched before the current position:
^RELAYS # Start of string, followed by "RELAYS"
[\d.]* # and any number of digits/dots.
) # End of lookbehind assertion
\d+ # Match one or more digits.
With a PCRE engine, you could use (?:^RELAYS\.|\G\.)(\d+) and access group 1 for each match.
See it live on regex101.com.
Explanation:
(?: # Start a non-capturing group that matches...
^RELAYS\. # either the start of the string and "RELAYS."
| # or
\G\. # the position after the previous match, followed by "."
) # End of non-capturing group
(\d+) # Match a number and capture it in group 1