Why my optional captured group in my regex does not work? - regex

Here is a text example that I will usually get:
CERTIFICATION/repos_1/test_examples_1_01_C.py::test_case[6]
CERTIFICATION/repos_1/test_examples_2_01_C.py::test_case[7]
INTEGRATION/test_example_scan_1.py::test_case
INTEGRATION/test_example_scan_2.py::test_case
Here is the regex I'm using to capture 3 different groups:
^.*\/(.*)\.py.*:{2}(.*(\[.*\])?)
If we take an example with the first line of my examples I should get:
test_examples_1_BV_01_C - test_case[6] - [6]
And for the last line:
test_example_scan_2 - test_case - None
But if you try this regex you will find out that the first example does not work. I can't get
the [6]. If you remove the "?" you will have no match with line that does not have "[.*]" at the end
So, how can I get all those information ? And what do I do wrong ?
Regards

You can use
^.*\/(.*)\.py.*::(.*?(\[.*?\])?)$
See the regex demo
Details:
^ - start of string
.* - any zero or more chars other than line break chars, as many as possible
\/ - a / char
(.*) - Group 1: any zero or more chars other than line break chars, as many as possible
\.py - .py substring
.* - any zero or more chars other than line break chars, as many as possible
:: - a :: string
(.*?(\[.*?\])?) - Group 2: any zero or more chars other than line break chars, as few as possible, and then an optional Group 3 matching [, any zero or more chars other than line break chars, as few as possible, and a ]
$ - end of string.

With the help of negated character class you can get all matches and make this regex lot more efficient:
^.*/([^.]+)\.py::([^[]+(\[[^]]*]|))$
RegEx Demo

Related

golang regex get the string including the search character

I am extracting a piece of string from a string (link):
https://arteptweb-vh.akamaihd.net/i/am/ptweb/100000/100000/100095-000-A_0_VO-STE%5BANG%5D_AMM-PTWEB_XQ.1V7rLEYkPH.smil/master.m3u8
The desired output should be 100000/100000/100095-000-A_
I am using the Regex ^.*?(/[i,na,fm,d]([,/]?)(/am/ptweb/|.+=.+,))([^_]*).*?$ in Golang flavor and I can get only the group 4 with the folowing output 100000/100000/100095-000-A
However I want the underscore after A.
Bit stuck on this, any help on this is appreciated.
You can use
(/(i|na|fm|d)(/am/ptweb/|.+=.+,))([^_]*_?)
See the regex demo.
Details:
(/(i|na|fm|d)(/am/ptweb/|.+=.+,)) - Group 1:
/ - a / char
(i|na|fm|d) - Group 2: i, na, fm or d
(/am/ptweb/|.+=.+,) - Group 3: /amp/ptweb/ or one or more chars as many as possible (other than line break chars), =, one or more chars as many as possible (other than line break chars) and a , char
([^_]*_?) - Group 4: zero or more chars other than _ and then an optional _.
You can match the underscore after the A like:
^.*?(/(?:[id]|na|fm)([,/]?)(/am/ptweb/|.+=.+,))([^_]*_).*$
See a regex demo
A few notes about the pattern that you tried:
This notation is a character class [i,na,fm,d] which should be a grouping (?:[id]|na|fm)
In this group ([,/]?) you optionally capture either , or / so in theory it could match a string that has /i//am/ptweb/
The last part .*?$ does not have to be non greedy as it is the last part of the pattern
This part [^_]* can also match spaces and newlines

Regex choose based on string format

I have following formats of data:
CumulativeReport_cumulativeReportBins_CumulativeBinNetworksViews_totalSuccessfulHeartbeats_1
CumulativeReport_cumulativeReportBins_CumulativeBinNetworksViews_totalSuccessfulHeartbeats__1
I am using following regex:
^(.*)_(.*?_.*?)(_\d$|__\d$)
My requirement every time is to get CumulativeBinNetworksViews_totalSuccessfulHeartbeats. For first case its working fine but for second case its printing "totalSuccessfulHeartbeats_1". How to solve this.
You can use
^(.*)_([^_]+_[^_]+)__?\d$
See the regex demo. Details:
^ - start of string
(.*) - Group 1: any zero or more chars other than line break chars as many as possible
_ - an underscore
([^_]+_[^_]+) - Group 2: one or more chars other than _, _ and one or more chars other than _
__? - one or two underscores
\d - a digit
$ - end of string.

Ignore Until "Spacebar+I or V or X" - Regex Expression

So... I had a regex which worked just fine (wasn't pretty but worked), until the Roman Numerals reached more than X.
Currently my Regex looks like this:
(.*?)(^(X{1,3})(I[XV]|V?I{0,3})$|^(I[XV]|V?I{1,3})$|^V$)*(.)( EP\. )(\d*)(.*)
The problem I have right now is that if roman numeral has value 10 or more it's is in 1st group which drives me nuts.
I need it to work in a way that all before roman numerals is ignored.
Test Text:
PEPA THE PIG XVI EP. 169 - BAD ENDING
Could you please help me fix the regex so it would actually do what it suppose to do?
You should re-consider using anchors in the middle of a regex: ^ requires start of string and $ requires the end of string.
Besides, (.) before ( Ep\. ) consume the space, and the Ep pattern cannot match it.
Consider using
^(.*?)\b(X{1,3}(?:I[XV]|V?I{0,3})|I[XV]|V?I{1,3}|V)\b(.)\b(EP\.)\s*(\d+)(.*)
See the regex demo. You might still need to check what exactly you want to match with (.).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\b - a word boundary
(X{1,3}(?:I[XV]|V?I{0,3})|I[XV]|V?I{1,3}|V) - Group 2: one to three Xs followed with IX or IV, or with an optional V and then zero to three Is, or IX, IV, or an optional V followed with one to three Is or V
\b - a word boundary
(.) - Group 3: any one char (other than a newline)
\b - a word boundary
(EP\.) - Group 4: EP.
\s* - zero or more whitespaces
(\d+) - Group 5: one or more digits
(.*) - Group 6: any zero or more chars other than line break chars, as many as possible

Regexp Substring From URL

I need to retrieve some word from url :
WebViewActivity - https://google.com/search/?term=iphone_5s&utm_source=google&utm_campaign=search_bar&utm_content=search_submit
return I want :
search/iphone_5s
but I'm stuck and not really understand how to use regexp_substr to get that data.
I'm trying to use this query
regexp_substr(web_url, '\google.com/([^}]+)\/', 1,1,null,1)
which only return the 'search' word, and when I try
regexp_substr(web_url, '\google.com/([^}]+)\&', 1,1,null,1)
it turns out I get all the word until the last '&'
You may use a REGEXP_REPLACE to match the whole string but capture two substrings and replace with two backreferences to the capture group values:
REGEXP_REPLACE(
'WebViewActivity - https://google.com/search/?term=iphone_5s&utm_source=google&utm_campaign=search_bar&utm_content=search_submit',
'.*//google\.com/([^/]+/).*[?&]term=([^&]+).*',
'\1\2')
See the regex demo and the online Oracle demo.
Pattern details
.* - any zero or more chars other than line break chars as many as possible
//google\.com/ - a //google.com/ substring
([^/]+/) - Capturing group 1: one or more chars other than / and then a /
.* - any zero or more chars other than line break chars as many as possible
[?&]term= - ? or & and a term= substring
([^&]+) - Capturing group 2: one or more chars other than &
.* - any zero or more chars other than line break chars as many as possible
NOTE: To use this approach and get an empty result if the match is not found, append |.+ at the end of the regex pattern.

Get the first ocurrence of a string in a variable REGEX

I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.