regex group not matching

regex group not matching - regex

I have the regex
(\d|(IV|I{0,3})|\bone\b|\btwo\b|\bthree\b|\bfour\b)[\w\s]+
if I use the sentence
'1 has wound' - 1 is matched in group 1 as expected
'IV has wound' - IV is matched in group 1 as expected
but, the sentence
'one has wound' - the word one doesn't get matched in group 1
when i modify the regex as follows
(\bone\b|\btwo\b|\bthree\b|\bfour\b|\d|(IV|I{0,3}))[\w\s]+
the group matches as expected.
So, my question why does changing the order of the group work..
I tried looking up ordering and precedence for regex but couldn't find anything relevant..
Thx

I think you made a mistake in your regex, it should be
(\d|(IV|I{1,3})|\bone\b|\btwo\b|\bthree\b|\bfour\b)[\w\s
Notice it's I{1,3}, not I{0,3}.
So, because of that, your regex match zero I, thus the empty capture group 1

Related

Regex to remove time zone stamp

In Google Sheets, I have time stamps with formats like the following:
5/25/2022 14:13:05
5/25/2022 13:21:07 EDT
5/25/2022 17:07:39 GMT+01:00
I am looking for a regex that will remove everything after the time, so the desired output would be:
5/25/2022 14:13:05
5/25/2022 13:21:07
5/25/2022 17:07:39
I have come up with the following regex after some trial and error, although I am not sure if it is prone to errors: [^0-9:\/' '\n].*
And the function in Google Sheets that I plan to use is REGEXREPLACE().
My goal is to be able to do calculations regardless of one's time zone, however the result will be stamped with the user's local time zone.
Could someone confirm this is correct? Appreciate any feedback I can get!

You can use
=REGEXREPLACE(A1, "^(\S+\s\S+).*", "$1")
=REGEXREPLACE(A1, "^([\d/]+\s[\d:]+).*", "$1")
See the regex demo #1 / regex demo #2.
Details:
^ - start of string
(\S+\s\S+) - Group 1: one or more non-whitespaces, one or more whitespaces and one or more non-whitespaces
[\d/]+\s[\d:]+ - one or more digits or / chars, a whitespace, one or more digits or colons
.* - any zero or more chars other than line break chars as many as possible.
The $1 is a replacement backreference that refers to the Group 1 value.

With your shown samples, attempts please try following regex in REGEXREPLACE. This will help to match time stamp specifically. Here is the Online demo for following regex. This will create only 1 capturing group with which we are replacing the whole value(as per requirement).
=REGEXREPLACE(A1, "^((?:\d{1,2}\/){2}\d{4}\s+(?:\d{1,2}:){2}\d{1,2}).*", "$1")
Explanation: Adding detailed explanation for above used regex.
^( ##Matching from starting of the value and creating/opening one and only capturing group.
(?:\d{1,2}\/){2} ##Creating a non-capturing group and matching 1 to 2 digits followed by / with 2 times occurrence here.
\d{4}\s+ ##Matching 4 digits occurrence followed by 1 or more spaces here.
(?:\d{1,2}:){2} ##In a non-capturing group matching 1 to 2 occurrence of digits followed by colon and this combination should occur2 times.
\d{1,2} ##Matching 1 to 2 occurrences of digits.
) ##Closing capturing group here.
.* ##This will match everything till last but its not captured.

You can do this without REGEX by simply splitting and adding the first and second index.
=ARRAYFORMULA(
IF(ISBLANK(A2:A),,
INDEX(SPLIT(A2:A," "),0,1)+
INDEX(SPLIT(A2:A," "),0,2)))

Reusing branch reset group doesn't match all the alternatives

I am trying to validate an IPv4 address using the RegEx below
^((?|([0-9][0-9]?)|(1[0-9][0-9])|(2[0-5][0-5]))\.){3}(?2)$
The regex works fine until the 3rd octet of the IP address in most of the cases. But sometimes in the last octet, it only matches the first alternative in the Branch Reset Group and ignores the other alternating groups altogether. I know that all alternatives in a branch reset group refer to the same capturing group. I tried the suggestion to reuse the capture groups as described in this StackOverflow post. It worked partially.

There is an explanation about this behaviour on this page:
https://www.pcre.org/original/doc/html/pcrepattern.html#SEC15
The documentation states:
a subroutine call to a numbered subpattern always refers to the first
one in the pattern with the given number.
Using the example on that page:
(?|(abc)|(def))(?1)
Inside a (?| group, parentheses are numbered as usual, but the number
is reset at the start of each branch.
The numbers will look like this
(?|(abc)|(def))
1 1
This will match
abcabc
defabc
abcabc
But it does not match
defdef
It does not match defdef because the pattern will match the first def, but the following (?1) will only match the first numbered subpattern which is (abc)
See a regex demo.

The reason is that (?2) regex subroutine recurses the first capturing group pattern with the ID 2, ([0-9][0-9]?). If it fails to match (the $ requires the end of string right after it), backtracking starts and the match is eventually failed.
The correct approach to recurse a group of patterns is to avoid using a branch reset group and capture all alternatives into a single capturing group that will be recursed:
^(?:(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?1)$
// |____________ Group 1 _______________| \_ Regex subroutine
See the regex demo.
Note the octet pattern is a bit different, it is taken from How to Find or Validate an IP Address. Your octet pattern is wrong because 2[0-5][0-5] does not match numbers between 200 and 255 that end with 6, 7, 8 and 9.

Regex to capture words and numbers in separate groups

I need two groups - one to extract words, second - numbers. Example:
['| Sofia | 300']
need to extract:
Group 1 - Sofia; Group 2 - 300
My regex attempt:
([a-zA-Z]+[ ]*[a-zA-Z]+)([0-9]+)
I don't understand as to why this doesn't match. I've been reading for 30 minutes now and maybe I can't phrase my issue correctly, but I can't find solution. My thinking here is that each set of parentheses holds a group. The Regex inside them seems to work fine on its own, but when I try to capture 2 groups - it fails. Obviously I am missing something important about multiple group capturing.

It doesn't match because you're not matching the characters between "Sofia" and "300". This would match "Sofia300", but not "Sofia 300" or "Sofia | 300". Try this:
(\w+ *\w+).*?(\d+)
(I'm using \w instead of [a-zA-Z] and \d instead of [0-9] for brevity.)

The following will give you your groups:
/([a-z]+).*\|\s([0-9]+)/i
Example

Capture a substring of a matched group

Scanario
I have to grab a substring from a composed string.
Match condition:
string starts with 'section1:'
captured string may be a blank separated or a dash separated list of alphanumerical values
if the captured string ends with a specific suffix ('-xx'), exclude the suffix from the captured string.
Examples
section1:ypsilon : section 1 matches, grab 'ypsilon'
section1:ypsilon zeta : section 1 matches, grab 'ypsilon zeta'
section1:ypsilon-zeta : section 1 matches, grab 'ypsilon-zeta'
section1:ypsilon-xx : section 1 matches, grab 'ypsilon', exclude '-xx'
section1:ypsilon zeta-xx : section 1 matches, grab 'ypsilon zeta', exclude '-xx'
section1:ypsilon-zeta-xx : section 1 matches, grab 'ypsilon-zeta', exclude '-xx'
section2:ypsilon : section 2 does not match
Solution so far
^section1:([a-zA-Z0-9\- ]+)(\-xx)?$
The idea is to get the group 1, whereas the group 2 is optional.
Demo.
Question
Unfortunately the suffix matches the group1 definition, as it is an alphabetic string with a dash. So the resulting captured strings does not exclude the suffix.
Any clue?

You were close, the main problem you're facing is the greediness of operators.
n+ will match as many n as possible, if we wish to reduce this we have to suffix it with ?
I end up with this regex Demo here
^section1:([a-zA-Z0-9\- ]+?)(|-xx)$
Main difference is the ? after the + to make it non-greedy (or reluctant) and I prefer to use alternation between empty and desire suffix instead of a group (|-xx) this match nothing OR -xx before the end of line.
I've no argument between both, matter of taste I think.

Use alteration of -xx with a non capturing group and use ? to make + not so ready that -xx is sucked up in the match:
(?<=^section1):([a-zA-Z0-9\- ]+?)(?:-xx|:)
Demo
If you don't have the second : to use as a bookmark, use $:
(?<=^section1):([a-zA-Z0-9\- ]+?)(?:-xx|\s*$)
Demo 2

How to match a group of value to group 1

Was tying to solve a regex question posted in SO, but was stuck with this.
From this string
Ob=Web technology,OB=Product SPe,OB=Dev profile,OB=Computer Management,oB=Hardware Services,cd=sti,CD=com,cd=ws
The values has to be removed as below.
Web technology,Product SPe,Dev profile,Computer Management,Hardware Services
I was trying the below regex.
(?=Ob)(?:(\w+=)([\w\s]+,?))+
My assumption was that group 1 should have all keys and group 2 should have all the values. But all except the last key value pair all others are getting assigned to group 0.
Is there a way go getting all values to group 2 ?
And here is what I was working on.

The issue with your regex is that group 1 and group 2 are enclosed within a non-capturing group. This caused the entire regex to get captured with group 0. And the other thing is the the positive-lookahead prevented the regex to do a global match.
Below regex will gather all keys to group to group 1 and values to group 2.
(\w+)=([\w\s]+)(?=[,\s]+)
Check it out how it works here.

,?cd=.*?(?:,|$)|ob=
Try this.Replace by empty string.See demo.Do not forget flag i.
http://regex101.com/r/lZ5mN8/59
or
cd=.*?(?:,|$)|[^=,]+=(.*?)(?=,|$)
Try this.Replace by $1.See demo.
http://regex101.com/r/lZ5mN8/57

REgex:
(?i)Ob=([^,]+)|(?!.*\bob\b).+
Replacement string:
$1
DEMO
(?i) Will do a case insensitive match.
Ob=([^,]+) Group index 1 contains all the Ob values.
| OR
(?!.*\bob\b).+ Match any character one or more times but it won't contain \bob\b

This regex should work for you:
^(?!Ob=).*(*SKIP)(*F)|(\w+)=(\w+(?=,|$))
You can see that you're getting all keys in group #1 and all values in group #2.
RegEx Demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex group not matching - regex

I think you made a mistake in your regex, it should be (\d|(IV|I{1,3})|\bone\b|\btwo\b|\bthree\b|\bfour\b)[\w\s Notice it's I{1,3}, not I{0,3}. So, because of that, your regex match zero I, thus the empty capture group 1

Related

Regex to remove time zone stamp

Reusing branch reset group doesn't match all the alternatives

Regex to capture words and numbers in separate groups

Capture a substring of a matched group

How to match a group of value to group 1

Categories

Resources