How to match similar groups that repeat, but aren't the same

How to match similar groups that repeat, but aren't the same - regex

I have a number of strings relating to products. Each of these have reference numbers and I want to create a regex that picks up if different reference numbers are mentioned more than one time. So, given the following example:
"AB01 MyProduct" >>> No match - because there is only one ID
"AB02 MyOtherProduct" >>> No match - because there is only one ID
"AB03 YetAnotherProduct" >>> No match - because there is only one ID
"AnAccessory for AB01, AB02, AB03 or AB101" >>> Matches!!
"AB02 MyOtherProduct, MyOtherProduct called the AB02" >>> No match - because the codes are the same
Can anyone give me a clue?

If your regex engine supports negative lookaheads, this would do the trick:
(AB\d+).*?(?!\1)AB\d+
It matches if there are two sequences matching AB\d+ and the second one is not the same as the first one (ensured by the negative lookahead).
Explained:
( # start capture group 1
AB # match `AB` literally
\d+ # match one or more digits
) # end capture group one
.*? # match any sequence of characters, non-greedy
(?! # start negative lookahead, match this position if it does not match
\1 # whatever was captured in capture group 1
) # end lookahead
AB # match `AB` literally
\d+ # match one or more digits
Tests (JavaScript):
> var pattern = /(AB\d+).*?(?!\1)AB\d+/;
> pattern.test("AB01 MyProduct")
false
> pattern.test("AnAccessory for AB01, AB02, AB03 or AB101")
true
> pattern.test("AB02 MyOtherProduct, MyOtherProduct called the AB02")
false

Related

Or between groups when one group has to be preceeded by a character

I have the following data:
$200 – $4,500
Points – $2,500
I would like to capture the ranges in dollars, or capture the Points string if that is the lower range.
For example, if I ran my regex on each of the entries above I would expect:
Group 1: 200
Group 2: 4,500
and
Group 1: Points
Group 2: 2,500
For the first group, I can't figure out how to capture only the integer value (without the $ sign) while allowing for capturing Points.
Here is what I tried:
(?:\$([0-9,]+)|Points) – \$([0-9,]+)
https://regex101.com/r/mD9JeR/1

Just use an alternation here:
^(?:(Points)|\$(\d{1,3}(?:,\d{3})*)) - \$(\d{1,3}(?:,\d{3})*)$
Demo
The salient points of the above regex pattern are that we use an alternation to match either Points or a dollar amount on the lower end of the range, and we use the following regex for matching a dollar amount with commas:
\$\d{1,3}(?:,\d{3})*

Coming up with a regex that doesn't match the $ is not difficult. Coming up with a regex that doesn't match the $ and consistently puts the two values, whether they are both numeric or one of them is Points, as capture groups 1 and 2 is not straightforward. The difficulties disappear if you use named capture groups. This regex requires the regex module from the PyPi repository since it uses the same named groups multiple times.
import regex
tests = [
'$200 – $4,500',
'Points – $2,500'
]
re = r"""(?x) # verbose mode
^ # start of string
(
\$ # first alternate choice
(?P<G1>[\d,]+) # named group G1
| # or
(?P<G1>Points) # second alternate choice
)
\x20–\x20 # ' – '
\$
(?P<G2>[\d,]+) # named group g2
$ # end of string
"""
# or re = r'^(\$(?P<G1>[\d,]+)|(?P<G1>Points)) – \$(?P<G2>[\d,]+)$'
for test in tests:
m = regex.match(re, test)
print(m.group('G1'), m.group('G2'))
Prints:
200 4,500
Points 2,500
UPDATE
#marianc was on the right track with his comment but did not ensure that there were no extraneous characters in the input. So, with his useful input:
import re
tests = [
'$200 – $4,500',
'Points – $2,500',
'xPoints – $2,500',
]
rex = r'((?<=^\$)\d{1,3}(?:,\d{3})*|(?<=^)Points) – \$(\d{1,3}(?:,\d{3})*)$'
for test in tests:
m = re.search(rex, test)
if m:
print(test, '->', m.groups())
else:
print(test, '->', 'No match')
Prints:
$200 – $4,500 -> ('200', '4,500')
Points – $2,500 -> ('Points', '2,500')
xPoints – $2,500 -> No match
Note that a search rather than a match is done since a lookbehind assertion done at the beginning of the line cannot succeed. But we enforce no extraneous characters at the start of the line by including the ^ anchor in our lookbehind assertion.

For the first capturing group, you could use an alternation matching either Points and assert what is on the left is a non whitespace char, or match the digits with an optional decimal value asserting what is on the left is a dollar sign using a positive lookbehind if that is supported.
For the second capturing group, there is no alternative so you can match the dollar sign and capture the digits with an optional decimal value in group 2.
((?<=\$)\d{1,3}(?:,\d{3})*|(?<!\S)Points) – \$(\d{1,3}(?:,\d{3})*)
Explanation
( Capture group 1
(?<=\$)\d{1,3}(?:,\d{3})* Positive lookbehind, assert a $ to the left and match 1-3 digits and repeat 0+ matching a comma and 3 digits
| Or
(?<!\S)Points Positive lookbehind, assert a non whitespace char to the left and match Points
) Close group 1
– Match literally
\$ Match $
( Capture group 2
\d{1,3}(?:,\d{3})* Match 1-3 digits and 0+ times a comma and 3 digits
) Close group
Regex demo

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?

How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.

Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line

Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?

I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with

Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.

^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex

Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

RegEx: Get every word until last 4 words

I have strings like
wwww-wwww-wwww
wwww-www-ww-ww
Many w separated with -
But it's not regular wwww-wwww, it could be w-w-w-w as well
I try to find a regex that capture every word until the last 4 words.
So the result for example 1 would be the first 8w's (wwww-wwww)
For 2nd example the first 5w's (wwww-w)
Is it possible to do this in regex?
I have something like this right now:
^\w*(?=\w{4}$)
or maybe
[^-]*(?=\w{4}$)
I have 2 problems with my "solutions":
the last 4 words will not be captured for example 2. They are interrupted by the -
the words before the last 4 will not be captured. They are interrupted by the -.

Yes, it's possible with a slightly more sophisticated lookahead assertion:
/\w(?=(?:-*\w){4,}$)/x
Explanation:
/ # Start of regex
\w # Match a "word" character
(?= # only if the following can be matched afterwards:
(?: # (Start of capturing group)
-* # - zero or more separators
\w # - exactly one word character
){4,} # (End of capturing group), repeated 4 or more times.
$ # Then make sure we've reached the end of the string.
) # End of lookahead assertion/x
Test it live on regex101.com.

regex conditional statement (only parse numbers if string has a certain beginning)

first I used a string that returned my relaystates, so “1.0.0.0.1.1.0.0” would get parsed/grouped with \d+,
then my eight switches used ‘format response’, e.g. {1} to get the state for each switch.
now I need to get the numbers out of this string: “RELAYS.1.0.0.0.1.1.0.0”
\d+ will still get the numbers but I only want to get them IF the string starts with “RELAYS"
can anyone please explain how I could do that?
thnx a million in advance!
Edited icebear (today 00:24)

With a .NET engine, you could use the regex (?<=^RELAYS[\d.]*)\d+. But most regex engines don't support indefinite repetition in a negative lookbehind assertion.
See it live on regexhero.net.
Explanation:
(?<= # Assert that the following can be matched before the current position:
^RELAYS # Start of string, followed by "RELAYS"
[\d.]* # and any number of digits/dots.
) # End of lookbehind assertion
\d+ # Match one or more digits.
With a PCRE engine, you could use (?:^RELAYS\.|\G\.)(\d+) and access group 1 for each match.
See it live on regex101.com.
Explanation:
(?: # Start a non-capturing group that matches...
^RELAYS\. # either the start of the string and "RELAYS."
| # or
\G\. # the position after the previous match, followed by "."
) # End of non-capturing group
(\d+) # Match a number and capture it in group 1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to match similar groups that repeat, but aren't the same - regex

Related

Or between groups when one group has to be preceeded by a character

How to use regular expression to use as few groups as possible to match as long string as possible

Selecting if no delimiter, and no selecting if it is

RegEx: Get every word until last 4 words

regex conditional statement (only parse numbers if string has a certain beginning)

Categories

Resources