regex return conditional group - regex

I spent lot time figuring out a simple regex to return a group (only 1st group).
So the string can be -
"No purchase required" or "Purchase of $50.00 worth groceries is required."
I am trying to write a regex which can parse "No" or "50" based on the given string.
This is what I have written.
(?:(No) monthly maintenance|Purchase of \$([\d\.]+ worth groceries)
This works fine but I want my output as 1st group/group 1 only.

Why not just use /(?:No monthly maintenance|Purchase of $([0-9.]+) worth groceries)/.
The match will fail if it's not in one of those formats, and Group 1 matches '' for the "No monthly maintenance" case, or the number for other case.
If you really need to capture the string No or the number, you might need to get a little more complicated and do something like:
/(?:Purchase of $)?([0-9.]+|No) (?:monthly maintenance|worth groceries)/

Most languages count the matching groups in order, regardless of whether they are in a non-capturing group ((?:...|...)), so forcing the interesting part into the first capturing group might be more trouble than it's worth.
Depending on what language you are using you might want to try using two different precompiled regular expressions and return the matching group of the first match, that way you can easily fit the interesting part into the first group.

I'm not sure you can get the result in group number 1, but you can get both results to appear in the same named group. Here's an example in PowerShell:
$input1 = 'Purchase of $50.00 worth groceries is required'
$input2 = 'No monthly maintenance required'
$re = '(?:(?<xyz>No) monthly maintenance|Purchase of \$(?<xyz>[\d\.]+) worth groceries)'
$match = [regex]::Match($input1, $re)
$match.Groups['xyz']
$match = [regex]::Match($input2, $re)
$match.Groups['xyz']
Which results in the following:
Success : True
Captures : {50.00}
Index : 13
Length : 5
Value : 50.00
Success : True
Captures : {No}
Index : 0
Length : 2
Value : No
Not all languages support named groups though. Since PowerShell runs on the .NET Framework, this will work for any .NET language.

Related

Regex to extract an "id" separated by string (or begin)

I need to extract IDs from a string : 1;#ChapitreA;#2;#ChapitreB;
Here, IDs are 1 and 2
I tried (^|;#)([0-9]?)(;) but it returns too many results => Here
Or behind :
0-2 1;
0-0 null
0-1 1
1-2 ;
12-16 ;#2;
12-14 ;#
14-15 2
15-16 ;
Constraint : I cannot use groups as this RegEx will be use in Nintex Workflow which doesn't support it. By the way I can't use any programming language neither to manage the result.. I need the RegEx to return the exact result I need (IDs) and nothing more.
Solution : (?:(?<=^)|(?<=;#))\d+(?=;)
Using groups : (?:^|;#)(\d+);
Without using groups : (?:(?<=^)|(?<=;#))\d+(?=;)
Thx #wiktor-stribiżew

Regex have result in one group when pattern have several options inside

I am parsing a string with bundle configurations.
For simplicity sake, let's say there are only two layouts:
WEIGHT*SIZE
and
SIZE*WEIGHT
sample data looks like this:
12g*15
13g*20
20pack*2.5kg
40packs*15g
10p*35g
Regex I am using now is basically two regex expressions for each layout divided by '|':
(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)
But in first two lines it gives result in group(1), and for later 3 result is in group(2).
For the simplicity sake, can I somehow have result only in group(1) regardless which side of "|" in my regex fired? So I don't need to iterate through groups after using re.search?
(I know i could just do ([\d.]{1,3})(?:g|kg), but I need to fetch weight form exactly this types of layouts, single weight without bundle size like 5kg should not be taken into account)
You didn't specify which language/flavour of RegEx you are using, but assuming you are using Python here are a few possible solutions:
Option 1: Select first non-empty capturing group
This is the solution proposed in the comment from Tim Biegeleisen, and probably the quickest. Might look something like this:
import re
pattern = '(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')
rx = re.compile(pattern)
for e in examples:
for match in rx.finditer(e):
for g in match.groups():
if g:
print(g)
Output:
12
13
2.5
15
35
Option 2: Use named capturing groups
The syntax is (?P<name>regex) as per this page. RegEx allows the same name for two capturing groups, so you could modify your RegEx to the following:
(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)
However, Python's in-built re module does not supported identically named groups (as per this answer), so you would need to pip install and import the PyPI regex module. Might look like this:
import regex
pattern = r'(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')
rx = regex.compile(pattern)
for e in examples:
for x in rx.finditer(e):
print(x.group("weight"))
Output:
12
13
2.5
15
35
Option 3: Rewrite the RegEx so that you can put both options inside a single named capturing group
You could make the parts before and after the weight optional so that you just have a single instance of the weight group:
(?:(?:[0-9]{1,3})(?:packs|pack|p|)\*)?([\d.,]{1,3})(?:g|kg)(?:\*(?:[0-9]{1,3})(?:packs|pack|p|))?
The above RegEx captures the number for the weight only in Group 1 for all of your examples. However it will also capture weights for a string like 40packs*15g*40packs that doesn't match your initial spec. You should be able to rewrite it to be more strict while still keeping only a single capturing group, but it might end up getting quite long.

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Number groups with 0 as delimiter

There's a long natural number that can be grouped to smaller numbers by the 0 (zero) delimiter.
Example: 4201100370880
This would divide to Group1: 42, Group2: 110, Group3: 370880
There are 3 groups, groups never start with 0 and are at least 1 char long. Also the last groups is "as is", meaning it's not terminated by a tailing 0.
This is what I came up with, but it only works for certain inputs (like 420110037880):
(\d+)0([1-9][0-9]{1,2})0([1-9]\d+)
This shows I'm attempting to declare the 2nd group's length to min2 max3, but I'm thinking the correct solution should not care about it. If the delimiter was non-numeric I could probably tackle it, but I'm stumped.
All right, factoring in comment information, try splitting on a regex (this may vary based on what language you're using - .split(/.../) in JavaScript, preg_split in PHP, etc.)
The regex you want to split on is: 0(?!0). This translates to "a zero that is not followed by a zero". I believe this will solve your splitting problem.
If your language allows a limit parameter (PHP does), set it to 3. If not, you will need to do something like this (JavaScript):
result = input.split(/0(?!0)/);
result = result.slice(0,2).concat(result.slice(2).join("0"));
The following one should suit your needs:
^(.*?)0(?!0)(.*?)0(?!0)(.*)$
Visualization by Debuggex
The following regex works:
(\d+?)0(?!0) with the g modifier
Demo: http://regex101.com/r/rS4dE5
For only three matches, you can do:
(\d+?)0(?!0)(\d+?)0(?!0)(.*)

Can I optimize this phone-regex?

Ok, so I have this regex:
( |^|>)(((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8})))( |$|<)
It formats Dutch and Belgian phone numbers (I only want those hence the 31 and 32 as country code).
Its not much fun to decipher but as you can see it also has a lot duplicated. but now it does handles it very accurately
All the following European formatted phone numbers are accepted
0031201234567
0031223234567
0031612345678
+31(0)20-1234567
+31(0)223-234567
+31(0)6-12345678
020-1234567
0223-234567
06-12345678
0201234567
0223234567
0612345678
and the following false formatted ones are not
06-1234567 (mobile phone number in the Netherlands should have 8 numbers after 06 )
0223-1234567 (area code with home phone)
as opposed to this which is good.
020-1234567 (area code with 3 numbers has 7 numbers for the phone as opposed to a 4 number area code which can only have 6 numbers for phone number)
As you can see it's the '-' character that makes it a little difficult but I need it in there because it's a part of the formatting usually used by people, and I want to be able to parse them all.
Now is my question... do you see a way to simplify this regex (or even improve it if you see a fault in it), while keeping the same rules?
You can test it at regextester.com
(The '( |^|>)' is to check if it is at the start of a word with the possibility it being preceded by either a new line or a '>'. I search for the phone numbers in HTML pages.)
First observation: reading the regex is a nightmare. It cries out for Perl's /x mode.
Second observation: there are lots, and lots, and lots of capturing parentheses in the expression (42 if I count correctly; and 42 is, of course, "The Answer to Life, the Universe, and Everything" -- see Douglas Adams "Hitchiker's Guide to the Galaxy" if you need that explained).
Bill the Lizard notes that you use '(-)?( )?' several times. There's no obvious advantage to that compared with '-? ?' or possibly '[- ]?', unless you are really intent on capturing the actual punctuation separately (but there are so many capturing parentheses working out which '$n' items to use would be hard).
So, let's try editing a copy of your one-liner:
( |^|>)
(
((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7})) |
((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6})) |
((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)
OK - now we can see the regular structure of your regular expression.
There's much more analysis possible from here. Yes, there can be vast improvements to the regular expression. The first, obvious, one is to extract the international prefix part, and apply that once (optionally, or require the leading zero) and then apply the national rules.
( |^|>)
(
(((\+|00)(31|32)( )?(\(0\))?)|0)
(((([0-9]{2})(-)?( )?)?)([0-9]{7})) |
(((([0-9]{3})(-)?( )?)?)([0-9]{6})) |
(((([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)
Then we can simplify the punctuation as noted before, and remove some plausibly redundant parentheses, and improve the country code recognizer:
( |^|>)
(
(((\+|00)3[12] ?(\(0\))?)|0)
(((([0-9]{2})-? ?)?)[0-9]{7}) |
(((([0-9]{3})-? ?)?)[0-9]{6}) |
(((([0-9]{1})-? ?)?)[0-9]{8})
)
( |$|<)
We can observe that the regex does not enforce the rules on mobile phone codes (so it does not insist that '06' is followed by 8 digits, for example). It also seems to allow the 1, 2 or 3 digit 'exchange' code to be optional, even with an international prefix - probably not what you had in mind, and fixing that removes some more parentheses. We can remove still more parentheses after that, leading to:
( |^|>)
(
(((\+|00)3[12] ?(\(0\))?)|0) # International prefix or leading zero
([0-9]{2}-? ?[0-9]{7}) | # xx-xxxxxxx
([0-9]{3}-? ?[0-9]{6}) | # xxx-xxxxxx
([0-9]{1}-? ?[0-9]{8}) # x-xxxxxxxx
)
( |$|<)
And you can work out further optimizations from here, I'd hope.
Good Lord Almighty, what a mess! :) If you have high-level semantic or business rules (such as the ones you describe talking about European numbers, numbers in the Netherlands, etc.) you'd probably be better served breaking that single regexp test into several individual regexp tests, one for each of your high level rules.
if number =~ /...../ # Dutch mobiles
# ...
elsif number =~ /..../ # Belgian landlines
# ...
# etc.
end
It'll be quite a bit easier to read and maintain and change that way.
Split it into multiple expressions. For example (pseudo-code)...
phone_no_patterns = [
/[0-9]{13}/, # 0031201234567
/+(31|32)\(0\)\d{2}-\d{7}/ # +31(0)20-1234567
# ..etc..
]
def check_number(num):
for pattern in phone_no_patterns:
if num matches pattern:
return match.groups
Then you just loop over each pattern, checking if each one matches..
Splitting the patterns up makes its easy to fix specific numbers that are causing problems (which would be horrible with the single monolithic regex)
(31|32) looks bad. When matching 32, the regex engine will first try to match 31 (2 chars), fail, and backtrack two characters to match 31. It's more efficient to first match 3 (one character), try 1 (fail), backtrack one character and match 2.
Of course, your regex fails on 0800- numbers; they're not 10 digits.
It's not an optimization, but you use
(-)?( )?
three times in your regex. This will cause you to match on phone numbers like these
+31(0)6-12345678
+31(0)6 12345678
but will also match numbers containing a dash followed by a space, like
+31(0)6- 12345678
You can replace
(-)?( )?
with
(-| )?
to match either a dash or a space.