Regex have result in one group when pattern have several options inside - regex

I am parsing a string with bundle configurations.
For simplicity sake, let's say there are only two layouts:
WEIGHT*SIZE
and
SIZE*WEIGHT
sample data looks like this:
12g*15
13g*20
20pack*2.5kg
40packs*15g
10p*35g
Regex I am using now is basically two regex expressions for each layout divided by '|':
(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)
But in first two lines it gives result in group(1), and for later 3 result is in group(2).
For the simplicity sake, can I somehow have result only in group(1) regardless which side of "|" in my regex fired? So I don't need to iterate through groups after using re.search?
(I know i could just do ([\d.]{1,3})(?:g|kg), but I need to fetch weight form exactly this types of layouts, single weight without bundle size like 5kg should not be taken into account)

You didn't specify which language/flavour of RegEx you are using, but assuming you are using Python here are a few possible solutions:
Option 1: Select first non-empty capturing group
This is the solution proposed in the comment from Tim Biegeleisen, and probably the quickest. Might look something like this:
import re
pattern = '(?:[0-9]{1,3})(?:packs|pack|p|)\*([\d.,]{1,3})(?:g|kg)|([\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')
rx = re.compile(pattern)
for e in examples:
for match in rx.finditer(e):
for g in match.groups():
if g:
print(g)
Output:
12
13
2.5
15
35
Option 2: Use named capturing groups
The syntax is (?P<name>regex) as per this page. RegEx allows the same name for two capturing groups, so you could modify your RegEx to the following:
(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)
However, Python's in-built re module does not supported identically named groups (as per this answer), so you would need to pip install and import the PyPI regex module. Might look like this:
import regex
pattern = r'(?:[0-9]{1,3})(?:packs|pack|p|)\*(?P<weight>[\d.,]{1,3})(?:g|kg)|(?P<weight>[\d.,]{1,3})(?:g|kg)\*(?:[0-9]{1,3})(?:packs|pack|p|)'
examples = "12g*15,13g*20,20pack*2.5kg,40packs*15g,10p*35g".split(',')
rx = regex.compile(pattern)
for e in examples:
for x in rx.finditer(e):
print(x.group("weight"))
Output:
12
13
2.5
15
35
Option 3: Rewrite the RegEx so that you can put both options inside a single named capturing group
You could make the parts before and after the weight optional so that you just have a single instance of the weight group:
(?:(?:[0-9]{1,3})(?:packs|pack|p|)\*)?([\d.,]{1,3})(?:g|kg)(?:\*(?:[0-9]{1,3})(?:packs|pack|p|))?
The above RegEx captures the number for the weight only in Group 1 for all of your examples. However it will also capture weights for a string like 40packs*15g*40packs that doesn't match your initial spec. You should be able to rewrite it to be more strict while still keeping only a single capturing group, but it might end up getting quite long.

Related

Regular Expression Extracting Text from a group

I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)

Complete Regex Pattern- String Exclusion, Optional End Brackets, Multiple Matches

I'm parsing a bunch of line items on an inventory list and while each line describes something similar, the text format was not standardized. I'm been working on a regex pattern for the past few days but I'm not having much luck with getting a pattern that can match all of my test scenarios. I hoping that someone with a lot more regex experience might be able to point out a few errors in the the pattern
Pattern To Match the palette number: \([Pp]alette [No\.\s]?#?(.*?)\),
1. Warehouse A, (Palette #91L41)
# Match Result Correct: 91L41
2. Warehouse B Palette No. 214
# Match Result Incorrect: no match
3. Warehouse Lot Storage C (Palette No. 9),
# Match Result Incorrect: o. 9 //I don't quite understand why it matches the o
4. Store Location D of Palette (Palette #1),
# Match Result Correct: 1
5. Store Location E of Palette, Empty, lot #45,
# Match Result Incorrect: no match
I've also tried to make the parenthesis optional so that it will match examples 2 and 5 but it's too greedy and included the previously mentioned lot word
Anything in brackets causes the engine to look for ONE of the provided characters. Your pattern successfully matches, for example, strings like: Palette Nabcdefg
To indicate one of different options, you'll need to use paranthesis. What you're actually looking for should look something like this: [Pp]alette (No\.?\s?|#)?(\d+?)
Though it seems highly ineffective to not standardize the pattern. Your last case for example could be completely incompatible since it seems to be capable of containing possibly any kind of input.
A little bit of explanation on matching your patterns with regular expressions. You really don't need to look for and match your parentheses ( .. ) in this case.
Let's say we want to just find any string with the word Palette that is followed with whitespace and the # symbol and capture the Palette sequence from it.
You could simply just use the following:
[Pp]alette\s+#([A-Z0-9]+)
This will result in capturing 91L41 and 1 from the matched patterns
1. Warehouse A, (Palette #91L41)
4. Store Location D of Palette (Palette #1)
Now say we want to find any string that has Palette, followed by whitespace and either a # symbol or No.
We can use a Non-capturing group for this. Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything.
So we could do something like:
[Pp]alette\s+(?:No[ .]+|#)([A-Z0-9]+)
Now this results in matching the following strings and capturing 91L41, 214, 9 and 1
1. Warehouse A, (Palette #91L41)
2. Warehouse B Palette No. 214
3. Warehouse Lot Storage C (Palette No. 9)
4. Store Location D of Palette (Palette #1)
And last if you want to match all the following strings and capture the Palette sequence.
[Pp]alette[\w, ]+(?:No[ .]+|#)([A-Z0-9]+)
See working demo and an explanation on this regular expression.
Everyone has a different way of using regular expressions, this is just one of many ways you can simply understand and accomplish this.
This should work for your case:
[Pp]alette.*?(?:No\.?|#)\s*(\w+)
This will search following types of patterns:
[Pp]alette{any_characters}No.{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}No{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}#{optonal_spaces}(alphanumeric)
Check it in action here
MATCH 1
1. [26-31] `91L41`
MATCH 2
1. [60-63] `214`
MATCH 3
1. [104-105] `9`
MATCH 4
1. [148-149] `1`
MATCH 5
1. [195-197] `45`

Create shortest possible regex

I want to create a regex that will match any of these values
7-5
6-6 ((0-99) - (0-99))
6-4
6-3
6-2
6-1
6-0
0-6
1-6
2-6
3-6
4-6
the 6-6 example is a special case, here are some examples of values:
6-6 (23-8)
6-6 (4-25)
6-6 (56-34)
Is it possible to make one regex that can do this?
If so, is it possible to further extend that regex for the 6-6 special case such that the the difference between the two numbers within the parentheses is equal to 2 or -2?
I could easily write this with procedural code, but i'm really curious if someone can devise a regex for this.
Lastly, if it could be further extended such that the individual digits were in their own match groups I'd be amazed. An example would be for 7-5, i could have a match group that just had the value 7, and another that had the value 5. However for 6-6 (24-26) I'd like a match group that had the first six, a match group for the second 6, a match group for the 24 and a match group for the 26.
This may be impossible, but some of you can probably get this part of the way there.
Good luck, and thanks for the help.
NO. The answer is "We can't," and the reason is because you're trying to use a hammer to dig a hole.
The problem with writing one long "clever" (this word causes a knee-jerk reaction in many people who are far more anti-regex than I) regex is that, six months from now, you'll have forgotten those clever regex features that you used so heavily, and you'll have written six months worth of code related to something else, and you'll get back to your impressive regex and have to tweak one detail, and you'll say, "WTF?"
This is what (I understand) you want, in Perl:
# data is in $_
if(/7-5|6-[0-4]|[0-4]-6|6-6 \((\d{1,2})-(\d{1,2})\)/) {
if($1 and $2 and abs($1 - $2) == 2) {
# we have the right difference
}
}
Some might say that the given regex is a bit much, but I don't think it's too bad. If the \d{1,2} bit is a little too obscure you could use \d\d? (which is what I used at first, but didn't like the repetition).
You can do it like this:
7-5|6-[0-4]|[0-5]-6|6-6 \(\d\d?-\d\d?\)
Just add parens to get your match groups.
Off the top of my head (there may be some errors but the principle should be good):
\d-\d|6-6 (\d+-\d+)
And like with any regexp, you can surround what you want to extract with parentheses for match groups:
(\d)-(\d)|(6)-(6) ((\d)+-(\d+))
In the 6-6 case, the first two parentheses should get the sixes, and the second two should get the multi-digit values that come afterwards.
Here is one that will match only the numbers you want and let you get each digit by name:
p = r'(?P<a>[0-4]|6|7)-(?P<b>[0-4]|6|5) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?'
To get each digit you could use:
values = re.search(p, string).group('a', 'b', 'c', 'd')
Which will return a four element tuple with the values you are looking for (or None if no match was found).
One problem with this pattern is that it will patch the stuff in the parenthesis whether or not there was a match to '6-6'. This one will only match the final parenthesis if 6-6 is matched:
p = r'(?P<a>[0-4]|(?P<tmp_a>6)|7)-(?P<b>(?(tmp_a)(?P<tmp_b>6)|([0-4]|5)))(?(tmp_b) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?)'
I don't know of any way to look for a difference between the numbers in the parenthesis; regex only knows about strings, not numerical values . . .
(I am assuming python syntax here; the perl syntax is slightly different, though perl supports the python way of doing things.)

Regex expression to back reference more than 9 values in a replace

I have a regex expression that traverses a string and pulls out 40 values, it looks sort if like the query below, but much larger and more complicated
est(.*)/test>test>(.*)<test><test>(.*)test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test><test>(.*)/test>
My question is how do I use these expressions with the replace command when the number exceeds 9. It seems as if whenever I use \10 it returns the value for \1 and then appends a 0 to the end.
Any help would be much appreciated thanks :)
Also I am using UEStudio, but if a different program does it better then no biggie :)
As pointed out by psycho brm:
Use $10 instead of \10
I am using notepad++ and it works beautifull.
Most of the simple Regex engines used by editors aren't equipped to handle more than 10 matching groups; it doesn't seem like UltraEdit can. I just tried Notepad++ and it won't even match a regex with 10 groups.
Your best bet, I think, is to write something fast in a quick language with a decent regex parser. but that wouldn't answer the question as asked
Here's something in Python:
import re
pattern = re.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
with open('input.txt', 'r') as f:
for line in f:
m = pattern.match(line)
print m.groups()
Note that Python allows backreferences such as \20: in order to have a backreference to group 2 followed by a literal 0, you need to use \g<2>0, which is unambiguous.
Edit:
Most flavors of regex, and editors which include a regex engine, should follow the replace syntax as follows:
abcdefghijklmnop
search: (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(?<name>.)(.)
note: 1 2 3 4 5 6 7 8 9 10 11 12 13
value: a b c d e f g h i j k l m
replace result:
\11 k1 i.e.: match 1, then the character "1"
${12} l most should support this
${name} l few support named references, but use them where you can.
Named references are usually only possible in very specific flavor of regex libraries, test your tool to know for sure.
put a $ in front of the double digit subgroup: e.g. \1\2\3\4\5\6\7\8\9$10 It worked for me.
Try using named groups; so instead of the tenth:
(.*)
use:
(?<group10>.*)
and then use the following replace string:
${group10}
(That's of course in the absence of a better solution using looping, and remember that there might be different regex syntax flavours depending on your environment.)
If you cannot handle more than 9 subgroups why not initially match groups of 9 and then loop and apply regexes to those matches?
i.e. first match (<test.*/test>)+ and then for each subgroup match on <test(.*)/test>.

What is wrong with this Regular Expression?

I am beginner and have some problems with regexp.
Input text is : something idUser=123654; nick="Tom" something
I need extract value of idUser -> 123456
I try this:
//idUser is already 8 digits number
MatchCollection matchsID = Regex.Matches(pk.html, #"\bidUser=(\w{8})\b");
Text = matchsID[1].Value;
but on output i get idUser=123654, I need only number
The second problem is with nick="Tom", how can I get only text Tom from this expresion.
you don't show your output code, where you get the group from your match collection.
Hint: you will need group 1 and not group 0 if you want to have only what is in the parentheses.
.*?idUser=([0-9]+).*?
That regex should work for you :o)
Here's a pattern that should work:
\bidUser=(\d{3,8})\b|\bnick="(\w+)"
Given the input string:
something idUser=123654; nick="Tom" something
This yields 2 matches (as seen on rubular.com):
First match is User=123654, group 1 captures 123654
Second match is nick="Tom", group 2 captures Tom
Some variations:
In .NET regex, you can also use named groups for better readability.
If nick always appears after idUser, you can match the two at once instead of using alternation as above.
I've used {3,8} repetition to show how to match at least 3 and at most 8 digits.
API links
Match.Groups property
This is how you get what individual groups captured in a match
Use look-around
(?<=idUser=)\d{1,8}(?=(;|$))
To fix length of digits to 6, use (?<=idUser=)\d{6}(?=($|;))