Regular Expression Match Groups [duplicate] - regex

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have the following regular expression:
(.*(\d*)(-)(\d*).*)
It correctly matches the following string:
Court 19-24
However, the second group is empty - Group 1: Court 19-24, Group 2: [empty], Group 3: -, Group 4: 24
What is wrong with my regular expression that the second group doesn't contain 19?

Maybe you are looking for something like this?
let re = /.* (\d+)(-)(\d+).*/;
let str = 'Court 19-24';
var match = re.exec(str);
console.log('group 0:', match[0])
console.log('group 1:', match[1])
console.log('group 2:', match[2])
console.log('group 3:', match[3])

Group indexes annotated
(.(\d)(-)(\d*).*)
1 2 3 4
Not sure what you expected here?
To match 19 and 24 into two separate groups, use something like this:
(\d+)-(\d+)
To also match the court name into a group, a non-greedy star works.
(.*?) (\d+)-(\d+)
I bet your initial regex was more like this:
(.*(\d*)(-)(\d*).*)
This will behave closer to what you describe, because
. also matches digits, i.e. .* will go up to the -, and
* is not required to match anything at all, i.e. \d* it will happily match zero digits, causing group 2 to be empty.
Lessons:
Don't use * when you expect at least something to be there. Prefer + in this case.
Be wary of the greedy star, especially when used with the unspecific . it can match things you did not intend.
You don't have to put parenthesis around everything in regular expressions. Only add groups around things you want to extract (i.e. "capture"), or around things you want to match/fail/repeat as one (i.e. "make atomic").

Related

RegEx - Match a number that has different beginnings

I'm new to RegEx and I'm trying to match a specific number that has 8 digits, and has 3 start options:
00
15620450000
VS
For Example:
1562045000012345678
VS12345678
0012345678
12345678
I don't want to match the 4th option.
Right now I have managed to match the first and third options, but I'm having problems with the second one, I wrote this expression, trying to match the 8 digits under 'Project':
156204500|VS|00(?<Project>\d{8})
What should I do?
Thanks
With your shown samples, please try following regex once.
^(?:00|15620450000|VS)(\d{8})$
OR to match it with Project try:
^(?:00|15620450000|VS)(?<Project>\d{8})$
Online demo for above regex
Explanation: Adding detailed explanation for above.
^(?:00|15620450000|VS) ##Checking value from starting and in a non-capturing group matching 00/15620450000/VS here as per question.
(?<Project>\d{8} ##Creating group named Project which is making sure value has only 8 digits till end of value.
)$ ##Closing capturing group here.
Let's understand why your solution failed that will help you get around such kind of problems in the future. Your regex, 156204500|VS|00(\d{8}) is processed as follows:
156204500 OR VS OR 00(\d{8})
In arithmetic,
1 + 2 + 3 (4 + 5) <--- (4 + 5) is multiplied with only 3
is different from
(1 + 2 + 3) (4 + 5) <--- (4 + 5) is multiplied with (1 + 2 + 3)
This rule is applicable to RegEx as well. Obviously, you intended to use the second form.
By now, you must have already figured out the following solution:
(15620450000|VS|00)(\d{8})
Note that unless you want to capture a group, a capturing group does not make sense and this is where regex has another concept called non-capturing group which you obtain by putting ?: as the first thing in the parentheses. With a non-capturing group, the final solution becomes:
(?:15620450000|VS|00)\d{8}

Matching multiple pattern without knowing if they are in the string

I am trying to find a regex in order to match with one pattern and then capture all the year dates following this pattern.
For example, I have the following strings and I am trying to get one regex expression to capture pattern 1 and then the following dates (max 2) if they exist.
"'pattern1' foobar 4 foo 1 bar foo 1900 and 2000"
"'pattern1' foobar 4 foo 1 bar foo 1900"
"'pattern1' foobar 4 foo 1 bar foo"
The following expression matches the first case but not if a date is removed:
('pattern1').*?(\d{4}).*?(\d{4})
Adding ? after the potential date groups only matches the pattern as it satisfies the expression with no match of dates:
('pattern1').*?(\d{4}).*?(\d{4})
Hence my issue is not being able to specify that a group can or can not be in the expression but match if it is
You could make the both parts optional and use word boundaries around the digits to prevent them being part of a larger word.
If you want to match more years, you would have to add more optional groups. In that case, I would suggest using an approach like in the answer of #Alexander Mashin.
('pattern1')(?:.*?(\b\d{4}\b)(?:.*?(\b\d{4}\b))?)?
Regex demo
If you must solve your problem with one regular expression, simply use ^'pattern1'|\d{4}. The first match will contain 'pattern1' in the beginning of the string, remaining ones, the years (post AD 999).
A more correct solution would be to match a line, containing 'pattern1' and dates, capturing 'pattern1' and the tail containing dates (e.g. ^(?<head>'pattern1')(?<tail>(?:.*?\d{4})+.*$), and then matching dates in the tail (just \d{4}). But the exact code depends on your environment.

Regular Expression Extracting Text from a group

I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)

Complete Regex Pattern- String Exclusion, Optional End Brackets, Multiple Matches

I'm parsing a bunch of line items on an inventory list and while each line describes something similar, the text format was not standardized. I'm been working on a regex pattern for the past few days but I'm not having much luck with getting a pattern that can match all of my test scenarios. I hoping that someone with a lot more regex experience might be able to point out a few errors in the the pattern
Pattern To Match the palette number: \([Pp]alette [No\.\s]?#?(.*?)\),
1. Warehouse A, (Palette #91L41)
# Match Result Correct: 91L41
2. Warehouse B Palette No. 214
# Match Result Incorrect: no match
3. Warehouse Lot Storage C (Palette No. 9),
# Match Result Incorrect: o. 9 //I don't quite understand why it matches the o
4. Store Location D of Palette (Palette #1),
# Match Result Correct: 1
5. Store Location E of Palette, Empty, lot #45,
# Match Result Incorrect: no match
I've also tried to make the parenthesis optional so that it will match examples 2 and 5 but it's too greedy and included the previously mentioned lot word
Anything in brackets causes the engine to look for ONE of the provided characters. Your pattern successfully matches, for example, strings like: Palette Nabcdefg
To indicate one of different options, you'll need to use paranthesis. What you're actually looking for should look something like this: [Pp]alette (No\.?\s?|#)?(\d+?)
Though it seems highly ineffective to not standardize the pattern. Your last case for example could be completely incompatible since it seems to be capable of containing possibly any kind of input.
A little bit of explanation on matching your patterns with regular expressions. You really don't need to look for and match your parentheses ( .. ) in this case.
Let's say we want to just find any string with the word Palette that is followed with whitespace and the # symbol and capture the Palette sequence from it.
You could simply just use the following:
[Pp]alette\s+#([A-Z0-9]+)
This will result in capturing 91L41 and 1 from the matched patterns
1. Warehouse A, (Palette #91L41)
4. Store Location D of Palette (Palette #1)
Now say we want to find any string that has Palette, followed by whitespace and either a # symbol or No.
We can use a Non-capturing group for this. Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything.
So we could do something like:
[Pp]alette\s+(?:No[ .]+|#)([A-Z0-9]+)
Now this results in matching the following strings and capturing 91L41, 214, 9 and 1
1. Warehouse A, (Palette #91L41)
2. Warehouse B Palette No. 214
3. Warehouse Lot Storage C (Palette No. 9)
4. Store Location D of Palette (Palette #1)
And last if you want to match all the following strings and capture the Palette sequence.
[Pp]alette[\w, ]+(?:No[ .]+|#)([A-Z0-9]+)
See working demo and an explanation on this regular expression.
Everyone has a different way of using regular expressions, this is just one of many ways you can simply understand and accomplish this.
This should work for your case:
[Pp]alette.*?(?:No\.?|#)\s*(\w+)
This will search following types of patterns:
[Pp]alette{any_characters}No.{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}No{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}#{optonal_spaces}(alphanumeric)
Check it in action here
MATCH 1
1. [26-31] `91L41`
MATCH 2
1. [60-63] `214`
MATCH 3
1. [104-105] `9`
MATCH 4
1. [148-149] `1`
MATCH 5
1. [195-197] `45`

What is wrong with this Regular Expression?

I am beginner and have some problems with regexp.
Input text is : something idUser=123654; nick="Tom" something
I need extract value of idUser -> 123456
I try this:
//idUser is already 8 digits number
MatchCollection matchsID = Regex.Matches(pk.html, #"\bidUser=(\w{8})\b");
Text = matchsID[1].Value;
but on output i get idUser=123654, I need only number
The second problem is with nick="Tom", how can I get only text Tom from this expresion.
you don't show your output code, where you get the group from your match collection.
Hint: you will need group 1 and not group 0 if you want to have only what is in the parentheses.
.*?idUser=([0-9]+).*?
That regex should work for you :o)
Here's a pattern that should work:
\bidUser=(\d{3,8})\b|\bnick="(\w+)"
Given the input string:
something idUser=123654; nick="Tom" something
This yields 2 matches (as seen on rubular.com):
First match is User=123654, group 1 captures 123654
Second match is nick="Tom", group 2 captures Tom
Some variations:
In .NET regex, you can also use named groups for better readability.
If nick always appears after idUser, you can match the two at once instead of using alternation as above.
I've used {3,8} repetition to show how to match at least 3 and at most 8 digits.
API links
Match.Groups property
This is how you get what individual groups captured in a match
Use look-around
(?<=idUser=)\d{1,8}(?=(;|$))
To fix length of digits to 6, use (?<=idUser=)\d{6}(?=($|;))