Matching multiple pattern without knowing if they are in the string - regex

I am trying to find a regex in order to match with one pattern and then capture all the year dates following this pattern.
For example, I have the following strings and I am trying to get one regex expression to capture pattern 1 and then the following dates (max 2) if they exist.
"'pattern1' foobar 4 foo 1 bar foo 1900 and 2000"
"'pattern1' foobar 4 foo 1 bar foo 1900"
"'pattern1' foobar 4 foo 1 bar foo"
The following expression matches the first case but not if a date is removed:
('pattern1').*?(\d{4}).*?(\d{4})
Adding ? after the potential date groups only matches the pattern as it satisfies the expression with no match of dates:
('pattern1').*?(\d{4}).*?(\d{4})
Hence my issue is not being able to specify that a group can or can not be in the expression but match if it is

You could make the both parts optional and use word boundaries around the digits to prevent them being part of a larger word.
If you want to match more years, you would have to add more optional groups. In that case, I would suggest using an approach like in the answer of #Alexander Mashin.
('pattern1')(?:.*?(\b\d{4}\b)(?:.*?(\b\d{4}\b))?)?
Regex demo

If you must solve your problem with one regular expression, simply use ^'pattern1'|\d{4}. The first match will contain 'pattern1' in the beginning of the string, remaining ones, the years (post AD 999).
A more correct solution would be to match a line, containing 'pattern1' and dates, capturing 'pattern1' and the tail containing dates (e.g. ^(?<head>'pattern1')(?<tail>(?:.*?\d{4})+.*$), and then matching dates in the tail (just \d{4}). But the exact code depends on your environment.

Related

Finding nth occurrence of a pattern within a string in SQL (Presto)

I am writing a query in Presto SQL using the function regexp_extract
I have a string that may look like the following examples:
'1A2B2C3D3E'
'1A1B2C2D3E'
'1A2B1C2D2E'
What I'm trying to do is find for example the second occurrence of 1[A-E].
If I try
regexp_extract(col, '(1[A-E])(1[A-E])', 2)
This will work for the second example (and the first since it returns nothing since there is no second occurence). However, this will fail for the third example. It returns nothing. I know that is because my regex is searching for a 1[A-E] followed directly by another 1[A-E].
So then I tried
regexp_extract(col, '(1[A-E])(.*)(1[A-E])', 3)
But this does not work either. I am not sure how I can account for the fact that I may have 1A1B2C or 1A2B1C to find that second 1. Any help?
Your second pattern does work in the latest version of Trino (formerly known as Presto SQL):
WITH t(col) AS (
VALUES
'1A2B2C3D3E',
'1A1B2C2D3E',
'1A2B1C2D2E')
SELECT regexp_extract(col, '(1[A-E])(.*)(1[A-E])', 3)
FROM t
_col0
-------
NULL
1B
1C
(3 rows)
As others have commented, you don't need the capture groups for the first match or for the .*, and you should use the lazy quantifier to avoid .* eagerly matching all characters between the first and last occurrence:
WITH t(col) AS (
VALUES
'1A2B2C3D3E',
'1A1B2C2D3E',
'1A2B1C2D2E',
'1A2B1C2D1E')
SELECT regexp_extract(col, '1[A-E].*?(1[A-E])', 1)
FROM t
_col0
-------
NULL
1B
1C
1C
(4 rows)
You don't need the second capture group (.*) to keep the 2 capture groups in the result, and you can optionally match the allowed characters in between.
From what I read on this page you might also consider using regexp_extract_all to get all the matches, as regexp_extract returns the first match.
As the example data consists of a digit followed by a char A-E, you could exclude matching the 1 from the character class to prevent overmatching and backtracking.
(1[A-E])[02-9A-E]*(1[A-E])
Regex demo
If using a single capture group to get the second value is also ok, you can use
1[A-E][02-9A-E]*(1[A-E])
Regex demo

Regular Expression Extracting Text from a group

I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)

Complete Regex Pattern- String Exclusion, Optional End Brackets, Multiple Matches

I'm parsing a bunch of line items on an inventory list and while each line describes something similar, the text format was not standardized. I'm been working on a regex pattern for the past few days but I'm not having much luck with getting a pattern that can match all of my test scenarios. I hoping that someone with a lot more regex experience might be able to point out a few errors in the the pattern
Pattern To Match the palette number: \([Pp]alette [No\.\s]?#?(.*?)\),
1. Warehouse A, (Palette #91L41)
# Match Result Correct: 91L41
2. Warehouse B Palette No. 214
# Match Result Incorrect: no match
3. Warehouse Lot Storage C (Palette No. 9),
# Match Result Incorrect: o. 9 //I don't quite understand why it matches the o
4. Store Location D of Palette (Palette #1),
# Match Result Correct: 1
5. Store Location E of Palette, Empty, lot #45,
# Match Result Incorrect: no match
I've also tried to make the parenthesis optional so that it will match examples 2 and 5 but it's too greedy and included the previously mentioned lot word
Anything in brackets causes the engine to look for ONE of the provided characters. Your pattern successfully matches, for example, strings like: Palette Nabcdefg
To indicate one of different options, you'll need to use paranthesis. What you're actually looking for should look something like this: [Pp]alette (No\.?\s?|#)?(\d+?)
Though it seems highly ineffective to not standardize the pattern. Your last case for example could be completely incompatible since it seems to be capable of containing possibly any kind of input.
A little bit of explanation on matching your patterns with regular expressions. You really don't need to look for and match your parentheses ( .. ) in this case.
Let's say we want to just find any string with the word Palette that is followed with whitespace and the # symbol and capture the Palette sequence from it.
You could simply just use the following:
[Pp]alette\s+#([A-Z0-9]+)
This will result in capturing 91L41 and 1 from the matched patterns
1. Warehouse A, (Palette #91L41)
4. Store Location D of Palette (Palette #1)
Now say we want to find any string that has Palette, followed by whitespace and either a # symbol or No.
We can use a Non-capturing group for this. Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything.
So we could do something like:
[Pp]alette\s+(?:No[ .]+|#)([A-Z0-9]+)
Now this results in matching the following strings and capturing 91L41, 214, 9 and 1
1. Warehouse A, (Palette #91L41)
2. Warehouse B Palette No. 214
3. Warehouse Lot Storage C (Palette No. 9)
4. Store Location D of Palette (Palette #1)
And last if you want to match all the following strings and capture the Palette sequence.
[Pp]alette[\w, ]+(?:No[ .]+|#)([A-Z0-9]+)
See working demo and an explanation on this regular expression.
Everyone has a different way of using regular expressions, this is just one of many ways you can simply understand and accomplish this.
This should work for your case:
[Pp]alette.*?(?:No\.?|#)\s*(\w+)
This will search following types of patterns:
[Pp]alette{any_characters}No.{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}No{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}#{optonal_spaces}(alphanumeric)
Check it in action here
MATCH 1
1. [26-31] `91L41`
MATCH 2
1. [60-63] `214`
MATCH 3
1. [104-105] `9`
MATCH 4
1. [148-149] `1`
MATCH 5
1. [195-197] `45`

perl regular expression digit match

Doing the below regex match to verify whether date is in the YYYY_MM_DD Format. But the regular expression gives an error message if i have a value of 2012_07_7. Date part and month should be exactly 2 digits according to the regex pattern. Not sure why it's not working.
if ($cmdParams{RunId} !~ m/^\d{4}_\d{2}_\d{2}$/)
{
print "Not a valid date in the format YYYY_MM_DD";
}
Your regex specifies exactly 2 digits for the day component, if you want to allow either 1 or 2 digits you should use {1,2} rather than {2}
Well if you look at your data that you have: 2012_07_7 you can see that the day-part is not of two digits.
Obviously. Your pattern dictates that the last numeric chunk should be of two digits, whereas you are providing 1. So if you want your pattern to match this text, try something like:
if ($cmdParams{RunId} !~ m/^\d{4}_\d{2}_\d\d?$/)
My solution: ^\d{4}_(?:1[0-2]|0?[1-9])_(?:3[01]|[1-2]\d|0?[1-9])$
this pattern match: 2000_12_01 or 2001_1_1 or 2001_02_1

What is wrong with this Regular Expression?

I am beginner and have some problems with regexp.
Input text is : something idUser=123654; nick="Tom" something
I need extract value of idUser -> 123456
I try this:
//idUser is already 8 digits number
MatchCollection matchsID = Regex.Matches(pk.html, #"\bidUser=(\w{8})\b");
Text = matchsID[1].Value;
but on output i get idUser=123654, I need only number
The second problem is with nick="Tom", how can I get only text Tom from this expresion.
you don't show your output code, where you get the group from your match collection.
Hint: you will need group 1 and not group 0 if you want to have only what is in the parentheses.
.*?idUser=([0-9]+).*?
That regex should work for you :o)
Here's a pattern that should work:
\bidUser=(\d{3,8})\b|\bnick="(\w+)"
Given the input string:
something idUser=123654; nick="Tom" something
This yields 2 matches (as seen on rubular.com):
First match is User=123654, group 1 captures 123654
Second match is nick="Tom", group 2 captures Tom
Some variations:
In .NET regex, you can also use named groups for better readability.
If nick always appears after idUser, you can match the two at once instead of using alternation as above.
I've used {3,8} repetition to show how to match at least 3 and at most 8 digits.
API links
Match.Groups property
This is how you get what individual groups captured in a match
Use look-around
(?<=idUser=)\d{1,8}(?=(;|$))
To fix length of digits to 6, use (?<=idUser=)\d{6}(?=($|;))