regex for specific fileName patterns

regex for specific fileName patterns - regex

Hi I had situtation with frame02_0046.tiff for which I figured out the following regex ^(.*)(\d+)([^\d]*)$ however I have another pattern of names frame1.03.png to frame5.03.png
how can in the regex I can include both the name pattern
pattern = '^(.*)(\d+)([^\d]*)$'
patternExpr = re.compile(pattern)

As posted in the comment, your regex is "bad" for the purpose you have.
You can use this regex instead:
^([a-z]*?)(\d+)[_.](\d+)\.[a-z]+
Working demo
Note the flags I used in the above link.
Match information
Match 1
Full match 0-17 `frame02_0046.tiff`
Group 1. 0-5 `frame`
Group 2. 5-7 `02`
Group 3. 8-12 `0046`
Match 2
Full match 18-31 `frame1.03.png`
Group 1. 18-23 `frame`
Group 2. 23-24 `1`
Group 3. 25-27 `03`
Match 3
Full match 32-45 `frame5.03.png`
Group 1. 32-37 `frame`
Group 2. 37-38 `5`
Group 3. 39-41 `03`

Related

Reusing branch reset group doesn't match all the alternatives

I am trying to validate an IPv4 address using the RegEx below
^((?|([0-9][0-9]?)|(1[0-9][0-9])|(2[0-5][0-5]))\.){3}(?2)$
The regex works fine until the 3rd octet of the IP address in most of the cases. But sometimes in the last octet, it only matches the first alternative in the Branch Reset Group and ignores the other alternating groups altogether. I know that all alternatives in a branch reset group refer to the same capturing group. I tried the suggestion to reuse the capture groups as described in this StackOverflow post. It worked partially.

There is an explanation about this behaviour on this page:
https://www.pcre.org/original/doc/html/pcrepattern.html#SEC15
The documentation states:
a subroutine call to a numbered subpattern always refers to the first
one in the pattern with the given number.
Using the example on that page:
(?|(abc)|(def))(?1)
Inside a (?| group, parentheses are numbered as usual, but the number
is reset at the start of each branch.
The numbers will look like this
(?|(abc)|(def))
1 1
This will match
abcabc
defabc
abcabc
But it does not match
defdef
It does not match defdef because the pattern will match the first def, but the following (?1) will only match the first numbered subpattern which is (abc)
See a regex demo.

The reason is that (?2) regex subroutine recurses the first capturing group pattern with the ID 2, ([0-9][0-9]?). If it fails to match (the $ requires the end of string right after it), backtracking starts and the match is eventually failed.
The correct approach to recurse a group of patterns is to avoid using a branch reset group and capture all alternatives into a single capturing group that will be recursed:
^(?:(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?1)$
// |____________ Group 1 _______________| \_ Regex subroutine
See the regex demo.
Note the octet pattern is a bit different, it is taken from How to Find or Validate an IP Address. Your octet pattern is wrong because 2[0-5][0-5] does not match numbers between 200 and 255 that end with 6, 7, 8 and 9.

REGEX Capturing differing sets of repeating groups

this is a two-part question, but I feel the answers will be related.
I have this regex pattern:
(\d+)(aa|bb) which I use to capture this string: 1bb2aa3aa4bb5bb6aa7bb8cc9cc
See demo: example 1
The way it captures the random series of aa and bb (both preceded by a digit) is exactly what I want, and is good as far as it goes.
So we get this match on regex101:
Match 1
Full match 0-3 `1bb`
Group 1. 0-1 `1`
Group 2. 1-3 `bb`
Match 2
Full match 3-6 `2aa`
Group 1. 3-4 `2`
Group 2. 4-6 `aa`
Match 3
Full match 6-9 `3aa`
Group 1. 6-7 `3`
Group 2. 7-9 `aa`
Match 4
Full match 9-12 `4bb`
Group 1. 9-10 `4`
Group 2. 10-12 `bb`
Match 5
Full match 12-15 `5bb`
Group 1. 12-13 `5`
Group 2. 13-15 `bb`
Match 6
Full match 15-18 `6aa`
Group 1. 15-16 `6`
Group 2. 16-18 `aa`
Match 7
Full match 18-21 `7bb`
Group 1. 18-19 `7`
Group 2. 19-21 `bb`
As expected, the 8cc9ccbit at the end is ignored. I would like capture this as well, in the same way I have captured the first repeating groups, in the same expression. So in the final output, I'd get something like this added to the end of the output. This should work for any amounts of matches on either side. This text is just one example.
Full match 21-24 `8cc`
Group 1. 21-22 `8`
Group 2. 22-24 `cc`
Match 7
Full match 24-27 `9cc`
Group 1. 24-25 `9`
Group 2. 25-27 `cc`
Also, I'd like to do similar but flipping the 'or' group to the end i.e. this:
1cc2cc3cc4cc5cc6cc7ccb8aa9bb
My current regex pattern (\\d+)(cc) only matches the repeating 'cc' groups.
See demo: example 2
I would like a similar full capture, with any amount of permissible entries of each group.
Any thoughts?

You may use
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)
See the regex demo
The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) lookahead, and then will consecutively match and capture digits and aa, bb or cc, but digits + aa or bb will be matched unless digits + cc is not in front.
Details
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)) - either of the two alternatives:
\G(?!^) - end of the previous successful match
(?(?=\d+(?:aa|bb))(?<!\dcc)) - if-then-else construct: if there is 1+ digits and aa or bb immediately to the right of the current location ((?=\d+(?:aa|bb)), then only continue matching if there is no digit followed with cc immediately to the left of the current location ((?<!\dcc))
| - or
^ - start of string
(?=(?:\d+(?:aa|bb))+(?:\d+cc)+) - a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):
(?:\d+(?:aa|bb))+ - one or more occurrences of 1+ digits followed with aa or bb
(?:\d+cc)+ - one or more occurrences of 1+ digits followed with cc
(\d+) - Group 1: one or more digits
(aa|bb|cc) - aa, bb or cc.
For the second pattern, replace cc with (?:aa|bb):
(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)

I'm no expert with perl, so I'll give a bit of pseudo code here. Feel free to suggest an edit.
You can start by matching any number of xaa or xbb combos, followed by one or more xcc combos using this pattern: ^(?:\d+(?:aa|bb))+(?:\dcc)+$
Once you have that you can use this pattern to capture the appropriate groups: (\d+)(aa|bb|cc)
Demo 1
Demo 2
Something like:
if(ismatch("^(?:\d+(?:aa|bb))+(?:\dcc)+$", inputString))
{
match = match("(\d+)(aa|bb|cc)", inputString);
}
from here you can extract the information using the groups.

regex group not matching

I have the regex
(\d|(IV|I{0,3})|\bone\b|\btwo\b|\bthree\b|\bfour\b)[\w\s]+
if I use the sentence
'1 has wound' - 1 is matched in group 1 as expected
'IV has wound' - IV is matched in group 1 as expected
but, the sentence
'one has wound' - the word one doesn't get matched in group 1
when i modify the regex as follows
(\bone\b|\btwo\b|\bthree\b|\bfour\b|\d|(IV|I{0,3}))[\w\s]+
the group matches as expected.
So, my question why does changing the order of the group work..
I tried looking up ordering and precedence for regex but couldn't find anything relevant..
Thx

I think you made a mistake in your regex, it should be
(\d|(IV|I{1,3})|\bone\b|\btwo\b|\bthree\b|\bfour\b)[\w\s
Notice it's I{1,3}, not I{0,3}.
So, because of that, your regex match zero I, thus the empty capture group 1

Conditional Regexp: return only one group

Two types of URLs I want to match:
(1) www.test.de/type1/12345/this-is-a-title.html
(2) www.test.de/category/another-title-oh-yes.html
In the first type, I want to match "12345".
In the second type I want to match "category/another-title-oh-yes".
Here is what I came up with:
(?:(?:\.de\/type1\/([\d]*)\/)|\.de\/([\S]+)\.html)
This returns the following:
For type (1):
Match group 1: 12345
Match group 2:
For type (2):
Match group:
Match group 2: category/another-title-oh-yes
As you can see, it is working pretty well already.
For various reasons I need the regex to return only one match-group, though. Is there a way to achieve that?

Java/PHP/Python
Get both the matched group at index 1 using both Negative Lookahead and Positive Lookbehind.
((?<=\.de\/type1\/)\d+|(?<=\.de\/)(?!type1)[^\.]+)
There are two regex pattern that are ORed.
First regex pattern looks for 12345
Second regex pattern looks for category/another-title-oh-yes.
Note:
Each regex pattern must match exactly one match in each URL
Combine whole regex pattern inside the parenthesis (...|...) and remove parenthesis from the [^\.]+ and \d+ where:
[^\.]+ find anything until dot is found
\d+ find one or more digits
Here is online demo on regex101
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. [18-23] `12345`
MATCH 2
1. [57-86] `category/another-title-oh-yes`
JavaScript
try this one and get both the matched group at index 2.
((?:\.de\/type1\/)(\d+)|(?:\.de\/)(?!type1)([^\.]+))
Here is online demo on regex101.
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. `.de/type1/12345`
2. `12345`
MATCH 2
1. `.de/category/another-title-oh-yes`
2. `category/another-title-oh-yes`

Maybe this:
^www\.test\.de/(type1/(.*)\.|(.*)\.html)$
Debuggex Demo
Then for example:
var str = "www.test.de/type1/12345/this-is-a-title.html"
var regex = /^www\.test\.de/(type1/(.*)\.|(.*)\.html)$/
console.log(str.match(regex))
This will output an array, the first element is the string, the second one is whatever is after the website address, the third is what matched according to type1 and the fourth element is the rest.
You can do something like var matches = str.match(regex); return matches[2] || matches[3];

Regex optional group

I am using this regex:
((?:[a-z][a-z]+))_(\d+)_((?:[a-z][a-z]+)\d+)_(\d{13})
to match strings like this:
SH_6208069141055_BC000388_20110412101855
separating into 4 groups:
SH
6208069141055
BC000388
20110412101855
Question: How do I make the first group optional, so that the resulting group is a empty string?
I want to get 4 groups in every case, when possible.
Input string for this case: (no underline after the first group)
6208069141055_BC000388_20110412101855

Making a non-capturing, zero to more matching group, you must append ?.
(?: ..... )?
^ ^____ optional
|____ group

You can easily simplify your regex to be this:
(?:([a-z]{2,})_)?(\d+)_([a-z]{2,}\d+)_(\d+)$
^ ^^
|--------------||
| first group ||- quantifier for 0 or 1 time (essentially making it optional)
I'm not sure whether the input string without the first group will have the underscore or not, but you can use the above regex if it's the whole string.
regex101 demo
As you can see, the matched group 1 in the second match is empty and starts at matched group 2.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex for specific fileName patterns - regex

Hi I had situtation with frame02_0046.tiff for which I figured out the following regex ^(.)(\d+)([^\d])$ however I have another pattern of names frame1.03.png to frame5.03.png how can in the regex I can include both the name pattern pattern = '^(.)(\d+)([^\d])$' patternExpr = re.compile(pattern)

Related

Reusing branch reset group doesn't match all the alternatives

REGEX Capturing differing sets of repeating groups

regex group not matching

Conditional Regexp: return only one group

Regex optional group

Categories

Resources