How to improve a regex for print range? - regex

I would like to improve a VBA regex for a print range.
Currently I have this:
(\d+(-\d+)*)+(,\d+(-\d+)*)*
But, for an entry 12-25,45,50-53 this is returning the , and - like this:
Match 1: -25
Match 2: ,50-53
Match 3: -53
and is not returning the 45
Ideally I'd like a group returned for each comma delimited entry without any , or - like this:
Match 1: (12-25)
Match 2: (45)
Match 3: (50-53)

The reason 45 is not in a group is that you are repeating the second capturing group. When you are repeating a capturing group, the group contains the value of the last iteration.
So (,\d+(-\d+)*) will capture ,45. Now the whole group is repeated due to the outer * and within that last iteration ,50 is captured by ,\d+ and -53 is captured by -\d+
What you might do is match 1+ digits and use a single optional group for the hyphen and 1+ digits part to get 3 matches.
Use a positive lookahead (?=,|$) to assert what is directly on the right is a comma or the end of the string.
\d+(?:-\d+)?(?=,|$)
Regex demo
If you want 3 groups, you could use:
(\d+(?:-\d+)?),(\d+(?:-\d+)?),(\d+(?:-\d+)?)
Regex demo

Related

How do I trim leading zeros from a regular expression capture group?

I have a regular expression that splits an 18 digit number into 4 capture groups (saved at regex101.com).
/(\d{5})(\d{2})(\d{2})(\d{9})/mg
Using 000012022000456789 as the test string, my result is:
Group 1: 00001
Group 2: 20
Group 3: 22
Group 4: 000456789
I also need to trim leading zeros from Group 1, so my desired result is:
Group 1: 1
Group 2: 20
Group 3: 22
Group 4: 000456789
Can this all be done using one regular expression? Note that this is a general regular expression question, not specific to an engine or language.
You could add a non-capturing group to absorb up to 4 leading zeros, and adjust your first capturing group to match from 1 to 5 digits:
(?:0{0,4})(\d{1,5})(\d{2})(\d{2})(\d{9})
Demo on regex101
As long as your input is always an 18-digit number, this will work fine. If however the input could be other than 18 digits, this might match something like 01122333333333 or 000001111122333333333.
You can work around this by adding a lookbehind assertion before the second group that requires it to be preceded by exactly 5 digits and an assertion that the string be terminated by a non-digit:
(?:0{0,4})(\d{1,5})(?<=\b\d{5})(\d{2})(\d{2})(\d{9})(?=\b)
Demo on regex101

How to include a substring EXCEPT an exact one in middle of REGEX expression?

Issue
I'm trying to match 3 groups, where one is conditional
String: 12345-12345-1230
Group 1: 12345-12345
Group 2: -123
Group 3: 0
However I only want to match Group 2 if the string is NOT "-000". Meaning group 2 will either be blank if that section is '-000' or it will be whatever else those 4 characters are; '-123' '-001', etc.).
Here is the REGEX with it just accepting anything as group 2:
^(.{5}-.{5})(.{4})([0-9])$ regex101
What I've tried
Negative Lookahead:
^(.{5}-.{5})(?!-000)([0-9])$
^(.{5}-.{5})(.{4}(?!.{4}))([0-9])$
OR Operator:
^(.{5}-.{5})(-000)|(.{4})([0-9])$
This is the closest I've come, however I can't get it to work WITH the final condition ([0-9])$. It's also not ideal to have the remove case (-000) as a separate group as the accept case (not -000).
You may try:
^(\d{5}-\d{5})(?:-000|(-\d{3}))(\d)$
See the online demo.
^ - Start of line ancor.
( - Open 1st capture group.
\d{5}-\d{5} - Match 5 digits, an hyphen, and again 5 digits.
) - Close 1st capture group.
(?: - Open non-capturing group.
-000 - Match "-000" literally.
| - Pipe symbol used as an or-operator.
( - Open 2nd capture group.
-\d{3} - match an hyphen and 3 digits.
) - Close 2nd capture group.
) - Close non-capturing group.
( - Open 3rd capture group.
(\d) - Match a single digit.
) - Close 3rd capture group.
$ - End line ancor.
If you want to capture the 2nd group without hypen, then try: ^(\d{5}-\d{5})-(?:000|(\d{3}))(\d)$
Try this:
(\d{5}-\d{5})(?!-000)(-\d{3})(0)
See Demo

REGEXP_REPLACE for exact regex pattern, not working

I'm trying to match an exact pattern to do some data cleanup for ISSN's using the code below:
select case when REGEXP_REPLACE('1234-5678 ÿþT(zlsd?k+j''fh{l}x[a]j).,~!##$%^&*()_+{}|:<>?`"\;''/-', '([0-9]{4}[\-]?[Xx0-9]{4})(.*)', '$1') not similar to '[0-9]{4}[\-]?[Xx0-9]{4}' then 'NOT' else 'YES' end
The pattern I want match any 8 digit group with a possible dash in the middle and possible X at the end.
The code above works for most cases, but if capture group 1 is the following example: 123456789 then it also returns positive because it matches the first 8 digits, and I don't want it to.
I tried surrounding capture group 1 with ^...$ but that doesn't work either.
So I would like to match exactly these examples and similar ones:
1234-5678
1234-567X
12345678
1234567X
BUT NOT THESE (and similar):
1234567899
1234567899x
What am I missing?
You may use
^([0-9]{4}-?[Xx0-9]{4})([^0-9].*)?$
See the regex demo
Details
^ - start of string
([0-9]{4}-?[Xx0-9]{4}) - Capturing group 1 ($1): four digits, an optional -, and then four x / X or digits
([^0-9].*)? - an optional Capturing group 2: any char other than a digit and then any 0+ chars as many as possible
$ - end of string.

REGEX Capturing differing sets of repeating groups

this is a two-part question, but I feel the answers will be related.
I have this regex pattern:
(\d+)(aa|bb) which I use to capture this string: 1bb2aa3aa4bb5bb6aa7bb8cc9cc
See demo: example 1
The way it captures the random series of aa and bb (both preceded by a digit) is exactly what I want, and is good as far as it goes.
So we get this match on regex101:
Match 1
Full match 0-3 `1bb`
Group 1. 0-1 `1`
Group 2. 1-3 `bb`
Match 2
Full match 3-6 `2aa`
Group 1. 3-4 `2`
Group 2. 4-6 `aa`
Match 3
Full match 6-9 `3aa`
Group 1. 6-7 `3`
Group 2. 7-9 `aa`
Match 4
Full match 9-12 `4bb`
Group 1. 9-10 `4`
Group 2. 10-12 `bb`
Match 5
Full match 12-15 `5bb`
Group 1. 12-13 `5`
Group 2. 13-15 `bb`
Match 6
Full match 15-18 `6aa`
Group 1. 15-16 `6`
Group 2. 16-18 `aa`
Match 7
Full match 18-21 `7bb`
Group 1. 18-19 `7`
Group 2. 19-21 `bb`
As expected, the 8cc9ccbit at the end is ignored. I would like capture this as well, in the same way I have captured the first repeating groups, in the same expression. So in the final output, I'd get something like this added to the end of the output. This should work for any amounts of matches on either side. This text is just one example.
Full match 21-24 `8cc`
Group 1. 21-22 `8`
Group 2. 22-24 `cc`
Match 7
Full match 24-27 `9cc`
Group 1. 24-25 `9`
Group 2. 25-27 `cc`
Also, I'd like to do similar but flipping the 'or' group to the end i.e. this:
1cc2cc3cc4cc5cc6cc7ccb8aa9bb
My current regex pattern (\\d+)(cc) only matches the repeating 'cc' groups.
See demo: example 2
I would like a similar full capture, with any amount of permissible entries of each group.
Any thoughts?
You may use
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)
See the regex demo
The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) lookahead, and then will consecutively match and capture digits and aa, bb or cc, but digits + aa or bb will be matched unless digits + cc is not in front.
Details
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)) - either of the two alternatives:
\G(?!^) - end of the previous successful match
(?(?=\d+(?:aa|bb))(?<!\dcc)) - if-then-else construct: if there is 1+ digits and aa or bb immediately to the right of the current location ((?=\d+(?:aa|bb)), then only continue matching if there is no digit followed with cc immediately to the left of the current location ((?<!\dcc))
| - or
^ - start of string
(?=(?:\d+(?:aa|bb))+(?:\d+cc)+) - a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):
(?:\d+(?:aa|bb))+ - one or more occurrences of 1+ digits followed with aa or bb
(?:\d+cc)+ - one or more occurrences of 1+ digits followed with cc
(\d+) - Group 1: one or more digits
(aa|bb|cc) - aa, bb or cc.
For the second pattern, replace cc with (?:aa|bb):
(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)
I'm no expert with perl, so I'll give a bit of pseudo code here. Feel free to suggest an edit.
You can start by matching any number of xaa or xbb combos, followed by one or more xcc combos using this pattern: ^(?:\d+(?:aa|bb))+(?:\dcc)+$
Once you have that you can use this pattern to capture the appropriate groups: (\d+)(aa|bb|cc)
Demo 1
Demo 2
Something like:
if(ismatch("^(?:\d+(?:aa|bb))+(?:\dcc)+$", inputString))
{
match = match("(\d+)(aa|bb|cc)", inputString);
}
from here you can extract the information using the groups.

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))