How do I trim leading zeros from a regular expression capture group? - regex

I have a regular expression that splits an 18 digit number into 4 capture groups (saved at regex101.com).
/(\d{5})(\d{2})(\d{2})(\d{9})/mg
Using 000012022000456789 as the test string, my result is:
Group 1: 00001
Group 2: 20
Group 3: 22
Group 4: 000456789
I also need to trim leading zeros from Group 1, so my desired result is:
Group 1: 1
Group 2: 20
Group 3: 22
Group 4: 000456789
Can this all be done using one regular expression? Note that this is a general regular expression question, not specific to an engine or language.

You could add a non-capturing group to absorb up to 4 leading zeros, and adjust your first capturing group to match from 1 to 5 digits:
(?:0{0,4})(\d{1,5})(\d{2})(\d{2})(\d{9})
Demo on regex101
As long as your input is always an 18-digit number, this will work fine. If however the input could be other than 18 digits, this might match something like 01122333333333 or 000001111122333333333.
You can work around this by adding a lookbehind assertion before the second group that requires it to be preceded by exactly 5 digits and an assertion that the string be terminated by a non-digit:
(?:0{0,4})(\d{1,5})(?<=\b\d{5})(\d{2})(\d{2})(\d{9})(?=\b)
Demo on regex101

Related

regex in python/ansible

I am new bee to regex, I have an example string : account-device-v2-2-3-63-21900
and using this regular expression [1-9]-[0-9]-[0-9]*
I am getting output as 1-2-3
but my intention is to match/extract pattern 2-3-63
Meaning to get digits with hyphens after v2 (or v1 etc), I don't need last digit part (21000 or any other number)
Any suggestions please?
You want to get 1 or more digit except 0, dash, 1 or more digit, dash, 1 or more digit from account-device-v2-2-3-63-21900 or account-device-v1-2-3-63-21900?
Use v[12]-([1-9]+?-[0-9]+?-[0-9]+?)- and get first group.
Demo: https://regex101.com/r/hMLGsK/1
The pattern [1-9]-[0-9]-[0-9]* matches 2-2-3 because your pattern does not match the v and a digit part and this is the first part it can match.
Note that [0-9]* Matches optional digits, so 2-2- could also be a match.
Using a capture group to get the value:
\bv[1-9][0-9]*-([1-9][0-9]*-[0-9]+-[0-9]+)
\bv[1-9][0-9]*- Match v1 or also possibly v20 etc..
( Capture group 1
[1-9][0-9]* Match a digit starting at 1
-[0-9]+-[0-9]+ 2 parts matching - and 1 or more digits starting from 0
) Close group 1
Regex demo

How to improve a regex for print range?

I would like to improve a VBA regex for a print range.
Currently I have this:
(\d+(-\d+)*)+(,\d+(-\d+)*)*
But, for an entry 12-25,45,50-53 this is returning the , and - like this:
Match 1: -25
Match 2: ,50-53
Match 3: -53
and is not returning the 45
Ideally I'd like a group returned for each comma delimited entry without any , or - like this:
Match 1: (12-25)
Match 2: (45)
Match 3: (50-53)
The reason 45 is not in a group is that you are repeating the second capturing group. When you are repeating a capturing group, the group contains the value of the last iteration.
So (,\d+(-\d+)*) will capture ,45. Now the whole group is repeated due to the outer * and within that last iteration ,50 is captured by ,\d+ and -53 is captured by -\d+
What you might do is match 1+ digits and use a single optional group for the hyphen and 1+ digits part to get 3 matches.
Use a positive lookahead (?=,|$) to assert what is directly on the right is a comma or the end of the string.
\d+(?:-\d+)?(?=,|$)
Regex demo
If you want 3 groups, you could use:
(\d+(?:-\d+)?),(\d+(?:-\d+)?),(\d+(?:-\d+)?)
Regex demo

REGEXP_REPLACE for exact regex pattern, not working

I'm trying to match an exact pattern to do some data cleanup for ISSN's using the code below:
select case when REGEXP_REPLACE('1234-5678 ÿþT(zlsd?k+j''fh{l}x[a]j).,~!##$%^&*()_+{}|:<>?`"\;''/-', '([0-9]{4}[\-]?[Xx0-9]{4})(.*)', '$1') not similar to '[0-9]{4}[\-]?[Xx0-9]{4}' then 'NOT' else 'YES' end
The pattern I want match any 8 digit group with a possible dash in the middle and possible X at the end.
The code above works for most cases, but if capture group 1 is the following example: 123456789 then it also returns positive because it matches the first 8 digits, and I don't want it to.
I tried surrounding capture group 1 with ^...$ but that doesn't work either.
So I would like to match exactly these examples and similar ones:
1234-5678
1234-567X
12345678
1234567X
BUT NOT THESE (and similar):
1234567899
1234567899x
What am I missing?
You may use
^([0-9]{4}-?[Xx0-9]{4})([^0-9].*)?$
See the regex demo
Details
^ - start of string
([0-9]{4}-?[Xx0-9]{4}) - Capturing group 1 ($1): four digits, an optional -, and then four x / X or digits
([^0-9].*)? - an optional Capturing group 2: any char other than a digit and then any 0+ chars as many as possible
$ - end of string.

REGEX Capturing differing sets of repeating groups

this is a two-part question, but I feel the answers will be related.
I have this regex pattern:
(\d+)(aa|bb) which I use to capture this string: 1bb2aa3aa4bb5bb6aa7bb8cc9cc
See demo: example 1
The way it captures the random series of aa and bb (both preceded by a digit) is exactly what I want, and is good as far as it goes.
So we get this match on regex101:
Match 1
Full match 0-3 `1bb`
Group 1. 0-1 `1`
Group 2. 1-3 `bb`
Match 2
Full match 3-6 `2aa`
Group 1. 3-4 `2`
Group 2. 4-6 `aa`
Match 3
Full match 6-9 `3aa`
Group 1. 6-7 `3`
Group 2. 7-9 `aa`
Match 4
Full match 9-12 `4bb`
Group 1. 9-10 `4`
Group 2. 10-12 `bb`
Match 5
Full match 12-15 `5bb`
Group 1. 12-13 `5`
Group 2. 13-15 `bb`
Match 6
Full match 15-18 `6aa`
Group 1. 15-16 `6`
Group 2. 16-18 `aa`
Match 7
Full match 18-21 `7bb`
Group 1. 18-19 `7`
Group 2. 19-21 `bb`
As expected, the 8cc9ccbit at the end is ignored. I would like capture this as well, in the same way I have captured the first repeating groups, in the same expression. So in the final output, I'd get something like this added to the end of the output. This should work for any amounts of matches on either side. This text is just one example.
Full match 21-24 `8cc`
Group 1. 21-22 `8`
Group 2. 22-24 `cc`
Match 7
Full match 24-27 `9cc`
Group 1. 24-25 `9`
Group 2. 25-27 `cc`
Also, I'd like to do similar but flipping the 'or' group to the end i.e. this:
1cc2cc3cc4cc5cc6cc7ccb8aa9bb
My current regex pattern (\\d+)(cc) only matches the repeating 'cc' groups.
See demo: example 2
I would like a similar full capture, with any amount of permissible entries of each group.
Any thoughts?
You may use
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+))(\d+)(aa|bb|cc)
See the regex demo
The regex will only match the string that meets the pattern in the (?=(?:\d+(?:aa|bb))+(?:\d+cc)+) lookahead, and then will consecutively match and capture digits and aa, bb or cc, but digits + aa or bb will be matched unless digits + cc is not in front.
Details
(?:\G(?!^)(?(?=\d+(?:aa|bb))(?<!\dcc))|(?=(?:\d+(?:aa|bb))+(?:\d+cc)+)) - either of the two alternatives:
\G(?!^) - end of the previous successful match
(?(?=\d+(?:aa|bb))(?<!\dcc)) - if-then-else construct: if there is 1+ digits and aa or bb immediately to the right of the current location ((?=\d+(?:aa|bb)), then only continue matching if there is no digit followed with cc immediately to the left of the current location ((?<!\dcc))
| - or
^ - start of string
(?=(?:\d+(?:aa|bb))+(?:\d+cc)+) - a positive lookahead that, immediately to the right of the current location, searches for the following (and returns true if it finds the patterns, or false if it does not):
(?:\d+(?:aa|bb))+ - one or more occurrences of 1+ digits followed with aa or bb
(?:\d+cc)+ - one or more occurrences of 1+ digits followed with cc
(\d+) - Group 1: one or more digits
(aa|bb|cc) - aa, bb or cc.
For the second pattern, replace cc with (?:aa|bb):
(?:\G(?!^)(?(?=\d+cc)(?<!\d(?:aa|bb)))|(?=(?:\d+cc)+(?:\d+(?:aa|bb))+))(\d+)(aa|bb|cc)
I'm no expert with perl, so I'll give a bit of pseudo code here. Feel free to suggest an edit.
You can start by matching any number of xaa or xbb combos, followed by one or more xcc combos using this pattern: ^(?:\d+(?:aa|bb))+(?:\dcc)+$
Once you have that you can use this pattern to capture the appropriate groups: (\d+)(aa|bb|cc)
Demo 1
Demo 2
Something like:
if(ismatch("^(?:\d+(?:aa|bb))+(?:\dcc)+$", inputString))
{
match = match("(\d+)(aa|bb|cc)", inputString);
}
from here you can extract the information using the groups.

Regex is possible to match?

I have files with these filename:
ZATR0008_2018.pdf
ZATR0018_2018.pdf
ZATR0218_2018.pdf
Where the 4 digits after ZATR is the issue number of magazine.
With this regex:
([1-9][0-9]*)(?=_\d)
I can extract 8, 18 or 218 but I would like to keep minimum 2 digits and max 3 digits so the result should be 08, 18 and 218.
How is possible to do that?
You may use
0*(\d{2,3})_\d
and grab Group 1 value. See the regex demo.
Details
0* - zero or more 0 chars
(\d{2,3}) - Group 1: two or three digits
_\d - a _ followed with a digit.
Here is a PCRE variation that grabs the value you need into a whole match:
0*\K\d{2,3}(?=_\d)
See another regex demo
Here, \K makes the regex engine omit the text matched so far (zeros) and then matches 2 to 3 digits that are followed with _ and a digit.
(?:[1-9][0-9]?)?[0-9]{2}(?=_[0-9])
or perhaps:
(?:[1-9][0-9]+|[0-9]{2})(?=_[0-9])
(https://www.freeformatter.com/regex-tester.html, which claims to use the XRegExp library, that you mention in another answer doesn't seem to backtrack into the (?:)? in my first suggestion where necessary, which makes it very different from any regex engine I've encoutered before and makes it prefer to match just the 18 of 218 even though it starts later in the string. But it does work with my second suggestion.
([1-9]\d{2,3})(?=_\d)
{x,y} will match from x to y times the previous pattern, in this case \d
Edit: from your own regex it looked as you wanted the part of the number which starts with a non-zero. However since your examples include leading 0s, maybe you really wanted :
(\d{2,3})(?=_\d)
Which will give you the last 3 digits before underscore unless there are only 2 digits.
I propose you:
^ZATR0*(\d{2,3})_\d+\.pdf$
demo code here. Result:
Match 1 Full match 0-17 ZATR0008_2018.pdf Group 1. 6-8 08
Match 2 Full match 18-35 ZATR0018_2018.pdf Group 1. 24-26 18
Match 3 Full match 36-53 ZATR0218_2018.pdf Group 1. 41-44 218