python regex repetition with capture question - regex

using python3's regex capabilities, is it possible to capture variable numbers of capture blocks, based on the number of the repetitions found? for instance, in the following search strings, i want to capture all the digit strings with the same regex.
search string 1(trying to capture: 89, 45):
zzz89zzz45.mp3
search string 2(trying to capture: 98, 67, 89, 45):
zzz98zzz67zzz89zzz45.mp3
search string 3(trying to capture: 98, 67, 89, 45, 55, 111):
zzz98zzz67zzz89zzz45vdvd55lplp111.mp3
the following regex will match all the repetitions, though all the values are not available for later use(only 1 digit string is captured):
((\d+)\D*)*\.mp3$
the other 2 options are writing a different regex for every case, or use findall(). Is there a way to adjust the above regex in order to capture every digit string for later use with various numbers of repetitions using just regex facilities, or to do this in python3, are you forced to use findall()?

Most or all regular expression engines in common use, including in particular those based on the PCRE syntax (like Python's), label their capturing groups according to the numerical index of the opening parenthesis, as the regex is written. So no, you cannot use capturing groups alone to extract an arbitrary, variable number of subsequences from a string.
The closest you can get (as far as I know) is to manually write out a certain number of capturing groups, something like this:
s = ...
res = re.match(r'\D*' + 25 * r'(\d+)\D+')
numbers = [r for r in res.groups() if r is not None]
This will get you up to 25 groups of digits. If you need more, replace 25 with some higher number.
I wouldn't be surprised if this were less efficient than the iterative approach with findall(), although I haven't tested it.

This will match all the numbers before the dot:
s = "zzz98zzz67zzz89zzz45vdvd55lplp111.mp3"
res = re.findall("[0-9]+(?=.*\\.)", s)
print(res)

Related

Is it possible to negate a group in a regular expression?

Let's say that we have this text:
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-12
2020-10-16
2020-11-12
2020-11-23
2020-11-15
2020-12-01
2020-12-11
2020-12-30
I want to do something like this:
\d\d\d\d-(NOT10)-(30)
So i want to get all dates of any year, but not of the 10th month and it is important, that the day is 30.
I tried a lot to do this using negative lookahead asserations but i did not come up with any working regexes.
You can use negative lookaheads:
\d\d\d\d-(?!10)\d\d-30
The Part (?!10) ensures that no 10 follows at the point where it is inserted into the regex. Notice that you still need to match the following digits afterwards, thus the \d\d part.
Generally speaking you can not (to my knowledge) negate a part that then also matches parts of the string. But with negative lookaheads you can simulate this as I did above. The generalized idea looks something like:
(?!<special-exclusion-pattern>)<general-inclusion-pattern>
Where the special-exclusion-pattern matches a subset of the general-inclusion-pattern. In the above case the general inclusion pattern is \d\d and the special exclusion pattern ins 10.
Try :
/20\d{2}-(?:0[1-9]|1[12])-30/
Explanation :
20\d{2} it will match 20XX
(?:0[1-9]|1[12]) it will match 0X or 11, 12
30 it will match 30
Demo :https://regex101.com/r/O2F1eV/1
It's easiest to simply convert the substring (if present) that matches /^\d{4}-10-30$/ to an empty string, then split the resulting string on one or more newlines.
If your string were
2020-10-16
2020-10-30
2020-11-12
2020-11-23
and was held by the variable str, then in Ruby, for example,
str.sub(/^\d{4}-10-30$/,'')
#=> "2020-10-16\n\n2020-11-12\n2020-11-23\n"
so
str.sub(/^\d{4}-10-30$/,'').split
#=> ["2020-10-16", "2020-11-12", "2020-11-23"]
Whatever language you are using undoubtedly has similar methods.

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Match if something is not preceded by something else

I'm trying to parse a string and extract some numbers from it. Basically, any 2-3 digits should be matched, except the ones that have "TEST" before them. Here are some examples:
TEST2XX_R_00.01.211_TEST => 00, 01, 211
TEST850_F_11.22.333_TEST => 11, 22, 333
TESTXXX_X_12.34.456 => 12, 34, 456
Here are some of the things I've tried:
(?<!TEST)[0-9]{2,3} - ignores only the first digit after TEST
_[0-9]{2,3}|\.[0-9]{2,3} - matches the numbers correctly, but matches the character before them (_ or .) as well.
I know this might be a duplicate to regex for matching something if it is not preceded by something else but I could not get my answer there.
Unfortunately, there is no way to use a single pattern to match a string not preceded with some sequence in Lua (note that you can't even rely on capturing an alternative that you need since TEST%d+|(%d+) will not work in Lua, Lua patterns do not support alternation).
You may remove all substrings that start with TEST + digits after it, and then extract digit chunks:
local s = "TEST2XX_R_00.01.211_TEST"
for x in string.gmatch(s:gsub("TEST%d+",""), "%d+") do
print(x)
end
See the Lua demo
Here, s:gsub("TEST%d+","") will remove TEST<digits>+ and %d+ pattern used with string.gmatch will extract all digit chunks that remain.

Complex Regular Expression, PEG, or Multiple Passes?

I am trying to extract some data from the following examples:
Name 789, 10-mill 12-27b
Manufacturer XY-2822, 10-mill, 17-25b
Other Manufacturer 16b Part
Another Manufacturer FER M9000, 11-mill, 11-40
18b Part
Maker 11-31, 10-mill
Maker 1x or 2x; max size 1x (34b), 2x (38/24b)
Maker REC6 15/18/26b. Square.
Producer FC-40 11-13-16-19-22-25-27-30-34b
What I'd like my results to be respectively are:
12, 27
17, 25
16
11, 40
18
11-31
34, 38, 24 (optional, its fine if only the latter two are provided)
15, 18, 26
11, 13, 16, 19, 22, 25, 27, 30, 34
I am happy to do this in multiple passes, using an expression grammar though I don't think that'll really help.
I'm having trouble using lookaheads and lookbehinds to grab that data and exclude things like "11-mill" and "XY-2822". What I find happening is I am able to exclude those matches but end up truncating good results for others matches.
What is the best way to go about this?
My current regex is
/(?:(\d+)[b\b\/-])([b\d\b]*)[^a-z]/i
which is capturing the letter 'b' (which is okay) but not capturing 34b in the final example
Not sure what are your exact requirements/formats but you can try this:
/(?:\G(?!^)[-\/]|^(?:.*[^\d\/-])?)\K\d++(?![-\/]\D)/
http://rubular.com/r/WJqcCNe2pr
details:
# two possible starts:
(?: # next occurrences
\G # anchor for the position after the previous match
(?!^) # not at the start of the line
[-\/]
| # first occurrence
^
(?:.*[^\d\/-])? # (note the greedy quantifier here,
# to obtain the last result of the line)
)
\K # discards characters matched before from the whole match
\d++ # several digits with a possessive quantifier to forbid backtracking
(?![-\/]\D) # not followed by an hyphen of a slash and a non-digit
You can improve the pattern if you replace (?:.*[^\d\/-])? with [^-\d\/\n]*+(?>[-\d\/]+[^-\d\/\n]+)* (remove the \n if you work line by line.). The goal of this change is to limit the backtracking (that occurs atomic group by atomic group, instead of character by character for the first version).
Perhaps, you can replace the negative lookahead with this kind of positive lookahead: (?=[-\/]\d|b|$)
An other version here.
Perhaps this:
(?<=\d-)\d+|\d+(?=-\d+)|\d+(?=(?:\/\d+)*b)
https://regex101.com/r/nR3eS9/1

Number groups with 0 as delimiter

There's a long natural number that can be grouped to smaller numbers by the 0 (zero) delimiter.
Example: 4201100370880
This would divide to Group1: 42, Group2: 110, Group3: 370880
There are 3 groups, groups never start with 0 and are at least 1 char long. Also the last groups is "as is", meaning it's not terminated by a tailing 0.
This is what I came up with, but it only works for certain inputs (like 420110037880):
(\d+)0([1-9][0-9]{1,2})0([1-9]\d+)
This shows I'm attempting to declare the 2nd group's length to min2 max3, but I'm thinking the correct solution should not care about it. If the delimiter was non-numeric I could probably tackle it, but I'm stumped.
All right, factoring in comment information, try splitting on a regex (this may vary based on what language you're using - .split(/.../) in JavaScript, preg_split in PHP, etc.)
The regex you want to split on is: 0(?!0). This translates to "a zero that is not followed by a zero". I believe this will solve your splitting problem.
If your language allows a limit parameter (PHP does), set it to 3. If not, you will need to do something like this (JavaScript):
result = input.split(/0(?!0)/);
result = result.slice(0,2).concat(result.slice(2).join("0"));
The following one should suit your needs:
^(.*?)0(?!0)(.*?)0(?!0)(.*)$
Visualization by Debuggex
The following regex works:
(\d+?)0(?!0) with the g modifier
Demo: http://regex101.com/r/rS4dE5
For only three matches, you can do:
(\d+?)0(?!0)(\d+?)0(?!0)(.*)