Match if something is not preceded by something else - regex

I'm trying to parse a string and extract some numbers from it. Basically, any 2-3 digits should be matched, except the ones that have "TEST" before them. Here are some examples:
TEST2XX_R_00.01.211_TEST => 00, 01, 211
TEST850_F_11.22.333_TEST => 11, 22, 333
TESTXXX_X_12.34.456 => 12, 34, 456
Here are some of the things I've tried:
(?<!TEST)[0-9]{2,3} - ignores only the first digit after TEST
_[0-9]{2,3}|\.[0-9]{2,3} - matches the numbers correctly, but matches the character before them (_ or .) as well.
I know this might be a duplicate to regex for matching something if it is not preceded by something else but I could not get my answer there.

Unfortunately, there is no way to use a single pattern to match a string not preceded with some sequence in Lua (note that you can't even rely on capturing an alternative that you need since TEST%d+|(%d+) will not work in Lua, Lua patterns do not support alternation).
You may remove all substrings that start with TEST + digits after it, and then extract digit chunks:
local s = "TEST2XX_R_00.01.211_TEST"
for x in string.gmatch(s:gsub("TEST%d+",""), "%d+") do
print(x)
end
See the Lua demo
Here, s:gsub("TEST%d+","") will remove TEST<digits>+ and %d+ pattern used with string.gmatch will extract all digit chunks that remain.

Related

Is it possible to negate a group in a regular expression?

Let's say that we have this text:
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-12
2020-10-16
2020-11-12
2020-11-23
2020-11-15
2020-12-01
2020-12-11
2020-12-30
I want to do something like this:
\d\d\d\d-(NOT10)-(30)
So i want to get all dates of any year, but not of the 10th month and it is important, that the day is 30.
I tried a lot to do this using negative lookahead asserations but i did not come up with any working regexes.
You can use negative lookaheads:
\d\d\d\d-(?!10)\d\d-30
The Part (?!10) ensures that no 10 follows at the point where it is inserted into the regex. Notice that you still need to match the following digits afterwards, thus the \d\d part.
Generally speaking you can not (to my knowledge) negate a part that then also matches parts of the string. But with negative lookaheads you can simulate this as I did above. The generalized idea looks something like:
(?!<special-exclusion-pattern>)<general-inclusion-pattern>
Where the special-exclusion-pattern matches a subset of the general-inclusion-pattern. In the above case the general inclusion pattern is \d\d and the special exclusion pattern ins 10.
Try :
/20\d{2}-(?:0[1-9]|1[12])-30/
Explanation :
20\d{2} it will match 20XX
(?:0[1-9]|1[12]) it will match 0X or 11, 12
30 it will match 30
Demo :https://regex101.com/r/O2F1eV/1
It's easiest to simply convert the substring (if present) that matches /^\d{4}-10-30$/ to an empty string, then split the resulting string on one or more newlines.
If your string were
2020-10-16
2020-10-30
2020-11-12
2020-11-23
and was held by the variable str, then in Ruby, for example,
str.sub(/^\d{4}-10-30$/,'')
#=> "2020-10-16\n\n2020-11-12\n2020-11-23\n"
so
str.sub(/^\d{4}-10-30$/,'').split
#=> ["2020-10-16", "2020-11-12", "2020-11-23"]
Whatever language you are using undoubtedly has similar methods.

Regex Match Roman Numerals from 0-39 Only

I am trying to write a regex that will match Roman numerals from 0 to 39 only. There are plenty of examples which match much larger Roman numerals, but I cannot figure out how to match this specific subset.
Got it. Try this:
/^(X{1,3})(I[XV]|V?I{0,3})$|^(I[XV]|V?I{1,3})$|^V$/
Update:
Zero doesn't exist in Roman numerals. Therefore feel free to tack on your own implementation for zero.
I'm not sure how to represent 0 using Roman numerals. I assume that it has separate token N (see Wikipedia).
Assuming the regex tries to match the whole string (like in Java) and you have lookahead, you can use this regex:
(?.)(X{0,3}(IX|IV|V?I{0,3})|N)
Explanation:
(?.): ensure at least one character
X{0,3}: define the tens (0, 10, 20, 30)
(...): define the final digit
IX: 9
IV: 4
V?I{0,3}: 0-3, 5-8 (0 not as whole number, require at least one X)
N: 0 (as whole number)
If you represent 0 as empty string, the regex is simpler:
X{0,3}(IX|IV|V?I{0,3})
since the lookahead and N in the previous regex is just to prevent empty string.
Assuming you know you have valid Roman numerals and want to fetch only the ones <= 39, that is easy:
^[XVI]*$
See it in action
If that is not the case, it's a little bit trickier, but you can still take advantage of the fact that all the numbers that can be represented only with X, V and I are 1..39:
^X{0,3}(?:V?I{0,3}|I[VX])$
See it in action
X{0,3} covers 10, 20, 30
X{0,3}V?I{0,3} covers all but the ones that end with 4 or 9 (14, 29, etc)
X{0,3}I[VX] exactly the ones ending with 4 or 9
Note: these will also match an empty string, which is my interpretation of a Roman zero. If that is not the case, you can replace the * with + for the first regex and add a positive lookahead at the start of the regex for the second ((?=.)).
Note 2: If they are not on separate lines (or in separate strings), you can replace ^ and $ with word boundaries (\b).

Complex Regular Expression, PEG, or Multiple Passes?

I am trying to extract some data from the following examples:
Name 789, 10-mill 12-27b
Manufacturer XY-2822, 10-mill, 17-25b
Other Manufacturer 16b Part
Another Manufacturer FER M9000, 11-mill, 11-40
18b Part
Maker 11-31, 10-mill
Maker 1x or 2x; max size 1x (34b), 2x (38/24b)
Maker REC6 15/18/26b. Square.
Producer FC-40 11-13-16-19-22-25-27-30-34b
What I'd like my results to be respectively are:
12, 27
17, 25
16
11, 40
18
11-31
34, 38, 24 (optional, its fine if only the latter two are provided)
15, 18, 26
11, 13, 16, 19, 22, 25, 27, 30, 34
I am happy to do this in multiple passes, using an expression grammar though I don't think that'll really help.
I'm having trouble using lookaheads and lookbehinds to grab that data and exclude things like "11-mill" and "XY-2822". What I find happening is I am able to exclude those matches but end up truncating good results for others matches.
What is the best way to go about this?
My current regex is
/(?:(\d+)[b\b\/-])([b\d\b]*)[^a-z]/i
which is capturing the letter 'b' (which is okay) but not capturing 34b in the final example
Not sure what are your exact requirements/formats but you can try this:
/(?:\G(?!^)[-\/]|^(?:.*[^\d\/-])?)\K\d++(?![-\/]\D)/
http://rubular.com/r/WJqcCNe2pr
details:
# two possible starts:
(?: # next occurrences
\G # anchor for the position after the previous match
(?!^) # not at the start of the line
[-\/]
| # first occurrence
^
(?:.*[^\d\/-])? # (note the greedy quantifier here,
# to obtain the last result of the line)
)
\K # discards characters matched before from the whole match
\d++ # several digits with a possessive quantifier to forbid backtracking
(?![-\/]\D) # not followed by an hyphen of a slash and a non-digit
You can improve the pattern if you replace (?:.*[^\d\/-])? with [^-\d\/\n]*+(?>[-\d\/]+[^-\d\/\n]+)* (remove the \n if you work line by line.). The goal of this change is to limit the backtracking (that occurs atomic group by atomic group, instead of character by character for the first version).
Perhaps, you can replace the negative lookahead with this kind of positive lookahead: (?=[-\/]\d|b|$)
An other version here.
Perhaps this:
(?<=\d-)\d+|\d+(?=-\d+)|\d+(?=(?:\/\d+)*b)
https://regex101.com/r/nR3eS9/1

best approach for my pattern match

So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.

python regex repetition with capture question

using python3's regex capabilities, is it possible to capture variable numbers of capture blocks, based on the number of the repetitions found? for instance, in the following search strings, i want to capture all the digit strings with the same regex.
search string 1(trying to capture: 89, 45):
zzz89zzz45.mp3
search string 2(trying to capture: 98, 67, 89, 45):
zzz98zzz67zzz89zzz45.mp3
search string 3(trying to capture: 98, 67, 89, 45, 55, 111):
zzz98zzz67zzz89zzz45vdvd55lplp111.mp3
the following regex will match all the repetitions, though all the values are not available for later use(only 1 digit string is captured):
((\d+)\D*)*\.mp3$
the other 2 options are writing a different regex for every case, or use findall(). Is there a way to adjust the above regex in order to capture every digit string for later use with various numbers of repetitions using just regex facilities, or to do this in python3, are you forced to use findall()?
Most or all regular expression engines in common use, including in particular those based on the PCRE syntax (like Python's), label their capturing groups according to the numerical index of the opening parenthesis, as the regex is written. So no, you cannot use capturing groups alone to extract an arbitrary, variable number of subsequences from a string.
The closest you can get (as far as I know) is to manually write out a certain number of capturing groups, something like this:
s = ...
res = re.match(r'\D*' + 25 * r'(\d+)\D+')
numbers = [r for r in res.groups() if r is not None]
This will get you up to 25 groups of digits. If you need more, replace 25 with some higher number.
I wouldn't be surprised if this were less efficient than the iterative approach with findall(), although I haven't tested it.
This will match all the numbers before the dot:
s = "zzz98zzz67zzz89zzz45vdvd55lplp111.mp3"
res = re.findall("[0-9]+(?=.*\\.)", s)
print(res)