Complex Regular Expression, PEG, or Multiple Passes? - regex

I am trying to extract some data from the following examples:
Name 789, 10-mill 12-27b
Manufacturer XY-2822, 10-mill, 17-25b
Other Manufacturer 16b Part
Another Manufacturer FER M9000, 11-mill, 11-40
18b Part
Maker 11-31, 10-mill
Maker 1x or 2x; max size 1x (34b), 2x (38/24b)
Maker REC6 15/18/26b. Square.
Producer FC-40 11-13-16-19-22-25-27-30-34b
What I'd like my results to be respectively are:
12, 27
17, 25
16
11, 40
18
11-31
34, 38, 24 (optional, its fine if only the latter two are provided)
15, 18, 26
11, 13, 16, 19, 22, 25, 27, 30, 34
I am happy to do this in multiple passes, using an expression grammar though I don't think that'll really help.
I'm having trouble using lookaheads and lookbehinds to grab that data and exclude things like "11-mill" and "XY-2822". What I find happening is I am able to exclude those matches but end up truncating good results for others matches.
What is the best way to go about this?
My current regex is
/(?:(\d+)[b\b\/-])([b\d\b]*)[^a-z]/i
which is capturing the letter 'b' (which is okay) but not capturing 34b in the final example

Not sure what are your exact requirements/formats but you can try this:
/(?:\G(?!^)[-\/]|^(?:.*[^\d\/-])?)\K\d++(?![-\/]\D)/
http://rubular.com/r/WJqcCNe2pr
details:
# two possible starts:
(?: # next occurrences
\G # anchor for the position after the previous match
(?!^) # not at the start of the line
[-\/]
| # first occurrence
^
(?:.*[^\d\/-])? # (note the greedy quantifier here,
# to obtain the last result of the line)
)
\K # discards characters matched before from the whole match
\d++ # several digits with a possessive quantifier to forbid backtracking
(?![-\/]\D) # not followed by an hyphen of a slash and a non-digit
You can improve the pattern if you replace (?:.*[^\d\/-])? with [^-\d\/\n]*+(?>[-\d\/]+[^-\d\/\n]+)* (remove the \n if you work line by line.). The goal of this change is to limit the backtracking (that occurs atomic group by atomic group, instead of character by character for the first version).
Perhaps, you can replace the negative lookahead with this kind of positive lookahead: (?=[-\/]\d|b|$)
An other version here.

Perhaps this:
(?<=\d-)\d+|\d+(?=-\d+)|\d+(?=(?:\/\d+)*b)
https://regex101.com/r/nR3eS9/1

Related

Regex to enter a decimal number digit by digit

I have a requirement where user can input only between 0.01 to 100.00 in a textbox. I am using regex to limit the data entered. However, I cannot enter a decimal point, like 95.83 in the regex. Can someone help me fix the below regex?
(^100([.]0{1,2})?)$|(^\d{1,2}([.]\d{1,2})?)$
if I copy paste the value, it passes. But unable to type a decimal point.
Please advice.
Link to regex tester: https://regex101.com/r/b2BF6A/1
Link to demo: https://stackblitz.com/edit/react-9h2xsy
The regex
You can use the following regex:
See regex in use here
^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$
How it works
^(?:...|...|...)$ this anchors the pattern to ensure it matches the entire string
^ assert position at the start of the line
(?:...|...|...) non-capture group - used to group multiple alternations
$ assert position at the end of the line
(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})? first option
(?:\d?[1-9]|[1-9]0) match either of the following
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
(?:\.\d{0,2})? optionally match the following
\. this character . literally
\d{0,2} match any digit between 0 and 2 times
0{0,2}\.(?:\d?[1-9]|[1-9]0) second option
0{0,2} match 0 between 0 and 2 times
\. match this character . literally
(?:\d?[1-9]|[1-9]0) match either of the following options
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
10{2}(?:\.0{0,2})? third option
10{2} match 100
(?:\.0{0,2})? optionally match ., followed by 0 between 0 and 2 times
How it works (in simpler terms)
With the above descriptions for each alternation, this is what they will match:
Any two-digit number other than 0 or 00, optionally followed by any two-digit decimal.
In terms of a range, it's 1.00-99.99 with:
Optional leading zero: 01.00-99.99
Optional decimal: 01-99, or 01.-99, or 01.0-01.99
Any two-digit decimal other than 0 or 00
In terms of a range, it's .01-.99 with:
Optional leading zeroes: 00.01-00.99 or 0.01-0.99
Literally 100, followed by optional decimals: 100, or 100., or 100.0, or 100.00
The code
RegExp vs /pattern/
In your code, you can use either of the following options (replacing pattern with the pattern above):
new RegExp('pattern')
/pattern/
The first option above uses a string literal. This means that you must escape the backslash characters in the string in order for the pattern to be properly read:
^(?:(?:\\d?[1-9]|[1-9]0)(?:\\.\\d{0,2})?|0{0,2}\\.(?:\\d?[1-9]|[1-9]0)|10{2}(?:\\.0{0,2})?)$
The second option above allows you to avoid this and use the regex as is.
Here's a fork of your code using the second option.
Usability Issues
Please note that you'll run into a couple of usability issues with your current method of tackling this:
The user cannot erase all the digits they've entered. So if the user enters 100, they can only erase 00 and the 1 will remain. One option to resolving this is to make the entire non-capture group (with the alternations) optional by adding a ? after it. Whilst this does solve that issue, you now need to keep two regular expression patterns - one for user input and the other for validation. Alternatively, you could just test if the input is an empty string to allow it (but not validate the form until the field is filled.
The user cannot enter a number beginning with .. This is because we don't allow the input of . to go through your validation steps. The same rule applies here as the previous point made. You can allow it though if the value is . explicitly or add a new alternation of |\.
Similarly to my last point, you'll run into the issue for .0 when a user is trying to write something like .01. Again here, you can run the same test.
Similarly again, 0 is not valid input - same applies here.
An change to the regex that covers these states (0, ., .0, 0., 0.0, 00.0 - but not .00 alternatives) is:
^(?:(?:\d?[1-9]?|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]?|[1-9]0)|10{2}(?:\.0{0,2})?)$
Better would be to create logic for these cases to match them with a separate regex:
^0{0,2}\.?0?$
Usability Fixes
With the changes above in mind, your function would become:
See code fork here
handleChange(e) {
console.log(e.target.value)
const r1 = /^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$/;
const r2 = /^0{0,2}\.?0?$/
if (r1.test(e.target.value)) {
this.setState({
[e.target.name]: e.target.value
});
} else if (r2.test(e.target.value)) {
// Value is invalid, but permitted for usability purposes
this.setState({
[e.target.name]: e.target.value
});
}
}
This now allows the user to input those values, but also allows us to invalidate them if the user tries to submit it.
Using the range 0.01 to 100.00 without padding is this (non-factored):
0\.(?:0[1-9]|[1-9]\d)|[1-9]\d?\.\d{2}|100\.00
Expanded
# 0.01 to 0.99
0 \.
(?:
0 [1-9]
| [1-9] \d
)
|
# 1.00 to 99.99
[1-9] \d? \.
\d{2}
|
# 100.00
100 \.
00
It can be made to have an optional cascade if incremental partial form
should be allowed.
That partial is shown here for the top regex range :
^(?:0(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|[1-9]\d?(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
The code line with stringed regex :
const newRegExp = new RegExp("^(?:0(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|[1-9]\\d?(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$");
_________________________
The regex 'partial' above requires the input to be blank or to start
with a digit. It also doesn't allow 1-9 with a preceding 0.
If that is all to be allowed, a simple mod is this :
^(?:0{0,2}(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|(?:[1-9]\d?|0[1-9])(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
which allows input like the following:
(It should be noted that doing this requires allowing the dot . as
a valid input but could be converted to 0. on the fly to be put
inside the input box.)
.1
00.01
09.90
01.
01.11
00.1
00
.
Stringed version :
"^(?:0{0,2}(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|(?:[1-9]\\d?|0[1-9])(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$"

Match if something is not preceded by something else

I'm trying to parse a string and extract some numbers from it. Basically, any 2-3 digits should be matched, except the ones that have "TEST" before them. Here are some examples:
TEST2XX_R_00.01.211_TEST => 00, 01, 211
TEST850_F_11.22.333_TEST => 11, 22, 333
TESTXXX_X_12.34.456 => 12, 34, 456
Here are some of the things I've tried:
(?<!TEST)[0-9]{2,3} - ignores only the first digit after TEST
_[0-9]{2,3}|\.[0-9]{2,3} - matches the numbers correctly, but matches the character before them (_ or .) as well.
I know this might be a duplicate to regex for matching something if it is not preceded by something else but I could not get my answer there.
Unfortunately, there is no way to use a single pattern to match a string not preceded with some sequence in Lua (note that you can't even rely on capturing an alternative that you need since TEST%d+|(%d+) will not work in Lua, Lua patterns do not support alternation).
You may remove all substrings that start with TEST + digits after it, and then extract digit chunks:
local s = "TEST2XX_R_00.01.211_TEST"
for x in string.gmatch(s:gsub("TEST%d+",""), "%d+") do
print(x)
end
See the Lua demo
Here, s:gsub("TEST%d+","") will remove TEST<digits>+ and %d+ pattern used with string.gmatch will extract all digit chunks that remain.

Trying to create a regex that allowes following format yyyy[: -][VW]Week number

My regex currently looks like this
\b(19|20)\d{2}\b[- :][VW][0-5]{1}(?(?=[5])[0-2]{1}|[0-9]{1})
It doesn't quite do what I want as I'm trying to get this part
(?(?=[5])[0-2]{1}|[0-9]{1})
to say "If the previous number was 5 then you may only choose between 0-2, and if it's another number 0-4 then choosing between 0-9 is allowed
Currently it allowes 00-59 with an exclusion of 05,15,25,35 etc.
Essentially I want it to look like this for example 2016-W25.
You need to replace [5] with a positive lookbehind (?<=5) in order to check a char to the left of the current location:
\b(19|20)\d{2}[- :][VW][0-5](?(?=(?<=5))[0-2]|[0-9])
^^^^^
See the regex demo
Also, you may get rid of the conditional pattern at all using a mere alternation group:
\b(19|20)\d{2}[- :][VW](?:[0-4][0-9]|5[0-2])
^^^^^^^^^^^^^^^^^^^^^
See this regex demo
The (?:[0-4][0-9]|5[0-2]) matches either a digit from 0 to 4 and then any digit (see [0-4][0-9]), or (see |) a 5 followed with 0, 1 or 2 (see 5[0-2]).
NOTE: Since the number of weeks can amount to 53, the [0-2] at the end might be replaced with [0-3] to also match 53 values.

is it possible to solve this with just one regex?

I would like to know if there is a regular expression that given for example this input:
lkjs kjsfjk ijsfj á 13total wer6klje additional lñk jshv kjsdfjk dj d 22total kejk jksfljkakvhjr j 3total fkljbher jr6 hrew7 hwr 41total sfdkj additional iuwefjkwf7 7erfh sf 5total klj kj kjsef87 jhwfe7 89 jhf
could output these 3 matches, which are numbers followed by total, that do not contain the word additional after (and before finding the next number):
22
3
5
So, for example I didn't match 13 because
13total wer6klje additional lñk jshv kjsdfjk dj d 22total
contains the word additional
And I didn't match 41 because
41total sfdkj additional iuwefjkwf7 7erfh sf 5total
contains the word additional
let me explain the input structure used in the example:
randomText 13total randomText aditional randomText
22total randomText
3total randomText
41total randomText aditional randomText
5total randomText
So basically the input is something like:
randomText X_total randomText_that_contains_or_not_'additional'
X_total randomText_that_contains_or_not_'additional'
....
X_total randomText_that_contains_or_not_'additional'
I know how to solve the problem using some additional code (using several patterns and matches, if-else structures...) but the system I'm working with, cannot make use of those. It just can be fed up with one regular expression (it's a complicated system, not easy to modify).
So, for example, with the regular expression [0-9]+(?=total) I would get this matches: 13, 22, 3, 41, 5
but as I said I just need 22, 3, 5
Can anybody build a more complex regular expression that matches those 3 numbers?
Thanks!
Of course it is possible (given that your regex flavour supports lookahead assertions)
\d+(?=total(?!\D*additional))
See it here on regex101
\d+ matches one ore more digits
(?=total(?!\D*additional)) nested lookaround assertions. Digits has to be followed by "total" not followed by additional (with only non digits inbetween)
A more advanced example based on Bergis comment:
\d+(?=total(?!(?:.(?!\d+total))*additional))
See it on regex101
Here I searching for additional as long as I not find \d+total
You can use (the total will always be preceded by a digit, right?)
\d+(?=total(?!(?:\D|\d(?!total))*additional))
Explanation
The idea is to forbid any additional before the next <digit>total:
\d+ # digits
(?=total # followed by total
(?! # not followed by...
(?:
\D++ # not a digit (possessive quantifier)
| # OR
\d(?!total) # a digit, but not followed by total
)*+ # any number of times
additional
)
)
The negative look ahead will fail the regex if it finds one, and we're sure not to pass over a <digit>total thanks to (?:\D|\d(?!total)).
See demo here.

python regex repetition with capture question

using python3's regex capabilities, is it possible to capture variable numbers of capture blocks, based on the number of the repetitions found? for instance, in the following search strings, i want to capture all the digit strings with the same regex.
search string 1(trying to capture: 89, 45):
zzz89zzz45.mp3
search string 2(trying to capture: 98, 67, 89, 45):
zzz98zzz67zzz89zzz45.mp3
search string 3(trying to capture: 98, 67, 89, 45, 55, 111):
zzz98zzz67zzz89zzz45vdvd55lplp111.mp3
the following regex will match all the repetitions, though all the values are not available for later use(only 1 digit string is captured):
((\d+)\D*)*\.mp3$
the other 2 options are writing a different regex for every case, or use findall(). Is there a way to adjust the above regex in order to capture every digit string for later use with various numbers of repetitions using just regex facilities, or to do this in python3, are you forced to use findall()?
Most or all regular expression engines in common use, including in particular those based on the PCRE syntax (like Python's), label their capturing groups according to the numerical index of the opening parenthesis, as the regex is written. So no, you cannot use capturing groups alone to extract an arbitrary, variable number of subsequences from a string.
The closest you can get (as far as I know) is to manually write out a certain number of capturing groups, something like this:
s = ...
res = re.match(r'\D*' + 25 * r'(\d+)\D+')
numbers = [r for r in res.groups() if r is not None]
This will get you up to 25 groups of digits. If you need more, replace 25 with some higher number.
I wouldn't be surprised if this were less efficient than the iterative approach with findall(), although I haven't tested it.
This will match all the numbers before the dot:
s = "zzz98zzz67zzz89zzz45vdvd55lplp111.mp3"
res = re.findall("[0-9]+(?=.*\\.)", s)
print(res)