RegEx for matching group in multiline texts - regex

I have this multi-line text, I want to extract the numerical value before the 'Next' text (in this case 13). The numerical values will change, but the location will stay the same, it indicates total # of pages on website. I am having trouble writing the correct regex to return this value:
Previous
1
2
3
...
13
Next
Showing 1 - 100 of 1227 Results[EXTRACT]
pattern =re.compile(r'(\d{1,2})\r\nNext', re.M)
result = pattern.match(text)
The expected return value is 13.

import re
t = """Previous
1
2
3
...
13
Next
Showing 1 - 100 of 1227 Results[EXTRACT]"""
re.search(r"\d+(?=\s+Next)", t).group(0)
Returns: '13'
The regular expression does a lookahead assertion to see if there is any amount (>1) of digits followed by any amount (>1) of whitespace characters followed by the word Next.

Related

Regex to enter a decimal number digit by digit

I have a requirement where user can input only between 0.01 to 100.00 in a textbox. I am using regex to limit the data entered. However, I cannot enter a decimal point, like 95.83 in the regex. Can someone help me fix the below regex?
(^100([.]0{1,2})?)$|(^\d{1,2}([.]\d{1,2})?)$
if I copy paste the value, it passes. But unable to type a decimal point.
Please advice.
Link to regex tester: https://regex101.com/r/b2BF6A/1
Link to demo: https://stackblitz.com/edit/react-9h2xsy
The regex
You can use the following regex:
See regex in use here
^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$
How it works
^(?:...|...|...)$ this anchors the pattern to ensure it matches the entire string
^ assert position at the start of the line
(?:...|...|...) non-capture group - used to group multiple alternations
$ assert position at the end of the line
(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})? first option
(?:\d?[1-9]|[1-9]0) match either of the following
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
(?:\.\d{0,2})? optionally match the following
\. this character . literally
\d{0,2} match any digit between 0 and 2 times
0{0,2}\.(?:\d?[1-9]|[1-9]0) second option
0{0,2} match 0 between 0 and 2 times
\. match this character . literally
(?:\d?[1-9]|[1-9]0) match either of the following options
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
10{2}(?:\.0{0,2})? third option
10{2} match 100
(?:\.0{0,2})? optionally match ., followed by 0 between 0 and 2 times
How it works (in simpler terms)
With the above descriptions for each alternation, this is what they will match:
Any two-digit number other than 0 or 00, optionally followed by any two-digit decimal.
In terms of a range, it's 1.00-99.99 with:
Optional leading zero: 01.00-99.99
Optional decimal: 01-99, or 01.-99, or 01.0-01.99
Any two-digit decimal other than 0 or 00
In terms of a range, it's .01-.99 with:
Optional leading zeroes: 00.01-00.99 or 0.01-0.99
Literally 100, followed by optional decimals: 100, or 100., or 100.0, or 100.00
The code
RegExp vs /pattern/
In your code, you can use either of the following options (replacing pattern with the pattern above):
new RegExp('pattern')
/pattern/
The first option above uses a string literal. This means that you must escape the backslash characters in the string in order for the pattern to be properly read:
^(?:(?:\\d?[1-9]|[1-9]0)(?:\\.\\d{0,2})?|0{0,2}\\.(?:\\d?[1-9]|[1-9]0)|10{2}(?:\\.0{0,2})?)$
The second option above allows you to avoid this and use the regex as is.
Here's a fork of your code using the second option.
Usability Issues
Please note that you'll run into a couple of usability issues with your current method of tackling this:
The user cannot erase all the digits they've entered. So if the user enters 100, they can only erase 00 and the 1 will remain. One option to resolving this is to make the entire non-capture group (with the alternations) optional by adding a ? after it. Whilst this does solve that issue, you now need to keep two regular expression patterns - one for user input and the other for validation. Alternatively, you could just test if the input is an empty string to allow it (but not validate the form until the field is filled.
The user cannot enter a number beginning with .. This is because we don't allow the input of . to go through your validation steps. The same rule applies here as the previous point made. You can allow it though if the value is . explicitly or add a new alternation of |\.
Similarly to my last point, you'll run into the issue for .0 when a user is trying to write something like .01. Again here, you can run the same test.
Similarly again, 0 is not valid input - same applies here.
An change to the regex that covers these states (0, ., .0, 0., 0.0, 00.0 - but not .00 alternatives) is:
^(?:(?:\d?[1-9]?|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]?|[1-9]0)|10{2}(?:\.0{0,2})?)$
Better would be to create logic for these cases to match them with a separate regex:
^0{0,2}\.?0?$
Usability Fixes
With the changes above in mind, your function would become:
See code fork here
handleChange(e) {
console.log(e.target.value)
const r1 = /^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$/;
const r2 = /^0{0,2}\.?0?$/
if (r1.test(e.target.value)) {
this.setState({
[e.target.name]: e.target.value
});
} else if (r2.test(e.target.value)) {
// Value is invalid, but permitted for usability purposes
this.setState({
[e.target.name]: e.target.value
});
}
}
This now allows the user to input those values, but also allows us to invalidate them if the user tries to submit it.
Using the range 0.01 to 100.00 without padding is this (non-factored):
0\.(?:0[1-9]|[1-9]\d)|[1-9]\d?\.\d{2}|100\.00
Expanded
# 0.01 to 0.99
0 \.
(?:
0 [1-9]
| [1-9] \d
)
|
# 1.00 to 99.99
[1-9] \d? \.
\d{2}
|
# 100.00
100 \.
00
It can be made to have an optional cascade if incremental partial form
should be allowed.
That partial is shown here for the top regex range :
^(?:0(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|[1-9]\d?(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
The code line with stringed regex :
const newRegExp = new RegExp("^(?:0(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|[1-9]\\d?(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$");
_________________________
The regex 'partial' above requires the input to be blank or to start
with a digit. It also doesn't allow 1-9 with a preceding 0.
If that is all to be allowed, a simple mod is this :
^(?:0{0,2}(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|(?:[1-9]\d?|0[1-9])(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
which allows input like the following:
(It should be noted that doing this requires allowing the dot . as
a valid input but could be converted to 0. on the fly to be put
inside the input box.)
.1
00.01
09.90
01.
01.11
00.1
00
.
Stringed version :
"^(?:0{0,2}(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|(?:[1-9]\\d?|0[1-9])(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$"

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

Get which capture group matched a result using Delphi's TRegex

I've written a regex whose job is to return all matches to its three alternative capture groups. My goal is to learn which capture group produced each match. PCRE seems able to produce that information. But I haven't yet been able coerce the TRegEx class in Delphi XE8 to yield meaningful capture group info for matches. I can't claim to be at the head of regex class, and TRegEx is new to me, so who knows what errors I'm making.
The regex (regex101.com workpad) is:
(?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
This test text:
externship L763 clinic 207 B706 b512
gives five matches in test environments. But a simple test program that walks the TGroupCollection of each TMatch in the TMatchCollection shows bizarre results about groups: all matches have more than one group (2, 3 or 4) with each group's Success true, and often the matched text is duplicated in several groups or is empty. So this data structure (below) isn't what I'd expect:
Using TRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
5 matches
'externship' with 2 groups:
length 10 at 1 value 'externship' (Sucess? True)
length 10 at 1 value 'externship' (Sucess? True)
'L763' with 4 groups:
length 4 at 12 value 'L763' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 12 value 'L763' (Sucess? True)
'clinic' with 2 groups:
length 6 at 17 value 'clinic' (Sucess? True)
length 6 at 17 value 'clinic' (Sucess? True)
'207' with 3 groups:
length 3 at 24 value '207' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 3 at 24 value '207' (Sucess? True)
'B706' with 4 groups:
length 4 at 28 value 'B706' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 28 value 'B706' (Sucess? True)
My simple test runner is this:
program regex_tester;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,
System.RegularExpressions,
System.RegularExpressionsCore;
var
Matched : Boolean;
J : integer;
Group : TGroup;
Match : TMatch;
Matches : TMatchCollection;
RegexText,
TestText : String;
RX : TRegEx;
RXPerl : TPerlRegEx;
begin
try
RegexText:='(?''word''\b[a-zA-Z]{3,}\b)|(?''id''\b\d{1,3}\b)|(?''course''\b[BL]\d{3}\b)';
TestText:='externship L763 clinic 207 B706 b512';
RX:=TRegex.Create(RegexText);
Matches:=RX.Matches(TestText);
Writeln(Format(#10#13#10#13'Using TRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RegexText, TestText]));
Writeln(Format('%d matches', [Matches.Count]));
for Match in Matches do
begin
Writeln(Format(' ''%s'' with %d groups:', [Match.Value,Match.Groups.Count]));
for Group in Match.Groups do
Writeln(Format(#9'length %d at %d value ''%s'' (Sucess? %s)', [Group.Length,Group.Index,Group.Value,BoolToStr(Group.Success, True)]));
end;
RXPerl:=TPerlRegEx.Create;
RXPerl.Subject:=TestText;
RXPerl.RegEx:=RegexText;
Writeln(Format(#10#13#10#13'Using TPerlRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RXPerl.Regex, RXPerl.Subject]));
Matched:=RXPerl.Match;
if Matched then
repeat
begin
Writeln(Format(' ''%s'' with %d groups:', [RXPerl.MatchedText,RXPerl.GroupCount]));
for J:=1 to RXPerl.GroupCount do
Writeln(Format(#9'length %d at %d, value ''%s''',[RXPerl.GroupLengths[J],RXPerl.GroupOffsets[J],RXPerl.Groups[J]]));
Matched:=RXPerl.MatchAgain;
end;
until Matched=false;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
I'd certainly appreciate a nudge in the right direction. If TRegEx is broken, I can of course use an alternative -- or I can let go of the perceived elegance of the solution, instead using three simpler tests to find the bits of info I need.
Added Information and Interpretation
As #andrei-galatyn notes, TRegEx uses TPerlRegEx for its work. So I added a section to my testing program (output below) where I experiment with that, too. It isn't as convenient to use as TRegEx, but its result is what it should be -- and without the problems of TRegEx's broken TGroup data structures. Whichever class I use, the last group's index (less 1 for TRegEx) is the capturing group I want.
Along the way I was reminded that Pascal arrays are often based on 1 rather than 0.
Using TPerlRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
'externship' with 1 groups:
length 10 at 1, value 'externship'
'L763' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 12, value 'L763'
'clinic' with 1 groups:
length 6 at 17, value 'clinic'
'207' with 2 groups:
length 0 at 1, value ''
length 3 at 24, value '207'
'B706' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 28, value 'B706'
Internally Delphi uses class TPerlRegEx and it has such description for GroupCount property:
Number of matched groups stored in the Groups array. This number is the number of the highest-numbered capturing group in your regular expression that actually participated in the last match. It may be less than the number of capturing groups in your regular expression.
E.g. when the regex "(a)|(b)" matches "a", GroupCount will be 1. When the same regex matches "b", GroupCount will be 2.
TRegEx class always adds one more group (for whole expression i guess).
In your case it should be enough to check every match like this:
case Match.Groups.Count-1 of
1: ; // "word" found
2: ; // "id" found
3: ; // "course" found
end;
It doesn't answer why Groups are filled with strange data, indeed it seems to be enough to answer your question. :)
I realize this is 4 years too late, but for others who may search for this, here's what I found (in Delphi RIO at least, I have not tested earlier versions, however the source URL below says as of XE), however:
From https://www.regular-expressions.info/delphi.html (important bit is bold italic)
The TMatch record provides several properties with details about the
match. Success indicates if a match was found. If this is False, all
other properties and methods are invalid. Value returns the matched
string. Index and Length indicate the position in the input string and
the length of the match. Groups returns a TGroupCollection record that
stores a TGroup record in its default Item[] property for each
capturing group. You can use a numeric index to Item[] for numbered
capturing groups, and a string index for named capturing groups.
So, if you NAME your matching group like this for example:
(?'MatchName'[^\/\?]*)
Which would match a string up until a / or ? character, you can then reference that capture group by name like this:
Matches := TRegEx.Matches(StrToMatch, RegexString, [ roExplicitCapture ] );
for Match in Matches do begin
// ... maybe some logic here to determine if this is the match you want ...
try
whatGotMatched := Match.Groups['MatchName'].Value;
except
whatGotMatched := '';
end;
end;
How exactly you get to your match groups in the Matches is dependent on how you structure your RegEx etc, how many matches you create, etc. I surround the Match.Groups['MatchName'].Value in a try-except block because if the match wasn't found, you will generate an index out of bounds error when accessing .Value since it will still say the match would have been at character X, for 0 characters. You could also check Match.Groups['MatchName'].Length for 0 before trying to access .Value to probably better results...but if you reference it by name, you don't have to decipher specifically what groups matched what, etc ... just ask for the named match you wanted, and if it matched, you will get it.
Edit: they reference to .Length to help verify before throwing an error about index out of bounds. However, if your search is constructed where a named search is not matched, and therefore not returned at all, you will get an error thrown for an invalid match name, so you may still may need to capture for that.