Why doesn't this recursive regex capture the entire code block? - regex

I'm trying to write a recursive regular expression to capture code blocks, but for some reason it seems to not be capturing them properly. I would expect the code below to capture the full body of the function, but instead it only captures the contents of the first if statement.
It's almost like the .+? is somehow gobbling up the first {, but it's supposed to be non-greedy so I don't understand why it would.
What is causing it to act this way?
Script:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
END
# Regular expression to capture balanced "{}" groups
my $regex = qr/
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?R) # Recursively match the entire expression
)* # Match 0 or more times
\} # Match closing brace
/x;
# is ".+?" gobbling up the first "{"?
# What would cause it to do this?
if ($text =~ m/int\s.+?($regex)/s){
print $1;
}
Output:
{
return x;
}
Expected Output:
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
I know that there is a Text::Balanced module for this purpose, but I am attempting to do this by hand in order to learn more about regular expressions.

(?R) recurses into the whole pattern – but what is the whole pattern? When you embed the quoted $regex into /int\s.+?($regex)/, the pattern is recompiled and (?R) refers to the new pattern. That's not what you intended.
I'd recommend you use named captures instead so that you can recurse by name. Change the $regex like
/(?<nestedbrace> ... (?&nestedbrace) ...)/
If you want to avoid extra captures, you can use the (?(DEFINE) ...) syntax to declare named regex patterns that can be called later:
my $define_nestedbrace_re = qr/(?(DEFINE)
(?<nestedbrace ... (?&nestedbrace) ...)
)/x;
Then: /int\s.+?((?&nestedbrace))$define_nestedbrace_re/
That won't create additional captures. However, it is not generally possible to write encapsulated regex fragments. Techniques like preferring named captures instead of numbered captures can help here.

You can change your recursive pattern to this one:
/int\s+.*? (
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace chars without backtracking
| # OR
(?-1) # Recursively match the previous group
)* # Match 0 or more times
\}
)/sx
Note use of (?-1) instead of (?R) that recurses whole matched pattern.
(?-1) is back-reference of previous capturing group.
Updated RegEx Demo

Related

Non-Capturing and Capturing Groups - The right way

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!
First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.
Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

Regex: Capturing first occurrence before lookahead

I'm trying to capture the urls before a particular word. The only trouble is that the word could also be part of the domain.
Examples: (i'm trying to capture everything before dinner)
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
I am able to use:
^(.*://.*/)(?=dinner/?)
The trouble I am having is the lookahead doesn't appear to by lazy enough
So the following is failing:
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/
as it captures:
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/
I'm both failing to understand why and how to fix my regex.
Perhaps I'm on the wrong track but how can I capture all my examples?
You can use some laziness:
^(.*?:\/\/).*?/(?=dinner/?)
Live demo
By using a .* in the middle of your regex you ate everything until the last colon, where it found a match.
.* in the middle of a regex, by the way, is very bad practice. It can cause horrendous backtracking performance degradation in long strings. .*? is better, since it is reluctant rather than greedy.
The lookahead doesn't have to be lazy or not, the lookahead is only a check and in your case with a quasi-fixed string.
What you need to make lazy is obviously the subpattern before the lookahead.
^https?:\/\/(?:[^\/]+\/)*?(?=dinner(?:\/|$))
Note: (?:/|$) is like a boundary that ensures the word "dinner" is followed by a slash or the end of the string.
You're primary flaw is using greedy matching .* versus non-greedy .*?.
The following performs the matching that you desire using perl, but the regex could easily be applied in any language. Note the use of word boundaries around dinner, which might or might not be what you want:
use strict;
use warnings;
while (<DATA>) {
if (m{^(.*?://.*?/.*?)(?=\bdinner\b)}) {
print $1, "\n";
}
}
__DATA__
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
Outputs:
https://breakfast.example.com/lunch/
https://breakfast.example.brunch.com:8080/lunch/
http://dinnerdemo.example.com/
Another way as well.
# Multi-line optional
# ^(?:(?!://).)*://[^?/\r\n]+/(?:(?!dinner)[^?/\r\n]+/)*(?=dinner)
^ # BOL
(?:
(?! :// )
.
)*
://
[^?/\r\n]+ # Domain
/
(?:
(?! dinner ) # Dirs ?
[^?/\r\n]+
/
)*
(?= dinner )
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
https://breakfast.example.com/lunch/dinner/login.html?returnURL=https://breakfast.example.com/lunch/dinner/
Using python 3.7
import re
s = '''
https://breakfast.example.com/lunch/dinner/
https://breakfast.example.brunch.com:8080/lunch/dinner
http://dinnerdemo.example.com/dinner/
'''
pat = re.compile(r'.*(?=dinner)', re.M)
mo = re.findall(pat, s)
for line in mo:
print(line, end=' ')
Print Output:
https://breakfast.example.com/lunch/
https://breakfast.example.brunch.com:8080/lunch/
http://dinnerdemo.example.com/

regular expression for replacing string in one column of text using perl

I am trying to write a regular expression to replace a string in 1st column of text file using perl. I have tried the following
foreach(#filecontents)
{
$_=~ s/($usersearch)\t|$usersearch\s\w+\t/$userreplace/gi;
}
This works with the data i have tested but Is there a better way to do it?
You can use the ^ anchor (start of the string anchor) and you can short the pattern a little:
$_ =~ s/^$usersearch(?:\s\w+)??\t/$userreplace/i;
Instead of using a lazy quantifier ?? you can write:
$_ =~ s/^$usersearch(?:[^\S\t]\w+)?\t/$userreplace/i;
The result can be a little faster with this second version.
Descriptions:
(?:..) # is a non capturing group, it's only used to group elements
# together without capturing
?? # is the lazy version of the ? quantifier (zero or one time)
(?:..)?? # means "match the group only if needed"
# (vs (?:..)? # means "match the group if it is possible")
[^\S\t] # a character class that contains all white characters except the tab
# the ^ at the begining is used to negate the class, \S is all that
# is not a white character ( \s <=> [^\S] ), you only need to add \t
# to exclude it.
Note: if your variable $usersearch may contain regex special characters, don't forget to use quotemeta before using it in a pattern.

Perl regex issue with brackets where content are multiline

I have a string in a file, which is to be read by Perl, and can either be:
previous content ending with a linebreak
keyword: content
next content
or
previous content, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c closed by matching parenthesis}
next content
In either case, I successfully loaded the contents, from the beginning of previous content, till the end of next, in a string, call it $str.
Now, I want to extract the stuff between the linebreak that ends previous content, and the linebreak before next content.
So I used a regex on $str like this:
if($str =~
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?!\{+) # NO { follows
\s+(?!\{+) # NO { with a heading whitespace
\s* # white space between keyword: and content
(?!\{+) # no { immediately before content
# question : should the last one be a negative lookbehind AFTER the check for content itself?
([^\s]+) # the content, should be in $1;
(?!\{+) # no trailing { immediately after content
\s+ # delimited by a whitespace, ignore what comes afterwards
| # or
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?=\s*{*\s*)*) # any mix of whitespace and {
(?=\{+) # at least one {
(?=\s*{*\s*)*) # again any mix of whitespace and {
([^\{\}]+) # no { or }
(?=\s*}*\s*)*) # any mix of whitespace and }
(?=\}+) # at least one }
(?=\s*}*\s*)*) # again any mix of whitespace and }
) { #do something with $1}
I realize that this one is not really addressing multiline information with nested parenthesis; however, it should capture objects in form keyword: {{ content} }
However, while I am able to capture the content in $1 in case of
keyword: content
form, I am unable to capture
keyword: {multiline with nested
{parenthesis} }
I finally did implement it using a simple counter based parser, instead of regex. I would love to know how can I do this in regex, to capture objects of the second form, with an explanation of the regex command, please.
Also, where did my formulation go wrong that it does not even capture single line content with multiple (but matched) heading and trailing parenthesis?
You can use this:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "previous content ending with a linebreak
keyword: content
next content
previous contnet, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c losed by matching parethesis}
next content";
while ($str =~ /\nkeyword:
(?| # branch reset: i.e. the two capture groups have the same number
\s*
({ (?> [^{}]++ | (?1) )*+ }) # recursive pattern
| # OR
\h*
(.*+) # capture all until the end of line
) # close the branch reset group
/xg ) {
print "$1\n";
}
This pattern try a possible content with nested curly brackets, if curly brackets are not found or are not balanced, the second alternative is tried and match only the content of the line (since the dot can't match newlines).
The branch reset feature (?|..|..) is useful to give the same number to the capturing group of each part of the alternation.
recursive pattern details:
( # open the capturing group 1
{ # literal opening curly bracket
(?> # atomic group: possible content between brackets
[^{}]++ # all that is not a curly bracket
| # OR
(?1) # recurse to the capturing group 1 (!here is the recursion!)
)*+ # repeat the atomic group zero or more times
} # literal closing curly bracket
) # close the capturing group 1
In this subpattern I use an atomic group (?>...) and possessive quantifiers ++ and *+ to avoid backtracking the most possible.
How about something like this?
if ($str =~ /keyword:\s*{(.*)}/s) {
my $key = $1;
if ($key =~ /([^{}]*)/) {
print "$1\n";
}
else {
print "$key\n";
}
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}
[^{|^}] is looking for a chunk of letters that doesn't have any braces in it i.e. the most inner letters of the nested braces.
The s modifier allows you to look at multiple lines even when using .*. However, you don't want to look at multiple lines for keywords without braces, so that part is in the elsif statement.
Do you need to have the same number of matching braces? For example, should keyword: {foo{bar{hello}}} output {{{hello}}}? If so, I feel like it would be better to stick with counters.
Edit:
For the input
keyword: {multiline
with nested {parenthesis} }
if you want the output
{multiline with nested {parenthesis} }
I believe that would be
if ($str =~ /keyword:\s*({.*})/s) {
my $match = $1;
$match =~ s/\n//g;
print "$match\n";
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}

how to match several regular expression patterns sequentially in perl

I want to do matching in the following way for a large multiline text:
I have a few matching patterns:
$text =~ m#finance(.*?)end#s;
$text =~ m#<class>(.*?)</class>#s;
$text =~ m#/data(.*?)<end>#s;
If either one is matched, then print the result print $1, and then continue with the rest of the text to match again for the three patterns.
How can I get the printed results in the order they appear in the whole text?
Many thanks for your help!
while ($text =~ m#(?: finance (.*?) end
| <class> (.*?) </class>
| data (.*?) </end>
)
#sgx) {
print $+;
}
ought to do it.
$+ is the last capturing group that successfully matched.
The /g modifier is intended specifically for this kind of usage; it turns the regex into an iterator that, when resumed, continues the match where it left off instead of restarting at the beginning of $text.
(And /x lets you use arbitrary whitespace, meaning you can make your regexes readable. Or as readable as they get, at least.)
If you need to deal with multiple captures, it becomes a bit harder as you can't use $+. You can, however, test for capturing groups being defined:
while ($text =~ m#(?: a (.*?) b (.*?) c
| d (.*?) e (.*?) f
| data (.*?) </end>
)
#sgx) {
if (defined $1) {
# first set matched (don't need to check $2)
}
elsif (defined $3) {
# second set matched
}
else {
# final one matched
}
}