Perl regex issue with brackets where content are multiline - regex

I have a string in a file, which is to be read by Perl, and can either be:
previous content ending with a linebreak
keyword: content
next content
or
previous content, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c closed by matching parenthesis}
next content
In either case, I successfully loaded the contents, from the beginning of previous content, till the end of next, in a string, call it $str.
Now, I want to extract the stuff between the linebreak that ends previous content, and the linebreak before next content.
So I used a regex on $str like this:
if($str =~
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?!\{+) # NO { follows
\s+(?!\{+) # NO { with a heading whitespace
\s* # white space between keyword: and content
(?!\{+) # no { immediately before content
# question : should the last one be a negative lookbehind AFTER the check for content itself?
([^\s]+) # the content, should be in $1;
(?!\{+) # no trailing { immediately after content
\s+ # delimited by a whitespace, ignore what comes afterwards
| # or
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?=\s*{*\s*)*) # any mix of whitespace and {
(?=\{+) # at least one {
(?=\s*{*\s*)*) # again any mix of whitespace and {
([^\{\}]+) # no { or }
(?=\s*}*\s*)*) # any mix of whitespace and }
(?=\}+) # at least one }
(?=\s*}*\s*)*) # again any mix of whitespace and }
) { #do something with $1}
I realize that this one is not really addressing multiline information with nested parenthesis; however, it should capture objects in form keyword: {{ content} }
However, while I am able to capture the content in $1 in case of
keyword: content
form, I am unable to capture
keyword: {multiline with nested
{parenthesis} }
I finally did implement it using a simple counter based parser, instead of regex. I would love to know how can I do this in regex, to capture objects of the second form, with an explanation of the regex command, please.
Also, where did my formulation go wrong that it does not even capture single line content with multiple (but matched) heading and trailing parenthesis?

You can use this:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "previous content ending with a linebreak
keyword: content
next content
previous contnet, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c losed by matching parethesis}
next content";
while ($str =~ /\nkeyword:
(?| # branch reset: i.e. the two capture groups have the same number
\s*
({ (?> [^{}]++ | (?1) )*+ }) # recursive pattern
| # OR
\h*
(.*+) # capture all until the end of line
) # close the branch reset group
/xg ) {
print "$1\n";
}
This pattern try a possible content with nested curly brackets, if curly brackets are not found or are not balanced, the second alternative is tried and match only the content of the line (since the dot can't match newlines).
The branch reset feature (?|..|..) is useful to give the same number to the capturing group of each part of the alternation.
recursive pattern details:
( # open the capturing group 1
{ # literal opening curly bracket
(?> # atomic group: possible content between brackets
[^{}]++ # all that is not a curly bracket
| # OR
(?1) # recurse to the capturing group 1 (!here is the recursion!)
)*+ # repeat the atomic group zero or more times
} # literal closing curly bracket
) # close the capturing group 1
In this subpattern I use an atomic group (?>...) and possessive quantifiers ++ and *+ to avoid backtracking the most possible.

How about something like this?
if ($str =~ /keyword:\s*{(.*)}/s) {
my $key = $1;
if ($key =~ /([^{}]*)/) {
print "$1\n";
}
else {
print "$key\n";
}
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}
[^{|^}] is looking for a chunk of letters that doesn't have any braces in it i.e. the most inner letters of the nested braces.
The s modifier allows you to look at multiple lines even when using .*. However, you don't want to look at multiple lines for keywords without braces, so that part is in the elsif statement.
Do you need to have the same number of matching braces? For example, should keyword: {foo{bar{hello}}} output {{{hello}}}? If so, I feel like it would be better to stick with counters.
Edit:
For the input
keyword: {multiline
with nested {parenthesis} }
if you want the output
{multiline with nested {parenthesis} }
I believe that would be
if ($str =~ /keyword:\s*({.*})/s) {
my $match = $1;
$match =~ s/\n//g;
print "$match\n";
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}

Related

Why doesn't this recursive regex capture the entire code block?

I'm trying to write a recursive regular expression to capture code blocks, but for some reason it seems to not be capturing them properly. I would expect the code below to capture the full body of the function, but instead it only captures the contents of the first if statement.
It's almost like the .+? is somehow gobbling up the first {, but it's supposed to be non-greedy so I don't understand why it would.
What is causing it to act this way?
Script:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
END
# Regular expression to capture balanced "{}" groups
my $regex = qr/
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?R) # Recursively match the entire expression
)* # Match 0 or more times
\} # Match closing brace
/x;
# is ".+?" gobbling up the first "{"?
# What would cause it to do this?
if ($text =~ m/int\s.+?($regex)/s){
print $1;
}
Output:
{
return x;
}
Expected Output:
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
I know that there is a Text::Balanced module for this purpose, but I am attempting to do this by hand in order to learn more about regular expressions.
(?R) recurses into the whole pattern – but what is the whole pattern? When you embed the quoted $regex into /int\s.+?($regex)/, the pattern is recompiled and (?R) refers to the new pattern. That's not what you intended.
I'd recommend you use named captures instead so that you can recurse by name. Change the $regex like
/(?<nestedbrace> ... (?&nestedbrace) ...)/
If you want to avoid extra captures, you can use the (?(DEFINE) ...) syntax to declare named regex patterns that can be called later:
my $define_nestedbrace_re = qr/(?(DEFINE)
(?<nestedbrace ... (?&nestedbrace) ...)
)/x;
Then: /int\s.+?((?&nestedbrace))$define_nestedbrace_re/
That won't create additional captures. However, it is not generally possible to write encapsulated regex fragments. Techniques like preferring named captures instead of numbered captures can help here.
You can change your recursive pattern to this one:
/int\s+.*? (
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace chars without backtracking
| # OR
(?-1) # Recursively match the previous group
)* # Match 0 or more times
\}
)/sx
Note use of (?-1) instead of (?R) that recurses whole matched pattern.
(?-1) is back-reference of previous capturing group.
Updated RegEx Demo

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

managing and documenting a multiline substitution in Perl

I have recently been learning about the \x modifier in Perl Best Practices, enabling you to do cool things like multi-line indentation and documentation:
$txt =~ m/^ # anchor at beginning of line
The\ quick\ (\w+)\ fox # fox adjective
\ (\w+)\ over # fox action verb
\ the\ (\w+) dog # dog adjective
(?: # whitespace-trimmed comment:
\s* \# \s* # whitespace and comment token
(.*?) # captured comment text; non-greedy!
\s* # any trailing whitespace
)? # this is all optional
$ # end of line anchor
/x; # allow whitespace
However, I was unable to do the equivalent for find/replace string substitutions? Is there some other similar best practice that should be used to more effectively manage complex substitutions?
Edit Take this for an example:
$test =~ s/(src\s*=\s*['"]?)(.*?\.(jpg|gif|png))/${1}something$2/sig;
Is there a similar way that this could be documented using multi-line/whitespace for better readability?
Many thanks
Since you've chosen not to provide an example of something that doesn't work, I'll offer a few guesses at what you might be doing wrong:
Note that the delimiter (in your case /) cannot appear inside any comments inside the regex, because then they'll be indicating the end of the regex. For example, this:
s/foo # this is interesting and/or cool
/bar/x
will not work, because the regex is terminated by the slash between and and or.
Note that /x does not work on the replacement-string, only on the regex itself. For example this:
s/foo/bar # I love the word bar/x
will replace foo with bar # I love the word bar.
If you really want to be able to put comments in the replacement-string, then I suppose you could use a replacement-expression instead, using the /e flag. That would let you use the full syntax of Perl. For example:
s/foo/'bar' # I love the word bar/e
Here is an example that does work:
$test =~
s/
# the regex to replace:
(src\s*=\s*['"]?) # src=' or src=" (plus optional whitespace)
(.*?\.(jpg|gif|png)) # the URI of the JPEG or GIF or PNG image
/
# the string to replace it with:
$1 . # src=' or src=" (unchanged)
'something' . # insert 'something' at the start of the URI
$2 # the original URI
/sige;
If we just add the /x, we can break up the regular expression portion easily, including allowing comments.
my $test = '<img src = "http://www.somewhere.com/im/alright/jack/keep/your/hands/off/of/my/stack.gif" />';
$test =~ s/
( src \s* = \s* ['"]? ) # a src attribute ...
( .*?
\. (jpg|gif|png) # to an image file type, either jpeg, gif or png
)
/$1something$2/sigx # put 'something' in front of it
;
You have to use the evaluation switch (/e) if you want to break up the replacement. But the multi-line for the match portion, works fine.
Notice that I did not have to separate $1, because $1something is not a valid identifier anyway, so my version of Perl, at least, does not get confused.
For most of my evaluated replacements, I prefer the bracket style of substitution delimiter:
$test =~ s{
( src \s* = \s* ['"]? ) # a src attribute ... '
( .*?
\. (jpg|gif|png) # to an image file type, either jpeg, gif or png
)
}{
$1 . 'something' . $2
}sigxe
;
just to make it look more code-like.
Well
$test =~ s/(src\s*=\s*['"]?) # first group
(.*?\.(jpg|gif|png)) # second group
/${1}something$2/sigx;
should and does work indeed. Of course, you can't use this on the right part, unless you use somethig like :
$test =~ s/(src\s*=\s*['"]?) # first group
(.*?\.(jpg|gif|png)) # second group
/
$1 # Get 1st group
. "something" # Append ...
. $2 # Get 2d group
/sigxe;
s/foo/bar/
could be written as
s/
foo # foo
/
"bar" # bar
/xe
/x to allow whitespace in the pattern
/e to allow code in the replacement expression

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}

how to match several regular expression patterns sequentially in perl

I want to do matching in the following way for a large multiline text:
I have a few matching patterns:
$text =~ m#finance(.*?)end#s;
$text =~ m#<class>(.*?)</class>#s;
$text =~ m#/data(.*?)<end>#s;
If either one is matched, then print the result print $1, and then continue with the rest of the text to match again for the three patterns.
How can I get the printed results in the order they appear in the whole text?
Many thanks for your help!
while ($text =~ m#(?: finance (.*?) end
| <class> (.*?) </class>
| data (.*?) </end>
)
#sgx) {
print $+;
}
ought to do it.
$+ is the last capturing group that successfully matched.
The /g modifier is intended specifically for this kind of usage; it turns the regex into an iterator that, when resumed, continues the match where it left off instead of restarting at the beginning of $text.
(And /x lets you use arbitrary whitespace, meaning you can make your regexes readable. Or as readable as they get, at least.)
If you need to deal with multiple captures, it becomes a bit harder as you can't use $+. You can, however, test for capturing groups being defined:
while ($text =~ m#(?: a (.*?) b (.*?) c
| d (.*?) e (.*?) f
| data (.*?) </end>
)
#sgx) {
if (defined $1) {
# first set matched (don't need to check $2)
}
elsif (defined $3) {
# second set matched
}
else {
# final one matched
}
}