Perl regex for multichar nested bracket - regex

I use recursive Perl regular reg-expressions to scan for nested singlechar brackets:
$RE = qr'(?:[\(]((?:(?>[^\(\)]+)|(??{$RE}))*)[\)])';
This lets me scan c-function calls, something like :
"func (a(b()))" ~= /$RE/
matching "(a(b()))" . Now I'd like to parse Pascal style nested [if,if-end] brackets, i.e.:
if (a) then
if (b) then
blaif := 1;
else
blaend := 2;
end if;
end if;
I tried to rewrite $RE from above to:
$RE_if = qr'(?:(?:if)((?:(?>(?!(?:\bif\b|\bend\s+if))+)|(??{$RE_if}))*)\
(?:\bend\s+if))';
But it kindof doesnt work. Does somebody have a regex that handles multichar brackets
like ["if","end if"] ?
-- Greetings Konrad

Let's look at the original pattern: (Extraneous escapes removed. Needless surrounding (?:) removed.)
[(] # Prefix.
(
(?: (?> [^()] +) # Some characters containing neither prefix nor suffix.
| (??{ $RE }) # Recursion
)*
)
[)] # Suffix.
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so:
\bif\b
(
(?: (?> (?:(?! \b(?:end\s+)?if\b ).)+ )
| (??{ $RE })
)*
)
\bend\s+if\b
By the way, (?>PAT+) can be written PAT++.

Related

regular expression with recursion in Perl

I'm trying to use this but can't make it work. I want to check the syntax of expressions like this: (1+2)*(3+4)
I have integers, +, * and brackets. That's it, but it can be nested to any depth.
In BNF syntax the expr can be described like this:
expr
<sum>
sum
<product>{+<product>}
product
<atom>{*<atom>}
atom
<number>|(<expr>)
number
<digit>{<digit>}
I tried to translate this to Perl like this:
$number = '\d+';
$atom = "($number|\\((?R)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^$expr$/)
{
print "OK";
}
But it doesn't match! What am I doing wrong?
When you recurse, the ^ at the start of the pattern will fail to match.
Use (?(DEFINE)...) to define the rules instead of using (?R).
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&sum) )
(?<sum> (?&product) (?: \+ (?&product) )*+ )
(?<product> (?&atom) (?: \* (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
which simplifies to
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&binary_op) )
(?<binary_op> (?&atom) (?: [+*] (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
That's assuming you're only trying to check for validity rather than trying to parse the string. If you need to parse the string, you can build a parser using Parse::RecDescent or Marpa::R2.
ikegami's workaround above with the DEFINE stuff is beautiful, but it doesn't answer the question how to do it my way. A minimal change of my code to make it work? ikegami is right, the cause of no match is the ^ in /^$expr$/ . When the parser reenters the regex recursively it again checks for beginning of string, which fails. So I cannot have ^ and $ in the regex it seems. Without them my string matches. But then some invalid strings match too, like A(1+2)*(3+4)B . In the absence of ^ and $ it doesn't necessarily match the whole string. Problem.
ikegami suggested a solution to this in a comment above. I'll just write it out. I have tested it and it works:
$number = '\d+';
$atom = "($number|\\((?1)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^($expr)$/)
{
print "OK";
}
Notice that I now have (?1) instead of (?R) and that I have enclosed $expr in brackets. (?1) refers to the first capture group, which is ($expr). So the recursion reenters this subex instead of the whole regex. ^ is not met again. That solves it.

Regular Expression Inception (a match within a match)

I am trying to create a regular expression that captures a named group and then looks within that named group to check if it contains certain qualities.
For example. I have a regular expression that matches a code block and I can use it to match and capture code blocks:
test.pl:
use strict;
use warnings;
my $text = <<'END_TEXT';
block {
// random stuff
}
block {
dog
}
END_TEXT
my $code_block_rx = qr{(?(DEFINE)
(?<code_block>
block\h\{ (?: [^{}]++ | (?&code_block) )*+ \}
)
)}xms;
while ($text =~ m/(?<match>(?&code_block))$code_block_rx/g) {
print $+{match}."\n";
}
This code will print both code blocks. But what if I only want to capture code blocks that contain the word "dog"?
Is there a way (in a single regular expression) to capture a code block, and then if that is found, look within the code block for the word "dog"?
I've tried modifying the regex to use a look ahead assertion, but it just causes the whole thing to fail: /(?<match>(?=dog)(?&code_block))$code_block_rx/g
What am I missing?
You tried to match dog at the position where the match starts.
Instead, you can check if it's in the block you matched.
while ($text =~ /(
\b block \h*+ ( (?&code_block) )
(?(DEFINE)
(?<code_block> \{ (?&code_block_body) \} )
(?<code_block_body> (?: [^{}]++ | (?&code_block) )*+ )
)
)/xg) {
my $block_stmt = $1;
my $block_stmt_block = $2;
if ($block_stmt_block =~ /\b dog \b/x) {
say $block_stmt;
}
}
It can be done in a single pattern by using (?(?{!( assertion() )})(*FAIL)) to match against something you already captured.
while ($text =~ m{(
\b block \h*+
# A code_block that contains the word 'dog'.
( (?&code_block) ) (?(?{!( "$^N" =~ /\b dog \b/x )})(*FAIL))
(?(DEFINE)
(?<code_block> \{ (?&code_block_body) \} )
(?<code_block_body> (?: [^{}]++ | (?&code_block) )*+ )
)
)}xg) {
say $1;
}

Perl regex - read java file and match entire text of a function in file

I am trying to read a .java file into a perl variable, and I want to match a function, say for instance:
public String example(){
return "hello";
}
What would the regex patter for this look like?
Current Attempt:
use strict;
use warnings;
open ( FILE, "example.java" ) || die "can't open file!";
my #lines = <FILE>;
close (FILE);
my $line;
foreach $line (#lines) {
if($line =~ /String example(.*)}/s){
print $line;
}
}
**Adopted from this answer
Regex:
^\s*([\w\s]+\(.*\)\s*(\{((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/(\w+)["']?[^;]+\4;$|[^{}<'"/]++|[^{}]++|(?2))*)}))
Breakdown:
^ \s*
( # (1 start)
[\w\s]+ \( .* \) \s* # How it matches a function definition
( # (2 start)
\{ # Opening curly bracket
( # (3 start)
(?> # Atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # Double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # Single quoted strings
| // .* $ # A comment block starting with //
| /\* [\s\S]*? \*/ # A multi-line comment block /*...*/
( \w+ ) # (4) ^
["']? [^;]+ \4 ; $ # ^
| [^{}<'"/]++ # Force engine to backtrack if it encounters special characters (possessive)
| [^{}]++ # Default matching behavior (possessive)
| (?2) # Recurs 2nd capturing group
)* # Zero to many times of atomic group
) # (3 end)
} # Closing curly bracket
) # (2 end)
) # (1 end)
Revo's regex is the Right Way To Do it (as much as a regex ever can be!).
But sometimes you just need something quick, to manipulate a file you have control over. I find, when using regexes, that it's often important to define "Good enough".
So, it may be "good enough" to assume the indentation is correct. In that case, you can just detect the start of the fn, then read until you find the next closing curly with the same indentation:
( # Capture \1.
^([\t ])+ # Match and capture leading whitespace to \2.
(?:\w+\s*)? # Privacy specifier, if any.
\w+\s*\( # Name and opening round brace: is a function.
.*? # Need Dot-matches-newline, to match fn body.
\n\2} # Curly brace is as indented as start of fn.
) # End capture of \1.
Should work on clean code that you wrote yourself, code you can pass through an auto-formatter first, etc.
Will work with K&R, Hortmann and Allman indent styles.
Will fail with one-line and in-line functions, and indent styles like GNU, Whitesmiths, Pico, Ratliff and Pico - things which Rico's answer handles with no problems at all.
Also fails on lambdas, nested functions, and functions which use generics, but even Revo's doesn't recognize those, and they're not that common.
And neither of our regexes capture the comments preceding a function, which is pretty sinful.

Perl regex: Extract token_key value from string

I just started learning Perl and trying to do regex to break down a token key.
The token itself has multiple "columns" and I only need the KEY section.
token:
{
"token_key":"C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8",
"key_type":"xyz",
"expires":1200
}
but I only need this part
C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8
everything else can be ignored when I output it.
Any suggestions or advice are welcome!
Thank You!
This looks like JSON -- perhaps a proper parser would be better?
use JSON::PP;
my $json = JSON::PP->new->utf8->allow_barekey;
my $token = $json->decode('{' . $str . '}')->{'token'};
print $token->{'token_key'};
In any case, you can extract it (a bit more hackishly) with a regex like so:
$str =~ /['"]token_key['"]:\s*['"]([a-f0-9]+)['"]/i;
print $1;
A slightly less hackish regex would be:
# (["'])\s*token_key\s*\1\s*:\s*(["'])((?:(?!\2)[\S\s])*)\2
( ["'] ) # (1)
\s* token_key \s*
\1
\s* : \s*
( ["'] ) # (2)
( # (3 start)
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
Assuming the following:
absence of white characters between "token_key" and its value,
double quotes, not single quotes have been used in parsed strings,
a string of file is in $_ variable.
In that case
if (/"token_key":"([^"]+)/)
{ print "$1\n" }

extract contents of each level of parentheses

I am converting a SMAPI grammar to JSGF. They are pretty similar grammars used in different speech recognition systems. SMAPI uses a question mark they way the rest of the world does, to mean 0 or 1 of the previous thing. JSGF uses square brackets for this. So, I need to convert a string like stuff? to [stuff], and parenthesized strings like ((((stuff)? that)? I)? like)? to [[[[stuff] that] I] like]. I have to leave alone strings like ((((stuff) that) I) hate). As Qtax pointed out, a more complicated example would be (foo ((bar)? (baz))?) being replaced by (foo [[bar] (baz)]).
Because of this, I have to extract every level of a parenthesized expression, see if it ends in a question mark, and replace the parens and question mark with square braces if it does.
I think Eric Strom's answer to this question is almost what I need. The problem is that when I use it, it returns the largest matched grouping, whereas I need to do operations on each individual groupings.
This is what I have so far: s/( \( (?: [^()?]* | (?0) )* \) ) \?/[$1]/xg. When matched with ((((stuff)? that)? I)? like)?, however, it produces only [((((stuff)? that)? I)? like)]. Any ideas on how to do this?
I
You'll also want to look at ysth's solution to that question, and use a tool that is already available to solve this problem:
use Text::Balanced qw(extract_bracketed);
$text = '((((stuff)? that)? I)? like)?';
for ($i=0; $i<length($text); $i++) {
($match,$remainder) = extract_bracketed( substr($text,$i), '()' );
if ($match && $remainder =~ /^\?/) {
substr($text,$i) =
'[' . substr($match,1,-1) . ']' . substr($remainder,1);
$i=-1; # fixed
}
}
In older Perl versions (pre 5.10), one could have used code assertions and dynamic regex for this:
...
my $s = '((((stuff)? that)? I)? like)?';
# recursive dynamic regex, we need
# to pre-declare lexical variables
my $rg;
# use a dynamically generated regex (??{..})
# and a code assertion (?{..})
$rg = qr{
(?: # start expression
(?> [^)(]+) # (a) we don't see any (..) => atomic!
| # OR
( # (b) start capturing group for level
\( (??{$rg}) \) \? # oops, we found parentheses \(,\) w/sth
) # in between and the \? at the end
(?{ print "[ $^N ]\n" }) # if we got here, print the captured text $^N
)* # done, repeat expression if possible
}xs;
$s =~ /$rg/;
...
during the match, the code assertion prints all matches, which are:
[ (stuff)? ]
[ ((stuff)? that)? ]
[ (((stuff)? that)? I)? ]
[ ((((stuff)? that)? I)? like)? ]
To use this according to your requirements, you could change the code assertion slightly, put the capturing parentheses at the right place, and save the matches in an array:
...
my #result;
my $rg;
$rg = qr{
(?:
(?> [^)(]+)
|
\( ( (??{$rg}) ) \) \? (?{ push #result, $^N })
)*
}xs;
$s =~ /$rg/ && print map "[$_]\n", #result;
...
which says:
[stuff]
[(stuff)? that]
[((stuff)? that)? I]
[(((stuff)? that)? I)? like]
Regards
rbo
You could solve it in a couple of ways, simplest being just executing your expression till there are no more replacements made. E.g:
1 while s/( \( (?: [^()?]* | (?0) )* \) ) \?/[$1]/xg;
But that is highly inefficient (for deeply nested strings).
You could do it in one pass like this instead:
s{
(?(DEFINE)
(?<r> \( (?: [^()]++ | (?&r) )*+ \) )
)
( \( )
(?= (?: [^()]++ | (?&r) )*+ \) \? )
|
\) \?
}{
$2? '[': ']'
}gex;