Let's consider this piece of code which does not belong to any known language:
foo() {
bar() {
bar();
}
baz() {
// Content baz
qux() {
// Content qux
}
}
}
I would like to iteratively process each function by calling a subroutine which receives: the function name, the arguments, the indentation level and the content.
So far I have written this:
#!/usr/bin/env perl
use 5.010;
$_ = do {local $/; <>};
s/([{}])/$1.($1 eq '{'?++$i:$i--).$1/eg;
parse($_);
sub parse {
local $_ = shift;
while (/(?<name>\w+)\s*\((?<args>.*?)\)\s*\{(\d)\{(?<content>.*?)\}(?<level>\3)\}/gs) {
parse($+{content});
process($+{content}, $+{args}, $+{level}, $+{content});
}
}
sub process {
my ($name, $args, $level, $content) = #_;
#...
}
The tricky idea is to replace in-place each matched brace { with an indentation number. So this:
{
{
}
}
will become this:
{1{
{2{
}2}
}1}
It allows to easily write the parsing regex which simply become:
qr/
\w+ # name
\s* \(.*?\) # arguments
\s* \{(\d)\{ # opening brace
.*? # content
\s* \}\1\} # closing brace
/x
How can I rewrite this without this trick?
Note that the choice of {1{ could be anything else like {(1), {1-, {[1] or even {1✈
You could try using recursive regular expressions. For example:
/(\{(?:[^{}]++|(?1))*\})/
will match a group of balanced braces. For more information, see
Can I use Perl regular expressions to match balanced text? in perlfaq6,
Perl documentation "Extended patterns" in perlre, and
extract_bracketed in Text::Balanced.
You will have a hard time doing this with a recursive regex because you can only get the last value of each capturing group and will lose all of the intervening values. This task is more suited to a parser like Parse::RecDescent:
use strict;
use warnings;
use 5.010;
use Parse::RecDescent;
sub process {
my ($name, $args, $depth) = #_;
say "$depth - $name($args)";
}
my $grammar = q{
{ my $indent = 0; }
startrule : expression(s /;/)
expression : function_call
| function_def[$indent]
function_call : identifier '(' arglist ')' ';'
function_def : identifier '(' arglist ')' '{' expression[ $arg[0]++ ](s?) '}'
{ main::process( $item{identifier}, join(',', #$item{arglist}), $arg[0] ) }
arglist : identifier(s? /,/)
identifier : /\w+/
};
# Tell parser to ignore spaces and C99 one-line comments
my $skip_spaces_and_comments = qr{ (?: \s+ | // .*?$ )* }mxs;
$Parse::RecDescent::skip = $skip_spaces_and_comments;
my $parser = Parse::RecDescent->new($grammar) or die 'Bad grammar';
my $text = do { local $/; <DATA> };
defined $parser->startrule($text) or die 'Failed to parse';
__DATA__
foo() {
bar() {
bar();
}
baz() {
// Content baz
qux() {
// Content qux
}
}
}
Output:
2 - bar()
1 - qux()
2 - baz()
3 - foo()
Note that the depths are inverted (1 is the most nested) and this doesn't return the contents of the function definitions, but it should get you started.
Related
rencently I have met a strange bug when use a dynamic regular expressions in perl for Nesting brackets' match. The origin string is " {...test{...}...} ", I want to grep the pair brace begain with test, "test{...}". actually there are probably many pairs of brace before and end this group , I don't really know the deepth of them.
Following is my match scripts: nesting_parser.pl
#! /usr/bin/env perl
use Getopt::Long;
use Data::Dumper;
my %args = #ARGV;
if(exists$args{'-help'}) {printhelp();}
unless ($args{'-file'}) {printhelp();}
unless ($args{'-regex'}) {printhelp();}
my $OpenParents;
my $counts;
my $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;$counts++; print "\nLeft:".$OpenParents." ;"})
| \} (?(?{$OpenParents ne 0; $counts++}) (?{$OpenParents--;print "Right: ".$OpenParents." ;"})) (?(?{$OpenParents eq 0}) (?!))
)*
)
}x;
my $string = `cat $args{'-file'}`;
my $partten = $args{'-regex'} ;
print "####################################################\n";
print "Grep [$partten\{...\}] from $args{'-file'}\n";
print "####################################################\n";
while ($string =~ /($partten$NestedGuts)/xmgs){
print $1."}\n";
print $2."####\n";
}
print "Regex has seen $counts brackts\n";
sub printhelp{
print "Usage:\n";
print "\t./nesting_parser.pl -file [file] -regex '[regex expression]'\n";
print "\t[file] : file path\n";
print "\t[regex] : regex string\n";
exit;
}
Actually my regex is:
our $OpenParents;
our $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;})
| \} (?(?{$OpenParents ne 0}) (?{$OpenParents--})) (?(?{$OpenParents eq 0} (?!))
)*
)
}x;
I have add brace counts in nesting_parser.pl
I also write a string generator for debug: gen_nesting.pl
#! /usr/bin/env perl
use strict;
my $buffer = "{{{test{";
unless ($ARGV[0]) {print "Please specify the nest pair number!\n"; exit}
for (1..$ARGV[0]){
$buffer.= "\n\{\{\{\{$_\}\}\}\}";
#$buffer.= "\n\{\{\{\{\{\{\{\{\{$_\}\}\}\}\}\}\}\}\}";
}
$buffer .= "\n\}}}}";
open TEXT, ">log_$ARGV[0]";
print TEXT $buffer;
close TEXT;
You can generate a test file by
./gen_nesting.pl 1000
It will create a log file named log_1000, which include 1000 lines brace pairs
Now we test our match scripts:
./nesting_parser.pl -file log_1000 -regex "test" > debug_1000
debug_1000 looks like a great perfect result, matched successfully! But when I gen a 4000 lines test log file and match it again, it seem crashed:
./gen_nesting.pl 4000
./nesting_parser.pl -file log_4000 -regex "test" > debug_4000
The end of debug_4000 shows
{{{{3277}
####
Regex has seen 26213 brackts
I don't know what's wrong with the regex expresions, mostly it works well for paired brackets, untill recently I found it crashed when I try to match a text file more than 600,000 lines.
I'm really confused by this problems,
I really hope to solve this problem.
thank you all!
First for matching nested brackets I normally use Regexp::Common.
Next, I'm guessing that your problem is that Perl's regular expression engine breaks after matching 32767 groups. You can verify this by turning on warnings and looking for a message like Complex regular subexpression recursion limit (32766) exceeded.
If so, you can rewrite your code using /g and \G and pos. The idea being that you match the brackets in a loop like this untested code:
my $start = pos($string);
my $open_brackets = 0;
my $failed;
while (0 < $open_brackets or $start == pos($string)) {
if ($string =~ m/\G[^{}]*(\{|\})/g) {
if ($1 eq '{') {
$open_brackets++;
}
else {
$open_brackets--;
}
}
else {
$failed = 1;
break; # WE FAILED TO MATCH
}
}
if (not $failed and 0 == $open_brackets) {
my $matched = substr($string, $start, pos($string));
}
I'm trying to munge a simple grammar with a perl regex (note this isn’t intended for production use, just a quick analysis for providing editor hints/completions). For instance,
my $GRAMMAR = qr{(?(DEFINE)
(?<expr> \( (?&expr) \) | (?&number) | (?&var) | (?&expr) (?&op) (?&expr) )
(?<number> \d++ )
(?<var> [a-z]++ )
(?<op> [-+*/] )
)}x;
I would like to be able to run this as
$expr =~ /$GRAMMAR(?&expr)/;
and then access all the variable names. However, according to perlre,
Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.
So apparently this is not possible. I could try using a (?{ code }) block to save variable names to a hash, but this doesn't respect backtracking (i.e. the assignment’s side effect persists even if the variable is backtracked past).
Is there any way to get everything captured by a given named capture group, including recursive matches? Or do I need to manually dig through the individual pieces (and thus duplicate all the patterns)?
The necessity of having to add capturing and backtracking machinery is one of the shortcomings that Regexp::Grammars addresses.
However, the grammar in your question is left-recursive, which neither Perl regexes nor a recursive-descent parser will parse.
Adapting your grammar to Regexp::Grammars and factoring out left-recursion produces
my $EXPR = do {
use Regexp::Grammars;
qr{
^ <Expr> $
<rule: Expr> <Term> <ExprTail>
| <Term>
<rule: Term> <Number>
| <Var>
| \( <MATCH=Expr> \)
<rule: ExprTail> <Op> <Expr>
<token: Op> \+ | \- | \* | \/
<token: Number> \d++
<token: Var> [a-z]++
}x;
};
Note that this simple grammar gives all operators equal precedence rather than Please Excuse My Dear Aunt Sally.
You want to extract all variable names, so you could walk the AST as in
sub all_variables {
my($root,$var) = #_;
$var ||= {};
++$var->{ $root->{Var} } if exists $root->{Var};
all_variables($_, $var) for grep ref $_, values %$root;
wantarray ? keys %$var : [ keys %$var ];
}
and print the result with
if ("(a + (b - c))" =~ $EXPR) {
print "[$_]\n" for sort +all_variables \%/;
}
else {
print "no match\n";
}
Another approach is to install an autoaction for the Var rule that records names of variables as they are successfully parsed.
package JustTheVarsMaam;
sub new { bless {}, shift }
sub Var {
my($self,$result) = #_;
++$self->{VARS}{$result};
$result;
}
sub all_variables { keys %{ $_[0]->{VARS} } }
1;
Call this one as in
my $vars = JustTheVarsMaam->new;
if ("(a + (b - c))" =~ $EXPR->with_actions($vars)) {
print "[$_]\n" for sort $vars->all_variables;
}
else {
print "no match\n";
}
Either way, the output is
[a]
[b]
[c]
Recursivity is native with Marpa::R2 using the BNF in the __DATA__ section below:
#!env perl
use strict;
use diagnostics;
use Marpa::R2;
my $input = shift || '(a + (b - c))';
my $grammar_source = do {local $/; <DATA>};
my $recognizer = Marpa::R2::Scanless::R->new
(
{
grammar => Marpa::R2::Scanless::G->new
(
{
source => \$grammar_source,
action_object => __PACKAGE__,
}
)
},
);
my %vars = ();
sub new { return bless {}, shift;}
sub varAction { ++$vars{$_[1]}};
$recognizer->read(\$input);
$recognizer->value() || die "No parse";
print join(', ', sort keys %vars) . "\n";
__DATA__
:start ::= expr
expr ::= NUMBER
| VAR action => varAction
| expr OP expr
| '(' expr ')'
NUMBER ~ [\d]+
VAR ~ [a-z]+
OP ~ [-+*/]
WS ~ [\s]+
:discard ~ WS
The output is:
a, b, c
Your question was adressing only how to get the variable names, so no notion of operator associativity and so on in this answer. Just note that Marpa has no problem with that, if needed.
I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:
my $regex = qr/
( #save $1
sp\s+ #start Soar production
( #save $2
\{ #opening brace
[^{}]* #anything but braces
\} #closing brace
| (?1) #or nested braces
)+ #0 or more
)
/x;
I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?
There are numerous problems. The recursive bit should be:
(
(?: \{ (?-1) \}
| [^{}]+
)*
)
All together:
my $regex = qr/
sp\s+
\{
(
(?: \{ (?-1) \}
| [^{}]++
)*
)
\}
/x;
print "$1\n" if 'sp { { word } }' =~ /($regex)/;
This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced 'extract_bracketed';
sub get_bracketed {
my $str = shift;
# seek to beginning of bracket
return undef unless $str =~ /(sp\s+)(?={)/gc;
# store the prefix
my $prefix = $1;
# get everything from the start brace to the matching end brace
my ($bracketed) = extract_bracketed( $str, '{}');
# no closing brace found
return undef unless $bracketed;
# return the whole match
return $prefix . $bracketed;
}
my $str = 'sp { { word } }';
print get_bracketed $str;
The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.
I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".
Here's a problem I ran into recently. I have attributes strings of the form
"x=1 and y=abc and z=c4g and ..."
Some attributes have numeric values, some have alpha values, some have mixed, some have dates, etc.
Every string is supposed to have "x=someval and y=anotherval" at the beginning, but some don't. I have three things I need to do.
Validate the strings to be certain that they have x and y.
Actually parse the values for x and y.
Get the rest of the string.
Given the example at the top, this would result in the following variables:
$x = 1;
$y = "abc";
$remainder = "z=c4g and ..."
My question is: Is there a (reasonably) simple way to parse these and validate with a single regular expression? i.e.:
if ($str =~ /someexpression/)
{
$x = $1;
$y = $2;
$remainder = $3;
}
Note that the string may consist of only x and y attributes. This is a valid string.
I'll post my solution as an answer, but it doesn't meet my single-regex preference.
Assuming you also want to do something with the other name=value pairs this is how I would do it ( using Perl version 5.10 ):
use 5.10.0;
use strict;
use warnings;
my %hash;
while(
$string =~ m{
(?: ^ | \G ) # start of string or previous match
\s*
(?<key> \w+ ) # word characters
=
(?<value> \S+ ) # non spaces
\s* # get to the start of the next match
(?: and )?
}xgi
){
$hash{$+{key}} = $+{value};
}
# to make sure that x & y exist
die unless exists $hash{x} and exists $hash{y};
On older Perls ( at least Perl 5.6 );
use strict;
use warnings;
my %hash;
while(
$string =~ m{
(?: ^ | \G ) # start of string or previous match
\s*
( \w+ ) = ( \S+ )
\s* # get to the start of the next match
(?: and )?
}xgi
){
$hash{$1} = $2;
}
# to make sure that x & y exist
die unless exists $hash{x} and exists $hash{y};
These have the added benefit of continuing to work if you need to work with more data.
I'm not the best at regular expressions, but this seems pretty close to what you're looking for:
/x=(.+) and y=([^ ]+)( and (.*))?/
Except you use $1, $2, and $4. In use:
my #strs = ("x=1 and y=abc and z=c4g and w=v4l",
"x=yes and y=no",
"z=nox and w=noy");
foreach (#strs) {
if ($_ =~ /x=(.+) and y=([^ ]+)( and (.*))?/) {
$x = $1;
$y = $2;
$remainder = $4;
print "x: $x; y: $y; remainder: $remainder\n";
} else {
print "Failed.\n";
}
}
Output:
x: 1; y: abc; remainder: z=c4g and w=v4l
x: yes; y: no; remainder:
Failed.
This of course leaves out plenty of error checking, and I don't know everything about your inputs, but this seems to work.
As a fairly simple modification to Rudd's version,
/^x=(.+) and y=([^ ]+)(?: and (.*))?/
will allow you to use $1, $2 and $3 (the ?: makes it a noncapturing group), and will ensure that the string starts with "x=" rather than allowing a "not_x=" to match
If you have better knowledge of what the x and y values will be, this should be used to tighten the regex further:
my #strs = ("x=1 and y=abc and z=c4g and w=v4l",
"x=yes and y=no",
"z=nox and w=noy",
"not-x=nox and y=present",
"x=yes and w='there is no and y=something arg here'");
foreach (#strs) {
if ($_ =~ /^x=(.+) and y=([^ ]+)(?: and (.*))?/) {
$x = $1;
$y = $2;
$remainder = $3;
print "x: {$x}; y: {$y}; remainder: {$remainder}\n";
} else {
print "$_ Failed.\n";
}
}
Output:
x: {1}; y: {abc}; remainder: {z=c4g and w=v4l}
x: {yes}; y: {no}; remainder: {}
z=nox and w=noy Failed.
not-x=nox and y=present Failed.
x: {yes and w='there is no}; y: {something}; remainder: {}
Note that the missing part of the last test is due to the current version of the y test requiring no spaces, if the x test had the same restriction that string would have failed.
Rudd and Cebjyre have gotten you most of the way there but they both have certain problems:
Rudd suggested:
/x=(.+) and y=([^ ]+)( and (.*))?/
Cebjyre modified it to:
/^x=(.+) and y=([^ ]+)(?: and (.*))?/
The second version is better because it will not confuse "not_x=foo" with "x=foo" but will accept things such as "x=foo z=bar y=baz" and set $1 = "foo z=bar" which is undesirable.
This is probably what you are looking for:
/^x=(\w+) and y=(\w+)(?: and (.*))?/
This disallows anything between the x= and y= options, places and allows and optional " and..." which will be in $3
Here's basically what I did to solve this:
($x_str, $y_str, $remainder) = split(/ and /, $str, 3);
if ($x_str !~ /x=(.*)/)
{
# error
}
$x = $1;
if ($y_str !~ /y=(.*)/)
{
# error
}
$y = $1;
I've omitted some additional validation and error handling. This technique works, but it's not as concise or pretty as I would have liked. I'm hoping someone will have a better suggestion for me.