I'm trying to munge a simple grammar with a perl regex (note this isn’t intended for production use, just a quick analysis for providing editor hints/completions). For instance,
my $GRAMMAR = qr{(?(DEFINE)
(?<expr> \( (?&expr) \) | (?&number) | (?&var) | (?&expr) (?&op) (?&expr) )
(?<number> \d++ )
(?<var> [a-z]++ )
(?<op> [-+*/] )
)}x;
I would like to be able to run this as
$expr =~ /$GRAMMAR(?&expr)/;
and then access all the variable names. However, according to perlre,
Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.
So apparently this is not possible. I could try using a (?{ code }) block to save variable names to a hash, but this doesn't respect backtracking (i.e. the assignment’s side effect persists even if the variable is backtracked past).
Is there any way to get everything captured by a given named capture group, including recursive matches? Or do I need to manually dig through the individual pieces (and thus duplicate all the patterns)?
The necessity of having to add capturing and backtracking machinery is one of the shortcomings that Regexp::Grammars addresses.
However, the grammar in your question is left-recursive, which neither Perl regexes nor a recursive-descent parser will parse.
Adapting your grammar to Regexp::Grammars and factoring out left-recursion produces
my $EXPR = do {
use Regexp::Grammars;
qr{
^ <Expr> $
<rule: Expr> <Term> <ExprTail>
| <Term>
<rule: Term> <Number>
| <Var>
| \( <MATCH=Expr> \)
<rule: ExprTail> <Op> <Expr>
<token: Op> \+ | \- | \* | \/
<token: Number> \d++
<token: Var> [a-z]++
}x;
};
Note that this simple grammar gives all operators equal precedence rather than Please Excuse My Dear Aunt Sally.
You want to extract all variable names, so you could walk the AST as in
sub all_variables {
my($root,$var) = #_;
$var ||= {};
++$var->{ $root->{Var} } if exists $root->{Var};
all_variables($_, $var) for grep ref $_, values %$root;
wantarray ? keys %$var : [ keys %$var ];
}
and print the result with
if ("(a + (b - c))" =~ $EXPR) {
print "[$_]\n" for sort +all_variables \%/;
}
else {
print "no match\n";
}
Another approach is to install an autoaction for the Var rule that records names of variables as they are successfully parsed.
package JustTheVarsMaam;
sub new { bless {}, shift }
sub Var {
my($self,$result) = #_;
++$self->{VARS}{$result};
$result;
}
sub all_variables { keys %{ $_[0]->{VARS} } }
1;
Call this one as in
my $vars = JustTheVarsMaam->new;
if ("(a + (b - c))" =~ $EXPR->with_actions($vars)) {
print "[$_]\n" for sort $vars->all_variables;
}
else {
print "no match\n";
}
Either way, the output is
[a]
[b]
[c]
Recursivity is native with Marpa::R2 using the BNF in the __DATA__ section below:
#!env perl
use strict;
use diagnostics;
use Marpa::R2;
my $input = shift || '(a + (b - c))';
my $grammar_source = do {local $/; <DATA>};
my $recognizer = Marpa::R2::Scanless::R->new
(
{
grammar => Marpa::R2::Scanless::G->new
(
{
source => \$grammar_source,
action_object => __PACKAGE__,
}
)
},
);
my %vars = ();
sub new { return bless {}, shift;}
sub varAction { ++$vars{$_[1]}};
$recognizer->read(\$input);
$recognizer->value() || die "No parse";
print join(', ', sort keys %vars) . "\n";
__DATA__
:start ::= expr
expr ::= NUMBER
| VAR action => varAction
| expr OP expr
| '(' expr ')'
NUMBER ~ [\d]+
VAR ~ [a-z]+
OP ~ [-+*/]
WS ~ [\s]+
:discard ~ WS
The output is:
a, b, c
Your question was adressing only how to get the variable names, so no notion of operator associativity and so on in this answer. Just note that Marpa has no problem with that, if needed.
Related
rencently I have met a strange bug when use a dynamic regular expressions in perl for Nesting brackets' match. The origin string is " {...test{...}...} ", I want to grep the pair brace begain with test, "test{...}". actually there are probably many pairs of brace before and end this group , I don't really know the deepth of them.
Following is my match scripts: nesting_parser.pl
#! /usr/bin/env perl
use Getopt::Long;
use Data::Dumper;
my %args = #ARGV;
if(exists$args{'-help'}) {printhelp();}
unless ($args{'-file'}) {printhelp();}
unless ($args{'-regex'}) {printhelp();}
my $OpenParents;
my $counts;
my $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;$counts++; print "\nLeft:".$OpenParents." ;"})
| \} (?(?{$OpenParents ne 0; $counts++}) (?{$OpenParents--;print "Right: ".$OpenParents." ;"})) (?(?{$OpenParents eq 0}) (?!))
)*
)
}x;
my $string = `cat $args{'-file'}`;
my $partten = $args{'-regex'} ;
print "####################################################\n";
print "Grep [$partten\{...\}] from $args{'-file'}\n";
print "####################################################\n";
while ($string =~ /($partten$NestedGuts)/xmgs){
print $1."}\n";
print $2."####\n";
}
print "Regex has seen $counts brackts\n";
sub printhelp{
print "Usage:\n";
print "\t./nesting_parser.pl -file [file] -regex '[regex expression]'\n";
print "\t[file] : file path\n";
print "\t[regex] : regex string\n";
exit;
}
Actually my regex is:
our $OpenParents;
our $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;})
| \} (?(?{$OpenParents ne 0}) (?{$OpenParents--})) (?(?{$OpenParents eq 0} (?!))
)*
)
}x;
I have add brace counts in nesting_parser.pl
I also write a string generator for debug: gen_nesting.pl
#! /usr/bin/env perl
use strict;
my $buffer = "{{{test{";
unless ($ARGV[0]) {print "Please specify the nest pair number!\n"; exit}
for (1..$ARGV[0]){
$buffer.= "\n\{\{\{\{$_\}\}\}\}";
#$buffer.= "\n\{\{\{\{\{\{\{\{\{$_\}\}\}\}\}\}\}\}\}";
}
$buffer .= "\n\}}}}";
open TEXT, ">log_$ARGV[0]";
print TEXT $buffer;
close TEXT;
You can generate a test file by
./gen_nesting.pl 1000
It will create a log file named log_1000, which include 1000 lines brace pairs
Now we test our match scripts:
./nesting_parser.pl -file log_1000 -regex "test" > debug_1000
debug_1000 looks like a great perfect result, matched successfully! But when I gen a 4000 lines test log file and match it again, it seem crashed:
./gen_nesting.pl 4000
./nesting_parser.pl -file log_4000 -regex "test" > debug_4000
The end of debug_4000 shows
{{{{3277}
####
Regex has seen 26213 brackts
I don't know what's wrong with the regex expresions, mostly it works well for paired brackets, untill recently I found it crashed when I try to match a text file more than 600,000 lines.
I'm really confused by this problems,
I really hope to solve this problem.
thank you all!
First for matching nested brackets I normally use Regexp::Common.
Next, I'm guessing that your problem is that Perl's regular expression engine breaks after matching 32767 groups. You can verify this by turning on warnings and looking for a message like Complex regular subexpression recursion limit (32766) exceeded.
If so, you can rewrite your code using /g and \G and pos. The idea being that you match the brackets in a loop like this untested code:
my $start = pos($string);
my $open_brackets = 0;
my $failed;
while (0 < $open_brackets or $start == pos($string)) {
if ($string =~ m/\G[^{}]*(\{|\})/g) {
if ($1 eq '{') {
$open_brackets++;
}
else {
$open_brackets--;
}
}
else {
$failed = 1;
break; # WE FAILED TO MATCH
}
}
if (not $failed and 0 == $open_brackets) {
my $matched = substr($string, $start, pos($string));
}
I want to learn how to create an abstract syntax tree for nested tuples using a Perl regexp with embedded code execution. I can easily program that using a Perl 6 grammar and I'm aware that using parsing modules would simplify the task in Perl 5, but I think for such simple tasks I should be able to do it without modules by learning how to mechanically translate from grammar definitions. I couldn't find a way to dereference $^R, so I try to undo the involuntary nesting at the end of the TUPLE rule definition, but the output is incorrect, e.g. some substrings appear twice.
use v5.10;
use Data::Dumper;
while (<DATA>) {
chomp;
/(?&TUPLE)(?{$a = $^R})
(?(DEFINE)
(?<TUPLE>
T \s (?&ELEM) \s (?&ELEM)
(?{ [$^R->[0][0],[$^R->[0][1],$^R[1]]] })
)
(?<ELEM>
(?: (a) (?{ [$^R,$^N] }) | \( (?&TUPLE) \) )
)
)/x;
say Dumper $a;
}
__DATA__
T a a
T (T a a) a
T a (T a a)
T (T a a) (T a a)
T (T (T a a) a) (T a (T a a))
Expected output data structure is a nested list:
['a','a'];
['a',['a','a']];
[['a','a'],'a'];
[['a','a'],['a','a']];
[[['a','a'],'a'],['a',['a','a']]]
For reference I'll also share my working Perl 6 code:
grammar Tuple {
token TOP { 'T ' <elem> ' ' <elem> }
token elem { 'a' | '(' <TOP> ')'}
}
class Actions {
method TOP($/) {make ($<elem>[0].made, $<elem>[1].made)}
method elem($/) {make $<TOP> ?? $<TOP>.made !! 'a'}
}
Trying to figure out how to use (?{ ... }) constructs is almost always not worth the effort. In particular, this can have unexpected behaviour together with backtracking. It is also very difficult to debug such regexes since the control flow tends to be non-obvious.
Instead, it tends to be easier to do write an ad-hoc recursive descent parser with m//gc-style lexing: Each Perl string stores its last match offset. When applying a regex with m/\G ... /gc in scalar context, it can anchor at the last offset and advances the offset iff the match succeeds.
Here:
use strict;
use warnings;
use Test::More;
sub parse {
my ($str) = #_;
pos($str) = 0; # set match position to beginning
return parse_tuple(\$str);
}
sub parse_tuple {
my ($ref) = #_;
$$ref =~ /\G T \s/gcx or die error($ref, "expected tuple start T");
my $car = parse_element($ref);
$$ref =~ /\G \s /gcx or die error($ref, "expected space between tuple elements");
my $cdr = parse_element($ref);
return [$car, $cdr];
}
sub parse_element {
my ($ref) = #_;
return 'a' if $$ref =~ /\G a /gcx;
$$ref =~ /\G \( /gcx or die error($ref, "expected opening paren for nested tuple");
my $tuple = parse_tuple($ref);
$$ref =~ /\G \) /gcx or die error($ref, "expected closing paren after nested tuple");
return $tuple;
}
sub error {
my ($ref, $msg) = #_;
my $snippet = substr $$ref, pos($$ref), 20;
return "$msg just before '$snippet...'";
}
is_deeply parse('T a a'), ['a','a'];
is_deeply parse('T (T a a) a'), [['a','a'],'a'];
is_deeply parse('T a (T a a)'), ['a',['a','a']];
is_deeply parse('T (T a a) (T a a)'), [['a','a'],['a','a']];
is_deeply parse('T (T (T a a) a) (T a (T a a))'), [[['a','a'],'a'],['a',['a','a']]];
done_testing;
I fixed the code in my question. Turns out I accidentally wrote $^R[1] instead of $^R->[1]. So now I understand why amon said that these constructs are hard to debug ;-)
use v5.10;
use Data::Dumper;
while (<DATA>) {
chomp;
/(?&TUPLE)(?{$a = $^R->[1]})
(?(DEFINE)
(?<TUPLE>
T \s (?&ELEM) \s (?&ELEM)
(?{ [$^R->[0][0],[$^R->[0][1],$^R->[1]]] })
)
(?<ELEM>
(?: (a) (?{ [$^R,$^N] }) | \( (?&TUPLE) \) )
)
)/x;
say Dumper $a;
}
__DATA__
T a a
T (T a a) a
T a (T a a)
T (T a a) (T a a)
T (T (T a a) a) (T a (T a a))
I am trying to match different logic expression, such as: "$a and $b" using Perl regex, here is my code:
$input =~ /^(.*)\s(and|or|==|<|>|>=|<=)\s(.*)$/ {
$arg1=$1;
$arg2=$3;
$opt=$2;
}
and my purpose is to get:
$arg1="$ARGV[0]=~/\w{4}/"
$arg2="$num_arg==1"
$opt ="and"
I want to get the exact value matched in the or expression. I don't want to do the same thing for all the cases to match one by one, and hardcode the operator.
Does anyone know how to solve the problem?
This code works for me:
$input = '$ARGV[0]=~/\w{4}/ and $num_arg==1';
if ($input=~/^(.*)\s(and|or|==|<|>|>=|<=)\s(.*)$/) {
$arg1=$1;
$arg2=$3;
$opt=$2;
print "$arg1\n$arg2\n$opt\n";
}
You need a little parser able to reveal the structure of a logical expression. That is because you may have another expression inside a term. You can use perl to test your grammar using Marpa::R2 package.
As a first attempt I would write:
<expression> ::= <term> | <expression> <binary-op> <term>
<term> ::= <factor> <binary-op> <factor> | <unary-op><factor>
<factor> ::= <id>
<binary-op> ::= (and|or|==|<|>|>=|<=)
<unary-op> ::= (not | ! )
One thing for sure that you can't complete describe the syntax of a logical expression using only regular expressions, it will always lack some valid case.
The Perl Code for validation
use Modern::Perl;
use Marpa::R2;
my $dsl = <<'END_OF_DSL';
:default ::= action => [name,values]
lexeme default = latm => 1
Expression ::= Term
| Expression BinaryOP Term
Term ::= Factor BinaryOP Factor
| UnaryOP Factor
Factor ::= ID
ID ~ [\w]+
BinaryOP ~ 'and' | 'or' | '==' | '<' | '>' | '>=' | '<='
UnaryOP ~ 'not' | '!'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
# your input
my $input = 'a and b or !c';
# your parser
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
# process input
my $recce = Marpa::R2::Scanless::R->new(
{ grammar => $grammar, semantics_package => 'My_Actions' } );
my $length_read = $recce->read( \$input );
die "Read ended after $length_read of ", length $input, " characters"
if $length_read != length $input;
I'd like to have a regular expression to match a separated values with some protected values that can contain the separator character.
For instance:
"A,B,{C,D,E},F"
would give:
"A"
"B"
"{C,D,E}"
"F"
Please note the protected values can be nested, as follows:
"A,B,{C,D,{E,F}},G"
would give:
"A"
"B"
"{C,D,{E,F}}"
"G"
I already coded that feature with a character iteration as follow:
sub Parse
{
my #item;
my $curly;
my $string;
foreach(split //)
{
$_ eq "{" and ++$curly;
$_ eq "}" and --$curly;
if(!$curly && /[,:]/)
{
push #item, $string;
undef $string;
next;
}
$string .= $_;
}
push #item, $string;
return #item;
}
But it would definitively be so much nicer with a regexp.
A regex that supports nesting would look as follows:
my #items;
push #items, $1 while
/
(?: ^ | \G , )
(
(?: [^,{}]+
| (
\{
(?: [^{}]
| (?2)
)*
\}
)
| # Empty
)
)
/xg;
$ perl -E'$_ = shift; ... say for #items;' 'A,B,{C,D,{E,F}},G'
A
B
{C,D,{E,F}}
G
Assumes valid input since it can't extract and validate at the same time. (Well, not without making things really messy.)
Improved from nhahtdh's answer.
$_ = "A,B,{C,D,E},F";
while ( m/(\{.*?\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "[$&]\n";
}
Improved it again. Please look at this one!
$_ = "A,B,{C,D,{E,F}},G";
while ( m/(\{.*\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "$&\n";
}
It will get:
A
B
{C,D,{E,F}}
G
$a = "A,B,{C,D,E},F";
while ($a =~ s/(\{[\{\}\w,]+\}|\w)//) {
push (#res, $1);
}
print "\#res: #res\n"
Result:
#res: A B {C,D,E} F
Explanation : we try to match either the protected block \{[\{\}\w,]+\} or just a single character \w successively in a loop, deleting it from the original string if there is a match. Every time there is a match, we store it (meaning the $1) in the array, et voilà!
Here is a regex in bash:
chronos#localhost / $ echo "A,B,{C,D,E},F" | grep -oE "(\{[^\}]*\}|[A-Z])"
A
B
{C,D,E}
F
Try this regex. Use the regex to match and extract the token.
/(\{.*?\}|(?<=,|^).*?(?=,|$))/
I have not tested this code in Perl.
There is an assumption about on how the regex engine works here (I assume that it will try to match the first part \{.*?\} before the second part). I also assume that there are no nested curly bracket, and badly paired curly brackets.
$s = "A,B,{C,D,E},F";
#t = split /,(?=.*{)|,(?!.*})/, $s;
I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".