Parsing nested tuples using perl regexp with $^R - regex

I want to learn how to create an abstract syntax tree for nested tuples using a Perl regexp with embedded code execution. I can easily program that using a Perl 6 grammar and I'm aware that using parsing modules would simplify the task in Perl 5, but I think for such simple tasks I should be able to do it without modules by learning how to mechanically translate from grammar definitions. I couldn't find a way to dereference $^R, so I try to undo the involuntary nesting at the end of the TUPLE rule definition, but the output is incorrect, e.g. some substrings appear twice.
use v5.10;
use Data::Dumper;
while (<DATA>) {
chomp;
/(?&TUPLE)(?{$a = $^R})
(?(DEFINE)
(?<TUPLE>
T \s (?&ELEM) \s (?&ELEM)
(?{ [$^R->[0][0],[$^R->[0][1],$^R[1]]] })
)
(?<ELEM>
(?: (a) (?{ [$^R,$^N] }) | \( (?&TUPLE) \) )
)
)/x;
say Dumper $a;
}
__DATA__
T a a
T (T a a) a
T a (T a a)
T (T a a) (T a a)
T (T (T a a) a) (T a (T a a))
Expected output data structure is a nested list:
['a','a'];
['a',['a','a']];
[['a','a'],'a'];
[['a','a'],['a','a']];
[[['a','a'],'a'],['a',['a','a']]]
For reference I'll also share my working Perl 6 code:
grammar Tuple {
token TOP { 'T ' <elem> ' ' <elem> }
token elem { 'a' | '(' <TOP> ')'}
}
class Actions {
method TOP($/) {make ($<elem>[0].made, $<elem>[1].made)}
method elem($/) {make $<TOP> ?? $<TOP>.made !! 'a'}
}

Trying to figure out how to use (?{ ... }) constructs is almost always not worth the effort. In particular, this can have unexpected behaviour together with backtracking. It is also very difficult to debug such regexes since the control flow tends to be non-obvious.
Instead, it tends to be easier to do write an ad-hoc recursive descent parser with m//gc-style lexing: Each Perl string stores its last match offset. When applying a regex with m/\G ... /gc in scalar context, it can anchor at the last offset and advances the offset iff the match succeeds.
Here:
use strict;
use warnings;
use Test::More;
sub parse {
my ($str) = #_;
pos($str) = 0; # set match position to beginning
return parse_tuple(\$str);
}
sub parse_tuple {
my ($ref) = #_;
$$ref =~ /\G T \s/gcx or die error($ref, "expected tuple start T");
my $car = parse_element($ref);
$$ref =~ /\G \s /gcx or die error($ref, "expected space between tuple elements");
my $cdr = parse_element($ref);
return [$car, $cdr];
}
sub parse_element {
my ($ref) = #_;
return 'a' if $$ref =~ /\G a /gcx;
$$ref =~ /\G \( /gcx or die error($ref, "expected opening paren for nested tuple");
my $tuple = parse_tuple($ref);
$$ref =~ /\G \) /gcx or die error($ref, "expected closing paren after nested tuple");
return $tuple;
}
sub error {
my ($ref, $msg) = #_;
my $snippet = substr $$ref, pos($$ref), 20;
return "$msg just before '$snippet...'";
}
is_deeply parse('T a a'), ['a','a'];
is_deeply parse('T (T a a) a'), [['a','a'],'a'];
is_deeply parse('T a (T a a)'), ['a',['a','a']];
is_deeply parse('T (T a a) (T a a)'), [['a','a'],['a','a']];
is_deeply parse('T (T (T a a) a) (T a (T a a))'), [[['a','a'],'a'],['a',['a','a']]];
done_testing;

I fixed the code in my question. Turns out I accidentally wrote $^R[1] instead of $^R->[1]. So now I understand why amon said that these constructs are hard to debug ;-)
use v5.10;
use Data::Dumper;
while (<DATA>) {
chomp;
/(?&TUPLE)(?{$a = $^R->[1]})
(?(DEFINE)
(?<TUPLE>
T \s (?&ELEM) \s (?&ELEM)
(?{ [$^R->[0][0],[$^R->[0][1],$^R->[1]]] })
)
(?<ELEM>
(?: (a) (?{ [$^R,$^N] }) | \( (?&TUPLE) \) )
)
)/x;
say Dumper $a;
}
__DATA__
T a a
T (T a a) a
T a (T a a)
T (T a a) (T a a)
T (T (T a a) a) (T a (T a a))

Related

multi replace in postgresql query using perl

I'm cleaning some text directly in my query, and rather than using nested replace functions, I found this bit of code that uses perl to perform multiple replacements at once: multi-replace with perl
CREATE FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text
AS $BODY$
my ($string, $orig, $repl) = #_;
my %subs;
if (#$orig != #$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref #$orig[0] eq 'ARRAY' || ref #$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
#subs{#$orig} = #$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } keys %subs;
$re = qr/($re)/;
$string =~ s/$re/$subs{$1}/g;
return $string;
$BODY$ language plperl strict immutable;
Example query:
select
name as original_name,
multi_replace(name, '{-,&,LLC$}', '{_,and,business}') as cleaned_name
from some_table
The function finds the pattern LLC at the end of the name string but removes it instead of replacing it with "business."
How can I make this work as intended?
When the strings in #$orig are to be matched literally, I'd actually use this:
my ($string, $orig, $repl) = #_;
# Argument checks here.
my %subs; #subs{ #$orig } = #$repl;
my $pat =
join "|",
map quotemeta,
sort { length($b) <=> length($a) }
#$orig;
return $string =~ s/$re/$subs{$&}/gr;
In particular, map quotemeta, was missing.
(By the way, the sort line isn't needed if you ensure that xy comes before x in #$orig when you want to replace both x(?!y) and xy.)
But you want the strings in #$orig to be treated as regex patterns. For that, you can use the following:
# IMPORTANT! Only provide strings from trusted sources in
# `#$orig` as it allows execution of arbitrary Perl code.
my ($string, $orig, $repl) = #_;
# Argument checks here.
my $re =
join "|",
map "(?:$orig->[$_])(?{ $_ })",
0..$#$orig;
{
use re qw( eval );
$re = qr/$re/;
}
return $string =~ s/$re/$repl->[$^R]/gr;
However, in your environment, I have doubts about the availability of use re qw( eval ); and (?{ }), so the above may be an unviable solution for you.
my ($string, $orig, $repl) = #_;
# Argument checks here.
my $re =
join "|",
map "(?<_$_>$orig->[$_])",
0..$#$orig;
$re = qr/$re/;
return
$string =~ s{$re}{
my ( $n ) =
map substr( $_, 1 ),
grep { $-{$_} && defined( $-{$_}[0] ) }
grep { /^_\d+\z/aa }
keys( %- );
$repl->[$n]
}egr;
While the regexp tests for LLC$ with the special meaning of the $, what gets captured into $1 is just the string LLC and so it doesn't find the look-up value to replace.
If the only thing you care about is $, then you could fix it by changing the map-building lines to:
#subs{map {my $t=$_; $t=~s/\$$//; $t} #$orig} = #$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } #$orig;
But it will be very hard to make it work more generally for every possible feature of regex.
The purpose of this plperl function in the linked blog post is to find/replace strings, not regular expressions. LLC being found with LLC$ as a search term does not happen in the original code, as the search terms go through quotemeta before being included into $re (as also sugggested in ikegami's answer)
The effect of removing the quotemeta transformation is that LLC at the end of a string is matched, but since as a key it's not found in $subs (because the key there isLLC$), then it's getting replaced by an empty string.
So how to make this work with regular expressions in the orig parameter?
The solution proposed by #ikegami does not seem usable from plperl, as it complains with this error: Unable to load re.pm into plperl.
I thought of an alternative implementation without the (?{ code }) feature: each match from the main alternation regexp can be rechecked against each regexp in orig, in a code block run with /ge. On the first match, the corresponding string in repl is selected as the replacement.
Code:
CREATE or replace FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text AS
$BODY$
my ($string, $orig, $repl) = #_;
my %subs;
if (#$orig != #$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref #$orig[0] eq 'ARRAY' || ref #$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
#subs{#$orig} = #$repl;
my $re = join "|", keys %subs;
$re = qr/($re)/;
# on each match, recheck the match individually against each regexp
# to find the corresponding replacement string
$string =~ s/$re/{ my $r; foreach (#$orig) { if ($1 =~ $_) {$r=$subs{$_}; last;} } $r;}/ge;
return $string;
$BODY$ language plperl strict immutable;
Test
=> select pg_temp.multi_replace(
'bar foo - bar & LLC',
'{^bar,-,&,LLC$}',
'{baz,_,and,business}'
);
multi_replace
----------------------------
baz foo _ bar and business

Recursive pairwise braces matching in Perl

Let's consider this piece of code which does not belong to any known language:
foo() {
bar() {
bar();
}
baz() {
// Content baz
qux() {
// Content qux
}
}
}
I would like to iteratively process each function by calling a subroutine which receives: the function name, the arguments, the indentation level and the content.
So far I have written this:
#!/usr/bin/env perl
use 5.010;
$_ = do {local $/; <>};
s/([{}])/$1.($1 eq '{'?++$i:$i--).$1/eg;
parse($_);
sub parse {
local $_ = shift;
while (/(?<name>\w+)\s*\((?<args>.*?)\)\s*\{(\d)\{(?<content>.*?)\}(?<level>\3)\}/gs) {
parse($+{content});
process($+{content}, $+{args}, $+{level}, $+{content});
}
}
sub process {
my ($name, $args, $level, $content) = #_;
#...
}
The tricky idea is to replace in-place each matched brace { with an indentation number. So this:
{
{
}
}
will become this:
{1{
{2{
}2}
}1}
It allows to easily write the parsing regex which simply become:
qr/
\w+ # name
\s* \(.*?\) # arguments
\s* \{(\d)\{ # opening brace
.*? # content
\s* \}\1\} # closing brace
/x
How can I rewrite this without this trick?
Note that the choice of {1{ could be anything else like {(1), {1-, {[1] or even {1✈
You could try using recursive regular expressions. For example:
/(\{(?:[^{}]++|(?1))*\})/
will match a group of balanced braces. For more information, see
Can I use Perl regular expressions to match balanced text? in perlfaq6,
Perl documentation "Extended patterns" in perlre, and
extract_bracketed in Text::Balanced.
You will have a hard time doing this with a recursive regex because you can only get the last value of each capturing group and will lose all of the intervening values. This task is more suited to a parser like Parse::RecDescent:
use strict;
use warnings;
use 5.010;
use Parse::RecDescent;
sub process {
my ($name, $args, $depth) = #_;
say "$depth - $name($args)";
}
my $grammar = q{
{ my $indent = 0; }
startrule : expression(s /;/)
expression : function_call
| function_def[$indent]
function_call : identifier '(' arglist ')' ';'
function_def : identifier '(' arglist ')' '{' expression[ $arg[0]++ ](s?) '}'
{ main::process( $item{identifier}, join(',', #$item{arglist}), $arg[0] ) }
arglist : identifier(s? /,/)
identifier : /\w+/
};
# Tell parser to ignore spaces and C99 one-line comments
my $skip_spaces_and_comments = qr{ (?: \s+ | // .*?$ )* }mxs;
$Parse::RecDescent::skip = $skip_spaces_and_comments;
my $parser = Parse::RecDescent->new($grammar) or die 'Bad grammar';
my $text = do { local $/; <DATA> };
defined $parser->startrule($text) or die 'Failed to parse';
__DATA__
foo() {
bar() {
bar();
}
baz() {
// Content baz
qux() {
// Content qux
}
}
}
Output:
2 - bar()
1 - qux()
2 - baz()
3 - foo()
Note that the depths are inverted (1 is the most nested) and this doesn't return the contents of the function definitions, but it should get you started.

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.
local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

How to access groups captured by recursive perl regexes?

I'm trying to munge a simple grammar with a perl regex (note this isn’t intended for production use, just a quick analysis for providing editor hints/completions). For instance,
my $GRAMMAR = qr{(?(DEFINE)
(?<expr> \( (?&expr) \) | (?&number) | (?&var) | (?&expr) (?&op) (?&expr) )
(?<number> \d++ )
(?<var> [a-z]++ )
(?<op> [-+*/] )
)}x;
I would like to be able to run this as
$expr =~ /$GRAMMAR(?&expr)/;
and then access all the variable names. However, according to perlre,
Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.
So apparently this is not possible. I could try using a (?{ code }) block to save variable names to a hash, but this doesn't respect backtracking (i.e. the assignment’s side effect persists even if the variable is backtracked past).
Is there any way to get everything captured by a given named capture group, including recursive matches? Or do I need to manually dig through the individual pieces (and thus duplicate all the patterns)?
The necessity of having to add capturing and backtracking machinery is one of the shortcomings that Regexp::Grammars addresses.
However, the grammar in your question is left-recursive, which neither Perl regexes nor a recursive-descent parser will parse.
Adapting your grammar to Regexp::Grammars and factoring out left-recursion produces
my $EXPR = do {
use Regexp::Grammars;
qr{
^ <Expr> $
<rule: Expr> <Term> <ExprTail>
| <Term>
<rule: Term> <Number>
| <Var>
| \( <MATCH=Expr> \)
<rule: ExprTail> <Op> <Expr>
<token: Op> \+ | \- | \* | \/
<token: Number> \d++
<token: Var> [a-z]++
}x;
};
Note that this simple grammar gives all operators equal precedence rather than Please Excuse My Dear Aunt Sally.
You want to extract all variable names, so you could walk the AST as in
sub all_variables {
my($root,$var) = #_;
$var ||= {};
++$var->{ $root->{Var} } if exists $root->{Var};
all_variables($_, $var) for grep ref $_, values %$root;
wantarray ? keys %$var : [ keys %$var ];
}
and print the result with
if ("(a + (b - c))" =~ $EXPR) {
print "[$_]\n" for sort +all_variables \%/;
}
else {
print "no match\n";
}
Another approach is to install an autoaction for the Var rule that records names of variables as they are successfully parsed.
package JustTheVarsMaam;
sub new { bless {}, shift }
sub Var {
my($self,$result) = #_;
++$self->{VARS}{$result};
$result;
}
sub all_variables { keys %{ $_[0]->{VARS} } }
1;
Call this one as in
my $vars = JustTheVarsMaam->new;
if ("(a + (b - c))" =~ $EXPR->with_actions($vars)) {
print "[$_]\n" for sort $vars->all_variables;
}
else {
print "no match\n";
}
Either way, the output is
[a]
[b]
[c]
Recursivity is native with Marpa::R2 using the BNF in the __DATA__ section below:
#!env perl
use strict;
use diagnostics;
use Marpa::R2;
my $input = shift || '(a + (b - c))';
my $grammar_source = do {local $/; <DATA>};
my $recognizer = Marpa::R2::Scanless::R->new
(
{
grammar => Marpa::R2::Scanless::G->new
(
{
source => \$grammar_source,
action_object => __PACKAGE__,
}
)
},
);
my %vars = ();
sub new { return bless {}, shift;}
sub varAction { ++$vars{$_[1]}};
$recognizer->read(\$input);
$recognizer->value() || die "No parse";
print join(', ', sort keys %vars) . "\n";
__DATA__
:start ::= expr
expr ::= NUMBER
| VAR action => varAction
| expr OP expr
| '(' expr ')'
NUMBER ~ [\d]+
VAR ~ [a-z]+
OP ~ [-+*/]
WS ~ [\s]+
:discard ~ WS
The output is:
a, b, c
Your question was adressing only how to get the variable names, so no notion of operator associativity and so on in this answer. Just note that Marpa has no problem with that, if needed.

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".