Rewriting a recursive regex for older Perl version - regex

The following piece of code works just fine with Perl (v5.16.2). However, when I run it using Perl v5.8.9, it complains about the following regex. How can I rewrite this regex in a way that works with Perl v5.8.9. (I can't update the version).
REGEX:
use strict;
use warnings;
our %formula_per_k;
INIT {
# List all functions that you want to allow in formulas. All other words will be interpretted as variables.
my #FORMULA_FUNCS = qw(sqrt exp log);
# Load the data via a file.
my $data = do {local $/; <DATA>};
# Parse K blocks
while ($data =~ m{
^K \s+ (\w+) \s* \{
( (?: [^{}]+ | \{(?2)\} )* ) # Matched braces only.
\}
}mgx) {
my ($name, $params) = ($1, $2);
# Parse LOL block
next if $params !~ m{
LOL \s* \{
( (?: [^{}]+ | \{(?1)\} )*? ) # Matched braces only.
\}
}mx;
my $lol = $1;
# Start building anonymous subroutine
my $conditions = '';
# Parse Conditions and Formulas
while ($lol =~ m{
COND \s* \{ (.*?) \} \s*
FORMULA \s* \{ (.*?) \}
}gx) {
my ($cond, $formula) = ($1, $2);
# Remove Excess spacing and translate variable into perl scalar.
for ($cond, $formula) {
s/^\s+|\s+$//g;
s{([a-zA-Z]+)}{
my $var = $1;
$var = "\$hashref->{$var}" if ! grep {$var eq $_} #FORMULA_FUNCS;
$var
}eg;
}
$conditions .= "return $formula if $cond; ";
}
my $code = "sub {my \$hashref = shift; ${conditions} return; }";
my $sub = eval $code;
if ($#) {
die "Invalid formulas in $name: $#";
}
$formula_per_k{$name} = $sub;
}
}
sub formula_per_k {
my ($k, $vars) = #_;
die "Unrecognized K value '$k'" if ! exists $formula_per_k{$k};
return $formula_per_k{$k}($vars);
}
print "'K1', {d => .1} = " . formula_per_k('K1', {d => .1}) . "\n";
print "'K1', {d => .05} = " . formula_per_k('K1', {d => .05}) . "\n";
print "'K3', {d => .02} = " . formula_per_k('K3', {d => .02}) . "\n";
print "'K3', {d => .021} = " . formula_per_k('K3', {d => .021}) . "\n";
__DATA__
... #OTHER STUFFS
K K1 {
LOL {
COND { d < 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d) }
COND { d >= 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d+0.4) }
}
}
... #OTHER STUFFS
K K2 {
LOL {
COND { d < 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d) }
COND { d >= 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d+0.8) }
}
}
... #OTHER STUFFS
K K3 {
LOL {
COND { d < 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d) }
COND { d >= 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d+0.3) }
}
}
... #OTHER STUFF
Outputs:
'K1', {d => .1} = 2.13345237791561
'K1', {d => .05} = 2.01370729772479
'K3', {d => .02} = -4.13029437251523
'K3', {d => .021} = -4.13002941430942
ERROR:
Sequence (?1...) not recognized in regex; marked by <-- HERE in m/
^K \s+ M3 \s* {
( (?: [^{}]+ | {(?2 <-- HERE )} )* ) # Matched braces only.
}
/ at ./code.pl line 215, <RFILE> line 12.
UPDATE:
Code is updated.
This was originally suggested by https://stackoverflow.com/users/1733163/miller

Before the introduction of (?PARNO), we had to use (??{ code }) to create recursive regular expressions. An example can be found in perlre - Extended Patterns.
The following is tested on v5.16.2, v5.20.0, and locally on a v5.8.9 perlbrew:
our $braces_re;
$braces_re = qr{
\{
(?:
(?> [^{}]+ )
|
(??{ $braces_re })
)*
\}
}sx;
# parse FOO block
while (
$data =~ m{
^FOO \s+ (\w+) \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )* ) # Matched braces only.
\}
}mgx
)
{
my $params = $1;
# parse BAR block
next if $params !~ m{
BAR \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )*? ) # Matched braces only.
\}
}mx;
# SOME CODE
}
Note, I intentionally separated out the declaration of the _re variable and its initialization. There are some versions of perl that will let you declare a recursive regular expression in the same statement as the initialization, but v5.8.9 is not one of them.
Also, if you're comfortable altering your original regex more than just dropping in a replacement for (?PARNO) notation, then the above can be reduced to the following. Also confirmed on v5.16.2:
my $braces_re;
$braces_re = qr{
(?:
(?> [^{}]+ )
| # The following is a "postponed" regular subexpression.
\{ (??{ $braces_re }) \} # Deferred execution enables recursive regex
)*
}sx;
# parse FOO block
while ( $data =~ m{^FOO \s+ (\w+) \s* \{ ( $braces_re ) \} }mgx ) {
my $params = $1;
# parse BAR block
next if $params !~ m{BAR \s* \{ ( $braces_re ) \}}mx;
# SOME CODE
}

Related

Substitute: replacement evaluation

Is the first and the second substitution equivalent if the replacement is passed in a variable?
#!/usr/bin/env perl6
use v6;
my $foo = 'switch';
my $t1 = my $t2 = my $t3 = my $t4 = 'this has a $foo in it';
my $replace = prompt( ':' ); # $0
$t1.=subst( / ( \$ \w+ ) /, $replace );
$t2.=subst( / ( \$ \w+ ) /, { $replace } );
$t3.=subst( / ( \$ \w+ ) /, { $replace.EVAL } );
$t4.=subst( / ( \$ \w+ ) /, { ( $replace.EVAL ).EVAL } );
say "T1 : $t1";
say "T2 : $t2";
say "T3 : $t3";
say "T4 : $t4";
# T1 : this has a $0 in it
# T2 : this has a $0 in it
# T3 : this has a $foo in it
# T4 : this has a switch in it
The only difference between $replace and {$replace} is that the second is a block that returns the value of the variable. It's only adding a level of indirection, but the result is the same.
Update: Edited according to #raiph's comments.

Regex parsing of DNS answer

I would like to parse the following lines
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
and get the following output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
I succeeded in parsing the first two fields with command
sed -Eun 's/[^>]+> ([0-9.]+)\.[0-9]+:.+q: A\? ([a-z0-9.-]+)\.([^:]+).*/\1:\2:\3/pg
`
But I cannot get the resolved IPs (A xx.xx.xx.xx). In fact there may be several.
Would it be possible to get such output using sed or Perl ?
EDIT:
As I added in comments, parsing of a larger input sample, I also require several lines to be discarded in the output. This lines can be characterized by :
the number of A records ("A xx.xx.xx.xx") is non null
or the line must not contains NXDomain\*?-
I succeed in meeting the new first need, but not for the second.
Following the #ikegami reply, here is my attempt:
perl -nle '
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $rest) = /^ \s+ ( [^:]++ ) : \s++ $field_value_re ( .* ) /sx
or next;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers = split(/\s*+,\s*+/, $answers);
my ($afield) = join " ", map { /^\S++\s++A\s++(\S++)/ } #answers;
if ( length($afield) != 0)
{
print join " ", $ip, $query, $afield;
}
' dns.sample
This does as you ask with the sample data
I first build a regex pattern $url_re that matches numeric URLs to make the following code more concise. Then I search for the first URL immediately after >, the named URL right after A?, and all of the following URLs which are preceded by A
They are all stored in array #urls and printed
use strict;
use warnings 'all';
use 5.010;
my $url_re = qr/(?:\d+\.){3}\d+/;
while ( <DATA> ) {
my #urls = ( />\s+($url_re)/, /A\?\s+([-\w.]+\w)/, /(A\s+$url_re)/g );
say "#urls";
}
__DATA__
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254 A 4.26.235.155
125.15.15.9 www.toto.net A 120.33.1.11
Each line appears to be of the form
{"id" with spaces}: {stuff} [ {key}: {stuff} ]*
You appear to be interested in information inside the "id", and inside the field named q. The value of the q field appears to be of the form
A? {word} {word} {ns_return} [, {ns_return} ]*
Here's a robust solution that handles the format described above.
perl -nle'
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $id_val, $rest) = /^ ( [^:]++ ) : \s++ ( $field_value_re ) ( .* ) /sx
or next;
next if $id_val =~ /\bNXDomain\b/;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers =
map { /^\S++\s++A\s++(\S++)/ }
split(/\s*+,\s*+/, $answers);
next if !#answers;
print join " ", $ip, $query, map { "A $_" } #answers;
' log
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
This prints the desired output by using the map function in a somewhat unorthodox way to ignore any fields after q:
perl -lne 'print join qq/\t/, m/> ([\d\.]+)\./, map {/A\? ([^\s]+)\./, /(A [\d\.]+)/g} / q:([^:]+)/' log.txt

perl stream file for regex token including scanned tokens

I am trying to stream a file in perl and tokenize the lines and include the tokens.
I have:
while( $line =~ /([\/][\d]*[%].*?[%][\d]*[\/]|[^\s]+|[\s]+)/g ) {
my $word = $1;
#...
}
But it doesn't work when there's no spaces in the token.
For example, if my line is:
$line = '/15%one (1)(2)%15/ is a /%good (1)%/ +/%number(2)%/.'
I would like to split that line into:
$output =
[
'/15%one (1)(2)%15/',
' ',
'is',
' ',
'a',
'/%good (1)%/',
' ',
'+',
'/%number(2)%/',
'.'
]
What is the best way to do this?
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
)
}sxg;
but that doesn't validate. If you want to validate, you could use
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
| \z (*COMMIT) (*FAIL)
| (?{ die "Syntax error" })
)
}sxg;
The following also validates, but it's a bit more readable and makes it easy to differentiate the token types.:
my #tokens;
for ($line) {
m{\G ( \s+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( ([\/])([0-9]*)% (?: (?! %\3\2 ). )* %\3\2 ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( (?: (?! [\/][0-9]*% )\S )+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G \z }sxgc
&& last;
die "Syntax error";
}
pos will get you information about where the error occurred.

Regex with recursive expression to match nested braces?

I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:
my $regex = qr/
( #save $1
sp\s+ #start Soar production
( #save $2
\{ #opening brace
[^{}]* #anything but braces
\} #closing brace
| (?1) #or nested braces
)+ #0 or more
)
/x;
I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?
There are numerous problems. The recursive bit should be:
(
(?: \{ (?-1) \}
| [^{}]+
)*
)
All together:
my $regex = qr/
sp\s+
\{
(
(?: \{ (?-1) \}
| [^{}]++
)*
)
\}
/x;
print "$1\n" if 'sp { { word } }' =~ /($regex)/;
This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced 'extract_bracketed';
sub get_bracketed {
my $str = shift;
# seek to beginning of bracket
return undef unless $str =~ /(sp\s+)(?={)/gc;
# store the prefix
my $prefix = $1;
# get everything from the start brace to the matching end brace
my ($bracketed) = extract_bracketed( $str, '{}');
# no closing brace found
return undef unless $bracketed;
# return the whole match
return $prefix . $bracketed;
}
my $str = 'sp { { word } }';
print get_bracketed $str;
The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.

Parsing YAML-like text file into hash structure

I've got the text file:
country = {
tag = ENG
ai = {
flags = { }
combat = { ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD }
continent = { "Oceania" }
area = { "America" "Maine" "Georgia" "Newfoundland" "Cuba" "Bengal" "Carnatic" "Ceylon" "Tanganyika" "The Mascarenes" "The Cape" "Gold" "St Helena" "Guiana" "Falklands" "Bermuda" "Oregon" }
region = { "North America" "Carribean" "India" }
war = 50
ferocity = no
}
date = { year = 0 month = january day = 0 }
}
What I'm trying to do is to parse this text into perl hash structure, so that the output after data dump looks like this:
$VAR1 = {
'country' => {
'ai' => {
'area' => [
'America',
'Maine',
'Georgia',
'Newfoundland',
'Cuba',
'Bengal',
'Carnatic',
'Ceylon',
'Tanganyika',
'The Mascarenes',
'The Cape',
'Gold',
'St Helena',
'Guiana',
'Falklands',
'Bermuda',
'Oregon'
],
'combat' => [
'ROY',
'WLS',
'PUR',
'SCO',
'EIR',
'FRA',
'DEL',
'USA',
'QUE',
'BGL',
'MAH',
'MOG',
'VIJ',
'MYS',
'DLH',
'GUJ',
'ORI',
'JAI',
'ASS',
'MLC',
'MYA',
'ARK',
'PEG',
'TAU',
'HYD'
],
'continent' => [
'Oceania'
],
'ferocity' => 'no',
'flags' => [],
'region' => [
'North America',
'Carribean',
'India'
],
'war' => 50
},
'date' => {
'day' => 0,
'month' => 'january',
'year' => 0
},
'tag' => 'ENG'
}
};
Hardcoded version might look like this:
#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;
my $ret;
$ret->{'country'}->{tag} = 'ENG';
$ret->{'country'}->{ai}->{flags} = [];
my #qw = qw( ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD );
$ret->{'country'}->{ai}->{combat} = \#qw;
$ret->{'country'}->{ai}->{continent} = ["Oceania"];
$ret->{'country'}->{ai}->{area} = ["America", "Maine", "Georgia", "Newfoundland", "Cuba", "Bengal", "Carnatic", "Ceylon", "Tanganyika", "The Mascarenes", "The Cape", "Gold", "St Helena", "Guiana", "Falklands", "Bermuda", "Oregon"];
$ret->{'country'}->{ai}->{region} = ["North America", "Carribean", "India"];
$ret->{'country'}->{ai}->{war} = 50;
$ret->{'country'}->{ai}->{ferocity} = 'no';
$ret->{'country'}->{date}->{year} = 0;
$ret->{'country'}->{date}->{month} = 'january';
$ret->{'country'}->{date}->{day} = 0;
sub hash_sort {
my ($hash) = #_;
return [ (sort keys %$hash) ];
}
$Data::Dumper::Sortkeys = \hash_sort;
print Dumper($ret);
I have to admit I have a huge problem dealing with nested curly brackets.
I've tried to solve it by using greedy and ungreedy matching, but it seems it didn't do the trick. I've also read about extended patterns (like (?PARNO)) but I have absolutely no clue how to use them in my particular problem. Order of data is irrelevant, since I have the hash_sort subroutine.
I'll apprieciate any help.
I broke it down to some simple assumptions:
An entry would consist of an identifier followed by an equals sign
An entry would be one of three basic types: a level or set or a single value
A set has 3 forms: 1) quoted, space-separated list; 2) key-value pairs, 3) qw-like unquoted list
A set of key-value pairs must contain an indentifier for a key and either nonspaces or a quoted
value for a value
See the interspersed comments.
use strict;
use warnings;
my $simple_value_RE
= qr/^ \s* (\p{Alpha}\w*) \s* = \s* ( [^\s{}]+ | "[^"]*" ) \s* $/x
;
my $set_or_level_RE
= qr/^ \s* (\w+) \s* = \s* [{] (?: ([^}]+) [}] )? \s* $/x
;
my $quoted_set_RE
= qr/^ \s* (?: "[^"]+" \s+ )* "[^"]+" \s* $/x
;
my $associative_RE
= qr/^ \s*
(?: \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) \s+ )*
\p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ )
\s* $
/x
;
my $pair_RE = qr/ \b ( \p{Alpha}\w* ) \s* = \s* ( "[^"]+" | \S+ )/x;
sub get_level {
my $handle = shift;
my %level;
while ( <$handle> ) {
# if the first character on the line is a close, then we're done
# at this level
last if m/^\s*[}]/;
my ( $key, $value );
# get simple values
if (( $key, $value ) = m/$simple_value_RE/ ) {
# done.
}
elsif (( $key, my $complete_set ) = m/$set_or_level_RE/ ) {
if ( $complete_set ) {
if ( $complete_set =~ m/$quoted_set_RE/ ) {
# Pull all quoted values with global flag
$value = [ $complete_set =~ m/"([^"]+)"/g ];
}
elsif ( $complete_set =~ m/$associative_RE/ ) {
# going to create a hashref. First, with a global flag
# repeatedly pull all qualified pairs
# then split them to key and value by spliting them at
# the first '='
$value
= { map { split /\s*=\s*/, $_, 2 }
( $complete_set =~ m/$pair_RE/g )
};
}
else {
# qw-like
$value = [ split( ' ', $complete_set ) ];
}
}
else {
$value = get_level( $handle );
}
}
$level{ $key } = $value;
}
return wantarray ? %level : \%level;
}
my %base = get_level( \*DATA );
Well, as David suggested, the easiest way would be to get whatever produced the file to use a standard format. JSON, YAML, or XML would be much easier to parse.
But if you really have to parse this format, I'd write a grammar for it using Regexp::Grammars (if you can require Perl 5.10) or Parse::RecDescent (if you can't). This'll be a little tricky, especially because you seem to be using braces for both hashes & arrays, but it should be doable.
The contents look pretty regular. Why not perform some substitutions on the content and convert it to hash syntax, then eval it. That would be a quick and dirty way to convert it.
You can also write a parser, assuming you know the grammar.