I am trying devise Perl regex to parse command output from IBM's runmqsc utility.
Each line of output of interest contains one or more attribute/value pairs with format: "ATTRIBUTE(VALUE)". The value for an attribute can be empty, or can contain parenthesis itself. Typically, a maximum of two attribute/value pairs appear on a given line, so the regex is written under this assumption.
Example input to Perl RE:
CHANNEL(TO.IPTWX01) CHLTYPE(CLUSRCVR)
DISCINT(6000) SHORTRTY(10)
TRPTYPE(TCP) DESCR( )
LONGTMR(1200) SCYEXIT( )
CONNAME(NODE(1414)) MREXIT( )
MREXIT( ) CONNAME2(SOME(1416))
TPNAME( ) BATCHSZ(50)
MCANAME( ) MODENAME( )
ALTTIME(00.41.56) SSLPEER()
CONTRIVED() ATTR (00-41-56)
CONTRIVED() DOCTORED()
MSGEXIT( )
I have the following Perl code to capture each attribute/value pair.
Perl Code
my $resplit = qr/\s+([^\s]+(?:\([^)]*\))?)\s?/;
while ( <IN2> )
{ s/[\s\r\n]+$//;
if ( m/^\s(?:$resplit)(?:$resplit)?$/ )
{ my ($one,$two) = ($1,$2);
print "one: $one, two: $two\n";
}
}
Here's the output when the above code is applied to sample input:
one: CHANNEL(TO.IPTWX01), two: CHLTYPE(CLUSRCVR)
one: DISCINT(6000), two: SHORTRTY(10)
one: TRPTYPE(TCP), two: DESCR( )
one: LONGTMR(1200), two: SCYEXIT( )
one: CONNAME(NODE(1414)), two: MREXIT( )
one: MREXIT( ), two: CONNAME2(SOME(1416))
one: TPNAME( ), two: BATCHSZ(50)
one: MCANAME( ), two: MODENAME( )
one: ALTTIME(00.41.56), two: SSLPEER()
one: CONTRIVED(), two: ATTR(00-41-56)
one: CONTRIVED(), two: DOCTORED()
one: MSGEXIT(, two: )
This works great with the exception of the last line in the output
above. I'm really struggling to figure out how
to modify the above expression $resplit to capture the last case.
Can anyone offer any ideas/suggestions on how to make this work or
another approach?
The Text::Balanced module is designed to handle this sort of problem. This approach will handle any number of columns as well.
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
my ($extracted, $remainder, $prefix);
while ( defined($remainder = <DATA>) ){
while ( Get_paren_text() ){
$prefix =~ s/ //g;
print $prefix, $extracted, "\n";
}
}
sub Get_paren_text {
($extracted, $remainder, $prefix)
= extract_bracketed($remainder, '()', '[\w ]+');
return defined $extracted;
}
__DATA__
CHANNEL(TO.IPTWX01) CHLTYPE(CLUSRCVR) FOO( ( BAR) )
DISCINT(6000) SHORTRTY(10) BIZZ((((BUZZ) ) ) ) )
TRPTYPE(TCP) DESCR( )
LONGTMR(1200) SCYEXIT( )
CONNAME(NODE(1414)) MREXIT( )
MREXIT( ) CONNAME2(SOME(1416))
TPNAME( ) BATCHSZ(50)
MCANAME( ) MODENAME( )
ALTTIME(00.41.56) SSLPEER()
CONTRIVED() ATTR (00-41-56)
CONTRIVED() DOCTORED()
MSGEXIT( )
I wanted to try to use Regexp::Grammars.
So here it is:
#! /opt/perl/bin/perl
use strict;
#use warnings;
use 5.10.1;
use Regexp::Grammars;
my $grammar = qr{
<line>
<token: line>
(?: <[pair]> \s* )+
(?{
my $arr = $MATCH{pair};
local $MATCH = {};
for my $pair( #$arr ){
my($key) = keys %$pair;
my($value) = values %$pair;
$MATCH->{$key} = $value;
}
})
<token: pair>
<attrib> \s* \( \s* <value> \s* \)
(?{
$MATCH = {
$MATCH{attrib} => $MATCH{value}
};
})
<token: attrib>
[^()]*?
<token: value>
(?:
<MATCH=pair> |
[^()]*?
)
}x;
use warnings;
my %attr;
while( my $line = <> ){
$line =~ /$grammar/;
for my $key ( keys %{ $/{line} } ){
$attr{$key} = $/{line}{$key};
}
}
use YAML;
say Dump \%attr;
---
ALTTIME: 00.41.56
ATTR: 00-41-56
BATCHSZ: 50
CHANNEL: TO.IPTWX01
CHLTYPE: CLUSRCVR
CONNAME:
NODE: 1414
CONNAME2:
SOME: 1416
CONTRIVED: ''
DESCR: ''
DISCINT: 6000
DOCTORED: ''
LONGTMR: 1200
MCANAME: ''
MODENAME: ''
MREXIT: ''
MSGEXIT: ''
SCYEXIT: ''
SHORTRTY: 10
SSLPEER: ''
TPNAME: ''
TRPTYPE: TCP
while ( <IN2> ) {
while ( /([A-Z]+)\s*(\((?:[^()]*+|(?2))*\))/g ) {
print "$1$2\n";
}
}
This works for nested parens e.g.
CONNAME(NODE(1414, SOME(1416) ) ) ATTR (00-41-56)
The (?2) part is recursive, the *+ means "don't backtrack" - only works in Perl 5.10 or later; I got this from http://faq.perl.org/perlfaq6.html#Can_I_use_Perl_regul
#!/usr/bin/perl
use strict;
use warnings;
my #parsed;
while ( my $line = <DATA> ) {
while ( $line =~ / ([A-Z0-9]+) \s* \( (.*?) \) \s /gx ) {
push #parsed, { $1 => $2 }
}
}
use Data::Dumper;
print Dumper \#parsed;
__DATA__
CHANNEL(TO.IPTWX01) CHLTYPE(CLUSRCVR)
DISCINT(6000) SHORTRTY(10)
TRPTYPE(TCP) DESCR( )
LONGTMR(1200) SCYEXIT( )
CONNAME(NODE(1414)) MREXIT( )
MREXIT( ) CONNAME2(SOME(1416))
TPNAME( ) BATCHSZ(50)
MCANAME( ) MODENAME( )
ALTTIME(00.41.56) SSLPEER()
CONTRIVED() ATTR (00-41-56)
CONTRIVED() DOCTORED()
MSGEXIT( )
Related
Is the first and the second substitution equivalent if the replacement is passed in a variable?
#!/usr/bin/env perl6
use v6;
my $foo = 'switch';
my $t1 = my $t2 = my $t3 = my $t4 = 'this has a $foo in it';
my $replace = prompt( ':' ); # $0
$t1.=subst( / ( \$ \w+ ) /, $replace );
$t2.=subst( / ( \$ \w+ ) /, { $replace } );
$t3.=subst( / ( \$ \w+ ) /, { $replace.EVAL } );
$t4.=subst( / ( \$ \w+ ) /, { ( $replace.EVAL ).EVAL } );
say "T1 : $t1";
say "T2 : $t2";
say "T3 : $t3";
say "T4 : $t4";
# T1 : this has a $0 in it
# T2 : this has a $0 in it
# T3 : this has a $foo in it
# T4 : this has a switch in it
The only difference between $replace and {$replace} is that the second is a block that returns the value of the variable. It's only adding a level of indirection, but the result is the same.
Update: Edited according to #raiph's comments.
I would like to parse the following lines
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
and get the following output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
I succeeded in parsing the first two fields with command
sed -Eun 's/[^>]+> ([0-9.]+)\.[0-9]+:.+q: A\? ([a-z0-9.-]+)\.([^:]+).*/\1:\2:\3/pg
`
But I cannot get the resolved IPs (A xx.xx.xx.xx). In fact there may be several.
Would it be possible to get such output using sed or Perl ?
EDIT:
As I added in comments, parsing of a larger input sample, I also require several lines to be discarded in the output. This lines can be characterized by :
the number of A records ("A xx.xx.xx.xx") is non null
or the line must not contains NXDomain\*?-
I succeed in meeting the new first need, but not for the second.
Following the #ikegami reply, here is my attempt:
perl -nle '
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $rest) = /^ \s+ ( [^:]++ ) : \s++ $field_value_re ( .* ) /sx
or next;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers = split(/\s*+,\s*+/, $answers);
my ($afield) = join " ", map { /^\S++\s++A\s++(\S++)/ } #answers;
if ( length($afield) != 0)
{
print join " ", $ip, $query, $afield;
}
' dns.sample
This does as you ask with the sample data
I first build a regex pattern $url_re that matches numeric URLs to make the following code more concise. Then I search for the first URL immediately after >, the named URL right after A?, and all of the following URLs which are preceded by A
They are all stored in array #urls and printed
use strict;
use warnings 'all';
use 5.010;
my $url_re = qr/(?:\d+\.){3}\d+/;
while ( <DATA> ) {
my #urls = ( />\s+($url_re)/, /A\?\s+([-\w.]+\w)/, /(A\s+$url_re)/g );
say "#urls";
}
__DATA__
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254 A 4.26.235.155
125.15.15.9 www.toto.net A 120.33.1.11
Each line appears to be of the form
{"id" with spaces}: {stuff} [ {key}: {stuff} ]*
You appear to be interested in information inside the "id", and inside the field named q. The value of the q field appears to be of the form
A? {word} {word} {ns_return} [, {ns_return} ]*
Here's a robust solution that handles the format described above.
perl -nle'
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $id_val, $rest) = /^ ( [^:]++ ) : \s++ ( $field_value_re ) ( .* ) /sx
or next;
next if $id_val =~ /\bNXDomain\b/;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers =
map { /^\S++\s++A\s++(\S++)/ }
split(/\s*+,\s*+/, $answers);
next if !#answers;
print join " ", $ip, $query, map { "A $_" } #answers;
' log
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
This prints the desired output by using the map function in a somewhat unorthodox way to ignore any fields after q:
perl -lne 'print join qq/\t/, m/> ([\d\.]+)\./, map {/A\? ([^\s]+)\./, /(A [\d\.]+)/g} / q:([^:]+)/' log.txt
I'm parsing a CSV file with embedded commas, and obviously, using split() has a few limitations due to this.
One thing I should note is that the values with embedded commas are surrounded by parentheses, double quotes, or both...
for example:
(Date, Notional),
"Date, Notional",
"(Date, Notional)"
Also, I'm trying to do this without using any modules for certain reasons I don't want to go into right now...
Can anyone help me out with this?
This should do what you need. It works in a very similar way to the code in Text::CSV_PP, but doesn't allow for escaped characters within the field as you say you have none
use strict;
use warnings;
use 5.010;
my $re = qr/(?| "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = "$line," =~ /$re/g;
say "<$_>" for #fields;
output
<Date, Notional 1>
<Date, Notional 2>
<Date, Notional 3>
Update
Here's a version for older Perls (prior to version 10) that don't have the regex branch reset construct. It produces identical output to the above
use strict;
use warnings;
use 5.010;
my $re = qr/(?: "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = grep defined, "$line," =~ /$re/g;
say "<$_>" for #fields;
I know you already have a working solution with Borodin's answer, but for the record there is also a simple solution with split (see the results at the bottom of the online demo). This situation sounds very similar to regex match a pattern unless....
#!/usr/bin/perl
$regex = '(?:\([^\)]*\)|"[^"]*")(*SKIP)(*F)|\s*,\s*';
$subject = '(Date, Notional), "Date, Notional", "(Date, Notional)"';
#splits = split($regex, $subject);
print "\n*** Splits ***\n";
foreach(#splits) { print "$_\n"; }
How it Works
The left side of the alternation | matches complete (parentheses) and (quotes), then deliberately fails. The right side matches commas, and we know they are the right commas because they were not matched by the expression on the left.
Possible Refinements
If desired, the parenthess-matching portion could be made recursive to match (nested(parens))
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
I know that this is quite old question, but for completeness I would like to add solution from great book "Mastering Regular Expressions" by Jeffrey Friedl (page 271):
sub parse_csv {
my $text = shift; # record containing comma-separated values
my #fields = ( );
my $field;
chomp($text);
while ($text =~ m{\G(?:^|,)(?:"((?>[^"]*)(?:""[^"]*)*)"|([^",]*))}gx) {
if (defined $2) {
$field = $2;
} else {
$field = $1;
$field =~ s/""/"/g;
}
# print "[$field]";
push #fields, $field;
}
return #fields;
}
Try it against test row:
my $line = q(Ten Thousand,10000, 2710 ,,"10,000",,"It's ""10 Grand"", baby",10K);
my #fields = parse_csv($line);
my $i;
for ($i = 0; $i < #fields; $i++) {
print "$fields[$i],";
}
print "\n";
I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".
I've got the text file:
country = {
tag = ENG
ai = {
flags = { }
combat = { ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD }
continent = { "Oceania" }
area = { "America" "Maine" "Georgia" "Newfoundland" "Cuba" "Bengal" "Carnatic" "Ceylon" "Tanganyika" "The Mascarenes" "The Cape" "Gold" "St Helena" "Guiana" "Falklands" "Bermuda" "Oregon" }
region = { "North America" "Carribean" "India" }
war = 50
ferocity = no
}
date = { year = 0 month = january day = 0 }
}
What I'm trying to do is to parse this text into perl hash structure, so that the output after data dump looks like this:
$VAR1 = {
'country' => {
'ai' => {
'area' => [
'America',
'Maine',
'Georgia',
'Newfoundland',
'Cuba',
'Bengal',
'Carnatic',
'Ceylon',
'Tanganyika',
'The Mascarenes',
'The Cape',
'Gold',
'St Helena',
'Guiana',
'Falklands',
'Bermuda',
'Oregon'
],
'combat' => [
'ROY',
'WLS',
'PUR',
'SCO',
'EIR',
'FRA',
'DEL',
'USA',
'QUE',
'BGL',
'MAH',
'MOG',
'VIJ',
'MYS',
'DLH',
'GUJ',
'ORI',
'JAI',
'ASS',
'MLC',
'MYA',
'ARK',
'PEG',
'TAU',
'HYD'
],
'continent' => [
'Oceania'
],
'ferocity' => 'no',
'flags' => [],
'region' => [
'North America',
'Carribean',
'India'
],
'war' => 50
},
'date' => {
'day' => 0,
'month' => 'january',
'year' => 0
},
'tag' => 'ENG'
}
};
Hardcoded version might look like this:
#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;
my $ret;
$ret->{'country'}->{tag} = 'ENG';
$ret->{'country'}->{ai}->{flags} = [];
my #qw = qw( ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD );
$ret->{'country'}->{ai}->{combat} = \#qw;
$ret->{'country'}->{ai}->{continent} = ["Oceania"];
$ret->{'country'}->{ai}->{area} = ["America", "Maine", "Georgia", "Newfoundland", "Cuba", "Bengal", "Carnatic", "Ceylon", "Tanganyika", "The Mascarenes", "The Cape", "Gold", "St Helena", "Guiana", "Falklands", "Bermuda", "Oregon"];
$ret->{'country'}->{ai}->{region} = ["North America", "Carribean", "India"];
$ret->{'country'}->{ai}->{war} = 50;
$ret->{'country'}->{ai}->{ferocity} = 'no';
$ret->{'country'}->{date}->{year} = 0;
$ret->{'country'}->{date}->{month} = 'january';
$ret->{'country'}->{date}->{day} = 0;
sub hash_sort {
my ($hash) = #_;
return [ (sort keys %$hash) ];
}
$Data::Dumper::Sortkeys = \hash_sort;
print Dumper($ret);
I have to admit I have a huge problem dealing with nested curly brackets.
I've tried to solve it by using greedy and ungreedy matching, but it seems it didn't do the trick. I've also read about extended patterns (like (?PARNO)) but I have absolutely no clue how to use them in my particular problem. Order of data is irrelevant, since I have the hash_sort subroutine.
I'll apprieciate any help.
I broke it down to some simple assumptions:
An entry would consist of an identifier followed by an equals sign
An entry would be one of three basic types: a level or set or a single value
A set has 3 forms: 1) quoted, space-separated list; 2) key-value pairs, 3) qw-like unquoted list
A set of key-value pairs must contain an indentifier for a key and either nonspaces or a quoted
value for a value
See the interspersed comments.
use strict;
use warnings;
my $simple_value_RE
= qr/^ \s* (\p{Alpha}\w*) \s* = \s* ( [^\s{}]+ | "[^"]*" ) \s* $/x
;
my $set_or_level_RE
= qr/^ \s* (\w+) \s* = \s* [{] (?: ([^}]+) [}] )? \s* $/x
;
my $quoted_set_RE
= qr/^ \s* (?: "[^"]+" \s+ )* "[^"]+" \s* $/x
;
my $associative_RE
= qr/^ \s*
(?: \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) \s+ )*
\p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ )
\s* $
/x
;
my $pair_RE = qr/ \b ( \p{Alpha}\w* ) \s* = \s* ( "[^"]+" | \S+ )/x;
sub get_level {
my $handle = shift;
my %level;
while ( <$handle> ) {
# if the first character on the line is a close, then we're done
# at this level
last if m/^\s*[}]/;
my ( $key, $value );
# get simple values
if (( $key, $value ) = m/$simple_value_RE/ ) {
# done.
}
elsif (( $key, my $complete_set ) = m/$set_or_level_RE/ ) {
if ( $complete_set ) {
if ( $complete_set =~ m/$quoted_set_RE/ ) {
# Pull all quoted values with global flag
$value = [ $complete_set =~ m/"([^"]+)"/g ];
}
elsif ( $complete_set =~ m/$associative_RE/ ) {
# going to create a hashref. First, with a global flag
# repeatedly pull all qualified pairs
# then split them to key and value by spliting them at
# the first '='
$value
= { map { split /\s*=\s*/, $_, 2 }
( $complete_set =~ m/$pair_RE/g )
};
}
else {
# qw-like
$value = [ split( ' ', $complete_set ) ];
}
}
else {
$value = get_level( $handle );
}
}
$level{ $key } = $value;
}
return wantarray ? %level : \%level;
}
my %base = get_level( \*DATA );
Well, as David suggested, the easiest way would be to get whatever produced the file to use a standard format. JSON, YAML, or XML would be much easier to parse.
But if you really have to parse this format, I'd write a grammar for it using Regexp::Grammars (if you can require Perl 5.10) or Parse::RecDescent (if you can't). This'll be a little tricky, especially because you seem to be using braces for both hashes & arrays, but it should be doable.
The contents look pretty regular. Why not perform some substitutions on the content and convert it to hash syntax, then eval it. That would be a quick and dirty way to convert it.
You can also write a parser, assuming you know the grammar.