Some capture groups seem lost when matching group repeatedly - regex

Trying to parse the output of monitoring plugins I ran into a problem where the match result was unexpected by me:
First consider this debugger session with Perl 5.18.2:
DB<6> x $_
0 'last=0.508798;;;0'
DB<7> x $RE
0 (?^u:^((?^u:\'[^\'=]+\'|[^\'= ]+))=((?^u:\\d+(?:\\.\\d*)?|\\.\\d+))(s|%|[KMT]?B)?(;(?^u:\\d+(?:\\.\\d*)?|\\.\\d+)?){0,4}$)
-> qr/(?^u:^((?^u:'[^'=]+'|[^'= ]+))=((?^u:\d+(?:\.\d*)?|\.\d+))(s|%|[KMT]?B)?(;(?^u:\d+(?:\.\d*)?|\.\d+)?){0,4}$)/
DB<8> #m = /$RE/
DB<9> x #m
0 'last'
1 0.508798
2 undef
3 ';0'
DB<10>
OK, the regex $RE (intended to match "'label'=value[UOM];[warn];[crit];[min];[max]") looks terrifying at a first glance, so let me show the construction of it:
my $RE_label = qr/'[^'=]+'|[^'= ]+/;
my $RE_simple_float = qr/\d+(?:\.\d*)?|\.\d+/;
my $RE_numeric = qr/[-+]?$RE_simple_float(?:[eE][-+]?\d+)?/;
my $RE = qr/^($RE_label)=($RE_simple_float)(s|%|[KMT]?B)?(;$RE_simple_float?){0,4}$/;
The relevant part is (;$RE_simple_float?){0,4}$ intended to match ";[warn];[crit];[min];[max]" (still not perfect), so for ";;;0" I'd expect #m to end with ';', ';', ';0'.
However it seems the matches are lost, except for the last one.
Did I misunderstand something, or is it a Perl bug?

When you use {<number>} (or + or * for that matter) after a capture group, only the last value that is matched by the capture group is stored. This explain why you only end up with ;0 instead of ;;;0 in your fourth capture group: (;$RE_simple_float?){0,4} sets the fourth capture group to the last element it matches.
Top fix that, I would recommend to match the whole end of the string, and split it afterwards:
my $RE = qr/...((?:;$RE_simple_float?){0,4})$/;
my #m = /$RE/;
my #end = split /;/, $m[3]; # use /(?<=;)/ to keep the semicolons
Another solution is to repeat the capture group: replace (;$RE_simple_float?){0,4} with
(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?
The capture groups that do not match will be set to undef. This issue with this approach is that it's a bit verbose, and only works for {}, but not for + or *.

Following demo code utilizes split to obtain data of interest. Investigate if it will fit as a solution for your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
while( <DATA> ) {
chomp;
say;
my $record;
$record->#{qw/label value warn crit min max/} = split(/[=;]/,$_);
say Dumper($record);
}
exit 0;
#'label'=value[UOM];[warn];[crit];[min];[max]
__DATA__
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
'label 2'=10%;7%;18%;2%;28%
'label 3'=0.5us;2.3us
Output
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
$VAR1 = {
'crit' => '1.2s',
'warn' => '0.8s',
'value' => '0.3345s',
'label' => '\'label 1\'',
'max' => '3.2s',
'min' => '0.2s'
};
'label 2'=10%;7%;18%;2%;28%
$VAR1 = {
'min' => '2%',
'max' => '28%',
'label' => '\'label 2\'',
'value' => '10%',
'warn' => '7%',
'crit' => '18%'
};
'label 3'=0.5us;2.3us
$VAR1 = {
'min' => undef,
'max' => undef,
'label' => '\'label 3\'',
'warn' => '2.3us',
'value' => '0.5us',
'crit' => undef
};

Related

How do I grab an unknown number of captures from a pattern?

Lets say I have a pattern:
<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2
As you can see, the pattern could have multiple values (in this case pin has 2 sets of pin names). The amount is unknown.
How would I parse this? Here is what I have so far, but it is not helpful as it does not take into account if the pattern has more than 2 sets of pins.
my $pattern = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
if ( $pattern =~ m#\<cell\> (\w*=\w*) \<pin\> (\w*=\w*) \<type\> (\w*=\w*)#) {
my $cell_name = $1;
my $pin_name = $2;
my $type_name = $3;
}
as you can see, this will only work if there is only one set of pin names. However I want it to be able to adjust to multiple unknown sets of pin names. I think I would have to construct like an array or hash, but I am not really sure what is the best way of grabbing these values taking into account the unknown multiple pin sets.
I would like to be able to store the cell_name,pin_name and type_name as an array or hash with the sets of values.
Your problem is a bit trickier than Why do I get the first capture group only? but some of those ideas may help. The trick is to stop thinking about doing everything in a single pattern.
If that's really your input, I'd be tempted to match groups of things around an =. Matching in list context, such as assigning to a hash, returns the list of matches:
use Data::Dumper;
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %values = $input =~ m/ (\S+) = (\S+) /gx;
print Dumper( \%values );
The things before the = become keys and the things after become the values:
$VAR1 = {
'pin1' => 'pin2',
'type1' => 'type2',
'cell1' => 'cell2',
'pin3' => 'pin4'
};
But life probably isn't that easy. The example names probably don't really have pin, cell, and so on.
There's another thing I like to do, though, because I miss having all that fun with sscan. You can walk a string by matching part of it at a time, then on the next match, start where you left off. Here's the whole thing first:
use v5.10;
use Data::Dumper;
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %hash;
while( 1 ) {
state $type;
if( $input =~ /\G < (.*?) > \s* /xgc ) {
$type = $1;
}
elsif( $input =~ /\G (\S+) = (\S+) \s* /xgc ) {
$hash{$type}{$1}{$2}++;
}
else { last }
}
print Dumper( \%hash );
And the data structure, which really doesn't matter and can be anything that you like:
$VAR1 = {
'type' => {
'type1' => {
'type2' => 1
}
},
'pin' => {
'pin1' => {
'pin2' => 1
},
'pin3' => {
'pin4' => 1
}
},
'cell' => {
'cell1' => {
'cell2' => 1
}
}
};
But let's talk about his for a moment. First, all of the matches are in scalar context since they are in the conditional parts of the if-elsif-else branches. That means they only make the next match.
However, I've anchored the start of each pattern with \G. This makes the pattern match at the beginning of the string or the position where the previous successful match left off when I use the /g flag in scalar context.
But, I want to try several patterns, so some of them are going to fail. That's where the /c flag comes in. It doesn't reset the match position on failure. That means the \G anchor won't reset on an unsuccessful match. So, I can try a pattern, and if that doesn't work, start at the same position with the next one.
So, when I encounter something in angle brackets, I remember that type. Until I match another thing in angle brackets, that's the type of thing I'm matching. Now when I match (\S+) = (\S+), I can assign the matches to the right type.
To watch this happen, you can output the remembered string position. Each scalar maintains its own cursor and pos(VAR) returns that position:
use v5.10;
use Data::Dumper;
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %hash;
while( 1 ) {
state $type;
say "Starting matches at " . ( pos($input) // 0 );
if( $input =~ /\G < (.*?) > \s* /xgc ) {
$type = $1;
say "Matched <$type>, left off at " . pos($input);
}
elsif( $input =~ /\G (\S+) = (\S+) \s* /xgc ) {
$hash{$type}{$1}{$2}++;
say "Matched <$1|$2>, left off at " . pos($input);
}
else {
say "Nothing left to do, left off at " . pos($input);
last;
}
}
print Dumper( \%hash );
Before the Dumper output, you now see the global matches in scalar context walk the string:
Starting matches at 0
Matched <cell>, left off at 7
Starting matches at 7
Matched <cell1|cell2>, left off at 19
Starting matches at 19
Matched <pin>, left off at 25
Starting matches at 25
Matched <pin1|pin2>, left off at 35
Starting matches at 35
Matched <pin3|pin4>, left off at 45
Starting matches at 45
Matched <type>, left off at 52
Starting matches at 52
Matched <type1|type2>, left off at 63
Starting matches at 63
Nothing left to do, left off at 63
Finally, as a bonus, here's a recursive decent grammar that does it. It's certainly overkill for what you've provided, but does better in more tricky situations. I won't explain it other than to say it produces the same data structure:
use v5.10;
use Parse::RecDescent;
use Data::Dumper;
my $grammar = <<~'HERE';
startrule: context_pairlist(s)
context_pairlist: context /\s*/ pair(s)
context: '<' /[^>]+/ '>'
{ $::context = $item[2] }
pair: /[A-Za-z0-9]+/ '=' /[A-Za-z0-9]+/
{ main::build_hash( $::context, #item[1,3] ) }
HERE
my $parser = Parse::RecDescent->new( $grammar );
my %hash;
sub build_hash {
my( $context, $name, $value ) = #_;
$hash{$context}{$name}{$value}++;
}
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
$parser->startrule( $input );
say Dumper( \%hash );
You have space separated tokens. Some tokens indicate a new scope and some tokens indicate values being set in that scope. I find it most straightforward to just run through the list of tokens in this case:
#!/usr/bin/env perl
use feature 'say';
use strict;
use warnings;
my $s = q{<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2};
my (%h, $k);
while ($s =~ /(\S+)/g) {
my ($x, $y)= split /=/, $1;
if (defined $y) {
push $h{$k}->#*, {key => $x, value => $y};
next;
}
$h{$k = $x} = [];
}
use Data::Dumper;
print Dumper \%h;
Note that this method considers everything that is not an assignment to be a scope marker.
The resulting data structure is suitable for feeding into something else. Using {key => $key, 'value' => $value} instead of {$key => $value} allows 1) straightforward handling of assignments in a scope upstream; and 2) actually allows the same identifier to be assigned multiple times in a scope (giving you an opportunity to detect this if so desired):
$VAR1 = {
'<cell>' => [
{
'value' => 'cell2',
'key' => 'cell1'
}
],
'<pin>' => [
{
'value' => 'pin2',
'key' => 'pin1'
},
{
'value' => 'pin4',
'key' => 'pin3'
}
],
'<type>' => [
{
'key' => 'type1',
'value' => 'type2'
}
]
};
Another approach is to split the $pattern into an array where each tag starts a new row. This makes it easier to extract the relevant data as this example shows:
#!/usr/bin/perl
$pattern="<cell> cell1=1234567890 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
%cell=%pin=%type=();
print "Original pattern =$pattern\n";
($pattern_split=$pattern) =~ s/</\n</g;
#array=split(/\n/, $pattern_split);
# Extract relevant data (NOTE: finetune regex here) and store them in appropriate hashes indexed by $cnum (cellphone number)
for $_ (#array) {
/<cell>\s*\w+=(\w+)/ && do { $cnum = $1; $cell{$cnum} = $cnum };
/<pin>\s*(.+?)\s*$/ && do { $pin_list=$1; $pin{$cnum} = $pin_list };
/<type>\s*\w+=(\w+)/ && do { $type{$cnum} = $1 };
}
$cn="1234567890";
print "Result: Cellnumber '$cell{$cn}' has pin_list='$pin{$cn}' and type='$type{$cn}'\n";
Prints:
Original pattern =<cell> cell1=1234567890 <pin> pin1=pin2 pin3=pin4 <type> type1=type2
Result: Cellnumber '1234567890' has pin_list='pin1=pin2 pin3=pin4' and type='type2'

Why do #+ and #{^CAPTURE} differ in length?

I'm trying to understand how the regex variables work, so I can save submatch positions in the payload within embedded code expressions. According to perlvar, the positive indices of the array correspond to $1, $2, $3, etc., but that doesn't seem to be the case?
#!/usr/bin/perl -w
use v5.28;
use Data::Dumper;
"XY" =~ / ( (.*) (.) (?{
say Dumper { match_end => \#+ };
say Dumper { capture => \#{^CAPTURE} }
}) ) (.)/x;
Output:
$VAR1 = {
'match_end' => [
2,
undef,
1,
2,
undef
]
};
$VAR1 = {
'capture' => [
undef,
'X',
'Y'
]
};
$VAR1 = {
'match_end' => [
1,
2,
0,
1,
undef
]
};
$VAR1 = {
'capture' => [
'XY',
'',
'X'
]
};
The #+ array apparently gets allocated, or otherwise prepared, already at compilation
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) )/x'
prints
(0, undef, undef)
(0 for the whole match and an undef for each indicated capture group), while
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) (.) )/x'
prints
(0, undef, undef, undef)
with one more element for one more capture group.
One the other hand, the #{^CAPTURE} is just plain empty until there are actual patterns to capture, as we can see from mob's detailed analysis. This, I'd say, plays well with its name.
After the fact the arrays agree, with that shift of one in indices since #+ also contains (offset for) the whole match, at $+[0].
Another difference is that a trailing failed optional match doesn't get a slot in #{^CAPTURE}
perl -MData::Dump=dd -we'$_=q(abc); /((x)? (.) (x)?)/x; dd #+; dd #{^CAPTURE}'
prints
(1, 1, undef, 1, undef)
("a", undef, "a")
The perlvar docs are unclear about what #{^CAPTURE} look like in the middle of a regexp evaluation, but there is a clear progression that depends where in the regexp you are looking at it.
use 5.026;
use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0;
sub DEBUG_CAPTURE { say Dumper { a => $_[0], capture => \#{^CAPTURE} }; }
"XY" =~ /
(?{DEBUG_CAPTURE(0)})
(
(?{DEBUG_CAPTURE(1)})
(
(?{DEBUG_CAPTURE(2)})
(.*) (?{DEBUG_CAPTURE(3)})
(.) (?{DEBUG_CAPTURE(4)})
)
(?{DEBUG_CAPTURE(5)}) (.)
(?{DEBUG_CAPTURE(6)})
)
(?{DEBUG_CAPTURE(7)}) /x;
DEBUG_CAPTURE(8);
Output
$VAR1 = {'a' => 0,'capture' => []};
$VAR1 = {'a' => 1,'capture' => []};
$VAR1 = {'a' => 2,'capture' => []};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'XY']};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'X']};
$VAR1 = {'a' => 4,'capture' => [undef,undef,'X','Y']};
$VAR1 = {'a' => 5,'capture' => [undef,'XY','X','Y']};
$VAR1 = {'a' => 3,'capture' => [undef,'XY','','Y']};
$VAR1 = {'a' => 4,'capture' => [undef,'XY','','X']};
$VAR1 = {'a' => 5,'capture' => [undef,'X','','X']};
$VAR1 = {'a' => 6,'capture' => [undef,'X','','X','Y']};
$VAR1 = {'a' => 7,'capture' => ['XY','X','','X','Y']};
$VAR1 = {'a' => 8,'capture' => ['XY','X','','X','Y']};
The docs are correct if you are observing #{^CAPTURE} after a regexp has been completely evaluated. While evaluation is in process, #{^CAPTURE} seems to grow as the number of capture groups encountered increases. But it's not clear how useful it is to look at #{^CAPTURE} at least until you get to the end of the expression.

perl: capturing the replaced-with string

I have code in a loop similar to
for( my $i=0; $a =~ s/<tag>(.*?)<\/tag>/sprintf("&CITE%03d;",$i)/e ; $i++ ){
%cite{ $i } = $1;
}
but instead of just the integer index, I want to make the keys of the hash the actual replaced-with text (placeholder "&CITE001;", etc.) without having to redo the sprintf().
I was almost sure there was a way to do it (variable similar to $& and such, but maybe I was thinking of vim's substitutions and not perl. :)
Thanks!
my $i = 0;
s{<tag>(.*?)</tag>}{
my $entity = sprintf("&CITE%03d;", $i++);
$cite{$entity} = $1;
$entity
}eg;
I did a something of a hacque, but really wanted something a bit more elegant. What I ended up doing (for now) is
my $t;
for( my $i=0; $t = sprintf("&CITE%04d;",$i), $all =~ s/($oct.*?$cct)/$t/s; $i++ ){
$cites{$t} = $1;
}
but I really wanted something even more "self-contained".
Just being able to grab the replacement string would've made things much simpler, though. This is a simple read-modify-write op.
True, adding the 'g' modifier should help shave some microseconds off it. :D
I think any method other than re-starting the search from the start of the target
is always the better choice.
In that vein and, as an alternative, you can move the logic inside the regex
via a Code Construct (?{ code }) and leverage the fact that $^N contains
the last capture content.
Perl
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my $target = "<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>";
my %cite;
my ($cnt,$key) = (0,'');
$target =~ s/
<tag> (.*?) <\/tag>
(?{
$key = sprintf("&CITE%03d;", $cnt++);
$cite{$key} = $^N;
})
/$key/xg;
print $target, "\n";
print Dumper(\%cite);
Output
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};
Edited/code by #Ikegami
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
sub f {
my $target = "<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>";
my %cite;
my ($cnt,$key) = (0,'');
$target =~ s/
<tag> (.*?) <\/tag>
(?{
$key = sprintf("&CITE%03d;", $cnt++);
$cite{$key} = $^N;
})
/$key/xg;
print $target, "\n";
print Dumper(\%cite);
}
f() for 1..2;
Output
Variable "$key" will not stay shared at (re_eval 1) line 2.
Variable "$cnt" will not stay shared at (re_eval 1) line 2.
Variable "%cite" will not stay shared at (re_eval 1) line 3.
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};
$VAR1 = {};
This issue has been addressed in 5.18.
Perl by #sln
See, now I don't get that issue in version 5.20.
And, I don't believe I got it in 5.12 either.
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
sub wrapper {
my ($targ, $href) = #_;
my ($cnt, $key) = (0,'');
$$targ =~ s/<tag>(.*?)<\/tag>(?{ $key = sprintf("&CITE%03d;", $cnt++); $href->{$key} = $^N; })/$key/g;
}
my ($target,%cite) = ("<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>", ());
wrapper( \$target, \%cite );
print $target, "\n";
print Dumper(\%cite);
($target,%cite) = ("<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>", ());
wrapper( \$target, \%cite );
print $target, "\n";
print Dumper(\%cite);
Output
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};

how to grep required till some condition while file processing in perl scripts

I want to capture and make one line for the following varieties of warnings
here's the code snippet which i am using for gathering different warnings and after observing warnings pattern i have warnings of different lines and i want join next line warning part in one line and assign to it a hash ref.
#!/usr/bin/perl
use strict;
use warnings;
open(FH,"/home/goudar/Desktop/1.log") or die !$;
while(<FH>){
my $line = $_;
if($line =~ m/^Warning: (.*)$/){
my $date1 = `date`;
chomp($date1);
$subStepValues = {
Warning => $1,
Warning_timeStamp => $date1,
};
push #{$subsSteps->{'subStepValues'}}, $subStepValues;
}
}
close(FH);
FILE
Warning: No clock-gating check is inferred for clock clk_12800
at pins i_osc/i_osc_top/i_clk_div/g817/S and i_osc/i_osc_top/i_clk_div/g817/A1 of cell i_osc/i_osc_top/i_clk_div/g817. (PTE-060)
Warning: Virtual clock 'clk_vir' cannot be made propagated. (UITE-316)
Warning: Virtual clock 'clk_ext' cannot be made propagated. (UITE-316)
Warning: Net i_obr/i_obr34/rec2 has been annotated with max resistance
using the set_resistance command. This takes precedence over
values from parasitics. (PARA-047)
Warning: Net i_obr/i_obr34/rec2 has been annotated with min resistance
using the set_resistance command. This takes precedence over
values from parasitics. (PARA-047)
You need very small additions to your code.
The only possibility other than matching /^Warning:/ is the continuation of it on the next line(s), which is exactly what you want. So you can add an else branch and in it append the line to the last previously matched 'Warning'. This handles multiple such lines as well, they all get appended to the last 'Warning', one after another.
So you need to:
1) Remember the position in the array of the last match ('Warning')
2) Add an else in which you append the line to that last 'Warning', using the last value of the counter.
For this to work you now have to chomp the line.
use strict;
use warnings;
open(FH,"/home/goudar/Desktop/1.log") or die !$;
my $subStepValues; # added missing declarations
my $subsSteps = { };
my $cnt = -1; # added
while (<FH>) {
chomp(my $line = $_); # added
if ($line =~ m/^Warning: (.*)$/) {
$cnt++; # added
my $date1 = `date`;
chomp($date1);
$subStepValues = {
Warning => $1,
Warning_timeStamp => $date1,
};
push #{ $subsSteps->{'subStepValues'} }, $subStepValues;
}
else { # added
$subsSteps->{'subStepValues'}->[$cnt]->{'Warning'} .= " $line";
}
}
close(FH);
foreach my $ss (#{$subsSteps->{'subStepValues'}}) { # added, print
foreach my $key (keys %{$ss}) {
print "$key => $ss->{$key}\n";
}
}
This prints all lines belonging to each 'Warning'.
I recommend three-argument open, open my $fh, '<', $infile, and then while ($fh) { }.
Alternatively, you can read the whole file into one string (or read it into an array and join) and remove new lines. Then split that string on Warning: (drop the first element, an empty string), or match the same way as you do, but globally (/g modifier).
This forms an array where each element is a string with all text belonging to its/^Warning:/. Then make another pass over that array, to add that time stamp and put it all into your data structure.
I would suggest that what you really want, is to split your record on something other than linefeed. Then you can use multi-line pattern matches to extract your content.
Looking at it - the 'end of record' appears to be ).
So:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #warnings;
open my $input, '<', '/home/goudar/Desktop/1.log' or die $!;
local $/ = ')';
while (<$input>) {
my $timestamp = time();
my $formatted_time = localtime();
my ($code) = m/\(([\w\d\-]+)\)/;
my ($message) = m/Warning: (.*)\z/ms;
next unless $code;
push( #warnings, { time => $timestamp, formatted_time => $formatted_time,
msg => $message, code => $code } );
}
close ( $input );
print Dumper \#warnings;
This gives (with your content):
$VAR1 = [
{
'msg' => 'No clock-gating check is inferred for clock clk_12800
at pins i_osc/i_osc_top/i_clk_div/g817/S and i_osc/i_osc_top/i_clk_div/g817/A1 of cell i_osc/i_osc_top/i_clk_div/g817. (PTE-060)',
'time' => 1457430459,
'code' => 'PTE-060',
'formatted_time' => 'Tue Mar 8 09:47:39 2016'
},
{
'formatted_time' => 'Tue Mar 8 09:47:39 2016',
'msg' => 'Virtual clock \'clk_vir\' cannot be made propagated. (UITE-316)',
'time' => 1457430459,
'code' => 'UITE-316'
},
{
'formatted_time' => 'Tue Mar 8 09:47:39 2016',
'code' => 'UITE-316',
'msg' => 'Virtual clock \'clk_ext\' cannot be made propagated. (UITE-316)',
'time' => 1457430459
},
{
'formatted_time' => 'Tue Mar 8 09:47:39 2016',
'msg' => 'Net i_obr/i_obr34/rec2 has been annotated with max resistance
using the set_resistance command. This takes precedence over
values from parasitics. (PARA-047)',
'time' => 1457430459,
'code' => 'PARA-047'
},
{
'msg' => 'Net i_obr/i_obr34/rec2 has been annotated with min resistance
using the set_resistance command. This takes precedence over
values from parasitics. (PARA-047)',
'time' => 1457430459,
'code' => 'PARA-047',
'formatted_time' => 'Tue Mar 8 09:47:39 2016'
}
];
Whilst we're at it:
3 argument open with lexical filehandles is much better style.
my $line = $_ is redundant. You can do while ( my $line = <$filehandle> ) {
you don't need it if all you're doing is regex anyway, because the default match target is $_
unclear what you're doing with $substeps so I've used an array in the above for clarity.
It's usually a good idea to keep an unformatted time, because that enables sorting/searching etc.
the above captures linefeeds in your messages. This can be easily remedied by $message =~ tr/\n//d;

Regex on a string

I'm trying to formulate a regular expression to use on text. Using in-memory variables is not giving the same result.
The below regular expression provides $1 and $2 that return what I expect. rw results vary. These positions can vary: I am looking to extract the data irrespective of the position in the string.
\/vol\/(\w+)\?(\w+|\s+).*rw=(.*\w+)
My data:
_DATA_
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
I'm trying to capture the second and third groups (if it is a space it should return a space), and a list of entries in rw, ro and root.
The expression (.*\w+) will match up to the last word character in the line. What you are looking for is most likely this ([0-9a-z:]+)
Guessing from your comment in reply to ikegami, maybe the following will give results you want.
#!/usr/bin/perl
use strict;
use warnings;
my #keys = qw/ rw ro root /;
my $wanted = join "|", #keys;
my %data;
while (<DATA>) {
my ($path, $param) = split;
my ($vol, $q) = (split '/', $path)[2,3];
my %tmp = map {split /=/} grep /^(?:$wanted)/, split /,/, $param;
$data{$vol}{$q // ' '} = \%tmp;
}
use Data::Dumper; print Dumper \%data;
__DATA__
/vol/vol1 -sec=sys,rw=h1:h2,anon=0
/vol/vol1/q1 -sec=sys,rw=h3:h4,anon=0,ro=h1:h2
/vol/vol2/q1 -sec=sys,root=host5,ro=h3:h5,rw=h1:h2,anon=0
The output from Data::Dumper is:
$VAR1 = {
'vol2' => {
'q1' => {
'ro' => 'h3:h5',
'root' => 'host5',
'rw' => 'h1:h2'
}
},
'vol1' => {
' ' => {
'rw' => 'h1:h2'
},
'q1' => {
'ro' => 'h1:h2',
'rw' => 'h3:h4'
}
}
};
Update: can you tell me what does (?:) mean in the grep?
(?: . . .) is a non-capturing group. It is used in this case because the beginning of the regex has ^. Without grouping, the regex would attempt to match ro positioned at the beginning of the string or rw or root anywhere in the string (not just the beginning).
/^ro|rw|root/ rather than /^(?:ro|rw|root)/
The second expression helps the search along because it knows to only attempt a match at the beginning of the string for all 3 patterns and not to try to match anywhere in the string (speeds things up although in your case, there are only 3 alternating matches to attempt - so, wouldn't make a huge difference here). But, still a good practice to follow.
what does (// ' ') stand for?
That is the defined or operator. The expression $q // ' ' says to use $q for the key in the hash if it is defined or a space instead.
You said in your original post I'm trying to capture the second and third groups (if it is a space it should return a space).
$q can be undefined when the split, my ($vol, $q) = (split '/', $path)[2,3]; has only a vol and not a q such as in this data line (/vol/vol1 -sec=sys,rw=h1:h2,anon=0).
No idea what you want, but a regex would not make a good parser here.
while (<DATA>) {
my ($path, $opts) = split;
my %opts =
map { my ($k,$v) = split(/=/, $_, 2); $k=>$v }
split(/,/, $opts);
...
}
(my %opts = split(/[,=]/, $opts); might suffice.)