populating a multi - level hash - regex

I have this data set - I am only concerned with the class id, the process#server:
I am trying to load all three into a multi-level hash.
class_id: 995 (OCAive ) ack_type: NOACK supercede: NO copy: NO bdcst: NO
PID SENDS RECEIVES RETRANSMITS TIMEOUTS MEAN S.D. #
21881 (rocksrvr#ith111 ) 1 1 0 0
24519 (miclogUS#ith110 ) 1 1 0 0
26163 (gkoopsrvr#itb101 ) 1 1 0 0
28069 (sitetic#ith100 ) 23 4 0 0
28144 (srtic#ithb10 ) 33 5 0 0
29931 (loick#ithbsv115 ) 1 1 0 0
87331 (rrrrv_us#ith102 ) 1 1 0 0
---------- ---------- ---------- ---------- ------- ------- ------
61 14 0 0
When i try and populate the hash, it does not get much (data dumper results below)
$VAR1 = '';
$VAR2 = {
'' => undef
};
This is the code:
#!/usr/bin/perl -w
use strict;
my $newcomm_dir = "/home/casper-compaq/work_perl/newcomm";
my $newcomm_file = "newcomm_stat_result.txt";
open NEWCOMM , "$new_comm_dir$newcomm_file";
while (<NEWCOMM>) {
if ($_ =~ /\s+class_id:\s\d+\s+((\w+)\s+)/){
my $message_class = $1;
}
if ($_ =~ /((\w+)\#\w+\s+)/) {
my $process = $1;
}
if ($_ =~ /(\w+\#(\w+)\s+)/) {
my $servers = $1;
}
my $newcomm_stat_hash{$message_class}{$servers}=$process;
use Data::Dumper;
print Dumper (%newcomm_stat_hash);
}

Apart from the declarations problem, there are also multiple problems with your regex expressions. I would suggest making sure you get the expected output into each of the variables prior to trying to insert it into a hash.
For one thing you need to match parenthesis as \(, \) otherwise they are just interpreted as variable containers.

Your my declarations inside the if only have scope until the end of the if block, and your hash assignment shouldn't have a my. Try declaring %newcomm_stat_hash before the while loop and $message_class, $process, and $servers at the top of the while block.
You also probably want to check your open for failure; I suspect you are missing a / there.

I suspect your main error is that your file is not opened, due to the path missing a /, between directory and filename. You can use the autodie pragma to check that the open succeeds, or you can use or die "Can't open file: $!".
You have some scope issues. First off, $message_class will be undefined throughout the loop, since it's scope lasts only inside one iteration. You probably also want to have the hash outside the loop, if you want to be able to use it later on.
I put a next statement in the header line check, as the other checks will be invalid in that particular line. If you want to be more precise, you can put the whole thing outside the loop, and just do a single line check.
You do not need the two variables for process and server, just use them directly, and both at the same time.
Lastly, you probably want to send the reference to the hash to Data::Dumper in the print, else the hash will expand, and the print will be somewhat misleading.
#!/usr/bin/perl -w
use strict;
use autodie;
my $newcomm_dir = "/home/casper-compaq/work_perl/newcomm/";
my $newcomm_file = "newcomm_stat_result.txt";
open my $fh, '<', "$new_comm_dir$newcomm_file";
my $message_class;
my %newcomm_stat_hash;
while (<$fh>) {
if (/^\s+class_id:\s+\d+\s+\((\w+)\)\s+/){
$message_class = $1;
next;
}
if (/(\w+)\#(\w+)/) {
$newcomm_stat_hash{$message_class}{$2}=$1;
}
}
use Data::Dumper;
print Dumper \%newcomm_stat_hash;

Is your file being opened?
I think you need to change:
open NEWCOMM , "$new_comm_dir$newcomm_file";
To:
open NEWCOMM , $new_comm_dir.'/'.$newcomm_file;

Related

Why do I get the first capture group only?

(https://stackoverflow.com/a/2304626/6607497 and https://stackoverflow.com/a/37004214/6607497 did not help me)
Analyzing a problem with /proc/stat in Linux I started to write a small utility, but I can't get the capture groups the way I wanted.
Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
For example with these input lines I get the output:
> cat /proc/stat
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
0
0 0
1 0
2 0
3 0
So the match actually works, but I don't get the capture groups into #vals (perls 5.18.2 and 5.26.1).
Only the last of the repeated matches from a single pattern is captured.
Instead, can just split the line and then check on -- and adjust -- the first field
while (<$fh>) {
my ($cpu, #vals) = split;
next if not $cpu =~ s/^cpu//;
print "$cpu $#vals\n";
}
If the first element of the split's return doesn't start with cpu the regex substition fails and so the line is skipped. Otherwise, you get the number following cpu (or an empty string), as in OP.†
Or, can use the particular structure of the line you process
while (<$fh>) {
if (my ($cpu, #vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) {
print "$cpu $#vals\n";
}
}
The regex returns two items and each is split in the map, except that the first one is just passed as is into $cpu (being either a number or an empty string), while the other yields the numbers.
Both these produce the needed output in my tests.
† Since we always check for ^cpu (and remove it) it makes sense to do that first, and only then split -- when needed. However, that gets a little tricky for the following reason.
That bare split strips the leading (and trailing) whitespaces by its default, so for lines where cpu string has no trailing digits (cpu 2709779...) we would end up having the next number for what should be the cpu designation! A quiet error.
Thus we need to specify for split to use spaces, as it then leaves the leading spaces
while (<$fh>) {
next if not s/^cpu//;
my ($cpu, #vals) = split /\s+/; # now $cpu may be space(s)
print "$cpu $#vals\n";
}
This now works as intended as the cpu without trailing numbers gets space(s), a case to handle but clear. But this is misleading and an unaware maintainer -- or us the proverbial six months later -- may be tempted to remove the seemingly "unneeded" /\s+/, introducing an error.
Going by the example input, following content inside the while loop should work.
if (/^cpu(\d*)/) {
my $cpu = $1;
my (#vals) = /(?:\s+(\d+))+/g;
print "$cpu $#vals\n";
}
In an exercise for Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl I pull out the big guns). We don't tell people this because we want to highlight the natural behavior to try to write everything in a single regex. Some of the contortions in other answers remind me of that, and I wouldn't want to maintain any of them.
First, there's the issue of only processing the interesting lines. Then, once we have that line, grab all the numbers. Translating that problem statement into code is very simple and straightforward. No acrobatics here because assertions and anchors do most of the work:
use v5.10;
while( <DATA> ) {
next unless /\A cpu(\d*) \s /ax;
my $cpu = $1;
my #values = / \b (\d+) \b /agx;
say "$cpu " . #values;
}
__END__
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
Note that the OP still has to decide how to handle the cpu case with no trailing digits. Don't know what you want to do with the empty string.
Perl's regex engine will only remember the last capture group from a repeated expression. If you want to capture each number in a separate capture group, then one option would be to use an explicit regex pattern:
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
Replacing
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
with
while (<$fh>) {
my #vals;
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(#vals, $^N) }))+$/) {
does what I wanted (requires perl 5.8 or newer).
he's my example. I thought I'd add it because I like simple code. It also allows "cpu7" with no trailing digits.
#!/usr/bin/perl
use strict;
use warnings;
my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>)
{
if ( /^cpu(\d+)(\s+)?(.*)$/ )
{
my $cpu = $1;
my $vals = scalar split( /\s+/, $3 ) ;
print "$cpu $vals\n";
}
}
close($fh);
Just adding to Tim's answer:
You can capture multiple values with one group (using the g-modifier), but then you have to split the statement.
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
my #vals= /(?:\s+(\d+))/g;
print "$cpu $#vals\n";
}

Storing Numerical Data in a Variable through matching in Perl

I am a beginner at Perl and want to store some data from a file format into a variable. Specifically, each line of the file has a format like the following:
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
I was able to use matching to get the line I wanted (with the code below).
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
}
However, I want to store the three coordinate values as variables or in a hash. Is it possible to use matching to store the coordinate values rather than having to use substring.
In this instance I would simply split each record into an array and verify the identifying fields. The coordinate values can simply be extracted from the array if the line has been found to be relevant.
Like this
use strict;
use warnings;
my $residue_name = 'ASP';
my $residue_number = 72;
while (<DATA>) {
my #fields = split;
next unless $fields[0] eq 'ATOM'
and $fields[2] eq 'CB'
and $fields[3] eq $residue_name
and $fields[5] == $residue_number;
my #coords = #fields[6, 7, 8];
print "#coords\n";
}
__DATA__
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
output
-2.80100 -7.45000 -2.09400
You can get the end of the line which is AFTER the match with $' (see http://perldoc.perl.org/perlvar.html), and split around spaces like in :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
(undef, $x, $y, $z) = split /\s+/;
...
}
(undef is necessary, because $_ will start with some spaces, thus the first variable will be empty)
You can also write something like :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
/\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)/;
($x, $y, $z) = ($1, $2, $3);
}
In fact, as always in Perl, there are lots of ways to do it...

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Not able to extract regex matches, return "1" instead of string

I am seeing strange problem..can someplease please help.
I have a log template that looks like this
CPU load: 0
Memory load: 7
User load: 0
Interface Information:
eth0: Up
eth1: Up
Processes Information:
Now, I login to my device and get the logs like
my #output = $ssh->exec("show details");
The output looks similar, as show below but different values for parameters
CPU load: 21
Memory load: 27
User load: 21
Interface Information:
eth0: Down
eth1: Up
Processes Information:
First I am opening the template file, split it into line by line and when I try to
compate it with "show details" output, for the matches, I am getting value 1 as result
and not the matched string. Can someone please help.
Code:
my #output = $ssh->exec("show details");
open (FH, "templates/SHOW.txt") || die "Could not open File: $!\n";
#file_array = <FH>;
#TemplateArray = split(/\n/,#file_array);
#matches = split(/\n/,#output);
foreach $keys (#matches) {
foreach (#TemplateArray) {
$keys =~ m/($_)/;
unshift (#result_array, $1);
}
}
print "\n #result_array\n";
}
I get "1" as result but no string.
When you use split on an array, the array will be in scalar context, and will only return the number of elements in it. In other words:
#TemplateArray = split(/\n/,#file_array);
#matches = split(/\n/,#output);
is equal to:
#TemplateArray = $#file_array;
#matches = $#output;
Which is why you get "1" as a result.
Also, if you are not already doing it:
use strict;
use warnings;
Adding to TLP's answer, the solution is to change
#matches = split(/\n/,#output);
to
#matches = map { split(/\n/, $_) } #output;
so split() operates on strings from #output.
split's expects a string for its second argument, so
#TemplateArray = split(/\n/, #file_array);
ends up being the same as
#TemplateArray = split(/\n/, scalar(#file_array));
Keep in mind that scalar(#file_array) returns the number of elements in the array.
#file_array = <FH>;
will populate #file_array as follows:
#file_array = (
"line1\n",
"line2\n",
"line3\n",
);
In other words, it's already split into lines. If you're trying to remove the trailing newlines, you want to replace
#TemplateArray = split(/\n/,#file_array);
with
chomp( my #TemplateArray = #file_array );
I can't help you fix
#matches = split(/\n/,#output);
because I don't know what $ssh contains, and thus I don't know what #output contains.
Please use
use strict;
use warnings;

How do I access captured substrings after a successful regex match in Perl?

I am searching for a string in Perl and storing it in another scalar variable. I want to print this scalar variable. The code below doesn't seem to work. I am not sure what is going wrong and what is the right way to go about it. Why is it printing '1' when it doesn't exist in the program?
Data it is running on
DATA
13 E 0.496 -> Q 0.724
18 S 0.507 -> R 0.513
19 N 0.485 -> S 0.681
21 N 0.557 -> K 0.482
The following is my code:
#!/usr/bin/perl
use strict;
use warnings;
my $find = '\s{10}[0-9]{2}\s[A-Z]'; #regex. it can also be '\s{10}[0-9]{2}\s[A-Z]'
#both dont seem to work
my #element;
open (FILE, "/user/run/test") || die "can't open file \n";
while (my $line = <FILE>) {
chomp ($line);
print "reached here1 \n"; # to test whether it reading the program properly
my $value = $line=~ /$find/ ;
print "reached here 3 \n"; # to test whether it reading the program properly
print "$value \n";
}
exit;
OUTPUT
reached here1
1
reached here 3
reached here1
1
reached here 3
The regex matching operation returns true (1) for match success, false otherwise. If you want to retrieve the match, you should try one of the following:
use the match variables $1, $2...
match in list context ($m1, $m2) = $string =~ /$regex/
Note that you need to use captures in your regex for these to work. Which you're not doing yet.
You ought to take a look at the complete documentation in perlop, section "Regexp Quote-Like Operators"
JB is correct. Your regular expression would need to use captures (which are defined by parentheses) for the individual parts to be collected. If you want to capture all of the elements in your line, you would want this:
my $find = '\s{10}([0-9]{2})\s([A-Z])';
my $field1;
my $field2;
while (my $line = <FILE>) {
chomp ($line);
if ($line=~ /$find/) {
$field1 = $1;
$field2 = $2;
# Do something with current line's field 1 and field 2
}
}
m// returns the captured matches in list context:
#!/usr/bin/perl
use strict;
use warnings;
my $pattern = qr/^\s{10}([0-9]{2})\s[A-Z]/;
while ( my $line = <DATA> ) {
if ( my ($n) = $line =~ $pattern ) {
print "$n\n";
}
}
__DATA__
13 E 0.496 -> Q 0.724
18 S 0.507 -> R 0.513
19 N 0.485 -> S 0.681
21 N 0.557 -> K 0.482
I'm unable to replicate your results. What I get is:
reached here1
reached here 3
1
reached here1
reached here 3
1
reached here1
reached here 3
1
reached here1
reached here 3
1
Regardless, it's printing 1 because you've told it to: the print statement to do so is inside the while loop, and what it's printing is an indication of whether or not the pattern matched.
You'd benefit from indenting your code properly:
#!/usr/bin/perl
use strict;
use warnings;
my $find = '\s{10}[0-9]{2}\s[A-Z]'; #regex. it can also be '\s{10}[0-9]{2}\s[A-Z]'
#both dont seem to work
my #element;
open (FILE, "foo") || die "can't open file \n";
while (my $line = <FILE>) {
chomp ($line);
print "reached here1 \n"; # to test whether it reading the program properly
my $value = $line=~ /$find/ ;
print "reached here 3 \n"; # to test whether it reading the program properly
print "$value \n";
}
exit;