Why do I get the first capture group only? - regex

(https://stackoverflow.com/a/2304626/6607497 and https://stackoverflow.com/a/37004214/6607497 did not help me)
Analyzing a problem with /proc/stat in Linux I started to write a small utility, but I can't get the capture groups the way I wanted.
Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
For example with these input lines I get the output:
> cat /proc/stat
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
0
0 0
1 0
2 0
3 0
So the match actually works, but I don't get the capture groups into #vals (perls 5.18.2 and 5.26.1).

Only the last of the repeated matches from a single pattern is captured.
Instead, can just split the line and then check on -- and adjust -- the first field
while (<$fh>) {
my ($cpu, #vals) = split;
next if not $cpu =~ s/^cpu//;
print "$cpu $#vals\n";
}
If the first element of the split's return doesn't start with cpu the regex substition fails and so the line is skipped. Otherwise, you get the number following cpu (or an empty string), as in OP.†
Or, can use the particular structure of the line you process
while (<$fh>) {
if (my ($cpu, #vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) {
print "$cpu $#vals\n";
}
}
The regex returns two items and each is split in the map, except that the first one is just passed as is into $cpu (being either a number or an empty string), while the other yields the numbers.
Both these produce the needed output in my tests.
† Since we always check for ^cpu (and remove it) it makes sense to do that first, and only then split -- when needed. However, that gets a little tricky for the following reason.
That bare split strips the leading (and trailing) whitespaces by its default, so for lines where cpu string has no trailing digits (cpu 2709779...) we would end up having the next number for what should be the cpu designation! A quiet error.
Thus we need to specify for split to use spaces, as it then leaves the leading spaces
while (<$fh>) {
next if not s/^cpu//;
my ($cpu, #vals) = split /\s+/; # now $cpu may be space(s)
print "$cpu $#vals\n";
}
This now works as intended as the cpu without trailing numbers gets space(s), a case to handle but clear. But this is misleading and an unaware maintainer -- or us the proverbial six months later -- may be tempted to remove the seemingly "unneeded" /\s+/, introducing an error.

Going by the example input, following content inside the while loop should work.
if (/^cpu(\d*)/) {
my $cpu = $1;
my (#vals) = /(?:\s+(\d+))+/g;
print "$cpu $#vals\n";
}

In an exercise for Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl I pull out the big guns). We don't tell people this because we want to highlight the natural behavior to try to write everything in a single regex. Some of the contortions in other answers remind me of that, and I wouldn't want to maintain any of them.
First, there's the issue of only processing the interesting lines. Then, once we have that line, grab all the numbers. Translating that problem statement into code is very simple and straightforward. No acrobatics here because assertions and anchors do most of the work:
use v5.10;
while( <DATA> ) {
next unless /\A cpu(\d*) \s /ax;
my $cpu = $1;
my #values = / \b (\d+) \b /agx;
say "$cpu " . #values;
}
__END__
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
Note that the OP still has to decide how to handle the cpu case with no trailing digits. Don't know what you want to do with the empty string.

Perl's regex engine will only remember the last capture group from a repeated expression. If you want to capture each number in a separate capture group, then one option would be to use an explicit regex pattern:
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}

Replacing
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
with
while (<$fh>) {
my #vals;
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(#vals, $^N) }))+$/) {
does what I wanted (requires perl 5.8 or newer).

he's my example. I thought I'd add it because I like simple code. It also allows "cpu7" with no trailing digits.
#!/usr/bin/perl
use strict;
use warnings;
my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>)
{
if ( /^cpu(\d+)(\s+)?(.*)$/ )
{
my $cpu = $1;
my $vals = scalar split( /\s+/, $3 ) ;
print "$cpu $vals\n";
}
}
close($fh);

Just adding to Tim's answer:
You can capture multiple values with one group (using the g-modifier), but then you have to split the statement.
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
my #vals= /(?:\s+(\d+))/g;
print "$cpu $#vals\n";
}

Related

Perl: Matching 3 pairs of numbers from 4 consecutive numbers

I am writing some code and I need to do the following:
Given a 4 digit number like "1234" I need to get 3 pairs of numbers (the first 2, the 2 in the middle, and the last 2), in this example I need to get "12" "23" and "34".
I am new to perl and don't know anything about regex. In fact, I am writing a script for personal use and I've started reading about Perl some days ago because I figured it was going to be a better language for the task at hand (need to do some statistics with the numbers and find patterns)
I have the following code but when testing I processed 6 digit numbers, because I "forgot" that the numbers I would be processing are 4 digits, so it failed with the real data, of course
foreach $item (#totaldata)
{
my $match;
$match = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
if ($match)
{
($arr1[$i], $arr2[$i], $arr3[$i]) = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
$processednums++;
$i++;
}
}
Thank you.
You can move last matching position with pos()
pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset..
my $item = 1234;
my #arr;
while ($item =~ /(\d\d)/g) {
push #arr, $1;
pos($item)--;
}
print "#arr\n"; # 12 23 34
The simplest way would be to use a global regex pattern search
It is nearly always best to separate verificaton of the input data from processing, so the program below first rejects any values that are not four characters long or that contain a non-digit character
Then the regex pattern finds all points in the string that are followed by two digits, and captures them
use strict;
use warnings 'all';
for my $val ( qw/ 1234 6572 / ) {
next if length($val) != 4 or $val =~ /\D/;
my #pairs = $val =~ /(?=(\d\d))/g;
print "#pairs\n";
}
output
12 23 34
65 57 72
Here's a pretty loud example demonstrating how you can use substr() to fetch out the portions of the number, while ensuring that what you're dealing with is in fact exactly a four-digit number.
use warnings;
use strict;
my ($one, $two, $three);
while (my $item = <DATA>){
if ($item =~ /^\d{4}$/){
$one = substr $item, 0, 2;
$two = substr $item, 1, 2;
$three = substr $item, 2, 2;
print "one: $one, two: $two, three: $three\n";
}
}
__DATA__
1234
abcd
a1b2c3
4567
891011
Output:
one: 12, two: 23, three: 34
one: 45, two: 56, three: 67
foreach $item (#totaldata) {
if ( my #match = $item =~ m/(?=(\d\d))/ ) {
($heads[$i], $middles[$i], $tails[$i]) = #match;
$processednums++;
$i++;
}
}

Changing time format using regex in perl

I want to read 12h format time from file and replace it with 24 hour
example
this is due at 3:15am -> this is due 15:15
I tried saving variables in regex and manupilate it later but didnt work, I also tried using substitution "/s" but because it is variable I couldnt figure it out
Here is my code:
while (<>) {
my $line = $_;
print ("this is text before: $line \n");
if ($line =~ m/\d:\d{2}pm/g){
print "It is PM! \n";}
elsif ($line =~ m/(\d):(\d\d)am/g){
print "this is try: $line \n";
print "Its AM! \n";}
$line =~ s/($regexp)/<French>$lexicon{$1}<\/French>/g;
print "sample after : $line\n";
}
A simple script can do the work for you
$str="this is due at 3:15pm";
$str=~m/\D+(\d+):\d+(.*)$/;
$hour=($2 eq "am")? ( ($1 == 12 )? 0 : $1 ) : ($1 == 12 ) ? $1 :$1+12;
$min=$2;
$str=~s/at.*/$hour:$min/g;
print "$str\n";
Gives output as
this is due 15:15
What it does??
$str=~m/\D+(\d+):(\d+)(.*)$/; Tries to match the string with the regex
\D+ matches anything other than digits. Here it matches this is due at
(\d+) matches any number of digits. Here it matches 3. Captured in group 1 , $1 which is the hours
: matches :
(\d+) matches any number of digits. Here it matches 15, which is the minutes
(.*) matches anything follwed, here am . Captures in group 2, `$2
$ anchors the regex at end of
$hour=($2 eq "am")? ( ($1 == 12 )? 0 : $1 ) : ($1 == 12 ) ? $1 :$1+12; Converts to 24 hour clock. If $2 is pm adds 12 unless it is 12. Also if the time is am and 12 then the hour is 0
$str=~s/at.*/$hour:$min/g; substitutes anything from at to end of string with $hour:$min, which is the time obtained from the ternary operation performed before
#!/usr/bin/env perl
use strict;
use warnings;
my $values = time_12h_to_24h("11:00 PM");
sub time_12h_to_24h
{
my($t12) = #_;
my($hh,$mm,$ampm) = $t12 =~ m/^(\d\d?):(\d\d?)\s*([AP]M?)/i;
$hh = ($hh % 12) + (($ampm =~ m/AM?/i) ? 0 : 12);
return sprintf("%.2d:%.2d", $hh, $mm);
}
I found this code in the bleow link. Please check:
Is my pseudo code for changing time format correct?
Try this it give what you expect
my #data = <DATA>;
foreach my $sm(#data){
if($sm =~/(12)\.\d+(pm)/g){
print "$&\n";
}
elsif($sm =~m/(\d+(\.)?\d+)(pm)/g )
{
print $1+12,"\n";
}
}
__DATA__
Time 3.15am
Time 3.16pm
Time 5.17pm
Time 1.11am
Time 1.01pm
Time 12.11pm

Storing Numerical Data in a Variable through matching in Perl

I am a beginner at Perl and want to store some data from a file format into a variable. Specifically, each line of the file has a format like the following:
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
I was able to use matching to get the line I wanted (with the code below).
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
}
However, I want to store the three coordinate values as variables or in a hash. Is it possible to use matching to store the coordinate values rather than having to use substring.
In this instance I would simply split each record into an array and verify the identifying fields. The coordinate values can simply be extracted from the array if the line has been found to be relevant.
Like this
use strict;
use warnings;
my $residue_name = 'ASP';
my $residue_number = 72;
while (<DATA>) {
my #fields = split;
next unless $fields[0] eq 'ATOM'
and $fields[2] eq 'CB'
and $fields[3] eq $residue_name
and $fields[5] == $residue_number;
my #coords = #fields[6, 7, 8];
print "#coords\n";
}
__DATA__
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
output
-2.80100 -7.45000 -2.09400
You can get the end of the line which is AFTER the match with $' (see http://perldoc.perl.org/perlvar.html), and split around spaces like in :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
(undef, $x, $y, $z) = split /\s+/;
...
}
(undef is necessary, because $_ will start with some spaces, thus the first variable will be empty)
You can also write something like :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
/\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)/;
($x, $y, $z) = ($1, $2, $3);
}
In fact, as always in Perl, there are lots of ways to do it...

populating a multi - level hash

I have this data set - I am only concerned with the class id, the process#server:
I am trying to load all three into a multi-level hash.
class_id: 995 (OCAive ) ack_type: NOACK supercede: NO copy: NO bdcst: NO
PID SENDS RECEIVES RETRANSMITS TIMEOUTS MEAN S.D. #
21881 (rocksrvr#ith111 ) 1 1 0 0
24519 (miclogUS#ith110 ) 1 1 0 0
26163 (gkoopsrvr#itb101 ) 1 1 0 0
28069 (sitetic#ith100 ) 23 4 0 0
28144 (srtic#ithb10 ) 33 5 0 0
29931 (loick#ithbsv115 ) 1 1 0 0
87331 (rrrrv_us#ith102 ) 1 1 0 0
---------- ---------- ---------- ---------- ------- ------- ------
61 14 0 0
When i try and populate the hash, it does not get much (data dumper results below)
$VAR1 = '';
$VAR2 = {
'' => undef
};
This is the code:
#!/usr/bin/perl -w
use strict;
my $newcomm_dir = "/home/casper-compaq/work_perl/newcomm";
my $newcomm_file = "newcomm_stat_result.txt";
open NEWCOMM , "$new_comm_dir$newcomm_file";
while (<NEWCOMM>) {
if ($_ =~ /\s+class_id:\s\d+\s+((\w+)\s+)/){
my $message_class = $1;
}
if ($_ =~ /((\w+)\#\w+\s+)/) {
my $process = $1;
}
if ($_ =~ /(\w+\#(\w+)\s+)/) {
my $servers = $1;
}
my $newcomm_stat_hash{$message_class}{$servers}=$process;
use Data::Dumper;
print Dumper (%newcomm_stat_hash);
}
Apart from the declarations problem, there are also multiple problems with your regex expressions. I would suggest making sure you get the expected output into each of the variables prior to trying to insert it into a hash.
For one thing you need to match parenthesis as \(, \) otherwise they are just interpreted as variable containers.
Your my declarations inside the if only have scope until the end of the if block, and your hash assignment shouldn't have a my. Try declaring %newcomm_stat_hash before the while loop and $message_class, $process, and $servers at the top of the while block.
You also probably want to check your open for failure; I suspect you are missing a / there.
I suspect your main error is that your file is not opened, due to the path missing a /, between directory and filename. You can use the autodie pragma to check that the open succeeds, or you can use or die "Can't open file: $!".
You have some scope issues. First off, $message_class will be undefined throughout the loop, since it's scope lasts only inside one iteration. You probably also want to have the hash outside the loop, if you want to be able to use it later on.
I put a next statement in the header line check, as the other checks will be invalid in that particular line. If you want to be more precise, you can put the whole thing outside the loop, and just do a single line check.
You do not need the two variables for process and server, just use them directly, and both at the same time.
Lastly, you probably want to send the reference to the hash to Data::Dumper in the print, else the hash will expand, and the print will be somewhat misleading.
#!/usr/bin/perl -w
use strict;
use autodie;
my $newcomm_dir = "/home/casper-compaq/work_perl/newcomm/";
my $newcomm_file = "newcomm_stat_result.txt";
open my $fh, '<', "$new_comm_dir$newcomm_file";
my $message_class;
my %newcomm_stat_hash;
while (<$fh>) {
if (/^\s+class_id:\s+\d+\s+\((\w+)\)\s+/){
$message_class = $1;
next;
}
if (/(\w+)\#(\w+)/) {
$newcomm_stat_hash{$message_class}{$2}=$1;
}
}
use Data::Dumper;
print Dumper \%newcomm_stat_hash;
Is your file being opened?
I think you need to change:
open NEWCOMM , "$new_comm_dir$newcomm_file";
To:
open NEWCOMM , $new_comm_dir.'/'.$newcomm_file;

How do I access captured substrings after a successful regex match in Perl?

I am searching for a string in Perl and storing it in another scalar variable. I want to print this scalar variable. The code below doesn't seem to work. I am not sure what is going wrong and what is the right way to go about it. Why is it printing '1' when it doesn't exist in the program?
Data it is running on
DATA
13 E 0.496 -> Q 0.724
18 S 0.507 -> R 0.513
19 N 0.485 -> S 0.681
21 N 0.557 -> K 0.482
The following is my code:
#!/usr/bin/perl
use strict;
use warnings;
my $find = '\s{10}[0-9]{2}\s[A-Z]'; #regex. it can also be '\s{10}[0-9]{2}\s[A-Z]'
#both dont seem to work
my #element;
open (FILE, "/user/run/test") || die "can't open file \n";
while (my $line = <FILE>) {
chomp ($line);
print "reached here1 \n"; # to test whether it reading the program properly
my $value = $line=~ /$find/ ;
print "reached here 3 \n"; # to test whether it reading the program properly
print "$value \n";
}
exit;
OUTPUT
reached here1
1
reached here 3
reached here1
1
reached here 3
The regex matching operation returns true (1) for match success, false otherwise. If you want to retrieve the match, you should try one of the following:
use the match variables $1, $2...
match in list context ($m1, $m2) = $string =~ /$regex/
Note that you need to use captures in your regex for these to work. Which you're not doing yet.
You ought to take a look at the complete documentation in perlop, section "Regexp Quote-Like Operators"
JB is correct. Your regular expression would need to use captures (which are defined by parentheses) for the individual parts to be collected. If you want to capture all of the elements in your line, you would want this:
my $find = '\s{10}([0-9]{2})\s([A-Z])';
my $field1;
my $field2;
while (my $line = <FILE>) {
chomp ($line);
if ($line=~ /$find/) {
$field1 = $1;
$field2 = $2;
# Do something with current line's field 1 and field 2
}
}
m// returns the captured matches in list context:
#!/usr/bin/perl
use strict;
use warnings;
my $pattern = qr/^\s{10}([0-9]{2})\s[A-Z]/;
while ( my $line = <DATA> ) {
if ( my ($n) = $line =~ $pattern ) {
print "$n\n";
}
}
__DATA__
13 E 0.496 -> Q 0.724
18 S 0.507 -> R 0.513
19 N 0.485 -> S 0.681
21 N 0.557 -> K 0.482
I'm unable to replicate your results. What I get is:
reached here1
reached here 3
1
reached here1
reached here 3
1
reached here1
reached here 3
1
reached here1
reached here 3
1
Regardless, it's printing 1 because you've told it to: the print statement to do so is inside the while loop, and what it's printing is an indication of whether or not the pattern matched.
You'd benefit from indenting your code properly:
#!/usr/bin/perl
use strict;
use warnings;
my $find = '\s{10}[0-9]{2}\s[A-Z]'; #regex. it can also be '\s{10}[0-9]{2}\s[A-Z]'
#both dont seem to work
my #element;
open (FILE, "foo") || die "can't open file \n";
while (my $line = <FILE>) {
chomp ($line);
print "reached here1 \n"; # to test whether it reading the program properly
my $value = $line=~ /$find/ ;
print "reached here 3 \n"; # to test whether it reading the program properly
print "$value \n";
}
exit;