(https://stackoverflow.com/a/2304626/6607497 and https://stackoverflow.com/a/37004214/6607497 did not help me)
Analyzing a problem with /proc/stat in Linux I started to write a small utility, but I can't get the capture groups the way I wanted.
Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
For example with these input lines I get the output:
> cat /proc/stat
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
0
0 0
1 0
2 0
3 0
So the match actually works, but I don't get the capture groups into #vals (perls 5.18.2 and 5.26.1).
Only the last of the repeated matches from a single pattern is captured.
Instead, can just split the line and then check on -- and adjust -- the first field
while (<$fh>) {
my ($cpu, #vals) = split;
next if not $cpu =~ s/^cpu//;
print "$cpu $#vals\n";
}
If the first element of the split's return doesn't start with cpu the regex substition fails and so the line is skipped. Otherwise, you get the number following cpu (or an empty string), as in OP.†
Or, can use the particular structure of the line you process
while (<$fh>) {
if (my ($cpu, #vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) {
print "$cpu $#vals\n";
}
}
The regex returns two items and each is split in the map, except that the first one is just passed as is into $cpu (being either a number or an empty string), while the other yields the numbers.
Both these produce the needed output in my tests.
† Since we always check for ^cpu (and remove it) it makes sense to do that first, and only then split -- when needed. However, that gets a little tricky for the following reason.
That bare split strips the leading (and trailing) whitespaces by its default, so for lines where cpu string has no trailing digits (cpu 2709779...) we would end up having the next number for what should be the cpu designation! A quiet error.
Thus we need to specify for split to use spaces, as it then leaves the leading spaces
while (<$fh>) {
next if not s/^cpu//;
my ($cpu, #vals) = split /\s+/; # now $cpu may be space(s)
print "$cpu $#vals\n";
}
This now works as intended as the cpu without trailing numbers gets space(s), a case to handle but clear. But this is misleading and an unaware maintainer -- or us the proverbial six months later -- may be tempted to remove the seemingly "unneeded" /\s+/, introducing an error.
Going by the example input, following content inside the while loop should work.
if (/^cpu(\d*)/) {
my $cpu = $1;
my (#vals) = /(?:\s+(\d+))+/g;
print "$cpu $#vals\n";
}
In an exercise for Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl I pull out the big guns). We don't tell people this because we want to highlight the natural behavior to try to write everything in a single regex. Some of the contortions in other answers remind me of that, and I wouldn't want to maintain any of them.
First, there's the issue of only processing the interesting lines. Then, once we have that line, grab all the numbers. Translating that problem statement into code is very simple and straightforward. No acrobatics here because assertions and anchors do most of the work:
use v5.10;
while( <DATA> ) {
next unless /\A cpu(\d*) \s /ax;
my $cpu = $1;
my #values = / \b (\d+) \b /agx;
say "$cpu " . #values;
}
__END__
cpu 2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106 ...
Note that the OP still has to decide how to handle the cpu case with no trailing digits. Don't know what you want to do with the empty string.
Perl's regex engine will only remember the last capture group from a repeated expression. If you want to capture each number in a separate capture group, then one option would be to use an explicit regex pattern:
if (open(my $fh, '<', my $file = '/proc/stat')) {
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
print "$cpu $#vals\n";
}
}
close($fh);
} else {
die "$file: $!\n";
}
Replacing
while (<$fh>) {
if (my ($cpu, #vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
with
while (<$fh>) {
my #vals;
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(#vals, $^N) }))+$/) {
does what I wanted (requires perl 5.8 or newer).
he's my example. I thought I'd add it because I like simple code. It also allows "cpu7" with no trailing digits.
#!/usr/bin/perl
use strict;
use warnings;
my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>)
{
if ( /^cpu(\d+)(\s+)?(.*)$/ )
{
my $cpu = $1;
my $vals = scalar split( /\s+/, $3 ) ;
print "$cpu $vals\n";
}
}
close($fh);
Just adding to Tim's answer:
You can capture multiple values with one group (using the g-modifier), but then you have to split the statement.
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
my #vals= /(?:\s+(\d+))/g;
print "$cpu $#vals\n";
}
I want to make a scraper that grab all the <a id href="">...</a> elements from some website and the format of the elements is:
<a id href="/model.aspx?modelid=886874">Samsung Galaxy Note 4 SM-N910F</a>
And the thing that is changed all the time is the ?modelid=integer. How do I make a regular expression for this?
try this:
$re = "/<a[^\"]*href=\"([^\"]*)\"[^>]*>([^<]+)<\\/a>/mi";
$str = "<a id href=\"sjdkg\">...</a>\n<a id href=\"sjdkg\">.dg..</a>";
preg_match_all($re, $str, $matches);
$matches[1]; // for href
$matches[2]; // for innertext
var_dump($matches);
output:
array
0 =>
array
0 => string '<a id href="sjdkg">...</a>' (length=26)
1 => string '<a id href="sjdkg">.dg..</a>' (length=28)
1 =>
array
0 => string 'sjdkg' (length=5)
1 => string 'sjdkg' (length=5)
2 =>
array
0 => string '...' (length=3)
1 => string '.dg..' (length=5)
live demo
This is the regex you need:
modelid\=\d\d\d\d\d\d\">(.*)</a>
Regex Explanation
I have written a small perl "hack" to replace 1's with alphabets in a range of columns in a tab delimited file. The file looks like this:
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
The gola is to take columns 11 through 16 and append letters A to Z if a value of 1 is seen in those columns. My code so far is producing an empty output and this is my first time doing regular expressions.
cat infile.txt \
| perl -ne '#alphabet=("A".."Z");
$is_known_intron = 0;
$is_known_donor = 1;
$is_known_acceptor = 1;
chomp;
$_ =~ s/^\s+//;
#d = split /\s+/, $_;
#d_bool=#d[$11-$16];
$ct=1;
$known_intron = $d[$10];
$num_of_overlapping_gene = $d[$9];
$known_acceptor = $d[$8];
$known_donor = $d[$7];
$k="";
if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
for ($i = 0; $i < scalar #d_bool; $i++){
$k.=$alphabet[$i] if ($d_bool[$i])
}
$alphabet_ct{$k}+=$ct;
}
END
{
foreach $k (sort keys %alphabet_ct){
print join("\t", $k, $alphabet_ct{$k}), "\n";
}
} '\
> Outfile.txt
What should I be doing instead?
Thanks!
* Edit *
Expected Output
ABCD 45
BCD 23
ABCDEF 1215
so on and so forth.
I converted your code into a script for ease of debugging. I've put comments in the code to point out dodgy bits:
use strict;
use warnings;
my %alphabet_ct;
my #alphabet = ( "A" .. "Z" );
my $is_known_intron = 0;
my $is_known_donor = 1;
my $is_known_acceptor = 1;
while (<DATA>) {
# don't process the first line
next unless /chr10/;
chomp;
# this should remove whitespace at the beginning of the line but is doing nothing as there is none
$_ =~ s/^\s+//;
my #d = split /\s+/, $_;
# the range operator in perl is .. (not "-")
my #d_bool = #d[ 10 .. 15 ];
my $known_intron = $d[9];
my $known_acceptor = $d[7];
my $known_donor = $d[6];
my $k = "";
# this expression is false for all the data in the sample you provided as
# $is_known_intron is set to 0
if ( ( $known_intron == $is_known_intron )
and ( $known_donor == $is_known_donor )
and ( $known_acceptor == $is_known_acceptor ) )
{
for ( my $i = 0; $i < scalar #d_bool; $i++ ) {
$k .= $alphabet[$i] if $d_bool[$i];
}
# it is more idiomatic to write $alphabet_ct{$k}++;
# $alphabet_ct{$k} += $ct;
$alphabet_ct{$k}++;
}
}
foreach my $k ( sort keys %alphabet_ct ) {
print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}
__DATA__
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
With $is_known_intron set to 1, the sample data gives the results:
ABCDE 1
ABCE 1
ACD 1
CE 2
I'm trying to extract the specific lines from a trace file like below:
- 0.118224 0 7 ack 40 ------- 1 2.0 7.0 0 2
r 0.118436 1 2 tcp 40 ------- 2 7.1 2.1 0 1
+ 0.118436 1 2 ack 40 ------- 2 3.1 2.1 0 3
- 0.118436 1 2 ack 40 ------- 2 4.1 2.1 0 3
r 0.120256 0 7 ack 40 ------- 1 2.0 7.0 0 2
I want to extract any line that have the following:
r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x.
Note: x means any value and numbers could be between 3-to-7.
here is my try-its not working !!:
if {[regexp \r+ ([0-9.]+) 1 2.*- ([3-7.]+) 2.*- ([0-9.]+) $line -> time]}
Any suggestion??
Here's another approach: extract the fields you want to use for comparison
while {[gets $f line] != -1} {
lassign [split $line] a - b c - - - - d e - -
if {
$a eq "r" &&
$b == 1 &&
$c == 2 &&
3 <= floor($d) && floor($d) <= 7 &&
floor($e) == 2
} {
puts $line
}
}
You have to escape the . with a \. It means "any character" in regexp.
So your regexp could look like:
if {[regexp {r \d\.\d{5} 1 2 \d{3} \d{2} ------- \d [3-7]\.\d 2\.\d \d \d} $line -> time ]} {
# ...
}
Now you have to place () around the part you want.
Btw: I used the following transformation on your description of what you want to match:
set input {r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x}
set re [subst [regsub -all {x{2,}} $data {\\\\d{[string length \0]}}]]
set re [string map {. {\.} x {\d} numbers {[3-7]}} $re]
I have a line such as this:
andy_1972 * andy#ip.address 0 0 0 0 0 0 119075 224 1342751704 1348550270
I want the end result to be the bolded characters, like this:
andy_1972 119075
I am trying to just trim the line down to the word and the 4th number from the end of the line.
How can I do this using regex? I'm using Notepad++
This will match the first word and the fourth-from-last number:
^(\w+).* (\d+) \d+ \d+ \d+$
In perl-compatible (perl or PCRE) that would be
$string = "andy_1972 * andy#ip.address 0 0 0 0 0 0 119075 224 1342751704 1348550270";
$string =~ /^(\w+).* (\d+) \d+ \d+ \d+$/;
print $1 $2;
Using cut:
echo andy_1972 \* andy#ip.address 0 0 0 0 0 0 119075 224 1342751704 1348550270 |
cut -d' ' -f1,10