Perl Text Extraction

Perl Text Extraction - regex

I just need to extract the numbers in each of these items and store them separately, whats the best way to do this ?
IF the data is something like
p °c 4'i
App data usage stats
E
iii
! 12:12PM
Received data
Sent data
Current usage
598KB
28KB
626KB :18%
Get Current Stat Browser App
J
Battery Level
I tried this, but I get only 18 as an output in this case.
foreach my $line (#lines) {
if ($line =~/ :[ ]*(\d+)[ ]*(KB|%)/) {
$value = $1;
print "the value is $value\n";
push (#array, $1);
}
}
Thanks,

Loop over every line, and using a regular expression
foreach my $line (#lines) {
if ($line =~ /(\d+)/) {
push (#array, $1);
}
}
And you'll have all the numbers in your #array array

Here's one way to do it. Note that it does not care about which kind of numbers it extracts, as per your request.
It splits the line on colons in max two fields, key and value. Then we extract numbers from the values and insert into the hash. This part will effectively skip all lines where values do not contain numbers. This is also where you would insert stricter checks, e.g. if ($value =~ /(\d+)\s*KB/i) would only capture numbers followed by KB (I opted to add case insensitivity).
use strict;
use warnings;
use Data::Dumper;
my %hash;
while (<DATA>) {
my ($key, $value) = split /\s*:\s*/, $_, 2;
if ($value =~ /(\d+)/) {
$hash{$key} = $1;
}
}
print Dumper \%hash;
__DATA__
Received data : 598 KB
Sent data : 28 KB
Current usage : 626 KB
Battery Level : 35 %
Output:
$VAR1 = {
'Sent data' => '28',
'Current usage' => '626',
'Battery Level' => '35',
'Received data' => '598'
};

Related

Preventing "foo" from matching "foo-bar" with grep -w

I am using grep inside my Perl script and I am trying to grep the exact keyword that I am giving. The problem is that "-w" doesn't recognize the "-" symbol as a separator.
example:
Let's say that I have these two records:
A1BG 0.0767377011073753
A1BG-AS1 0.233775553296782
if I give
grep -w "A1BG"
it returns both of them but I want only the exact one.
Any suggestions?
Many thanks in advance.
PS.
Here is my whole code.
The input file is a two-columns tab separated. So, I want to keep a unique value for each gene. In cases that I have more than one record, I calculate the average.
#!/usr/bin/perl
use strict;
use warnings;
#Find the average fc between common genes
sub avg {
my $total;
$total += $_ foreach #_;
return $total / #_;
}
my #mykeys = `cat G13_T.txt| awk '{print \$1}'| sort -u`;
foreach (#mykeys)
{
my #TSS = ();
my $op1 = 0;
my $key = $_;
chomp($key);
#print "$key\n";
my $command = "cat G13_T.txt|grep -E '([[:space:]]|^)$key([[:space:]]|\$)'";
#my $command = "cat Unique_Genes/G13_T.txt|grep -w $key";
my #belongs= `$command`;
chomp(#belongs);
my $count = scalar(#belongs);
if ($count == 1) {
print "$belongs[0]\n";
}
else {
for (my $i = 0; $i < $count; $i++) {
my #token = split('\t', $belongs[$i]);
my $lfc = $token[1];
push (#TSS, $lfc);
}
$op1 = avg(#TSS);
print $key ."\t". $op1. "\n";
}
}

If I got clarifications in comments right, the objective is to find the average of values (second column) for unique names in the first column. Then there is no need for external tools.
Read the file line by line and add up values for each name. The name uniqueness is granted by using a hash, with names being keys. Along with this also track their counts
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 filename\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my %results;
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
$results{$name}{value} += $value;
++$results{$name}{count};
}
foreach my $name (sort keys %results) {
$results{$name}{value} /= $results{$name}{count}
if $results{$name}{count} > 1;
say "$name => $results{$name}{value}";
}
After the file is processed each accumulated value is divided by its count and overwritten by that, so by its average (/= divides and assigns), if count > 1 (as a small measure of efficiency).
If there is any use in knowing all values that were found for each name, then store them in an arrayref for each key instead of adding them
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
push #{$results{$name}}, $value;
}
where now we don't need the count as it is given by the number of elements in the array(ref)
use List::Util qw(sum);
foreach my $name (sort keys %results) {
say "$name => ", sum(#{$results{$name}}) / #{$results{$name}};
}
Note that a hash built this way needs memory comparable to the file size (or may even exceed it), since all values are stored.
This was tested using the shown two lines of sample data, repeated and changed in a file. The code does not test the input in any way, but expects the second field to always be a number.
Notice that there is no reason to ever step out of our program and use external commands.

You may use a POSIX ERE regex with grep like this:
grep -E '([[:space:]]|^)A1BG([[:space:]]|$)' file
To return matches (not matching lines) only:
grep -Eo '([[:space:]]|^)A1BG([[:space:]]|$)' file
Details
([[:space:]]|^) - Group 1: a whitespace or start of line
A1BG - a substring
([[:space:]]|$) - Group 2: a whitespace or end of line

Perl: Trying to speed up parsing a delimited file

I have a large flat text file with lines that hold name/value pairs ("varname=value"). These pairs are seperated by a multi-character delimiter. So a single line in this file might look like this:
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
Each line holds about 50 name/value pairs.
I need to iterate through the lines of this file (there are about 100,000 lines) and store the name/value pairs in a hash so that
$field{'var1'} = value1
$field{'var2'} = value2
etc...
What I did was this:
# $line holds a single line from the file
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
Doing this for each line of the entire file takes (on my PC) about 2 seconds. This doesn't seem like a long time, but I really want to speed this up by quite a bit.
Of this 2 seconds, the first split takes about 0.6 seconds, while the foreach loop takes about 1.4 seconds. So I thought I'd get rid of the foreach loop and put it all in a single split:
%hash = split( /\Q|^|\E|=/, $line );
Much to my surprise, parsing the entire file this way took a full second longer! My question isn't really why this takes longer (although it would be a nice bonus to understand why), but my question is if there are any other (faster) ways to get the job done.
Thanks in advance.
------ Edit below this line ------
I just found out that changing this:
%hash = split( /\Q|^|\E|=/, $line );
into this:
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
makes it three times faster! Parsing the entire file this way now takes just over a second...
------ Snippet below this line ------
use strict;
use Time::HiRes qw( time );
my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
%hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
}
print Elapsed() . "\n";
################################################################################################################################
BEGIN {
my $startTime;
sub ResetTimer {
$startTime = time();
return $startTime;
}
sub Elapsed {
return time() - $startTime;
}
}

I can't easily answer your performance question, because I'd need a test case. But I'd guess that it's to do with how the regular expression is being processed.
You can see what that's doing with use re 'debug';, and that'll print the regular expression steps.
But for the broader question - I'd probably just tackle it with a global (assuming your data is as simple as the example):
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
my %row = m/(\w+)=(\w+)/g;
print Dumper \%row;
}
__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
You can use lookahead/behind to match delimiters if you've got more complicated things in there, but because it's one regex per line, you're invoking the regex engine less often, and that'll probably be faster. (But I can't tell you for sure without a test case).
If your data is more complicated, then perhaps:
my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
This will 'force' splitting the input into a new line, and then match 'anything' = 'anything'. But that's probably overkill unless your values include whitespace/pipes/metachars.
With editing your test case to use Benchmark:
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );
my $line =
"a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
sub double_split {
my %hash;
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ( $name, $value ) = split( /=/, $field );
$hash{$name} = $value;
}
}
sub single_split {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub re_replace_then_split {
$line =~ s/\Q|^|\E/=/g;
my %hash = split( /=/, $line );
}
sub single_regex {
my %hash = $line =~ m/(\w+)=(\w+)/g;
}
sub compound {
my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}
cmpthese(
1_000_000,
{ "Double Split" => \&double_split,
"single split with regex" => \&single_split,
"Replace then split" => \&re_replace_then_split,
"Single Regex" => \&single_regex,
"regex to linefeed them match" => \&compound
}
);
Looks like the results come out like:
Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split 18325/s -- -4% -34% -56% -97%
single split with regex 19050/s 4% -- -31% -54% -97%
Single Regex 27607/s 51% 45% -- -34% -96%
Replace then split 41733/s 128% 119% 51% -- -93%
regex to linefeed them match 641026/s 3398% 3265% 2222% 1436% --
... I'm a bit suspicious of that last, because that's absurdly faster. There's probably caching of results happening there.
But looking at it, what's slowing you down is the alternation in the regex:
sub single_split_with_alt {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub single_split {
my %hash = split( /[\|\^\=]+/, $line );
}
(I know that latter might not be quite what you want, but it's for illustrative purposes)
Gives:
Rate alternation single split
alternation 19135/s -- -37%
single split 30239/s 58% --
But there does come a point where this is moot, because your limiting factor is disk IO, not CPU.

Parsing data with perl- capturing a range of text

I'm writing code to parse all the interfaces on my network, looking for certain configurations.. etc.
the data looks like this:
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
Basically starting with "^interface " and ending "!".
my current algorithm is to "record" the data I need
foreach $line (#input) {
if ( $line =~ m/^interface.+\d/ && $line !~ m/interface Embedded-Service-Engine|BRI|TenGigabitEthernet|vlan|Port-channel|ATM|loopback/i) {
$record = 1;
}
#$int ne '' is to handle the rest of the file not in this format
if( $line =~ m/!/ && $int ne '') {
#save data in format 'interface fa2,mode trunk'
#if the interface doesn't have port-security
push(#intlist, join(','split("\r\n",$int))."\n") unless $int =~ m/port-security/;
$record=0;
$int='';
}
if ($record) {
$int.=$line;
}
}
while this works in my case, I'd like a simply way to do it. I've searched and found that you can use the range operator '..' on regex
which turns my code into :
#input # contains the file
#interfaces = grep (/^interface.+\d/ .. /!/, #input);
which gives me all the interface data, the problem is now every line is a single element in the #interfaces array. how can I then split this data up so everything from /^interface.+\d/ .. /!/ is one element in this array without creating more for loops?
The goal is to get it down to one element so I can then scan it for interfaces I don't want to look at interface Embedded-Service-Engine|BRI|TenGigabit as well as interfaces that have the correct configurations.

Have a look at $/ because I think that'll help. It's the record separator - which defaults to \n.
Then you can apply regular expressions to the current 'chunk' to pull out the data you require - by default a regular expression/capture group applies to $_ the implicit variable.
E.g.
#!/usr/bin/perl
use strict;
use warnings;
local $/ = '!';
while ( <DATA> ) {
my ( $interface ) = m/Interface (\w+)/i;
next if $interface =~ m/Embedded-Service-Engine/;
my ( $mode ) = m/mode (\w+)/;
print "$interface $mode\n";
print "---next record---\n";
}
__DATA__
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
If you need to keep the data for other uses (e.g. 'process as you go' isn't suitable) then the tool for the job is a hash.
You can either use something like the above - and populate the hash with particular keys you're interested in - or use the magic of map to do it for you.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $/ = '!';
my %interfaces;
while ( <DATA> ) {
my ( $interface ) = m/Interface (\w+)/i;
next if $interface =~ m/Embedded-Service-Engine/;
my %interface_values = map { my ( $key, $value ) = split; $key, $value || 1 } grep { /\w/ } split ( "\n" );
$interfaces{$interface} = \%interface_values;
}
print Dumper \%interfaces
__DATA__
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
That map line basically:
splits the current record on \n to get each line.
filters 'not word' values (so blank lines and !)
splits each line on whitepace, to get a key and value pair.
If no value is defined, sets it to 1. (so in the example, port-security )
Populates a hash with these key-value pairs.
and then updates %interfaces with the hash for each interface ID.
Giving something like:
$VAR1 = {
'fa1' => {
'port-security' => 1,
'mode' => 'access',
'Interface' => 'fa1',
'mac-address' => 'sticky'
},
'fa2' => {
'mode' => 'trunk',
'interface' => 'fa2'
}
};

A hash or hashref would be a result where you can work with. Furthermore, reading records based on a fixed structure can be read using a matching regex. Like so:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
our %MATCH;
*MATCH = \%+;
# read file into variable
my ( $file, $data ) = ( 'interfaces.txt', undef );
open( my $fh, '<', $file ) or die "cannot open file $file";
{
local $/;
$data = <$fh>;
}
close($fh);
print Dumper $data;
my $regex = qr{
(?sm)
interface [^\w\n]+
(?<interface> (\w[^\n]+))
[^\w]+
mode [^\w]+
(?<mode> (\w[^\n]+))
[^\w]+
((?<portsecurity> port-security)
[^\w]+)? # port-security is optional
(mac-address [^\w]+
(?<macaddress> (\w[^\n]+))
)? # mac-address is optional
[^!]*
!
}x;
my $results = {};
while ( $data =~ m/$regex/g ) {
my $interface = $MATCH{interface};
$results->{$interface} = { mode => $MATCH{mode} ? $MATCH{mode} : '' };
$results->{$interface}->{'port-security'} = 1
if defined $MATCH{portsecurity};
$results->{$interface}->{macaddress} = $MATCH{macaddress}
if defined $MATCH{macaddress};
}
print Dumper $results;
The result from your input is:
$VAR1 = {
'fa1' => {
'macaddress' => 'sticky',
'mode' => 'access',
'port-security' => 1
},
'fa2' => {
'mode' => 'trunk'
}
};
Having a hash with the interface names as key values, gives you the opportunity to use a 'grep' for the interfaces you want.
If your structure is not fixed - there is no ordering in your fields mode, port-security, mac-address - then you would need to read a interface record in one go, and split up the fields using separate regexes for each field.

This is my Final solution. In this particular case I'm searching for all switchports that have a maximum port-security not equal to 1. This is just an example and can be switched for any configuration. I'm also omitting certain interfaces from being caught if that configuration is actually applied to them.
#!/usr/bin/perl
$MDIR='/currentConfig';
#list of interfaces you don't want to see to filter output
#omit =(
'MANAGEMENT.PORT',
'sup.mgmt',
'Internal.EtherSwitch',
'Router',
'ip address \d',
'STRA'
);
#join with '|' to form the regex
$dontwant = join('|',#omit);
#search criteria
$search='switchport port-security maximum [^1]';
opendir(DIR,$MDIR) or die $!;
#dirContents=readdir DIR;close DIR;
foreach $file (#dirContents) {
open(IN,$MDIR.'/'.$file) or die $!;
#record seperator to !
$/='!';
my #inFile=<IN>; close IN;
#since the record seperator has been changed, '^' won't match beginning of line
my #ints = grep (/\ninterface/i,#inFile);
#set record seperator back to normal
$/="\n";
foreach $int (#ints) {
if ( $int =~ m/$search/i && $int !~ m/$dontwant/) {
push(#finalint,$int);
}
}
}
#just list the interfaces found, i'll use this to make it comma seperated
foreach $elem (#finalint) {
print $elem;
}

Perl Regular Expression Pattern

I have some data as such :
TYPE: Travel
ADDRESS
Barcelona
Paris
So, address can be 1 or many (I need to discard ADDRESS and get only those cities). For some reason my parsing fails (only "ADDRESS" is printed) to produce the correct result.Am i missing something ?
elsif (/^ADDRESS/) {
my #address_t = split /[no matter what i put,only ADDRESS is printed]+/, $_;
shift #address_t; #is this how i will discard ADDRESS ?
foreach my $address (#address_t) {
#address_names = ($address);
}
I think the regex is suppose to be split a newline, space ?
This is how i processed TYPE:
elsif (/^TYPE/) {
my #type_t = split '\s', $_;
$type = $type_tmp[1];
print "$type" ; #to test, but i have a hashmap which i load them in and print at the end of the file.
Thanks

use warnings;
use strict;
while(<DATA>) {
if (/^ADDRESS/) { # if line contains ADDRESS then read addresses
while (<DATA>) { # ... in a loop
last if !/^ +/; # until we find a non-indented line
print $_; # here you can push $_ to a list
}
}
if ($_ && /^TYPE/) { # a TYPE after address can be processed now
# stuff
}
}
__DATA__
TYPE: Travel
ADDRESS
Barcelona
Paris
TYPE: Travel
ADDRESS
Barcelona
Paris
Produces:
Barcelona
Paris
Barcelona
Paris

Try something like this:
It will print lines if the previous line matches /^ADDRESS/. Let me know if there's a point at which you want to stop, and I can adjust...
use warnings;
use strict;
my $current_line = "";
my $line_count = 0;
while (<IN>){
chomp;
my $previous_line = $current_line;
$current_line = $_;
if ($previous_line =~ /^ADDRESS/ or $line_count > 0 ){
$line_count++;
print "$current_line\n"
}
}

Storing Numerical Data in a Variable through matching in Perl

I am a beginner at Perl and want to store some data from a file format into a variable. Specifically, each line of the file has a format like the following:
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
I was able to use matching to get the line I wanted (with the code below).
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
}
However, I want to store the three coordinate values as variables or in a hash. Is it possible to use matching to store the coordinate values rather than having to use substring.

In this instance I would simply split each record into an array and verify the identifying fields. The coordinate values can simply be extracted from the array if the line has been found to be relevant.
Like this
use strict;
use warnings;
my $residue_name = 'ASP';
my $residue_number = 72;
while (<DATA>) {
my #fields = split;
next unless $fields[0] eq 'ATOM'
and $fields[2] eq 'CB'
and $fields[3] eq $residue_name
and $fields[5] == $residue_number;
my #coords = #fields[6, 7, 8];
print "#coords\n";
}
__DATA__
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
output
-2.80100 -7.45000 -2.09400

You can get the end of the line which is AFTER the match with $' (see http://perldoc.perl.org/perlvar.html), and split around spaces like in :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
(undef, $x, $y, $z) = split /\s+/;
...
}
(undef is necessary, because $_ will start with some spaces, thus the first variable will be empty)
You can also write something like :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
/\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)/;
($x, $y, $z) = ($1, $2, $3);
}
In fact, as always in Perl, there are lots of ways to do it...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl Text Extraction - regex

Loop over every line, and using a regular expression foreach my $line (#lines) { if ($line =~ /(\d+)/) { push (#array, $1); } } And you'll have all the numbers in your #array array

Related

Preventing "foo" from matching "foo-bar" with grep -w

Perl: Trying to speed up parsing a delimited file

Parsing data with perl- capturing a range of text

Perl Regular Expression Pattern

Storing Numerical Data in a Variable through matching in Perl

Categories

Resources