I am trying to parse through a simple enough file of field and value pairs.
So some fields I am not interested in and I want to skip
So in my "play" code I had a static thing like this:
next if $field =~ m/fieldToIgnore1|fieldToIgnore2/;
... then I extended this an an array and still happy
print "== using ~~ ==\n";
foreach my $field (#fields) {
next if $field ~~ #foni;
print "$field\n";
}
(fnoi == fields not of interest)
But when I carry that over back into my non-play setup it doesn't work.
Now in the play I was just looping over
my #fields = ("field1", "field2");
my #foni = ("fieldToIgnore1", "fieldToIgnore1");
In my proper code I go through each line and take out the lines that are setup like field - value lines and then strip out the field into a scalar... hence why I thought it would the same idea as my play code - but it doesn't seem to be
while ( <$infile> ) {
if ( /^PUBLISH:/ ) {
( $symbol, $record_type ) = ( $1, $2 );
print "symbol is: [$symbol]\n";
} else {
my ( $field, $value ) = split(/\|/);
next unless $value;
print "field is: [$field]\n";
print "value is: [$value]\n";
$field =~ s/^\s+|\s+$//g;
$value =~ s/^\s+|\s+$//g;
print "... field is: [$field]\n";
print "... value is: [$value]\n";
## ADD FIELD SKIPPING LOGIC HERE
You can build a regex pattern from your array, like this
my $re = join '|', #foni;
$re = qr/$re/; # Compile the regex
for my $field (#fields) {
next if $field =~ $re;
...
}
Related
I have a large flat text file with lines that hold name/value pairs ("varname=value"). These pairs are seperated by a multi-character delimiter. So a single line in this file might look like this:
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
Each line holds about 50 name/value pairs.
I need to iterate through the lines of this file (there are about 100,000 lines) and store the name/value pairs in a hash so that
$field{'var1'} = value1
$field{'var2'} = value2
etc...
What I did was this:
# $line holds a single line from the file
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
Doing this for each line of the entire file takes (on my PC) about 2 seconds. This doesn't seem like a long time, but I really want to speed this up by quite a bit.
Of this 2 seconds, the first split takes about 0.6 seconds, while the foreach loop takes about 1.4 seconds. So I thought I'd get rid of the foreach loop and put it all in a single split:
%hash = split( /\Q|^|\E|=/, $line );
Much to my surprise, parsing the entire file this way took a full second longer! My question isn't really why this takes longer (although it would be a nice bonus to understand why), but my question is if there are any other (faster) ways to get the job done.
Thanks in advance.
------ Edit below this line ------
I just found out that changing this:
%hash = split( /\Q|^|\E|=/, $line );
into this:
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
makes it three times faster! Parsing the entire file this way now takes just over a second...
------ Snippet below this line ------
use strict;
use Time::HiRes qw( time );
my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
%hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
}
print Elapsed() . "\n";
################################################################################################################################
BEGIN {
my $startTime;
sub ResetTimer {
$startTime = time();
return $startTime;
}
sub Elapsed {
return time() - $startTime;
}
}
I can't easily answer your performance question, because I'd need a test case. But I'd guess that it's to do with how the regular expression is being processed.
You can see what that's doing with use re 'debug';, and that'll print the regular expression steps.
But for the broader question - I'd probably just tackle it with a global (assuming your data is as simple as the example):
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
my %row = m/(\w+)=(\w+)/g;
print Dumper \%row;
}
__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
You can use lookahead/behind to match delimiters if you've got more complicated things in there, but because it's one regex per line, you're invoking the regex engine less often, and that'll probably be faster. (But I can't tell you for sure without a test case).
If your data is more complicated, then perhaps:
my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
This will 'force' splitting the input into a new line, and then match 'anything' = 'anything'. But that's probably overkill unless your values include whitespace/pipes/metachars.
With editing your test case to use Benchmark:
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );
my $line =
"a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
sub double_split {
my %hash;
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ( $name, $value ) = split( /=/, $field );
$hash{$name} = $value;
}
}
sub single_split {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub re_replace_then_split {
$line =~ s/\Q|^|\E/=/g;
my %hash = split( /=/, $line );
}
sub single_regex {
my %hash = $line =~ m/(\w+)=(\w+)/g;
}
sub compound {
my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}
cmpthese(
1_000_000,
{ "Double Split" => \&double_split,
"single split with regex" => \&single_split,
"Replace then split" => \&re_replace_then_split,
"Single Regex" => \&single_regex,
"regex to linefeed them match" => \&compound
}
);
Looks like the results come out like:
Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split 18325/s -- -4% -34% -56% -97%
single split with regex 19050/s 4% -- -31% -54% -97%
Single Regex 27607/s 51% 45% -- -34% -96%
Replace then split 41733/s 128% 119% 51% -- -93%
regex to linefeed them match 641026/s 3398% 3265% 2222% 1436% --
... I'm a bit suspicious of that last, because that's absurdly faster. There's probably caching of results happening there.
But looking at it, what's slowing you down is the alternation in the regex:
sub single_split_with_alt {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub single_split {
my %hash = split( /[\|\^\=]+/, $line );
}
(I know that latter might not be quite what you want, but it's for illustrative purposes)
Gives:
Rate alternation single split
alternation 19135/s -- -37%
single split 30239/s 58% --
But there does come a point where this is moot, because your limiting factor is disk IO, not CPU.
How can I print and get index from regex here:
my $search1 = "aaaNAMEaaaa";
my $search2 = "bbbbCHECKbbb";
if ( $search1 =~ /na\we/i and $search2 =~ /che\wk/i ) {
print "String found\n";
# This works with one search
# my $matched = $&;
# my $pos = index( $search1, $matched );
}
If both expressions match, $& will only have the last match. i.e. for the example above $& will always have the value CHECK and never have NAME because it was overwritten by the second pattern match.
You can wrap this logic in a function, then call that function as many times as you'd like with different string, pattern combinations:
use strict;
use warnings;
my $search1 = "aaaNAMEaaaa";
my $search2 = "bbbbCHECKbbb";
print index_from_match($search1, qr/na\we/i), "\n"; # 3
print index_from_match($search2, qr/che\wk/i), "\n"; # 4
print index_from_match($search1, qr/che\wk/i), "\n"; # -1
sub index_from_match {
my ($s, $pattern) = #_;
# uses a capture group instead of $&
if ( my ($match) = $s =~ m/($pattern)/ ) {
return index($s, $match);
}
return -1;
}
The core problem is that you're doing two regex comparisons in a single expression, so the values for the first one are lost before they can be processed
It's really hard to see how to help you without understanding the program flow within the conditional statement and how you actually use those values
Other languages use the idea of a match object, and it's easy to simulate that here by writing a subroutine that returns either a [ string, offset ] pair if the pattern matched, or undef if not. It's also less wasteful to use the built-in #- and #+ arrays to provide the values needed instead of repeating the search with index
It would look like this
use strict;
use warnings 'all';
use Carp 'croak';
my $search1 = 'aaaNAMEaaaa';
my $search2 = 'bbbbCHECKbbb';
my $match1 = match($search1, /na\we/i);
my $match2 = match($search2, qr/che\wk/i);
if ( $match1 and $match2 ) {
print "String found\n";
printf qq{"%s" found at offset %d\n}, #$match1;
printf qq{"%s" found at offset %d\n}, #$match2;
}
sub match {
my ($s, $re) = #_;
croak "Compiled regex required" unless ref $re eq 'Regexp';
return unless $s =~ $re;
[ substr($s, $-[0], $+[0]-$-[0]), $-[0] ];
}
output
String found
"NAME" found at offset 3
"CHECK" found at offset 4
I think it would also be neater to write this as
my $match1 = match($search1, qr/na\we/i);
my $match2 = match($search2, qr/che\wk/i);
if ( $match1 and $match2 ) {
print "String found\n";
printf qq{"%s" found at offset %d\n}, #$match1;
printf qq{"%s" found at offset %d\n}, #$match2;
}
Break your compound if into multiple if statements
my $search1 = "aaaNAMEaaaa";
my $search2 = "bbbbCHECKbbb";
my ($matched1, $matched2, $pos1, $pos2);
if ( $search1 =~ /na\we/i) {
$matched1 = $&;
$pos1 = index( $search1, $matched1);
if ($search2 =~ /che\wk/i ) {
print "String found\n";
$matched2 = $&;
$pos2 = index( $search2, $matched2);
#do whatever you need with $pos1 & $pos2
} else {
#reset previously set vars
undef $matched1;
undef $pos1;
}
}
I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11
I need to get values from lines like <fpage>327</fpage> and <lpage>335</lpage> and use their difference to replace NaNin a line with <page-count count="NaN"/>.
sample input file:
...many lines
<volume>74</volume>
<issue>3</issue>
<fpage>327</fpage>
<lpage>335</lpage>
...many lines
<counts><fig-count count="3"/><table-count count="2"/><equation-count count="0"/><ref-count count="37"/><page-count count="0"/></counts>
...many lines
sample output file desired:
...many lines
<volume>74</volume>
<issue>3</issue>
<fpage>327</fpage>
<lpage>335</lpage>
...many lines
<counts><fig-count count="3"/><table-count count="2"/><equation-count count="0"/><ref-count count="37"/><page-count count="8"/></counts>
...many lines
Here is what I am trying but I am getting <page-count count="0"/>:
while ( <$input> ) {
my $fpage = $1 if $fpage =~ m/<fpage>(\d+)/;
my $lpage = $1 if $lpage =~ m/<lpage>(\d+)/;
my $pages = $lpage - $fpage;
$_ =~ s!<page-count count="NaN"/>!<page-count count="${pages}"/>!;
print {$output} $_;
}
What am I doing wrong?
You're not actually testing the input for fpage and lpage. Try something like:
while ( my $in = <$input> ) {
my $fpage = $1 if $in =~ /<fpage>(\d+)/;
my $lpage = $1 if $in =~ /<lpage>(\d+)/;
my $pages = $lpage - $fpage;
$in =~ s!<page-count count="NaN"/>!<page-count count="${pages}"/>!;
print {$output} $in;
}
Note: This will only work if the entire block of text you're matching and replacing is available in each iteration of the while loop.
1) Your variables $fpage and $lpage go out of scope on each iteration. You can extend the scope by moving their decleration outsidde the while loop.
2) =~ does not do what you seem to want it to do. the command $fpage =~ m/<fpage>(\d+)/ is telling the Regex to search inside the variable $fpage. The default variable to search is $_, so your use of $_ =~ is redundant (and bad style)
3) If NaN occurs multiple times, your current code would only catch the first occurence.
my $fpage;
my $lpage;
while ( <$input> ) {
$fpage = $1 if $fpage =~ m/<fpage>(\d+)/;
$lpage = $1 if $lpage =~ m/<lpage>(\d+)/;
my $pages = $lpage - $fpage;
s!<page-count count="NaN"/>!<page-count count="${pages}"/>!;
print {$output} $_;
}
This performs no checks that you actually find fpage and lpage before the page-count.
Normally if you wish to change a variable with regex you do this:
$string =~ s/matchCase/changeCase/;
But is there a way to simply do the replace inline without setting it back to the variable?
I wish to use it in something like this:
my $name="jason";
print "Your name without spaces is: " $name => (/\s+/''/g);
Something like that, kind of like the preg_replace function in PHP.
Revised for Perl 5.14.
Since 5.14, with the /r flag to return the substitution, you can do this:
print "Your name without spaces is: [", do { $name =~ s/\s+//gr; }
, "]\n";
You can use map and a lexical variable.
my $name=" jason ";
print "Your name without spaces is: ["
, ( map { my $a = $_; $a =~ s/\s+//g; $a } ( $name ))
, "]\n";
Now, you have to use a lexical because $_ will alias and thus modify your variable.
The output is
Your name without spaces is: [jason]
# but: $name still ' jason '
Admittedly do will work just as well (and perhaps better)
print "Your name without spaces is: ["
, do { my ( $a = $name ) =~ s/\s+//g; $a }
, "]\n";
But the lexical copying is still there. The assignment within in the my is an abbreviation that some people prefer (not me).
For this idiom, I have developed an operator I call filter:
sub filter (&#) {
my $block = shift;
if ( wantarray ) {
return map { &$block; $_ } #_ ? #_ : $_;
}
else {
local $_ = shift || $_;
$block->( $_ );
return $_;
}
}
And you call it like so:
print "Your name without spaces is: [", ( filter { s/\s+//g } $name )
, "]\n";
print "Your name without spaces is: #{[map { s/\s+//g; $_ } $name]}\n";