Perl: Trying to speed up parsing a delimited file - regex

I have a large flat text file with lines that hold name/value pairs ("varname=value"). These pairs are seperated by a multi-character delimiter. So a single line in this file might look like this:
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
Each line holds about 50 name/value pairs.
I need to iterate through the lines of this file (there are about 100,000 lines) and store the name/value pairs in a hash so that
$field{'var1'} = value1
$field{'var2'} = value2
etc...
What I did was this:
# $line holds a single line from the file
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
Doing this for each line of the entire file takes (on my PC) about 2 seconds. This doesn't seem like a long time, but I really want to speed this up by quite a bit.
Of this 2 seconds, the first split takes about 0.6 seconds, while the foreach loop takes about 1.4 seconds. So I thought I'd get rid of the foreach loop and put it all in a single split:
%hash = split( /\Q|^|\E|=/, $line );
Much to my surprise, parsing the entire file this way took a full second longer! My question isn't really why this takes longer (although it would be a nice bonus to understand why), but my question is if there are any other (faster) ways to get the job done.
Thanks in advance.
------ Edit below this line ------
I just found out that changing this:
%hash = split( /\Q|^|\E|=/, $line );
into this:
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
makes it three times faster! Parsing the entire file this way now takes just over a second...
------ Snippet below this line ------
use strict;
use Time::HiRes qw( time );
my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
%hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
}
print Elapsed() . "\n";
################################################################################################################################
BEGIN {
my $startTime;
sub ResetTimer {
$startTime = time();
return $startTime;
}
sub Elapsed {
return time() - $startTime;
}
}

I can't easily answer your performance question, because I'd need a test case. But I'd guess that it's to do with how the regular expression is being processed.
You can see what that's doing with use re 'debug';, and that'll print the regular expression steps.
But for the broader question - I'd probably just tackle it with a global (assuming your data is as simple as the example):
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
my %row = m/(\w+)=(\w+)/g;
print Dumper \%row;
}
__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
You can use lookahead/behind to match delimiters if you've got more complicated things in there, but because it's one regex per line, you're invoking the regex engine less often, and that'll probably be faster. (But I can't tell you for sure without a test case).
If your data is more complicated, then perhaps:
my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
This will 'force' splitting the input into a new line, and then match 'anything' = 'anything'. But that's probably overkill unless your values include whitespace/pipes/metachars.
With editing your test case to use Benchmark:
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );
my $line =
"a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
sub double_split {
my %hash;
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ( $name, $value ) = split( /=/, $field );
$hash{$name} = $value;
}
}
sub single_split {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub re_replace_then_split {
$line =~ s/\Q|^|\E/=/g;
my %hash = split( /=/, $line );
}
sub single_regex {
my %hash = $line =~ m/(\w+)=(\w+)/g;
}
sub compound {
my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}
cmpthese(
1_000_000,
{ "Double Split" => \&double_split,
"single split with regex" => \&single_split,
"Replace then split" => \&re_replace_then_split,
"Single Regex" => \&single_regex,
"regex to linefeed them match" => \&compound
}
);
Looks like the results come out like:
Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split 18325/s -- -4% -34% -56% -97%
single split with regex 19050/s 4% -- -31% -54% -97%
Single Regex 27607/s 51% 45% -- -34% -96%
Replace then split 41733/s 128% 119% 51% -- -93%
regex to linefeed them match 641026/s 3398% 3265% 2222% 1436% --
... I'm a bit suspicious of that last, because that's absurdly faster. There's probably caching of results happening there.
But looking at it, what's slowing you down is the alternation in the regex:
sub single_split_with_alt {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub single_split {
my %hash = split( /[\|\^\=]+/, $line );
}
(I know that latter might not be quite what you want, but it's for illustrative purposes)
Gives:
Rate alternation single split
alternation 19135/s -- -37%
single split 30239/s 58% --
But there does come a point where this is moot, because your limiting factor is disk IO, not CPU.

Related

Searching Perl array with regex and return single capturing group only

I have a Perl script in which I perform web service calls in a loop. The server returns a multivalued HTTP header that I need to parse after each call with information that I will need to make the next call (if it doesn't return the header, I want to exit the loop).
I only care about one of the values in the header, and I need to get the information out of it with a regular expression. Let's say the header is like this, and I only care about the "foo" value:
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
I can get the header values like this: #values = $response->header( 'X-Header' );. But how do I quickly check if
There is a foo value, and
Parse and save the foo value for the next iteration?
Ideally, I'd like to do something like this:
my $value = 'default';
do {
# (do HTTP request; use $value)
#values = $response->header( 'X-Header' );
} while( $value = first { /(?:test-)([^;]+)(?:; blah=foo)/ } #values );
But grep, first (from List::Util), etc. all return the entire match and not just the single capturing group I want. I want to avoid cluttering up my code by looping over the array and matching/parsing inside the loop body.
Is what I want possible? What would be the most compact way to write it? So far, all I can come up with is using lookarounds and \K to discard the stuff I don't care about, but this isn't super readable and makes the regex engine perform a lot of unnecessary steps.
So it seems that you want to catch the first element with a certain pattern, but acquire only the pattern. And you want it done nicely. Indeed, first and grep only pass the element itself.
However, List::MoreUtils::first_result does support processing of its match
use List::MoreUtils 0.406 qw(first_result);
my #w = qw(a bit c dIT); # get first "it" case-insensitive
my $res = first_result { ( /(it)/i )[0] } #w;
say $res // 'undef'; #--> it
That ( ... )[0] is needed to put the regex in the list context so that it returns the actual capture. Another way would be firstres { my ($r) = /(it)/i; $r }. Pick your choice
For the data in the question
use warnings;
use strict;
use feature 'say';
use List::MoreUtils 0.406 qw(firstres);
my #data = (
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
);
if (my $r = firstres { ( /test-([^;]+);\s+blah=foo/ )[0] } #data) {
say $r
}
Prints abc12345, clarified in a comment to be the sought result.
Module versions prior to 0.406 (of 2015-03-03) didn't have firstres (alias first_result)
first { ... } #values returns one the values (or undef).
You could use either of these:
my ($value) = map { /...(...).../ } #values;
my $value = ( map { /...(...).../ } #values ) ? $1 : undef;
my $value = ( map { /...(...).../ } #values )[0];
Using first, it would look like the following, which is rather silly:
my $value = first { 1 } map { /...(...).../ } #values;
However, assuming the capture can't be an empty string or the string 0, List::MoreUtils's first_result could be used to avoid the unnecessary matches:
my $value = first_result { /...(...).../ ? $1 : undef } #values;
my $value = first_result { ( /...(...).../ )[0] } #values;
If the returned value can be false (e.g. an empty string or a 0) you can use something like
my $value = first_result { /...(...).../ ? \$1 : undef } #values;
$value = $$value if $value;
The first_result approach isn't necessarily faster in practice.
Following code snippet is looking for foo stored in a variable $find, the found values is stored in variable $found.
my $find = 'foo';
my $found;
while( $response->header( 'X-Header' ) ) {
if( /X-Header: .*?blah=($find)/ ) {
$found = $1;
last;
}
}
say $found if $found;
Sample demo code
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $find = 'foo';
my $found;
my #header = <DATA>;
chomp(#header);
for ( #header ) {
$found = $1 if /X-Header: .*?blah=($find)/;
last if $found;
}
say Dumper(\#header);
say "Found: $found" if $found;
__DATA__
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
Output
$VAR1 = [
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
];
Found: foo

In perl, Is there a more compact way to search for a number of patterns, and for each one, substitute with an expression

In perl, I am reading a line and trying to replace a set of strings with corresponding expressions using a sequence of if statements. For example:
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
}
I don't like this approach for a number of reasons. First, it is repetitive. I have to first check if the pattern exists, and then if it does, execute a function to generate a replacement value, then substitute. So it is both verbose, and slow (2 regex searches per pattern, perhaps eventually dozens of pattern strings).
I thought of a map where a number of codes are mapped to corresponding code to execute.
I can imagine mapping to a string and then using eval but then I can't check the code except at runtime. Is there any cleaner way of doing this?
I found the execute option in regex. What about writing a set of subroutines to process each regex, then creating a mapping:
my %regexMap = (
"\$fn", &foundFunc,
"\$hw", &hex8,
"\$hb", &hex2,
"\$sh", &rand6,
"\$ish", &shiftInst,
);
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/$regexMap{$1}/e;
print $line;
}
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
is a poor way of writing
$line =~ s/\$sh/ int(rand(6)) /e;
So
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
print($line);
}
can be written as
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$sh/ int(rand(6)) /e;
$line =~ s/\$ish/ $shiftInstructions[rand(#shiftInstructions)] /e;
print($line);
}
But that means you are scanning the string over and over again. Let's avoid that.
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$(sh|ish)/
if ( $1 eq "sh" ) { int(rand(6)) }
elsif ( $1 eq "ish" ) { $shiftInstructions[rand(#shiftInstructions)] }
/eg;
print($line);
}
Unfortunately, that reintroduces repetition. We can solve that using a dispatch table.
my #shiftInstructions = qw( lsr lsl rol ror );
my %replacements = (
sh => sub { int(rand(6)) },
ish => sub { $shiftInstructions[rand(#shiftInstructions)] },
);
my $alt = join '|', map quotemeta, keys(%replacements);
my $re = qr/\$($alt)/;
while (my $line = <>) {
print $line =~ s/$re/ $replacements{$1}->() /reg;
}
Now we have an efficient solution that can be extended without slowing down the matching, all while avoiding repetition.
The solution you added to your question was close, but it had two bugs.
&foo calls foo. To get a reference to it, use \&foo.
my %regexMap = (
"\$fn", \&foundFunc,
"\$hw", \&hex8,
"\$hb", \&hex2,
"\$sh", \&rand6,
"\$ish", \&shiftInst,
);
$regexMap{$1} now returns the reference. You want to call the referenced sub, which can be done using $regexMap{$1}->().
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/ $regexMap{$1}->() /e;
print $line;
}
In these cases, I often make some sort of data structure that holds the patterns and their actions:
my #tuples = (
[ qr/.../, sub { ... } ]
[ ... ].
);
Now the meat of the process stays the same no matter how many patterns I want to try:
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
Abstract this a little further with a subroutine that takes the data structure. Then you can pass it different tables depending on what you would like to do:
sub some_sub {
my #tuples = #_;
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
}
I've written about this sort of thing in Mastering Perl and Effective Perl Programming, and it's the sort of thing that does into my obscure modules like Brick and Data::Constraint.
I've been thinking about this more, and I wonder if regexes are actually part of what you are trying to do. It looks like you are matching literal strings, but using the match operator to do it. You don't give details of the input, so I'm guessing hereā€”it looks like there's an operation (e.g. $fn, and you want to match exactly that operation. The problem is finding that string then mapping it onto code. That looks something like this (and ikegami's answer is another form of this idea). Instead of an alternation, I match anything that might look like the string:
while( <> ) {
# find the string. Need example input to guess better
if( m/(\$[a-z]+)/ ) {
$table{$1}->() if exists $table{$1};
}
}
But again, it's dependent on the input, how many actual substrings you might want to match (so, the number of branches in an alternation), how many lines you want to process, and so on. There was a wonderful talk about processing apache log files with Regex::Trie and the various experiments they tried to make things faster. I've forgotten all the details, but very small adjustments made noticeable differences over tens of millions of lines.
Interesting reading:
Maybe this talk? An exploration of trie regexp matching
http://taint.org/2006/07/07/184022a.html
Matching a long list of phrases
OP's code can be written in following form
use strict;
use warnings;
use feature 'say';
my %regexMap = (
'$fn' => \&foundFunc,
'$hw' => \&hex8,
'$hb' => \&hex2,
'$sh' => \&rand6,
'$ish' => \&shiftInst,
);
my #keys = map { "\\$_" } keys %regexMap;
my $re = join('|', #keys);
while (<DATA>) {
chomp;
next unless /($re)/;
$regexMap{$1}->();
}
sub foundFunc { say 'sub_foundFunc' }
sub hex8 { say 'sub_hex8' }
sub hex2 { say 'sub_hex2' }
sub rand6 { say 'sub_rand6' }
sub shiftInst { say 'sub_shiftInst' }
__DATA__
$fn
$hw
$ac
$hb
$sh
$fn
$mf
$hb
$ish
$hw
Output
sub_foundFunc
sub_hex8
sub_hex2
sub_rand6
sub_foundFunc
sub_hex2
sub_shiftInst
sub_hex8

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

perl regex for multiple matches

I have a file with lines similar to following:
abcd1::101:xyz1,user,user1,abcd1,pqrs1,userblah,abcd1
I want to retain strings up to last ":" and remove all occurrences of abcd1
In the end, I need to have below:
abcd1::101:xyz1,xyz2,xyz3,pqrs1,xyz4
I tried code as below, but for some reason, it is not working. So please help
the account name is "abcd1"
sub UpdateEtcGroup {
my $account = shift;
my $file = "/tmp/group";
#ARGV = ($file);
$^I = ".bak";
while (<>){
s#^($account::\d{1,$}:)$account,?#$1#g;
s/,$//; # to remove the last "," if there
print;
}
}
split is the tool for the job, not a regex.
Because split lets you reliably separate out the field you do want to operate on, from the ones that you don't. Like this:
#!/usr/bin/env perl
use strict;
use warnings;
my $username = 'abcd1';
while ( <DATA> ) {
my #fields = split /:/;
my #users = split ( /,/, pop ( #fields ) );
print join ( ":", #fields,
join ( ",", grep { not m/^$username$/ } #users ) ),"\n";
}
__DATA__
abcd1::101:xyz1,user,user1,abcd1,pqrs1,userblah,abcd1
Don't use a regular expression for this.
use strict;
use warnings;
while (<DATA>) {
chomp;
my #parts = split(/:/, $_);
$parts[-1] = join(',', grep { !/^abcd/ } split(/,/, $parts[-1]));
print join(':', #parts) . "\n";
}
__DATA__
abcd1::101:xyz1,user,user1,abcd1,pqrs1,userblah,abcd1
abcd2::102:user1,xyz2,otheruser,abcd2,pqrs1,xyz4,abcd2
Output:
abcd1::101:xyz1,user,user1,pqrs1,userblah
abcd2::102:user1,xyz2,otheruser,pqrs1,xyz4