I have a very large text file whose lines are comma-separated values. Some values are missing. For each line, I'd like to print the index and value of all the non-empty fields.
For example, a line might look like
,,10.3,,,,5.2,3.1,,,,,,,
in which case the output I want is
2,10.3,6,5.2,7,3.1
I know how to accomplish this by first splitting the input into an array, and then going through the array with a for loop, but these are huge files (multi-gigabyte) and I'm wondering if there is a faster way. (e.g. using some advanced regexp)
I haven't benchmarked it yet, but I would assume
my $line = ",,10.3,,,,5.2,3.1,,,,,,,";
my $index = 0;
print join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
is faster than some advanced regexp.
The problem is that as long as you have to know the index, you still have to keep track of those missing entries somehow.
Something like this might not be too slow though:
my ($i, #vars);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
print join ",", #vars;
You could probably leave out the first capturing group and use pos() to work out the index.
Here's a comparison of my two suggestions and sin's with 1M iterations:
Rate flesk1 sin flesk2
flesk1 87336/s -- -8% -27%
sin 94518/s 8% -- -21%
flesk2 120337/s 38% 27% --
Seems like my regex works better than I thought.
You might be able to mix and match regex and code -
$line =~ /(?{($cnt,#ary)=(0,)})^(?:([^,]+)(?{push #ary,$cnt; push #ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', #ary);
expanded -
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
and print join( ',', #ary);
some benchmarks
With a slight tweak of flesk's and sln's (look for fleskNew and slnNew),
the winner is the fleskNew when the substitution operator is removed.
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1 => sub{
$index = 0;
join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
} );
numbers -
Rate flesk1 sln1 flesk2 slnNew fleskNew
flesk1 20325/s -- -51% -52% -56% -60%
sln1 41312/s 103% -- -1% -10% -19%
flesk2 41916/s 106% 1% -- -9% -17%
slnNew 45978/s 126% 11% 10% -- -9%
fleskNew 50792/s 150% 23% 21% 10% --
some benchmarks 2
Adds Birei's in-line replacment and trim (all-in-one) solution.
Abberations:
Flesk1 is modified to remove the final 'join' as it is not included in
the other regex solutions. This gives it a chance to bench better.
Birei deviates in the bench as it modifies the original string to be the final solution.
That aspect can't be taken out. The difference between Birei1 and BireiNew is that the
new one removes the final ','.
Flesk2, Birei1 and BireiNew have the additional overhead of restoring the original string
due to the substitution operator.
The winner still looks like FleskNew ..
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1a => sub{
$index = 0;
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
Birei1 => sub{
$i = -1;
$line =~
s/
(?(?=,+)
( (?: , (?{ ++$i }) )+ )
| (?<no_comma> [^,]+ ,? ) (?{ ++$i })
)
/
defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]
/xge;
$line = $samp;
},
BireiNew => sub{
$i = 0;
$line =~
s/
(?: , (?{++$i}) )*
(?<data> [^,]* )
(?: ,*$ )?
(?= (?<trailing_comma> ,?) )
/
length $+{data} ? "$i,$+{data}$+{trailing_comma}" : ""
/xeg;
$line = $samp;
},
} );
results -
Rate BireiNew Birei1 flesk1a flesk2 sln1 slnNew fleskNew
BireiNew 6030/s -- -18% -74% -85% -86% -87% -88%
Birei1 7389/s 23% -- -68% -82% -82% -84% -85%
flesk1a 22931/s 280% 210% -- -44% -45% -51% -54%
flesk2 40933/s 579% 454% 79% -- -2% -13% -17%
sln1 41752/s 592% 465% 82% 2% -- -11% -16%
slnNew 47088/s 681% 537% 105% 15% 13% -- -5%
fleskNew 49563/s 722% 571% 116% 21% 19% 5% --
Using a regex (although I'm sure it can be simpler):
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
Explanation:
s/PATTERN/REPLACEMENT/ge # g -> Apply to all occurrences
# e -> Evaluate replacement as a expression.
(?
(?=,+) # Check for one or more commas.
((?:,(?{ ++$i }))+) # If (?=,+) was true, increment variable '$i' with each comma found.
|
(?<no_comma>[^,]+,?)(?{ ++$i }) # If (?=,+) was false, get number between comma and increment the $i variable only once.
)
defined $+{no_comma} # If 'no_comma' was set in 'pattern' expression...
$i . qq[,] . $+{no_comma} # insert the position just before it.
qq[] # If wasn't set, it means that pattern matched only commas, so remove then.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
our $i = -1;
chomp;
printf qq[Orig = $_\n];
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
# s/,\Z//;
printf qq[Mod = $_\n\n];
}
__DATA__
,,10.3,,,,5.2,3.1,,,,,,,
10.3,,,,5.2,3.1,,,,,,,
,10.3,,,,5.2,3.1
,,10.3,5.2,3.1,
Run the script like:
perl script.pl
And output:
Orig = ,,10.3,,,,5.2,3.1,,,,,,,
Mod = 2,10.3,6,5.2,7,3.1,
Orig = 10.3,,,,5.2,3.1,,,,,,,
Mod = 0,10.3,4,5.2,5,3.1,
Orig = ,10.3,,,,5.2,3.1
Mod = 1,10.3,5,5.2,6,3.1
Orig = ,,10.3,5.2,3.1,
Mod = 2,10.3,3,5.2,4,3.1,
As you can see, it keeps last comma. I don't know how to remove it without an extra regex, just uncomment s/,\Z//; in previous code.
Related
I have a large flat text file with lines that hold name/value pairs ("varname=value"). These pairs are seperated by a multi-character delimiter. So a single line in this file might look like this:
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
Each line holds about 50 name/value pairs.
I need to iterate through the lines of this file (there are about 100,000 lines) and store the name/value pairs in a hash so that
$field{'var1'} = value1
$field{'var2'} = value2
etc...
What I did was this:
# $line holds a single line from the file
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
Doing this for each line of the entire file takes (on my PC) about 2 seconds. This doesn't seem like a long time, but I really want to speed this up by quite a bit.
Of this 2 seconds, the first split takes about 0.6 seconds, while the foreach loop takes about 1.4 seconds. So I thought I'd get rid of the foreach loop and put it all in a single split:
%hash = split( /\Q|^|\E|=/, $line );
Much to my surprise, parsing the entire file this way took a full second longer! My question isn't really why this takes longer (although it would be a nice bonus to understand why), but my question is if there are any other (faster) ways to get the job done.
Thanks in advance.
------ Edit below this line ------
I just found out that changing this:
%hash = split( /\Q|^|\E|=/, $line );
into this:
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
makes it three times faster! Parsing the entire file this way now takes just over a second...
------ Snippet below this line ------
use strict;
use Time::HiRes qw( time );
my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
%hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
}
print Elapsed() . "\n";
################################################################################################################################
BEGIN {
my $startTime;
sub ResetTimer {
$startTime = time();
return $startTime;
}
sub Elapsed {
return time() - $startTime;
}
}
I can't easily answer your performance question, because I'd need a test case. But I'd guess that it's to do with how the regular expression is being processed.
You can see what that's doing with use re 'debug';, and that'll print the regular expression steps.
But for the broader question - I'd probably just tackle it with a global (assuming your data is as simple as the example):
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
my %row = m/(\w+)=(\w+)/g;
print Dumper \%row;
}
__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
You can use lookahead/behind to match delimiters if you've got more complicated things in there, but because it's one regex per line, you're invoking the regex engine less often, and that'll probably be faster. (But I can't tell you for sure without a test case).
If your data is more complicated, then perhaps:
my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
This will 'force' splitting the input into a new line, and then match 'anything' = 'anything'. But that's probably overkill unless your values include whitespace/pipes/metachars.
With editing your test case to use Benchmark:
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );
my $line =
"a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
sub double_split {
my %hash;
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ( $name, $value ) = split( /=/, $field );
$hash{$name} = $value;
}
}
sub single_split {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub re_replace_then_split {
$line =~ s/\Q|^|\E/=/g;
my %hash = split( /=/, $line );
}
sub single_regex {
my %hash = $line =~ m/(\w+)=(\w+)/g;
}
sub compound {
my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}
cmpthese(
1_000_000,
{ "Double Split" => \&double_split,
"single split with regex" => \&single_split,
"Replace then split" => \&re_replace_then_split,
"Single Regex" => \&single_regex,
"regex to linefeed them match" => \&compound
}
);
Looks like the results come out like:
Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split 18325/s -- -4% -34% -56% -97%
single split with regex 19050/s 4% -- -31% -54% -97%
Single Regex 27607/s 51% 45% -- -34% -96%
Replace then split 41733/s 128% 119% 51% -- -93%
regex to linefeed them match 641026/s 3398% 3265% 2222% 1436% --
... I'm a bit suspicious of that last, because that's absurdly faster. There's probably caching of results happening there.
But looking at it, what's slowing you down is the alternation in the regex:
sub single_split_with_alt {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub single_split {
my %hash = split( /[\|\^\=]+/, $line );
}
(I know that latter might not be quite what you want, but it's for illustrative purposes)
Gives:
Rate alternation single split
alternation 19135/s -- -37%
single split 30239/s 58% --
But there does come a point where this is moot, because your limiting factor is disk IO, not CPU.
I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11
I'm parsing a CSV file with embedded commas, and obviously, using split() has a few limitations due to this.
One thing I should note is that the values with embedded commas are surrounded by parentheses, double quotes, or both...
for example:
(Date, Notional),
"Date, Notional",
"(Date, Notional)"
Also, I'm trying to do this without using any modules for certain reasons I don't want to go into right now...
Can anyone help me out with this?
This should do what you need. It works in a very similar way to the code in Text::CSV_PP, but doesn't allow for escaped characters within the field as you say you have none
use strict;
use warnings;
use 5.010;
my $re = qr/(?| "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = "$line," =~ /$re/g;
say "<$_>" for #fields;
output
<Date, Notional 1>
<Date, Notional 2>
<Date, Notional 3>
Update
Here's a version for older Perls (prior to version 10) that don't have the regex branch reset construct. It produces identical output to the above
use strict;
use warnings;
use 5.010;
my $re = qr/(?: "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = grep defined, "$line," =~ /$re/g;
say "<$_>" for #fields;
I know you already have a working solution with Borodin's answer, but for the record there is also a simple solution with split (see the results at the bottom of the online demo). This situation sounds very similar to regex match a pattern unless....
#!/usr/bin/perl
$regex = '(?:\([^\)]*\)|"[^"]*")(*SKIP)(*F)|\s*,\s*';
$subject = '(Date, Notional), "Date, Notional", "(Date, Notional)"';
#splits = split($regex, $subject);
print "\n*** Splits ***\n";
foreach(#splits) { print "$_\n"; }
How it Works
The left side of the alternation | matches complete (parentheses) and (quotes), then deliberately fails. The right side matches commas, and we know they are the right commas because they were not matched by the expression on the left.
Possible Refinements
If desired, the parenthess-matching portion could be made recursive to match (nested(parens))
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
I know that this is quite old question, but for completeness I would like to add solution from great book "Mastering Regular Expressions" by Jeffrey Friedl (page 271):
sub parse_csv {
my $text = shift; # record containing comma-separated values
my #fields = ( );
my $field;
chomp($text);
while ($text =~ m{\G(?:^|,)(?:"((?>[^"]*)(?:""[^"]*)*)"|([^",]*))}gx) {
if (defined $2) {
$field = $2;
} else {
$field = $1;
$field =~ s/""/"/g;
}
# print "[$field]";
push #fields, $field;
}
return #fields;
}
Try it against test row:
my $line = q(Ten Thousand,10000, 2710 ,,"10,000",,"It's ""10 Grand"", baby",10K);
my #fields = parse_csv($line);
my $i;
for ($i = 0; $i < #fields; $i++) {
print "$fields[$i],";
}
print "\n";
I want to match dates that have the following format:
2010-08-27 02:11:36
i.e. yyyy-mm-dd hh:mm:ss.
Right now I am not very particular about the date being actually feasible, but just that it is in the correct format.
Possible formats that should match are (for this example)
2010
2010-08
2010-08-27
2010-08-27 02
2010-08-27 02:11
2010-08-27 02:11:36
In Perl, what can be a concise regex for this?
I have this so far (which works, btw)
/\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/
Can this be improved performance-wise?
Based on the lack of a capturing group around the year, I assume you care only whether a date matches.
I tried a few different patterns related to the one from your question, and the one that gave a ten- to fifteen-percent improvement was disabling capturing, i.e.,
/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
The perlre documentation covers (?:...):
(?:pattern)
(?imsx-imsx:pattern)
This is for clustering, not capturing; it groups subexpressions like (), but doesn't make backreferences as () does. So
#fields = split(/\b(?:a|b|c)\b/)
is like
#fields = split(/\b(a|b|c)\b/)
but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to.
Any letters between ? and : act as flags modifiers as with (?imsx-imsx). For example,
/(?s-i:more.*than).*million/i
is equivalent to the more verbose
/(?:(?s-i)more.*than).*million/i
Benchmark output:
Rate U U/NC CH/NC/A CH/NC/A/U CH CH/NC null
U 31811/s -- -32% -58% -59% -61% -66% -93%
U/NC 46849/s 47% -- -38% -39% -42% -50% -90%
CH/NC/A 76119/s 139% 62% -- -1% -6% -18% -84%
CH/NC/A/U 76663/s 141% 64% 1% -- -6% -17% -84%
CH 81147/s 155% 73% 7% 6% -- -13% -83%
CH/NC 92789/s 192% 98% 22% 21% 14% -- -81%
null 481882/s 1415% 929% 533% 529% 494% 419% --
Code:
#! /usr/bin/perl
use warnings;
use strict;
use Benchmark qw/ :all /;
sub option_chain {
local($_) = #_;
/\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/
}
sub option_chain_nocap {
local($_) = #_;
/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
}
sub option_chain_nocap_anchored {
local($_) = #_;
/\A\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
}
sub option_chain_anchored_unrolled {
local($_) = #_;
/\A\d\d\d\d(-\d\d(-\d\d( \d\d(:\d\d(:\d\d)?)?)?)?)?\z/
}
sub simple_split {
local($_) = #_;
split /[ :-]/;
}
sub unrolled {
local($_) = #_;
grep defined($_), /\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) \z
|\A (\d\d\d\d)-(\d\d) \z
|\A (\d\d\d\d) \z
/x;
}
sub unrolled_nocap {
local($_) = #_;
grep defined($_), /\A \d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d \z
|\A \d\d\d\d-\d\d-\d\d \d\d:\d\d \z
|\A \d\d\d\d-\d\d-\d\d \d\d \z
|\A \d\d\d\d-\d\d-\d\d \z
|\A \d\d\d\d-\d\d \z
|\A \d\d\d\d \z
/x;
}
sub id { $_[0] }
my #examples = (
"xyz",
"2010",
"2010-08",
"2010-08-27",
"2010-08-27 02",
"2010-08-27 02:11",
"2010-08-27 02:11:36",
);
cmpthese -1 => {
"CH" => sub { option_chain $_ for #examples },
"CH/NC" => sub { option_chain_nocap $_ for #examples },
"CH/NC/A" => sub { option_chain_nocap_anchored $_ for #examples },
"CH/NC/A/U" => sub { option_chain_anchored_unrolled $_ for #examples },
"U" => sub { unrolled $_ for #examples },
"U/NC" => sub { unrolled_nocap $_ for #examples },
"null" => sub { id $_ for #examples },
};
How about something from Regexp::Common::time?
Your regex is just fine except for missing anchors (unless you want to match 2008 in "abc200890"?). Assuming you want to match the whole string:
/^\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
(?:...) should be used if you don't actually want the captured substrings, which I'd guess to be the case.
I would use the split function :
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #dates = (
'2010',
'2010-08',
'2010-08-27',
'2010-08-27 02',
'2010-08-27 02:11',
'2010-08-27 02:11:36',
);
for (#dates) {
my #list = split /[ :-]/;
print Dumper(\#list);
}
output :
$VAR1 = [
'2010'
];
$VAR1 = [
'2010',
'08'
];
$VAR1 = [
'2010',
'08',
'27'
];
$VAR1 = [
'2010',
'08',
'27',
'02'
];
$VAR1 = [
'2010',
'08',
'27',
'02',
'11'
];
$VAR1 = [
'2010',
'08',
'27',
'02',
'11',
'36'
];
This matches all the above (but also other stuff - see the comment!) and may be slightly easier to read:
/(\d{4})(-\d{2})?(\w{1}\d{2})?(:\d{2})?/
If you want faster, then look away from regex, and look at XS modules: Date::Calc is a good one.
Following up from an earlier question on extracting the n'th regex match, I now need to substitute the match, if found.
I thought that I could define the extraction subroutine and call it in the substitution with the /e modifier. I was obviously wrong (admittedly, I had an XY problem).
use strict;
use warnings;
sub extract_quoted { # à la codaddict
my ($string, $index) = #_;
while($string =~ /'(.*?)'/g) {
$index--;
return $1 if(! $index);
}
return;
}
my $string = "'How can I','use' 'PERL','to process this' 'line'";
extract_quoted ( $string, 3 );
$string =~ s/&extract_quoted($string,2)/'Perl'/e;
print $string; # Prints 'How can I','use' 'PERL','to process this' 'line'
There are, of course, many other issues with this technique:
What if there are identical matches at different positions?
What if the match isn't found?
In light of this situation, I'm wondering in what ways this could be implemented.
EDIT: leonbloy came up with this solution first. If your tempted to upvote it, upvote leonbloy's first.
Somewhat inspired by leonbloy's (earlier) answer:
$line = "'How can I','use' 'PERL' 'to process this';'line'";
$n = 3;
$replacement = "Perl";
print "Old line: $line\n";
$z = 0;
$line =~ s/'(.*?)'/++$z==$n ? "'$replacement'" : "'$1'"/ge;
print "New line: $line\n";
Old line: 'How can I','use' 'PERL' 'to process this';'line'
New line: 'How can I','use' 'Perl' 'to process this';'line'
Or you can do something as this
use strict;
use warnings;
my $string = "'How can I','use' .... 'perl','to process this' 'line'";
my $cont =0;
sub replacen { # auxiliar function: replaces string if incremented counter equals $index
my ($index,$original,$replacement) = #_;
$cont++;
return $cont == $index ? $replacement: $original;
}
#replace the $index n'th match (1-based counting) from $string by $rep
sub replace_quoted {
my ($string, $index,$replacement) = #_;
$cont = 0; # initialize match counter
$string =~ s/'(.*?)'/replacen($index,$1,$replacement)/eg;
return $string;
}
my $result = replace_quoted ( $string, 3 ,"PERL");
print "RESULT: $result\n";
A little ugly the "global" $cont variable, that could be polished, but you get the idea.
Update: a more compact version:
use strict;
my $string = "'How can I','use' .... 'perl','to process this' 'line'";
#replace the $index n'th match (1-based counting) from $string by $replacement
sub replace_quoted {
my ($string, $index,$replacement) = #_;
my $cont = 0; # initialize match counter
$string =~ s/'(.*?)'/$cont++ == $index ? $replacement : $1/eg;
return $string;
}
my $result = replace_quoted ( $string, 3 ,"PERL");
print "RESULT: $result\n";
If the regex isn't too much more complicated than what you have, you could follow a split with an edit and a join:
$line = "'How can I','use' 'PERL','to process this' 'line'";
$n = 3;
$new_text = "'Perl'";
#f = split /('.*?')/, $line;
# odd fields of #f contain regex matches
# even fields contain the text between matches
$f[2*$n-1] = $new_text;
$new_line = join '', #f;
See perldoc perlvar:
use strict; use warnings;
use Test::More tests => 5;
my %src = (
q{'I want to' 'extract the word' 'PERL','from this string'}
=> q{'I want to' 'extract the word' 'Perl','from this string'},
q{'What about', 'getting','PERL','from','here','?'}
=> q{'What about', 'getting','Perl','from','here','?'},
q{'How can I','use' 'PERL','to process this' 'line'}
=> q{'How can I','use' 'Perl','to process this' 'line'},
q{Invalid} => q{Invalid},
q{'Another invalid string'} => q{'Another invalid string'}
);
while ( my ($src, $target) = each %src ) {
ok($target eq subst_n($src, 3, 'Perl'), $src)
}
sub subst_n {
my ($src, $index, $replacement) = #_;
return $src unless $index > 0;
while ( $src =~ /'.*?'/g ) {
-- $index or return join(q{'},
substr($src, 0, $-[0]),
$replacement,
substr($src, $+[0])
);
}
return $src;
}
Output:
C:\Temp> pw
1..5
ok 1 - 'Another invalid string'
ok 2 - 'How can I','use' 'PERL','to process this' 'line'
ok 3 - Invalid
ok 4 - 'What about', 'getting','PERL','from','here','?'
ok 5 - 'I want to' 'extract the word' 'PERL','from this string'
Of course, you need to decide what happens if an invalid $index is passed or if the required match is not found. I just return the original string in the code above.
Reworking an answer to an earlier question, match n-1 times and then replace the next. Memoizing patterns spares poor Perl having to recompile the same patterns over and over.
my $_quoted = qr/'[^']+'/; # ' fix Stack Overflow highlighting
my %_cache;
sub replace_nth_quoted {
my($string,$index,$replace) = #_;
my $pat = $_cache{$index} ||=
qr/ ^
( # $1
(?:.*?$_quoted.*?) # match quoted substrings...
{#{[$index-1]}} # $index-1 times
)
$_quoted # the ${index}th match
/x;
$string =~ s/$pat/$1$replace/;
$string;
}
For example
my $string = "'How can I','use' 'PERL','to process this' 'line'";
print replace_nth_quoted($string, 3, "'Perl'"), "\n";
outputs
'How can I','use' 'Perl','to process this' 'line'