regular expression for matching date in Perl - regex

I want to match dates that have the following format:
2010-08-27 02:11:36
i.e. yyyy-mm-dd hh:mm:ss.
Right now I am not very particular about the date being actually feasible, but just that it is in the correct format.
Possible formats that should match are (for this example)
2010
2010-08
2010-08-27
2010-08-27 02
2010-08-27 02:11
2010-08-27 02:11:36
In Perl, what can be a concise regex for this?
I have this so far (which works, btw)
/\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/
Can this be improved performance-wise?

Based on the lack of a capturing group around the year, I assume you care only whether a date matches.
I tried a few different patterns related to the one from your question, and the one that gave a ten- to fifteen-percent improvement was disabling capturing, i.e.,
/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
The perlre documentation covers (?:...):
(?:pattern)
(?imsx-imsx:pattern)
This is for clustering, not capturing; it groups subexpressions like (), but doesn't make backreferences as () does. So
#fields = split(/\b(?:a|b|c)\b/)
is like
#fields = split(/\b(a|b|c)\b/)
but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to.
Any letters between ? and : act as flags modifiers as with (?imsx-imsx). For example,
/(?s-i:more.*than).*million/i
is equivalent to the more verbose
/(?:(?s-i)more.*than).*million/i
Benchmark output:
Rate U U/NC CH/NC/A CH/NC/A/U CH CH/NC null
U 31811/s -- -32% -58% -59% -61% -66% -93%
U/NC 46849/s 47% -- -38% -39% -42% -50% -90%
CH/NC/A 76119/s 139% 62% -- -1% -6% -18% -84%
CH/NC/A/U 76663/s 141% 64% 1% -- -6% -17% -84%
CH 81147/s 155% 73% 7% 6% -- -13% -83%
CH/NC 92789/s 192% 98% 22% 21% 14% -- -81%
null 481882/s 1415% 929% 533% 529% 494% 419% --
Code:
#! /usr/bin/perl
use warnings;
use strict;
use Benchmark qw/ :all /;
sub option_chain {
local($_) = #_;
/\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/
}
sub option_chain_nocap {
local($_) = #_;
/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
}
sub option_chain_nocap_anchored {
local($_) = #_;
/\A\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
}
sub option_chain_anchored_unrolled {
local($_) = #_;
/\A\d\d\d\d(-\d\d(-\d\d( \d\d(:\d\d(:\d\d)?)?)?)?)?\z/
}
sub simple_split {
local($_) = #_;
split /[ :-]/;
}
sub unrolled {
local($_) = #_;
grep defined($_), /\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d) \z
|\A (\d\d\d\d)-(\d\d)-(\d\d) \z
|\A (\d\d\d\d)-(\d\d) \z
|\A (\d\d\d\d) \z
/x;
}
sub unrolled_nocap {
local($_) = #_;
grep defined($_), /\A \d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d \z
|\A \d\d\d\d-\d\d-\d\d \d\d:\d\d \z
|\A \d\d\d\d-\d\d-\d\d \d\d \z
|\A \d\d\d\d-\d\d-\d\d \z
|\A \d\d\d\d-\d\d \z
|\A \d\d\d\d \z
/x;
}
sub id { $_[0] }
my #examples = (
"xyz",
"2010",
"2010-08",
"2010-08-27",
"2010-08-27 02",
"2010-08-27 02:11",
"2010-08-27 02:11:36",
);
cmpthese -1 => {
"CH" => sub { option_chain $_ for #examples },
"CH/NC" => sub { option_chain_nocap $_ for #examples },
"CH/NC/A" => sub { option_chain_nocap_anchored $_ for #examples },
"CH/NC/A/U" => sub { option_chain_anchored_unrolled $_ for #examples },
"U" => sub { unrolled $_ for #examples },
"U/NC" => sub { unrolled_nocap $_ for #examples },
"null" => sub { id $_ for #examples },
};

How about something from Regexp::Common::time?

Your regex is just fine except for missing anchors (unless you want to match 2008 in "abc200890"?). Assuming you want to match the whole string:
/^\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
(?:...) should be used if you don't actually want the captured substrings, which I'd guess to be the case.

I would use the split function :
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #dates = (
'2010',
'2010-08',
'2010-08-27',
'2010-08-27 02',
'2010-08-27 02:11',
'2010-08-27 02:11:36',
);
for (#dates) {
my #list = split /[ :-]/;
print Dumper(\#list);
}
output :
$VAR1 = [
'2010'
];
$VAR1 = [
'2010',
'08'
];
$VAR1 = [
'2010',
'08',
'27'
];
$VAR1 = [
'2010',
'08',
'27',
'02'
];
$VAR1 = [
'2010',
'08',
'27',
'02',
'11'
];
$VAR1 = [
'2010',
'08',
'27',
'02',
'11',
'36'
];

This matches all the above (but also other stuff - see the comment!) and may be slightly easier to read:
/(\d{4})(-\d{2})?(\w{1}\d{2})?(:\d{2})?/

If you want faster, then look away from regex, and look at XS modules: Date::Calc is a good one.

Related

How to get the position of all capture groups with quantifiers?

I have a small problem. I have a perl regexp with multiple capture groups. Some of them have quantifiers (like '+'). If no quantifier is added, then #- & #+ array are filled nicely with the matched position of the capture groups, but if a quantifier is added only the last match is detected. But I would like to have all of them.
An example:
my $s = 'xx1a2b3cyy';
my $re = qr/^xx(\d\w)+/;
So I'd like to know that matches are '1a', '2b', '3c' at 2, 4, 6.
Simple matching gives:
if ($s =~ $re) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
Gives:
Match 0 6, 8 8
i: 0 - 'xx1a2b3c
i: 1 - '3c
So only the last capture group match is remembered.
My next simple try was which is not really what I want as the RE is different:
$re = qr/(\d\w)/;
my #s = ($s =~ /$re/g);
print "RE: '#s'\n";
while ($s =~ /$re/g) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
gives:
RE: '1a 2b 3c'
Match 2 2, 4 4
i: 0 - '1a
i: 1 - '1a
Match 4 4, 6 6
i: 0 - '2b
i: 1 - '2b
Match 6 6, 8 8
i: 0 - '3c
i: 1 - '3c
But this not what I want, as it would match a string like 'ZZ1aAA2bBB3cZZ'.
So somehow I have to combine the two. The best what I could get:
$re = '^xx(?:\d\w)*?\G(\d\w)';
pos($s) = 2;
while ($s =~ m($re)g) {
print "Match pos: ", pos($s), ', G: ', $1, ", '#-', '#+'\n"
}
gives:
Match pos: 4, G: 1a, '0 2', '4 4'
Match pos: 6, G: 2b, '0 4', '6 6'
Match pos: 8, G: 3c, '0 6', '8 8'
This is almost nice, but for this I need to know the position of the first possible match. If it is not set properly it will not match anything. I can only determine the first position if I remove the non greedy part:
$re = '^xx(\d\w)';
if ($s =~ m($re)) {
print "Match: '#-', '#+'\n";
}
which gives:
Match: '0 2', '4 4'
So $-[1] gives the first position, but for this I have to modify the RE "manually".
If I add code execution into the pattern I almost get what I need:
use re 'eval';
$re = '^xx(\d\w)+(??{print "Code: <#-> <#+>\n"})';
$s =~ m($re) and print "Match\n";
gives:
Code: <0 6> <8 8>
Code: <0 4> <6 6>
Code: <0 2> <4 4>
For this I need to add the (?{ code }) part.
Does anybody know a simpler method (I mean not need to modify the original RE) to get all the possible matches of a capture group having a quantifier?
Thanks in advance!
There's no general solution; the regex engine simply doesn't store the necessary information. You're asking to use a regex as a parser, and that's a no-go.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $1 while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
my ($pairs) = $_[0] =~ /^xx((?:\d\w)+)/
or return ();
return unpack('(a2)*', $pairs);
}
If you just want the positions, it's the same.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $-[1] while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
$_[0] =~ /^xx((?:\d\w)+)/
or return ();
return map { $-[1] + ( $_ - 1 )*2 } 1..length($1)/2;
}
Even a non-general purpose solution is extremely hard using regular expressions. Say you had the following pattern:
xx(\d\w)+yy(\d\w)+zz
The correct solution would be:
use Storable qw( dclone );
my $s = "xx1a2byy3c4dZZ...xx5a6byy7c8dzz";
local our $rv;
if (
$s =~ /
(?{ [] })
xx
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[0] }, $^N; $r }) )+
yy
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[1] }, $^N; $r }) )+
zz
(?{ $rv = $^R; })
/x
) {
say "\$1: #{ $rv->[0] }";
say "\$2: #{ $rv->[1] }";
}
Output:
$1: 5a 6b
$2: 7c 8d
And something like
(zz(\d\w)+)+
would need
use Storable qw( dclone );
my $s = "zz1a2bzz3c4d";
local our $rv;
if (
$s =~ /
(?{ [] })
(?:
(?{ my $r = dclone($^R); push #$r, []; $r })
zz
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[-1] }, $^N; $r }) )+
)+
(?{ $rv = $^R; })
/x
) {
say "\$1: #$_" for #$rv;
}
Output:
$1: 1a 2b
$1: 3c 4d
I think I can give some explanation for the behavior you see:
In the first example I can see only one capture group. The quantifier allows it to be used multiple times, but it's one capture group nonetheless. So every new occurence of a matching subpattern would overwrite the value previously captured there. Even if the RE engine is already advanced behind it, but backtracking would occur (for e.g. a more advanced pattern with branching and the likes), it could be that the now again visited capture group would change. And since #- and #+ hold the positions to the capture groups (as opposed to occuring subpattern matches), this would explain why there's only the last occurence of the subpattern contained.
You could even play around with named subpatterns and %+/%- and would experience the same thing. It becomes more obvious with the already used (?{ }), at least for debugging purposes. But use re 'debug' is fine for shorter regexes / strings to match.
So be aware of the effects of backtracking to capture groups while matching is still in progress!
But if you don't have to care about backtracking, I can think of kind of a recipe to handle a capture group with a quantifier:
If your capture group is (bla) and your quantifier {0,3}, transform it into
(?:(bla)(?{ print $-[$#-],$+[$#-]."\n" })){0,3}.
You practically put the subpattern into another (non-capturing) group. If the RE engine is done with it, execute code regarding the last capture group matched so far. The quantifier outside the surrounding group is then responsible for the correct number of execution of the code fragment.
So you example becomes this:
use Data::Dumper;
my $s = 'xx1a2b3cyy';
my #submatches;
sub getem { push #submatches, [$-[$#-],$+[$#-]]; }
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+/;
print Dumper(\#submatches);
This also works with multiple capture groups transformed this way:
my $s = 'xx1a2b3cyy4de5fg6hihhh2';
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+yy(?:(\d\w{2})(?{ getem() }))+hh/;
You have to adjust the index used, if your capture group contains more capture groups. That's why I prefer names capture groups.
Hope this helps.

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.
local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

Finding indexes of non-empty fields in text files

I have a very large text file whose lines are comma-separated values. Some values are missing. For each line, I'd like to print the index and value of all the non-empty fields.
For example, a line might look like
,,10.3,,,,5.2,3.1,,,,,,,
in which case the output I want is
2,10.3,6,5.2,7,3.1
I know how to accomplish this by first splitting the input into an array, and then going through the array with a for loop, but these are huge files (multi-gigabyte) and I'm wondering if there is a faster way. (e.g. using some advanced regexp)
I haven't benchmarked it yet, but I would assume
my $line = ",,10.3,,,,5.2,3.1,,,,,,,";
my $index = 0;
print join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
is faster than some advanced regexp.
The problem is that as long as you have to know the index, you still have to keep track of those missing entries somehow.
Something like this might not be too slow though:
my ($i, #vars);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
print join ",", #vars;
You could probably leave out the first capturing group and use pos() to work out the index.
Here's a comparison of my two suggestions and sin's with 1M iterations:
Rate flesk1 sin flesk2
flesk1 87336/s -- -8% -27%
sin 94518/s 8% -- -21%
flesk2 120337/s 38% 27% --
Seems like my regex works better than I thought.
You might be able to mix and match regex and code -
$line =~ /(?{($cnt,#ary)=(0,)})^(?:([^,]+)(?{push #ary,$cnt; push #ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', #ary);
expanded -
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
and print join( ',', #ary);
some benchmarks
With a slight tweak of flesk's and sln's (look for fleskNew and slnNew),
the winner is the fleskNew when the substitution operator is removed.
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1 => sub{
$index = 0;
join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
} );
numbers -
Rate flesk1 sln1 flesk2 slnNew fleskNew
flesk1 20325/s -- -51% -52% -56% -60%
sln1 41312/s 103% -- -1% -10% -19%
flesk2 41916/s 106% 1% -- -9% -17%
slnNew 45978/s 126% 11% 10% -- -9%
fleskNew 50792/s 150% 23% 21% 10% --
some benchmarks 2
Adds Birei's in-line replacment and trim (all-in-one) solution.
Abberations:
Flesk1 is modified to remove the final 'join' as it is not included in
the other regex solutions. This gives it a chance to bench better.
Birei deviates in the bench as it modifies the original string to be the final solution.
That aspect can't be taken out. The difference between Birei1 and BireiNew is that the
new one removes the final ','.
Flesk2, Birei1 and BireiNew have the additional overhead of restoring the original string
due to the substitution operator.
The winner still looks like FleskNew ..
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1a => sub{
$index = 0;
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
Birei1 => sub{
$i = -1;
$line =~
s/
(?(?=,+)
( (?: , (?{ ++$i }) )+ )
| (?<no_comma> [^,]+ ,? ) (?{ ++$i })
)
/
defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]
/xge;
$line = $samp;
},
BireiNew => sub{
$i = 0;
$line =~
s/
(?: , (?{++$i}) )*
(?<data> [^,]* )
(?: ,*$ )?
(?= (?<trailing_comma> ,?) )
/
length $+{data} ? "$i,$+{data}$+{trailing_comma}" : ""
/xeg;
$line = $samp;
},
} );
results -
Rate BireiNew Birei1 flesk1a flesk2 sln1 slnNew fleskNew
BireiNew 6030/s -- -18% -74% -85% -86% -87% -88%
Birei1 7389/s 23% -- -68% -82% -82% -84% -85%
flesk1a 22931/s 280% 210% -- -44% -45% -51% -54%
flesk2 40933/s 579% 454% 79% -- -2% -13% -17%
sln1 41752/s 592% 465% 82% 2% -- -11% -16%
slnNew 47088/s 681% 537% 105% 15% 13% -- -5%
fleskNew 49563/s 722% 571% 116% 21% 19% 5% --
Using a regex (although I'm sure it can be simpler):
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
Explanation:
s/PATTERN/REPLACEMENT/ge # g -> Apply to all occurrences
# e -> Evaluate replacement as a expression.
(?
(?=,+) # Check for one or more commas.
((?:,(?{ ++$i }))+) # If (?=,+) was true, increment variable '$i' with each comma found.
|
(?<no_comma>[^,]+,?)(?{ ++$i }) # If (?=,+) was false, get number between comma and increment the $i variable only once.
)
defined $+{no_comma} # If 'no_comma' was set in 'pattern' expression...
$i . qq[,] . $+{no_comma} # insert the position just before it.
qq[] # If wasn't set, it means that pattern matched only commas, so remove then.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
our $i = -1;
chomp;
printf qq[Orig = $_\n];
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
# s/,\Z//;
printf qq[Mod = $_\n\n];
}
__DATA__
,,10.3,,,,5.2,3.1,,,,,,,
10.3,,,,5.2,3.1,,,,,,,
,10.3,,,,5.2,3.1
,,10.3,5.2,3.1,
Run the script like:
perl script.pl
And output:
Orig = ,,10.3,,,,5.2,3.1,,,,,,,
Mod = 2,10.3,6,5.2,7,3.1,
Orig = 10.3,,,,5.2,3.1,,,,,,,
Mod = 0,10.3,4,5.2,5,3.1,
Orig = ,10.3,,,,5.2,3.1
Mod = 1,10.3,5,5.2,6,3.1
Orig = ,,10.3,5.2,3.1,
Mod = 2,10.3,3,5.2,4,3.1,
As you can see, it keeps last comma. I don't know how to remove it without an extra regex, just uncomment s/,\Z//; in previous code.

How can I escape a literal string I want to interpolate into a regular expression?

Is there a built-in way to escape a string that will be used within/as a regular expression? E.g.
www.abc.com
The escaped version would be:
www\.abc\.com
I was going to use:
$string =~ s/[.*+?|()\[\]{}\\]/\\$&/g; # Escapes special regex chars
But I just wanted to make sure that there's not a cleaner built-in operation that I'm missing?
Use quotemeta or \Q...\E.
Consider the following test program that matches against $str as-is, with quotemeta, and with \Q...\E:
#! /usr/bin/perl
use warnings;
use strict;
my $str = "www.abc.com";
my #test = (
"www.abc.com",
"www/abc!com",
);
sub ismatch($) { $_[0] ? "MATCH" : "NO MATCH" }
my #match = (
[ as_is => sub { ismatch /$str/ } ],
[ qmeta => sub { my $qm = quotemeta $str; ismatch /$qm/ } ],
[ qe => sub { ismatch /\Q$str\E/ } ],
);
for (#test) {
print "\$_ = '$_':\n";
foreach my $method (#match) {
my($name,$match) = #$method;
print " - $name: ", $match->(), "\n";
}
}
Notice in the output that using the string as-is could produce spurious matches:
$ ./try
$_ = 'www.abc.com':
- as_is: MATCH
- qmeta: MATCH
- qe: MATCH
$_ = 'www/abc!com':
- as_is: MATCH
- qmeta: NO MATCH
- qe: NO MATCH
For programs that accept untrustworthy inputs, be extremely careful about using such potentially nasty bits as regular expressions: doing so could create unexpected runtime errors, denial-of-service vulnerabilities, and security holes.
The best way to do this is to use \Q to begin a quoted string and \E to end it.
my $foo = 'www.abc.com';
$bar =~ /blah\Q$foo\Eblah/;
You can also use quotemeta on the variable first. E.g.
my $quoted_foo = quotemeta($foo);
The \Q trick is documented in perlre under "Escape Sequences."

How can Perl split a line on whitespace except when the whitespace is in doublequotes?

I have the following string:
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
I need a regular expression to split this line but ignore spaces in double quotes in Perl.
The following is what I tried but it does not work.
(".*?"|\S+)
Once upon a time I also tried to re-invent the wheel, and solve this myself.
Now I just use Text::ParseWords
and let it do the job for me.
Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t.
First, let's see why (".*?"|\S+) "does not work". Specifically, look at ".*?" That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ"". Note that each "" at the beginning and end of that field will match ".*?" because "" consists of zero characters surrounded with double quotes.
It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.
Move the quotation marks outside of the capturing parentheses if you don't want them.
#!/usr/bin/perl
use strict;
use warnings;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my #parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};
use Data::Dumper;
print Dumper \#parts;
Output:
$VAR1 = [
'StartProgram',
'1',
'""C:\\Program Files\\ABC\\ABC XYZ""',
'CleanProgramTimeout',
'1',
'30'
];
In that vein, here is a more involved script:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT
my $re = qr{
(?<directive>StartProgram)\s+
(?<instance>[0-9][0-9]?)\s+
(?<path>"".+?""|\S+)\s+
(?<timeout_directive>CleanProgramTimeout)\s+
(?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;
for (#strings) {
if ( $_ =~ $re ) {
print Dumper \%+;
}
}
Output:
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => 'c:\\opt\\perl',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
Update: I cannot get Text::Balanced or Text::ParseWords to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};
print Dumper parse_line($s);
print Dumper parse_line($t);
sub parse_line {
my ($line) = #_;
my #parts = split /(\s+)/, $line;
my #real_parts;
for (my $i = 0; $i < #parts; $i += 1) {
unless ( $parts[$i] =~ /^""/ ) {
push #real_parts, $parts[$i] if $parts[$i] =~ /\S/;
next;
}
my $part;
do {
$part .= $parts[$i++];
} until ($part =~ /""$/);
push #real_parts, $part;
}
return \#real_parts;
}
my $x = 'StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';
my #parts = $x =~ /("".*?""|[^\s]+?(?>\s|$))/g;
my $str = 'StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';
print "str:$str\n";
#A = $str =~ /(".+"|\S+)/g;
foreach my $l (#A) {
print "<$l>\n";
}
That gives me:
$ ./test.pl
str:StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 130
<StartProgram>
<1>
<""C:\Program Files\ABC\ABC XYZ"">
<CleanProgramTimeout>
<1>
<30>