I have a small problem. I have a perl regexp with multiple capture groups. Some of them have quantifiers (like '+'). If no quantifier is added, then #- & #+ array are filled nicely with the matched position of the capture groups, but if a quantifier is added only the last match is detected. But I would like to have all of them.
An example:
my $s = 'xx1a2b3cyy';
my $re = qr/^xx(\d\w)+/;
So I'd like to know that matches are '1a', '2b', '3c' at 2, 4, 6.
Simple matching gives:
if ($s =~ $re) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
Gives:
Match 0 6, 8 8
i: 0 - 'xx1a2b3c
i: 1 - '3c
So only the last capture group match is remembered.
My next simple try was which is not really what I want as the RE is different:
$re = qr/(\d\w)/;
my #s = ($s =~ /$re/g);
print "RE: '#s'\n";
while ($s =~ /$re/g) {
print "Match #-, #+\n";
for (my $i = 0; $i < #-; ++$i) {
print 'i: ', $i, " - '", substr($s, $-[$i], $+[$i] - $-[$i]), "\n";
}
}
gives:
RE: '1a 2b 3c'
Match 2 2, 4 4
i: 0 - '1a
i: 1 - '1a
Match 4 4, 6 6
i: 0 - '2b
i: 1 - '2b
Match 6 6, 8 8
i: 0 - '3c
i: 1 - '3c
But this not what I want, as it would match a string like 'ZZ1aAA2bBB3cZZ'.
So somehow I have to combine the two. The best what I could get:
$re = '^xx(?:\d\w)*?\G(\d\w)';
pos($s) = 2;
while ($s =~ m($re)g) {
print "Match pos: ", pos($s), ', G: ', $1, ", '#-', '#+'\n"
}
gives:
Match pos: 4, G: 1a, '0 2', '4 4'
Match pos: 6, G: 2b, '0 4', '6 6'
Match pos: 8, G: 3c, '0 6', '8 8'
This is almost nice, but for this I need to know the position of the first possible match. If it is not set properly it will not match anything. I can only determine the first position if I remove the non greedy part:
$re = '^xx(\d\w)';
if ($s =~ m($re)) {
print "Match: '#-', '#+'\n";
}
which gives:
Match: '0 2', '4 4'
So $-[1] gives the first position, but for this I have to modify the RE "manually".
If I add code execution into the pattern I almost get what I need:
use re 'eval';
$re = '^xx(\d\w)+(??{print "Code: <#-> <#+>\n"})';
$s =~ m($re) and print "Match\n";
gives:
Code: <0 6> <8 8>
Code: <0 4> <6 6>
Code: <0 2> <4 4>
For this I need to add the (?{ code }) part.
Does anybody know a simpler method (I mean not need to modify the original RE) to get all the possible matches of a capture group having a quantifier?
Thanks in advance!
There's no general solution; the regex engine simply doesn't store the necessary information. You're asking to use a regex as a parser, and that's a no-go.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $1 while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
my ($pairs) = $_[0] =~ /^xx((?:\d\w)+)/
or return ();
return unpack('(a2)*', $pairs);
}
If you just want the positions, it's the same.
sub extract {
for ($_[0]) {
/^ xx /xg
or return ();
my #matches;
push #matches, $-[1] while /\G (\d\w) /xg;
return #matches;
}
}
or
sub extract {
$_[0] =~ /^xx((?:\d\w)+)/
or return ();
return map { $-[1] + ( $_ - 1 )*2 } 1..length($1)/2;
}
Even a non-general purpose solution is extremely hard using regular expressions. Say you had the following pattern:
xx(\d\w)+yy(\d\w)+zz
The correct solution would be:
use Storable qw( dclone );
my $s = "xx1a2byy3c4dZZ...xx5a6byy7c8dzz";
local our $rv;
if (
$s =~ /
(?{ [] })
xx
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[0] }, $^N; $r }) )+
yy
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[1] }, $^N; $r }) )+
zz
(?{ $rv = $^R; })
/x
) {
say "\$1: #{ $rv->[0] }";
say "\$2: #{ $rv->[1] }";
}
Output:
$1: 5a 6b
$2: 7c 8d
And something like
(zz(\d\w)+)+
would need
use Storable qw( dclone );
my $s = "zz1a2bzz3c4d";
local our $rv;
if (
$s =~ /
(?{ [] })
(?:
(?{ my $r = dclone($^R); push #$r, []; $r })
zz
(?: (\d\w) (?{ my $r = dclone($^R); push #{ $r->[-1] }, $^N; $r }) )+
)+
(?{ $rv = $^R; })
/x
) {
say "\$1: #$_" for #$rv;
}
Output:
$1: 1a 2b
$1: 3c 4d
I think I can give some explanation for the behavior you see:
In the first example I can see only one capture group. The quantifier allows it to be used multiple times, but it's one capture group nonetheless. So every new occurence of a matching subpattern would overwrite the value previously captured there. Even if the RE engine is already advanced behind it, but backtracking would occur (for e.g. a more advanced pattern with branching and the likes), it could be that the now again visited capture group would change. And since #- and #+ hold the positions to the capture groups (as opposed to occuring subpattern matches), this would explain why there's only the last occurence of the subpattern contained.
You could even play around with named subpatterns and %+/%- and would experience the same thing. It becomes more obvious with the already used (?{ }), at least for debugging purposes. But use re 'debug' is fine for shorter regexes / strings to match.
So be aware of the effects of backtracking to capture groups while matching is still in progress!
But if you don't have to care about backtracking, I can think of kind of a recipe to handle a capture group with a quantifier:
If your capture group is (bla) and your quantifier {0,3}, transform it into
(?:(bla)(?{ print $-[$#-],$+[$#-]."\n" })){0,3}.
You practically put the subpattern into another (non-capturing) group. If the RE engine is done with it, execute code regarding the last capture group matched so far. The quantifier outside the surrounding group is then responsible for the correct number of execution of the code fragment.
So you example becomes this:
use Data::Dumper;
my $s = 'xx1a2b3cyy';
my #submatches;
sub getem { push #submatches, [$-[$#-],$+[$#-]]; }
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+/;
print Dumper(\#submatches);
This also works with multiple capture groups transformed this way:
my $s = 'xx1a2b3cyy4de5fg6hihhh2';
$s =~ m/^xx(?:(\d\w)(?{ getem() }))+yy(?:(\d\w{2})(?{ getem() }))+hh/;
You have to adjust the index used, if your capture group contains more capture groups. That's why I prefer names capture groups.
Hope this helps.
I have a very large text file whose lines are comma-separated values. Some values are missing. For each line, I'd like to print the index and value of all the non-empty fields.
For example, a line might look like
,,10.3,,,,5.2,3.1,,,,,,,
in which case the output I want is
2,10.3,6,5.2,7,3.1
I know how to accomplish this by first splitting the input into an array, and then going through the array with a for loop, but these are huge files (multi-gigabyte) and I'm wondering if there is a faster way. (e.g. using some advanced regexp)
I haven't benchmarked it yet, but I would assume
my $line = ",,10.3,,,,5.2,3.1,,,,,,,";
my $index = 0;
print join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
is faster than some advanced regexp.
The problem is that as long as you have to know the index, you still have to keep track of those missing entries somehow.
Something like this might not be too slow though:
my ($i, #vars);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
print join ",", #vars;
You could probably leave out the first capturing group and use pos() to work out the index.
Here's a comparison of my two suggestions and sin's with 1M iterations:
Rate flesk1 sin flesk2
flesk1 87336/s -- -8% -27%
sin 94518/s 8% -- -21%
flesk2 120337/s 38% 27% --
Seems like my regex works better than I thought.
You might be able to mix and match regex and code -
$line =~ /(?{($cnt,#ary)=(0,)})^(?:([^,]+)(?{push #ary,$cnt; push #ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', #ary);
expanded -
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
and print join( ',', #ary);
some benchmarks
With a slight tweak of flesk's and sln's (look for fleskNew and slnNew),
the winner is the fleskNew when the substitution operator is removed.
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1 => sub{
$index = 0;
join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
} );
numbers -
Rate flesk1 sln1 flesk2 slnNew fleskNew
flesk1 20325/s -- -51% -52% -56% -60%
sln1 41312/s 103% -- -1% -10% -19%
flesk2 41916/s 106% 1% -- -9% -17%
slnNew 45978/s 126% 11% 10% -- -9%
fleskNew 50792/s 150% 23% 21% 10% --
some benchmarks 2
Adds Birei's in-line replacment and trim (all-in-one) solution.
Abberations:
Flesk1 is modified to remove the final 'join' as it is not included in
the other regex solutions. This gives it a chance to bench better.
Birei deviates in the bench as it modifies the original string to be the final solution.
That aspect can't be taken out. The difference between Birei1 and BireiNew is that the
new one removes the final ','.
Flesk2, Birei1 and BireiNew have the additional overhead of restoring the original string
due to the substitution operator.
The winner still looks like FleskNew ..
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1a => sub{
$index = 0;
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
Birei1 => sub{
$i = -1;
$line =~
s/
(?(?=,+)
( (?: , (?{ ++$i }) )+ )
| (?<no_comma> [^,]+ ,? ) (?{ ++$i })
)
/
defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]
/xge;
$line = $samp;
},
BireiNew => sub{
$i = 0;
$line =~
s/
(?: , (?{++$i}) )*
(?<data> [^,]* )
(?: ,*$ )?
(?= (?<trailing_comma> ,?) )
/
length $+{data} ? "$i,$+{data}$+{trailing_comma}" : ""
/xeg;
$line = $samp;
},
} );
results -
Rate BireiNew Birei1 flesk1a flesk2 sln1 slnNew fleskNew
BireiNew 6030/s -- -18% -74% -85% -86% -87% -88%
Birei1 7389/s 23% -- -68% -82% -82% -84% -85%
flesk1a 22931/s 280% 210% -- -44% -45% -51% -54%
flesk2 40933/s 579% 454% 79% -- -2% -13% -17%
sln1 41752/s 592% 465% 82% 2% -- -11% -16%
slnNew 47088/s 681% 537% 105% 15% 13% -- -5%
fleskNew 49563/s 722% 571% 116% 21% 19% 5% --
Using a regex (although I'm sure it can be simpler):
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
Explanation:
s/PATTERN/REPLACEMENT/ge # g -> Apply to all occurrences
# e -> Evaluate replacement as a expression.
(?
(?=,+) # Check for one or more commas.
((?:,(?{ ++$i }))+) # If (?=,+) was true, increment variable '$i' with each comma found.
|
(?<no_comma>[^,]+,?)(?{ ++$i }) # If (?=,+) was false, get number between comma and increment the $i variable only once.
)
defined $+{no_comma} # If 'no_comma' was set in 'pattern' expression...
$i . qq[,] . $+{no_comma} # insert the position just before it.
qq[] # If wasn't set, it means that pattern matched only commas, so remove then.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
our $i = -1;
chomp;
printf qq[Orig = $_\n];
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
# s/,\Z//;
printf qq[Mod = $_\n\n];
}
__DATA__
,,10.3,,,,5.2,3.1,,,,,,,
10.3,,,,5.2,3.1,,,,,,,
,10.3,,,,5.2,3.1
,,10.3,5.2,3.1,
Run the script like:
perl script.pl
And output:
Orig = ,,10.3,,,,5.2,3.1,,,,,,,
Mod = 2,10.3,6,5.2,7,3.1,
Orig = 10.3,,,,5.2,3.1,,,,,,,
Mod = 0,10.3,4,5.2,5,3.1,
Orig = ,10.3,,,,5.2,3.1
Mod = 1,10.3,5,5.2,6,3.1
Orig = ,,10.3,5.2,3.1,
Mod = 2,10.3,3,5.2,4,3.1,
As you can see, it keeps last comma. I don't know how to remove it without an extra regex, just uncomment s/,\Z//; in previous code.
I have the following string:
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
I need a regular expression to split this line but ignore spaces in double quotes in Perl.
The following is what I tried but it does not work.
(".*?"|\S+)
Once upon a time I also tried to re-invent the wheel, and solve this myself.
Now I just use Text::ParseWords
and let it do the job for me.
Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t.
First, let's see why (".*?"|\S+) "does not work". Specifically, look at ".*?" That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ"". Note that each "" at the beginning and end of that field will match ".*?" because "" consists of zero characters surrounded with double quotes.
It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.
Move the quotation marks outside of the capturing parentheses if you don't want them.
#!/usr/bin/perl
use strict;
use warnings;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my #parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};
use Data::Dumper;
print Dumper \#parts;
Output:
$VAR1 = [
'StartProgram',
'1',
'""C:\\Program Files\\ABC\\ABC XYZ""',
'CleanProgramTimeout',
'1',
'30'
];
In that vein, here is a more involved script:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT
my $re = qr{
(?<directive>StartProgram)\s+
(?<instance>[0-9][0-9]?)\s+
(?<path>"".+?""|\S+)\s+
(?<timeout_directive>CleanProgramTimeout)\s+
(?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;
for (#strings) {
if ( $_ =~ $re ) {
print Dumper \%+;
}
}
Output:
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => 'c:\\opt\\perl',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
Update: I cannot get Text::Balanced or Text::ParseWords to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};
print Dumper parse_line($s);
print Dumper parse_line($t);
sub parse_line {
my ($line) = #_;
my #parts = split /(\s+)/, $line;
my #real_parts;
for (my $i = 0; $i < #parts; $i += 1) {
unless ( $parts[$i] =~ /^""/ ) {
push #real_parts, $parts[$i] if $parts[$i] =~ /\S/;
next;
}
my $part;
do {
$part .= $parts[$i++];
} until ($part =~ /""$/);
push #real_parts, $part;
}
return \#real_parts;
}
my $x = 'StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';
my #parts = $x =~ /("".*?""|[^\s]+?(?>\s|$))/g;
my $str = 'StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';
print "str:$str\n";
#A = $str =~ /(".+"|\S+)/g;
foreach my $l (#A) {
print "<$l>\n";
}
That gives me:
$ ./test.pl
str:StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 130
<StartProgram>
<1>
<""C:\Program Files\ABC\ABC XYZ"">
<CleanProgramTimeout>
<1>
<30>