Why do #+ and #{^CAPTURE} differ in length? - regex

I'm trying to understand how the regex variables work, so I can save submatch positions in the payload within embedded code expressions. According to perlvar, the positive indices of the array correspond to $1, $2, $3, etc., but that doesn't seem to be the case?
#!/usr/bin/perl -w
use v5.28;
use Data::Dumper;
"XY" =~ / ( (.*) (.) (?{
say Dumper { match_end => \#+ };
say Dumper { capture => \#{^CAPTURE} }
}) ) (.)/x;
Output:
$VAR1 = {
'match_end' => [
2,
undef,
1,
2,
undef
]
};
$VAR1 = {
'capture' => [
undef,
'X',
'Y'
]
};
$VAR1 = {
'match_end' => [
1,
2,
0,
1,
undef
]
};
$VAR1 = {
'capture' => [
'XY',
'',
'X'
]
};

The #+ array apparently gets allocated, or otherwise prepared, already at compilation
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) )/x'
prints
(0, undef, undef)
(0 for the whole match and an undef for each indicated capture group), while
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) (.) )/x'
prints
(0, undef, undef, undef)
with one more element for one more capture group.
One the other hand, the #{^CAPTURE} is just plain empty until there are actual patterns to capture, as we can see from mob's detailed analysis. This, I'd say, plays well with its name.
After the fact the arrays agree, with that shift of one in indices since #+ also contains (offset for) the whole match, at $+[0].
Another difference is that a trailing failed optional match doesn't get a slot in #{^CAPTURE}
perl -MData::Dump=dd -we'$_=q(abc); /((x)? (.) (x)?)/x; dd #+; dd #{^CAPTURE}'
prints
(1, 1, undef, 1, undef)
("a", undef, "a")

The perlvar docs are unclear about what #{^CAPTURE} look like in the middle of a regexp evaluation, but there is a clear progression that depends where in the regexp you are looking at it.
use 5.026;
use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0;
sub DEBUG_CAPTURE { say Dumper { a => $_[0], capture => \#{^CAPTURE} }; }
"XY" =~ /
(?{DEBUG_CAPTURE(0)})
(
(?{DEBUG_CAPTURE(1)})
(
(?{DEBUG_CAPTURE(2)})
(.*) (?{DEBUG_CAPTURE(3)})
(.) (?{DEBUG_CAPTURE(4)})
)
(?{DEBUG_CAPTURE(5)}) (.)
(?{DEBUG_CAPTURE(6)})
)
(?{DEBUG_CAPTURE(7)}) /x;
DEBUG_CAPTURE(8);
Output
$VAR1 = {'a' => 0,'capture' => []};
$VAR1 = {'a' => 1,'capture' => []};
$VAR1 = {'a' => 2,'capture' => []};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'XY']};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'X']};
$VAR1 = {'a' => 4,'capture' => [undef,undef,'X','Y']};
$VAR1 = {'a' => 5,'capture' => [undef,'XY','X','Y']};
$VAR1 = {'a' => 3,'capture' => [undef,'XY','','Y']};
$VAR1 = {'a' => 4,'capture' => [undef,'XY','','X']};
$VAR1 = {'a' => 5,'capture' => [undef,'X','','X']};
$VAR1 = {'a' => 6,'capture' => [undef,'X','','X','Y']};
$VAR1 = {'a' => 7,'capture' => ['XY','X','','X','Y']};
$VAR1 = {'a' => 8,'capture' => ['XY','X','','X','Y']};
The docs are correct if you are observing #{^CAPTURE} after a regexp has been completely evaluated. While evaluation is in process, #{^CAPTURE} seems to grow as the number of capture groups encountered increases. But it's not clear how useful it is to look at #{^CAPTURE} at least until you get to the end of the expression.

Related

Some capture groups seem lost when matching group repeatedly

Trying to parse the output of monitoring plugins I ran into a problem where the match result was unexpected by me:
First consider this debugger session with Perl 5.18.2:
DB<6> x $_
0 'last=0.508798;;;0'
DB<7> x $RE
0 (?^u:^((?^u:\'[^\'=]+\'|[^\'= ]+))=((?^u:\\d+(?:\\.\\d*)?|\\.\\d+))(s|%|[KMT]?B)?(;(?^u:\\d+(?:\\.\\d*)?|\\.\\d+)?){0,4}$)
-> qr/(?^u:^((?^u:'[^'=]+'|[^'= ]+))=((?^u:\d+(?:\.\d*)?|\.\d+))(s|%|[KMT]?B)?(;(?^u:\d+(?:\.\d*)?|\.\d+)?){0,4}$)/
DB<8> #m = /$RE/
DB<9> x #m
0 'last'
1 0.508798
2 undef
3 ';0'
DB<10>
OK, the regex $RE (intended to match "'label'=value[UOM];[warn];[crit];[min];[max]") looks terrifying at a first glance, so let me show the construction of it:
my $RE_label = qr/'[^'=]+'|[^'= ]+/;
my $RE_simple_float = qr/\d+(?:\.\d*)?|\.\d+/;
my $RE_numeric = qr/[-+]?$RE_simple_float(?:[eE][-+]?\d+)?/;
my $RE = qr/^($RE_label)=($RE_simple_float)(s|%|[KMT]?B)?(;$RE_simple_float?){0,4}$/;
The relevant part is (;$RE_simple_float?){0,4}$ intended to match ";[warn];[crit];[min];[max]" (still not perfect), so for ";;;0" I'd expect #m to end with ';', ';', ';0'.
However it seems the matches are lost, except for the last one.
Did I misunderstand something, or is it a Perl bug?
When you use {<number>} (or + or * for that matter) after a capture group, only the last value that is matched by the capture group is stored. This explain why you only end up with ;0 instead of ;;;0 in your fourth capture group: (;$RE_simple_float?){0,4} sets the fourth capture group to the last element it matches.
Top fix that, I would recommend to match the whole end of the string, and split it afterwards:
my $RE = qr/...((?:;$RE_simple_float?){0,4})$/;
my #m = /$RE/;
my #end = split /;/, $m[3]; # use /(?<=;)/ to keep the semicolons
Another solution is to repeat the capture group: replace (;$RE_simple_float?){0,4} with
(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?(;$RE_simple_float?)?
The capture groups that do not match will be set to undef. This issue with this approach is that it's a bit verbose, and only works for {}, but not for + or *.
Following demo code utilizes split to obtain data of interest. Investigate if it will fit as a solution for your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
while( <DATA> ) {
chomp;
say;
my $record;
$record->#{qw/label value warn crit min max/} = split(/[=;]/,$_);
say Dumper($record);
}
exit 0;
#'label'=value[UOM];[warn];[crit];[min];[max]
__DATA__
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
'label 2'=10%;7%;18%;2%;28%
'label 3'=0.5us;2.3us
Output
'label 1'=0.3345s;0.8s;1.2s;0.2s;3.2s
$VAR1 = {
'crit' => '1.2s',
'warn' => '0.8s',
'value' => '0.3345s',
'label' => '\'label 1\'',
'max' => '3.2s',
'min' => '0.2s'
};
'label 2'=10%;7%;18%;2%;28%
$VAR1 = {
'min' => '2%',
'max' => '28%',
'label' => '\'label 2\'',
'value' => '10%',
'warn' => '7%',
'crit' => '18%'
};
'label 3'=0.5us;2.3us
$VAR1 = {
'min' => undef,
'max' => undef,
'label' => '\'label 3\'',
'warn' => '2.3us',
'value' => '0.5us',
'crit' => undef
};

Perl regex vs. Raku regex, differences in the engine?

I am trying to convert a regex based solution for the knapsack problem from Perl to raku. Details on Perlmonks
The Perl solution creates this regex:
(?<P>(?:vvvvvvvvvv)?)
(?<B>(?:vv)?)
(?<Y>(?:vvvv)?)
(?<G>(?:vv)?)
(?<R>(?:v)?)
0
(?=
(?(?{ $1 })wwww|)
(?(?{ $2 })w|)
(?(?{ $3 })wwwwwwwwwwww|)
(?(?{ $4 })ww|)
(?(?{ $5 })w|)
)
which gets matched against vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. After that the match hash %+ contains the items to put in the sack.
My raku conversion is:
$<B> = [ [ vv ]? ]
$<P> = [ [ vvvvvvvvvv ]? ]
$<R> = [ [ v ]? ]
$<Y> = [ [ vvvv ]? ]
$<G> = [ [ vv ]? ]
0
<?before
[ { say "B"; say $/<B>; say $0; say $1; $1 } w || { "" } ]
[ { say "P"; say $/<P>; say $0; say $1; $2 } wwww || { "" } ]
[ { say "R"; say $/<R>; say $0; say $1; $3 } w || { "" } ]
[ { say "Y"; say $/<Y>; say $0; say $1; $4 } wwwwwwwwwwww || { "" } ]
[ { say "G"; say $/<G>; say $0; say $1; $5 } ww || { "" } ]
which also matches vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. But the match object, $/ does not contain anything useful. Also, my debug says all say Nil, so at that point the backreference does not seem to work?
Here's my test script:
my $max-weight = 15;
my %items =
'R' => { w => 1, v => 1 },
'B' => { w => 1, v => 2 },
'G' => { w => 2, v => 2 },
'Y' => { w => 12, v => 4 },
'P' => { w => 4, v => 10 }
;
my $str = 'v' x %items.map(*.value<v>).sum ~
'0' ~
'w' x $max-weight;
say $str;
my $i = 0;
my $left = my $right = '';
for %items.keys -> $item-name
{
my $v = 'v' x %items{ $item-name }<v>;
my $w = 'w' x %items{ $item-name }<w>;
$left ~= sprintf( '$<%s> = [ [ %s ]? ] ' ~"\n", $item-name, $v );
$right ~= sprintf( '[ { say "%s"; say $/<%s>; say $0; say $1; $%d } %s || { "" } ]' ~ "\n", $item-name, $item-name, ++$i, $w );
}
use MONKEY-SEE-NO-EVAL;
my $re = sprintf( '%s0' ~ "\n" ~ '<?before ' ~ "\n" ~ '%s>' ~ "\n", $left, $right );
say $re;
dd $/ if $str ~~ m:g/<$re>/;
This answer only covers what's going wrong. It does not address a solution. I have not filed corresponding bugs. I have not yet even searched bug queues to see if I can find reports corresponding to either or both the two issues I've surfaced.
my $lex-var;
sub debug { .say for ++$, :$<rex-var>, :$lex-var }
my $regex = / $<rex-var> = (.) { $lex-var = $<rex-var> } <?before . { debug }> / ;
'xx' ~~ $regex; say $/;
'xx' ~~ / $regex /; say $/;
displays:
1
rex-var => Nil
lex-var => 「x」
「x」
rex-var => 「x」
2
rex-var => Nil
lex-var => 「x」
「x」
Focusing first on the first call of debug (the lines starting with 1 and ending at rex-var => 「x」), we can see that:
Something's gone awry during the call to debug: $<rex-var> is reported as having the value Nil.
When the regex match is complete and we return to the mainline, the say $/ reports a full and correctly populated result that includes the rex-var named match.
To begin to get a sense of what's gone wrong, please consider reading the bulk of my answer to another SO question. You can safely skip the Using ~. Footnotes 1,2, and 6 are also probably completely irrelevant to your scenario.
For the second match, we see that not only is $<rex-var> reported as being Nil during the debug call, the final match variable, as reported back in the mainline with the second say $/, is also missing the rex-var match. And the only difference is that the regex $regex is called from within an outer regex.

Identify words in a sentence which are figures through regex [duplicate]

I have string:
$string = 'Five People';
I want to replace all number-words into numbers. So results are:
$string = '5 People';
I have this function to convert single words to int:
function words_to_number($data) {
$data = strtr(
$data,
array(
'zero' => '0',
'a' => '1',
'one' => '1',
'two' => '2',
'three' => '3',
'four' => '4',
'five' => '5',
'six' => '6',
'seven' => '7',
'eight' => '8',
'nine' => '9',
'ten' => '10',
'eleven' => '11',
'twelve' => '12',
'thirteen' => '13',
'fourteen' => '14',
'fifteen' => '15',
'sixteen' => '16',
'seventeen' => '17',
'eighteen' => '18',
'nineteen' => '19',
'twenty' => '20',
'thirty' => '30',
'forty' => '40',
'fourty' => '40', // common misspelling
'fifty' => '50',
'sixty' => '60',
'seventy' => '70',
'eighty' => '80',
'ninety' => '90',
'hundred' => '100',
'thousand' => '1000',
'million' => '1000000',
'billion' => '1000000000',
'and' => '',
)
);
// Coerce all tokens to numbers
$parts = array_map(
function ($val) {
return floatval($val);
},
preg_split('/[\s-]+/', $data)
);
$stack = new SplStack; // Current work stack
$sum = 0; // Running total
$last = null;
foreach ($parts as $part) {
if (!$stack->isEmpty()) {
// We're part way through a phrase
if ($stack->top() > $part) {
// Decreasing step, e.g. from hundreds to ones
if ($last >= 1000) {
// If we drop from more than 1000 then we've finished the phrase
$sum += $stack->pop();
// This is the first element of a new phrase
$stack->push($part);
} else {
// Drop down from less than 1000, just addition
// e.g. "seventy one" -> "70 1" -> "70 + 1"
$stack->push($stack->pop() + $part);
}
} else {
// Increasing step, e.g ones to hundreds
$stack->push($stack->pop() * $part);
}
} else {
// This is the first element of a new phrase
$stack->push($part);
}
// Store the last processed part
$last = $part;
}
return $sum + $stack->pop();
}
// test
$words = 'five';
echo words_to_number($words);
Works great (try it ideone). I need to find a way to determine which words within a string is a word-number and then do a replace of all these matching words and convert them into numbers.
How can this be done? Maybe a regex approach?
I have tried to port a text2num Python library to PHP, mix it with a regex for matching English spelled out numbers, enhanced it to the decillion, and here is a result:
function text2num($s) {
// Enhanced the regex at http://www.rexegg.com/regex-trick-numbers-in-english.html#english-number-regex
$reg = <<<REGEX
(?x) # free-spacing mode
(?(DEFINE)
# Within this DEFINE block, we'll define many subroutines
# They build on each other like lego until we can define
# a "big number"
(?<one_to_9>
# The basic regex:
# one|two|three|four|five|six|seven|eight|nine
# We'll use an optimized version:
# Option 1: four|eight|(?:fiv|(?:ni|o)n)e|t(?:wo|hree)|
# s(?:ix|even)
# Option 2:
(?:f(?:ive|our)|s(?:even|ix)|t(?:hree|wo)|(?:ni|o)ne|eight)
) # end one_to_9 definition
(?<ten_to_19>
# The basic regex:
# ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|
# eighteen|nineteen
# We'll use an optimized version:
# Option 1: twelve|(?:(?:elev|t)e|(?:fif|eigh|nine|(?:thi|fou)r|
# s(?:ix|even))tee)n
# Option 2:
(?:(?:(?:s(?:even|ix)|f(?:our|if)|nine)te|e(?:ighte|lev))en|
t(?:(?:hirte)?en|welve))
) # end ten_to_19 definition
(?<two_digit_prefix>
# The basic regex:
# twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety
# We'll use an optimized version:
# Option 1: (?:fif|six|eigh|nine|(?:tw|sev)en|(?:thi|fo)r)ty
# Option 2:
(?:s(?:even|ix)|t(?:hir|wen)|f(?:if|or)|eigh|nine)ty
) # end two_digit_prefix definition
(?<one_to_99>
(?&two_digit_prefix)(?:[- ](?&one_to_9))?|(?&ten_to_19)|
(?&one_to_9)
) # end one_to_99 definition
(?<one_to_999>
(?&one_to_9)[ ]hundred(?:[ ](?:and[ ])?(?&one_to_99))?|
(?&one_to_99)
) # end one_to_999 definition
(?<one_to_999_999>
(?&one_to_999)[ ]thousand(?:[ ](?&one_to_999))?|
(?&one_to_999)
) # end one_to_999_999 definition
(?<one_to_999_999_999>
(?&one_to_999)[ ]million(?:[ ](?&one_to_999_999))?|
(?&one_to_999_999)
) # end one_to_999_999_999 definition
(?<one_to_999_999_999_999>
(?&one_to_999)[ ]billion(?:[ ](?&one_to_999_999_999))?|
(?&one_to_999_999_999)
) # end one_to_999_999_999_999 definition
(?<one_to_999_999_999_999_999>
(?&one_to_999)[ ]trillion(?:[ ](?&one_to_999_999_999_999))?|
(?&one_to_999_999_999_999)
) # end one_to_999_999_999_999_999 definition
# ==== MORE ====
(?<one_to_quadrillion>
(?&one_to_999)[ ]quadrillion(?:[ ](?&one_to_999_999_999_999_999))?|
(?&one_to_999_999_999_999_999)
) # end one_to_quadrillion definition
(?<one_to_quintillion>
(?&one_to_999)[ ]quintillion(?:[ ](?&one_to_quadrillion))?|
(?&one_to_quadrillion)
) # end one_to_quintillion definition
(?<one_to_sextillion>
(?&one_to_999)[ ]sextillion(?:[ ](?&one_to_quintillion))?|
(?&one_to_quintillion)
) # end one_to_sextillion definition
(?<one_to_septillion>
(?&one_to_999)[ ]septillion(?:[ ](?&one_to_sextillion))?|
(?&one_to_sextillion)
) # end one_to_septillion definition
(?<one_to_octillion>
(?&one_to_999)[ ]octillion(?:[ ](?&one_to_septillion))?|
(?&one_to_septillion)
) # end one_to_octillion definition
(?<one_to_nonillion>
(?&one_to_999)[ ]nonillion(?:[ ](?&one_to_octillion))?|
(?&one_to_octillion)
) # end one_to_nonillion definition
(?<one_to_decillion>
(?&one_to_999)[ ]decillion(?:[ ](?&one_to_nonillion))?|
(?&one_to_nonillion)
) # end one_to_decillion definition
(?<bignumber>
zero|(?&one_to_decillion)
) # end bignumber definition
(?<zero_to_9>
(?&one_to_9)|zero
) # end zero to 9 definition
# (?<decimals>
# point(?:[ ](?&zero_to_9))+
# ) # end decimals definition
) # End DEFINE
####### The Regex Matching Starts Here ########
\b(?:(?&ten_to_19)\s+hundred|(?&bignumber))\b
REGEX;
return preg_replace_callback('~' . trim($reg) . '~i', function ($x) {
return text2num_internal($x[0]);
}, $s);
}
function text2num_internal($s) {
// Port of https://github.com/ghewgill/text2num/blob/master/text2num.py
$Small = [
'zero'=> 0,
'one'=> 1,
'two'=> 2,
'three'=> 3,
'four'=> 4,
'five'=> 5,
'six'=> 6,
'seven'=> 7,
'eight'=> 8,
'nine'=> 9,
'ten'=> 10,
'eleven'=> 11,
'twelve'=> 12,
'thirteen'=> 13,
'fourteen'=> 14,
'fifteen'=> 15,
'sixteen'=> 16,
'seventeen'=> 17,
'eighteen'=> 18,
'nineteen'=> 19,
'twenty'=> 20,
'thirty'=> 30,
'forty'=> 40,
'fifty'=> 50,
'sixty'=> 60,
'seventy'=> 70,
'eighty'=> 80,
'ninety'=> 90
];
$Magnitude = [
'thousand'=> 1000,
'million'=> 1000000,
'billion'=> 1000000000,
'trillion'=> 1000000000000,
'quadrillion'=> 1000000000000000,
'quintillion'=> 1000000000000000000,
'sextillion'=> 1000000000000000000000,
'septillion'=> 1000000000000000000000000,
'octillion'=> 1000000000000000000000000000,
'nonillion'=> 1000000000000000000000000000000,
'decillion'=> 1000000000000000000000000000000000,
];
$a = preg_split("~[\s-]+(?:and[\s-]+)?~u", $s);
$a = array_map('strtolower', $a);
$n = 0;
$g = 0;
foreach ($a as $w) {
if (isset($Small[$w])) {
$g = $g + $Small[$w];
}
else if ($w == "hundred" && $g != 0) {
$g = $g * 100;
}
else {
$x = $Magnitude[$w];
if (strlen($x) > 0) {
$n =$n + $g * $x;
$g = 0;
}
else{
throw new Exception("Unknown number: " . $w);
}
}
}
return $n + $g;
}
echo text2num("one") . "\n"; // 1
echo text2num("twelve") . "\n"; // 12
echo text2num("seventy two") . "\n"; // 72
echo text2num("three hundred") . "\n"; // 300
echo text2num("twelve hundred") . "\n"; // 1200
echo text2num("twelve thousand three hundred four") . "\n"; // 12304
echo text2num("six million") . "\n"; // 6000000
echo text2num("six million four hundred thousand five") . "\n"; // 6400005
echo text2num("one hundred twenty three billion four hundred fifty six million seven hundred eighty nine thousand twelve") . "\n"; # // 123456789012
echo text2num("four decillion") . "\n"; // 4000000000000000000000000000000000
echo text2num("five hundred and thirty-seven") . "\n"; // 537
echo text2num("five hundred and thirty seven") . "\n"; // 537
See the PHP demo.
The regex can actually match either just big numbers or numbers like "eleven hundred", see \b(?:(?&ten_to_19)\s+hundred|(?&bignumber))\b. It can be further enhanced. E.g. word boundaries may be replaced with other boundary types (like (?<!\S) and (?!\S) to match in between whitespaces, etc.).
Decimal part in the regex is commented out since even if we match it, the num2text won't handle them.
You can use this regex:
\b(zero|a|one|tw(elve|enty|o)|th(irt(een|y)|ree)|fi(ft(een|y)|ve)|(four|six|seven|nine)(teen|ty)?|eight(een|y)?|ten|eleven|forty|hundred|thousand|(m|b)illion|and)+\b
By the way, there might be a better regex out there. Until someone posts it, you can use the following implementation
$regex = '/\b(zero|a|one|tw(elve|enty|o)|th(irt(een|y)|ree)|fi(ft(een|y)|ve)|(four|six|seven|nine)(teen|ty)?|eight(een|y)?|ten|eleven|forty|hundred|thousand|(m|b)illion|and)+\b/i';
function word_numbers_to_numbers($string) {
return preg_replace_callback($regex, function($m) {
return words_to_number($m[0]);
},$string);
}

perl: capturing the replaced-with string

I have code in a loop similar to
for( my $i=0; $a =~ s/<tag>(.*?)<\/tag>/sprintf("&CITE%03d;",$i)/e ; $i++ ){
%cite{ $i } = $1;
}
but instead of just the integer index, I want to make the keys of the hash the actual replaced-with text (placeholder "&CITE001;", etc.) without having to redo the sprintf().
I was almost sure there was a way to do it (variable similar to $& and such, but maybe I was thinking of vim's substitutions and not perl. :)
Thanks!
my $i = 0;
s{<tag>(.*?)</tag>}{
my $entity = sprintf("&CITE%03d;", $i++);
$cite{$entity} = $1;
$entity
}eg;
I did a something of a hacque, but really wanted something a bit more elegant. What I ended up doing (for now) is
my $t;
for( my $i=0; $t = sprintf("&CITE%04d;",$i), $all =~ s/($oct.*?$cct)/$t/s; $i++ ){
$cites{$t} = $1;
}
but I really wanted something even more "self-contained".
Just being able to grab the replacement string would've made things much simpler, though. This is a simple read-modify-write op.
True, adding the 'g' modifier should help shave some microseconds off it. :D
I think any method other than re-starting the search from the start of the target
is always the better choice.
In that vein and, as an alternative, you can move the logic inside the regex
via a Code Construct (?{ code }) and leverage the fact that $^N contains
the last capture content.
Perl
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my $target = "<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>";
my %cite;
my ($cnt,$key) = (0,'');
$target =~ s/
<tag> (.*?) <\/tag>
(?{
$key = sprintf("&CITE%03d;", $cnt++);
$cite{$key} = $^N;
})
/$key/xg;
print $target, "\n";
print Dumper(\%cite);
Output
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};
Edited/code by #Ikegami
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
sub f {
my $target = "<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>";
my %cite;
my ($cnt,$key) = (0,'');
$target =~ s/
<tag> (.*?) <\/tag>
(?{
$key = sprintf("&CITE%03d;", $cnt++);
$cite{$key} = $^N;
})
/$key/xg;
print $target, "\n";
print Dumper(\%cite);
}
f() for 1..2;
Output
Variable "$key" will not stay shared at (re_eval 1) line 2.
Variable "$cnt" will not stay shared at (re_eval 1) line 2.
Variable "%cite" will not stay shared at (re_eval 1) line 3.
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};
$VAR1 = {};
This issue has been addressed in 5.18.
Perl by #sln
See, now I don't get that issue in version 5.20.
And, I don't believe I got it in 5.12 either.
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
sub wrapper {
my ($targ, $href) = #_;
my ($cnt, $key) = (0,'');
$$targ =~ s/<tag>(.*?)<\/tag>(?{ $key = sprintf("&CITE%03d;", $cnt++); $href->{$key} = $^N; })/$key/g;
}
my ($target,%cite) = ("<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>", ());
wrapper( \$target, \%cite );
print $target, "\n";
print Dumper(\%cite);
($target,%cite) = ("<tag>zero</tag>\n<tag>one</tag>\n<tag>two</tag>\n<tag>three</tag>", ());
wrapper( \$target, \%cite );
print $target, "\n";
print Dumper(\%cite);
Output
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};
&CITE000;
&CITE001;
&CITE002;
&CITE003;
$VAR1 = {
'&CITE000;' => 'zero',
'&CITE001;' => 'one',
'&CITE002;' => 'two',
'&CITE003;' => 'three'
};

Prepare string with regex for HTML tags management

P/S: I am a PHP programmer.
Given:
div{3|5|6|9}[id = abc| class=image], a[id=link|class=out]
I want to use regex to generate a result as an array, e.g:
array(
[div] => array(
"3|5|6|9",
"id = abc| class=image"
)
[a] => array(
"",
"id=link|class=out")
)
Would anyone please help?
Thank you a lot!
Have a try with this:
$str='div{3|5|6|9}[id = abc| class=image], a[id=link|class=out]';
preg_match_all('/(\w+)(\{(.*?)\})?\[(.*?)\](?:, |$)?/', $str, $m);
$out = array($m[1][0] => array($m[3][0], $m[4][0]), $m[1][1] => array($m[3][1], $m[4][1]));
print_r($out);
Output:
Array
(
[div] => Array
(
[0] => 3|5|6|9
[1] => id = abc| class=image
)
[a] => Array
(
[0] =>
[1] => id=link|class=out
)
)
If you can guarantee that a comma will not exist between { and } and between [ and ], you can first split the string by , and then use this regular expression:
/([a-z]+)(\{(.*?)\})?\[(.*?)\]/
The captured groups that you want are $1, $3, and $4 (those back-reference numbers should match up if you use preg_match_all)
Note: I tested this out in Javascript.
preg_match_all('/(\w+)(\{(.*?)\})?\[(.*?)\](?:, |$)?/', $str, $m);
I believe the above is fine to work unless another string like:
$str='div{3|5|6|9}[id = abc| class=image], a[id=link|class=out], br, ul';
the regex does not capture the br and ul.