How should I handle regex-features labeled with "warning"? - regex

How should I handle regex-features labeled with "warning" like "(?{ code })", "(??{ code })" or "Special Backtracking Control Verbs"? How serious should I take the warnings?

I kinda think they’re here to stay, one way or the other — especially code escapes. Code escapes have been with us for more than a decade.
The scariness of them — that they can call code in unforeseen ways — is taken care of by use re "eval". Also, the regex matcher hasn’t been reëntrant until 5.12 IIRC, which could limit their usefulness.
The string-eval version, (??{ code }), used to be the only way to do recursion, but since 5.10 we have a much better way to do that; benchmarking the speed differences shows the eval way is way slower in most cases.
I mostly use the block-eval version, (?{ code}), for adding debugging, which happens at a different granualarity than use re "debug". It used to vaguely bother me that the return value from the block-eval version’s wasn’t usable, until I realized that it was. You just had to use it as the test part of a conditional pattern, like this pattern for testing whether a number was made up of digits that were decreasing by one each position to the right:
qr{
^ (
( \p{Decimal_Number} )
(?(?= ( \d )) | $)
(?(?{ ord $3 == 1 + ord $2 }) (?1) | $)
) $
}x
Before I figured out conditionals, I would have written that this way:
qr{
^ (
( \p{Decimal_Number} )
(?= $ | (??{ chr(1+ord($2)) }) )
(?: (?1) | $ )
) $
}x
which is much less efficient.
The backtracking control verbs are newer. I use them mostly for getting all possible permutations of a match, and that requires only (*FAIL). I believe it is the (*ACCEPT) feature that is especially marked “highly experimental”. These have only been with us since 5.10.

Related

Is there a shorter way to match /a|b|ab/

I am making a compiler and need to match 1 or 2 of two different patterns e.g. +,=,+= or else,if,else if
so far I can do:
/\b(else( if)?|(else )?if)\b/
The regex above works but the patterns if and else and mentioned twice.
is there a better way that doesn't require making a copy of each of the words?
The general way to factor all you say above is this
(?:\+=?|=|else(?:[ ]if)?|if)
No other boundary conditions are asserted around this.
And nothing other than a single space is assumed between the else if.
(?:
\+ =?
| =
| else
(?: [ ] if )?
| if
)

Atomic groups clarity

Consider this regex.
a*b
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 67 steps in debugger to fail.
Now consider this regex.
(?>a*)b
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 133 steps in debugger to fail.
And lastly this regex:
a*+b (a variant of atomic group)
This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
This takes 67 steps in debugger to fail.
When I check the benchmark atomic group (?>a*)b performs 179% faster.
Now atomic groups disable backtracking. So performance in match is good.
But why are the number of steps more? Can somebody explain on this?
Why is there a diff. in steps between two atomic groups (?>a*)b and a*+b.
Do they work differently?
Author note:
    This answer targets question 1 as delivered by the bounty text "I am looking forward to the exact reason why more steps are being needed by the debugger.I dont need answers explaining how atomic groups work.";     Jerry's answer addresses the other concerns very well, while my other answer takes a ride through the mentioned constructs, how they work, and why they are important. For full knowledge, simply reading this post is not enough!
Every group in a regular expression takes a step to step into and out of the group.
    WHAT?!
Yeah, I'm serious, read on...
Firstly, I would like to present you with quantified non-capturing groups, over without the group:
Pattern 1: (?:c)at
Pattern 2: cat
So what exactly happens here? We'll match the patterns with the test string "concat" on a regex engine with optimizations disabled:
While we're at it, I present you some more groups:
    Oh no! I'm going to avoid using groups!
But wait! Please note that the number of steps taken to match has no correlation with the performance of the match. pcre engines optimizes away most of the "unnecessary steps" as I've mentioned. Atomic groups are still the most efficient, despite more steps taken on an engine with optimizations disabled.
Perhaps relevant:
Why is a character class faster than alternation?
So what's backtracking?
The engine comes to quantifiers that are greedy by default. Greedy modifiers matches all possible and backtracks by demand, allowing efficient matches,
as referenced by Greedy vs. Reluctant vs. Possessive Quantifiers:
A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success! [...]
What does this have to do with a*+b?
In /a*+b/:
a The literal character "a".
    *+ Zero or more, possessive.
b The literal character "b".
As referenced by Greedy vs. Reluctant vs. Possessive Quantifiers:
A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.
Why does it matter?
The machine won't realize if it's doing an (in)efficient match on its own. See here for a decent example: Program run forever when matching regex. In many scenarios, regexes written quickly may not be efficient and may well easily be problematic in deployment.
So what's an atomic group?
After the pattern within the atomic group finishes matching, it will not let go, ever. Study this example:
Pattern: (?>\d\w{2}|12)c
Matching string: 12c
Looks perfectly legitimate, but this match fails. The steps are simple: The first alternation of the atomic group matches perfectly - \d\w{2} consumes 12c. The group then completes its match - now here is our pointer location:
Pattern: (?>\d\w{2}|12)c
^
Matching string: 12c
^
The pattern advances. Now we try to match c, but there is no c. Instead of trying to backtrack (releasing \d\w{2} and consuming 12), the match fails.
Well that's a bad idea then! Why would we prevent backtracking, Unihedron?
Now imagine we're manipulating with a JSON object. This file is not small. Backtracking from the end is going to be a bad idea.
"2597401":[{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2- 5":{"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338497979",
"runTime":"1022",
"execType":"user:binary",
"exec":"ft.D.64",
"numNodes":"4",
"sha1":"5a79879235aa31b6a46e73b43879428e2a175db5",
"execEpoch":1336766742,
"execModify":"Fri May 11 15:05:42 2012",
"startTime":"Thu May 31 15:59:39 2012",
"numCores":"64",
"sizeT":{"bss":"1881400168","text":"239574","data":"22504"}},
{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338497946",
"runTime":"33" "execType":"user:binary",
"exec":"cg.C.64",
"numNodes":"4",
"sha1":"caf415e011e28b7e4e5b050fb61cbf71a62a9789",
"execEpoch":1336766735,
"execModify":"Fri May 11 15:05:35 2012",
"startTime":"Thu May 31 15:59:06 2012",
"numCores":"64",
"sizeT":{"bss":"29630984","text":"225749","data":"20360"}},
{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2-5": {"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338500447",
"runTime":"145",
"execType":"user:binary",
"exec":"mg.D.64",
"numNodes":"4",
"sha1":"173de32e1514ad097b1c051ec49c4eb240f2001f",
"execEpoch":1336766756,
"execModify":"Fri May 11 15:05:56 2012",
"startTime":"Thu May 31 16:40:47 2012",
"numCores":"64",
"sizeT":{"bss":"456954120","text":"426186","data":"22184"}},{"jobID":"2597401",
"account":"TG-CCR120014",
"user":"charngda",
"pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
"flavor":["default"]}},
"startEpoch":"1338499002",
"runTime":"1444",
"execType":"user:binary",
"exec":"lu.D.64",
"numNodes":"4",
"sha1":"c6dc16d25c2f23d2a3321d4feed16ab7e10c2cc1",
"execEpoch":1336766748,
"execModify":"Fri May 11 15:05:48 2012",
"startTime":"Thu May 31 16:16:42 2012",
"numCores":"64",
"sizeT":{"bss":"199850984","text":"474218","data":"27064"}}],
Uh oh...
Do you get what I mean now? :P
I'll leave you to figure out the rest, and try to find out more about possessive quantifiers and atomic groups; I'm not writing anything else into this post. Here is where the JSON came from, I saw the answer a few days ago, very inspiring: REGEX reformatting.
Read also
The Stack Overflow Regex Reference
ReDoS - Wikipedia
I think something is wrong...
I don't know how you did your benchmarking but both a*+b and (?>a*)b should be the same. To quote regular-expressions.info (emphasis mine):
Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but not to (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.
And just to confirm the above, I ran the following on ideone:
$tests = 1000000;
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/a*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /a*b/ : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/(?>a*)b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /(?>a*)b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/a*+b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /a*+b/ : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
preg_match('/(?>a)*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );
printf( "For /(?>a)*b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );
Getting this as the output:
For /a*b/ : 0.000000879034996 per iteration for 1000000 iterations
For /(?>a*)b/: 0.000000876362085 per iteration for 1000000 iterations
For /a*+b/ : 0.000000880002022 per iteration for 1000000 iterations
For /(?>a)*b/: 0.000000883045912 per iteration for 1000000 iterations
Now, I am by no means a PHP expert so I don't know if this is the right way to benchmark stuff in there, but that's they all have about the same performance, which is kind of expected given the simplicity of the task.
Still, couple of things I note from the above:
Neither (?>a)*b nor (?>a*)b are 179% faster than another regex; the above all are within 7% from each other.
Back to the actual question
But why are the number of steps more? Can somebody explain on this?
It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group...
1 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^
2 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^^^^^^^^
3 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^
4 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^
5 / (?> a* ) b/x aaaaaaaaaaaaaaaaaaaaa...
^
That's 2 more steps because of the group than the 3 steps...
1 / a*+ b /x aaaaaaaaaaaaaaaaaaaa...
^
2 / a*+ b /x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^^
3 / a*+ b /x aaaaaaaaaaaaaaaaaaaaa...
^
Where you can say that a*b is the same as (?:a*)b but the latter having more steps:
1 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^
2 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaa...
^^^^^^^^
3 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^^
4 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
^
5 / (?: a* ) b/x aaaaaaaaaaaaaaaaaaaaa...
Note: Even there, you see that regex101 has the steps optimised a bit for the number of steps in a*b.
Conclusion
Possessive quantifiers and atomic groups work differently depending on how you use them. If I take regular-expression's example and tweaking it a little:
(?>a|b)*ac match aabaac
But
(?:a|b)*+ac and (?>(?:a|b)*)ac don't match aabaac.

Using regular expressions to find a word with the five letters abcde, each letter appearing exactly once, in any order, with no breaks in between

For example, the word debacle would work because of debac, but seabed would not work because: 1. there is no c in any 5-character sequence that can be formed, and 2. the letter e appears twice. As another example, feedback would work because of edbac. And remember, the solution must be done using only regular expressions.
A strategy I attempted to implement was: match the first letter if it's inside [a-e], and remember it. Then find the next letter in [a-e] but not the first letter. And so on. I wasn't sure what the syntax was (or even if some syntax existed) so my code didn't work:
open(DICT, "dictionary.txt");
#words = <DICT>;
foreach my $word(#words){
if ($word =~ /([a-e])([a-e^\1])([a-e^\1^\2])([a-e^\1^\2^\3])([a-e^\1^\2^\3^\4])/
){
print $word;
}
}
I was also thinking of using (?=regex) and \G but I wasn't sure how it would work out.
/
(?= .{0,4}a )
(?= .{0,4}b )
(?= .{0,4}c )
(?= .{0,4}d )
(?= .{0,4}e )
/xs
It's probably results in faster matching to generate a pattern from all combinations.
use Algorithm::Loops qw( NextPermute );
my #pats;
my #chars = 'a'..'e';
do { push #pats, quotemeta join '', #chars; } while NextPermute(#chars);
my $re = join '|', #pats;
abcde|abced|abdce|abdec|abecd|abedc|acbde|acbed|acdbe|acdeb|acebd|acedb|adbce|adbec|adcbe|adceb|adebc|adecb|aebcd|aebdc|aecbd|aecdb|aedbc|aedcb|bacde|baced|badce|badec|baecd|baedc|bcade|bcaed|bcdae|bcdea|bcead|bceda|bdace|bdaec|bdcae|bdcea|bdeac|bdeca|beacd|beadc|becad|becda|bedac|bedca|cabde|cabed|cadbe|cadeb|caebd|caedb|cbade|cbaed|cbdae|cbdea|cbead|cbeda|cdabe|cdaeb|cdbae|cdbea|cdeab|cdeba|ceabd|ceadb|cebad|cebda|cedab|cedba|dabce|dabec|dacbe|daceb|daebc|daecb|dbace|dbaec|dbcae|dbcea|dbeac|dbeca|dcabe|dcaeb|dcbae|dcbea|dceab|dceba|deabc|deacb|debac|debca|decab|decba|eabcd|eabdc|eacbd|eacdb|eadbc|eadcb|ebacd|ebadc|ebcad|ebcda|ebdac|ebdca|ecabd|ecadb|ecbad|ecbda|ecdab|ecdba|edabc|edacb|edbac|edbca|edcab|edcba
(This will get optimised into a trie in Perl 5.10+. Before 5.10, use Regexp::List.)
Your solution is clever but unfortunately [a-e^...] doesn't work, as you found. I don't believe there is a way to mix regular and negated character classes. I can think of a workaround using lookaheads though:
/(([a-e])(?!\2)([a-e])(?!\2)(?!\3)([a-e])(?!\2)(?!\3)(?!\4])([a-e])(?!\2)(?!\3)(?!\4])(?!\5)([a-e]))/
See it here: http://rubular.com/r/6pFrJe78b6.
UPDATE: Mob points out in the comments below, that alternation can be used to compact the above:
/(([a-e])(?!\2)([a-e])(?!\2|\3)([a-e])(?!\2|\3|\4])([a-e])(?!\2|\3|\4|\5)([a-e]))/
The new demo: http://rubular.com/r/UUS7mrz6Ze.
#! perl -lw
for (qw(debacle seabed feedback)) {
print if /([a-e])(?!\1)
([a-e])(?!\1)(?!\2)
([a-e])(?!\1)(?!\2)(?!\3)
([a-e])(?!\1)(?!\2)(?!\3)(?!\4)
([a-e])/x;
}

Is it possible to check if two groups are equal?

If I have some HTML like this:
<b>1<i>2</i>3</b>
And the following regex:
\<[^\>\/]+\>(.*?)\<\/[^\>]+\>
Then it will match:
<b>1<i>2</i>
I want it to only match HTML where the start and end tags are the same. Is there a way to do this?
Thanks,
Joe
Is there a way to do this?
Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.
Numbered Captures
Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:
#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
say "$1: $2" while m{
< ( [ib] ) >
(
(?:
(?! < /? \1 > ) .
) *
)
</ \1 >
}gsix;
}
Which when run, produces this:
$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar
Named Captures
It would be better to use named captures, which leads to this equivalent solution:
#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
say "$+{name}: $+{contents}" while m{
< (?<name> [ib] ) >
(?<contents>
(?:
(?! < /? \k<name> > ) .
) *
)
</ \k<name> >
}gsix;
}
Recursive Captures
Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:
( \( (?: [^()]++ | (?-1) )*+ \) )
I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.
#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration
while (my $line = <>) {
my #input = ( $line );
while (#input) {
my $cur = shift #input;
while ($cur =~ m{
< (?<name> [ib] ) >
(?<contents>
(?:
[^<]++
| (?0)
| (?! </ \k<name> > )
.
) *+
)
</ \k<name> >
}gsix)
{
say "$+{name}: $+{contents}";
push #input, $+{contents};
}
}
}
Which when demo’d produces this:
$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar
That’s still fairly simple, so if it works on your data, go for it.
Grammatical Patterns
However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.
As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.
For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:
(?(DEFINE) # begin regex subroutine defs for grammatical regex
(?<i_tag_end> < / i > )
(?<i_tag_start> < i (?&attributes) > )
(?<attributes> (?: \s* (?&one_attribute) ) *)
(?<one_attribute>
\b
(?&legal_attribute)
\s* = \s*
(?:
(?&quoted_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?&standard_attribute)
| (?&event_attribute)
)
(?<standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<event_attribute>
on click
| on dbl click
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?&quoted_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.
However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.
SUMMARY
I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:
You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.
Any one or more of those might well apply. In which case, don’t do it this way.
For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.
Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.
But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.
Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.
This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.
You probably don't want to use regular expressions with HTML.
But if you still want to do this you need to take a look at backreferences.
Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.
Related issues:
RegEx match open tags except XHTML self-contained tags

Is there something like a counter variable in regular expression replace?

If I have a lot of matches, for example in multi line mode, and I want to replace them with part of the match as well as a counter number that increments.
I was wondering if any regex flavor has such a variable. I couldn't find one, but I seem to remember something like that exists...
I'm not talking about scripting languages in which you can use callbacks for replacement. It's about being able to do this in tools like RegexBuddy, sublime text, gskinner.com/RegExr, ... much in the same way you can refer to captured substrings with \1 or $1.
FMTEYEWTK about Fancy Regexes
Ok, I’m going to go from the simple to the sublime. Enjoy!
Simple s///e Solution
Given this:
#!/usr/bin/perl
$_ = <<"End_of_G&S";
This particularly rapid,
unintelligible patter
isn't generally heard,
and if it is it doesn't matter!
End_of_G&S
my $count = 0;
Then this:
s{
\b ( [\w']+ ) \b
}{
sprintf "(%s)[%d]", $1, ++$count;
}gsex;
produces this
(This)[1] (particularly)[2] (rapid)[3],
(unintelligible)[4] (patter)[5]
(isn't)[6] (generally)[7] (heard)[8],
(and)[9] (if)[10] (it)[11] (is)[12] (it)[13] (doesn't)[14] (matter)[15]!
Interpolated Code in Anon Array Solution
Whereas this:
s/\b([\w']+)\b/##{[++$count]}=$1/g;
produces this:
#1=This #2=particularly #3=rapid,
#4=unintelligible #5=patter
#6=isn't #7=generally #8=heard,
#9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!
Solution with code in LHS instead of RHS
This puts the incrementation within the match itself:
s/ \b ( [\w']+ ) \b (?{ $count++ }) /#$count=$1/gx;
yields this:
#1=This #2=particularly #3=rapid,
#4=unintelligible #5=patter
#6=isn't #7=generally #8=heard,
#9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!
A Stuttering Stuttering Solution Solution Solution
This
s{ \b ( [\w'] + ) \b }
{ join " " => ($1) x ++$count }gsex;
generates this delightful answer:
This particularly particularly rapid rapid rapid,
unintelligible unintelligible unintelligible unintelligible patter patter patter patter patter
isn't isn't isn't isn't isn't isn't generally generally generally generally generally generally generally heard heard heard heard heard heard heard heard,
and and and and and and and and and if if if if if if if if if if it it it it it it it it it it it is is is is is is is is is is is is it it it it it it it it it it it it it doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't matter matter matter matter matter matter matter matter matter matter matter matter matter matter matter!
Exploring Boundaries
There are more robust approaches to word boundaries that work for plural possessives (the previous approaches don’t), but I suspect your mystery lies in getting the ++$count to fire, not with the subtleties of \b behavior.
I really wish people understood that \b isn’t what they think it is.
They always think it means there's white space or the edge of the string
there. They never think of it as \w\W or \W\w transitions.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
As you see, it's conditional depending on what it's touching. That’s what the (?(COND)THEN|ELSE) clause is for.
This becomes an issue with things like:
$_ = qq('Tis Paul's parents' summer-house, isn't it?\n);
my $count = 0;
s{
(?(?=[\-\w']) (?<![\-\w']) | (?<![^\-\w']) )
( [\-\w'] + )
(?(?<=[\-\w']) (?![\-\w']) | (?![^\-\w']) )
}{
sprintf "(%s)[%d]", $1, ++$count
}gsex;
print;
which correctly prints
('Tis)[1] (Paul's)[2] (parents')[3] (summer-house)[4], (isn't)[5] (it)[6]?
Worrying about Unicode
1960s-style ASCII is about 50 years out of date. Just as whenever you see anyone write [a-z], it’s nearly always wrong, it turns out that things like dashes and quotation marks shouldn’t show up as literals in patterns, either. While we’re at it, you probably don’t want to use \w, because that includes numbers and underscores as well, not just alphabetics.
Imagine this string:
$_ = qq(\x{2019}Tis Ren\x{E9}e\x{2019}s great\x{2010}grandparents\x{2019} summer\x{2010}house, isn\x{2019}t it?\n);
which you could have as a literal with use utf8:
use utf8;
$_ = qq(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?\n);
This time I’ll go at the pattern a bit differently, separating out my definition of terms from their execution to try to make it more readable and thence maintainable:
#!/usr/bin/perl -l
use 5.10.0;
use utf8;
use open qw< :std :utf8 >;
use strict;
use warnings qw< FATAL all >;
use autodie;
$_ = q(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?);
my $count = 0;
s{ (?<WORD> (?&full_word) )
# the rest is just definition
(?(DEFINE)
(?<word_char> [\p{Alphabetic}\p{Quotation_Mark}] )
(?<full_word>
# next line won't compile cause
# fears variable-width lookbehind
#### (?<! (?&word_char) ) )
# so must inline it
(?<! [\p{Alphabetic}\p{Quotation_Mark}] )
(?&word_char)
(?:
\p{Dash}
| (?&word_char)
) *
(?! (?&word_char) )
)
) # end DEFINE declaration block
}{
sprintf "(%s)[%d]", $+{WORD}, ++$count;
}gsex;
print;
That code when run produces this:
(’Tis)[1] (Renée’s)[2] (great‐grandparents’)[3] (summer‐house)[4], (isn’t)[5] (it)[6]?
Ok, so that may have beeen FMTEYEWTK about fancy regexes, but aren’t you glad you asked? ☺
In plain regular expressions there isn't as far as I know.
On the other hand, there are several tools which offer it as an extension, for example grepWin. In the tool's help (press F1):
Internally it uses Boost's Perl Regular Expression engine but the ${count} is implemented within (as with other extensions).