Is there something like a counter variable in regular expression replace? - regex

If I have a lot of matches, for example in multi line mode, and I want to replace them with part of the match as well as a counter number that increments.
I was wondering if any regex flavor has such a variable. I couldn't find one, but I seem to remember something like that exists...
I'm not talking about scripting languages in which you can use callbacks for replacement. It's about being able to do this in tools like RegexBuddy, sublime text, gskinner.com/RegExr, ... much in the same way you can refer to captured substrings with \1 or $1.

FMTEYEWTK about Fancy Regexes
Ok, I’m going to go from the simple to the sublime. Enjoy!
Simple s///e Solution
Given this:
#!/usr/bin/perl
$_ = <<"End_of_G&S";
This particularly rapid,
unintelligible patter
isn't generally heard,
and if it is it doesn't matter!
End_of_G&S
my $count = 0;
Then this:
s{
\b ( [\w']+ ) \b
}{
sprintf "(%s)[%d]", $1, ++$count;
}gsex;
produces this
(This)[1] (particularly)[2] (rapid)[3],
(unintelligible)[4] (patter)[5]
(isn't)[6] (generally)[7] (heard)[8],
(and)[9] (if)[10] (it)[11] (is)[12] (it)[13] (doesn't)[14] (matter)[15]!
Interpolated Code in Anon Array Solution
Whereas this:
s/\b([\w']+)\b/##{[++$count]}=$1/g;
produces this:
#1=This #2=particularly #3=rapid,
#4=unintelligible #5=patter
#6=isn't #7=generally #8=heard,
#9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!
Solution with code in LHS instead of RHS
This puts the incrementation within the match itself:
s/ \b ( [\w']+ ) \b (?{ $count++ }) /#$count=$1/gx;
yields this:
#1=This #2=particularly #3=rapid,
#4=unintelligible #5=patter
#6=isn't #7=generally #8=heard,
#9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!
A Stuttering Stuttering Solution Solution Solution
This
s{ \b ( [\w'] + ) \b }
{ join " " => ($1) x ++$count }gsex;
generates this delightful answer:
This particularly particularly rapid rapid rapid,
unintelligible unintelligible unintelligible unintelligible patter patter patter patter patter
isn't isn't isn't isn't isn't isn't generally generally generally generally generally generally generally heard heard heard heard heard heard heard heard,
and and and and and and and and and if if if if if if if if if if it it it it it it it it it it it is is is is is is is is is is is is it it it it it it it it it it it it it doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't matter matter matter matter matter matter matter matter matter matter matter matter matter matter matter!
Exploring Boundaries
There are more robust approaches to word boundaries that work for plural possessives (the previous approaches don’t), but I suspect your mystery lies in getting the ++$count to fire, not with the subtleties of \b behavior.
I really wish people understood that \b isn’t what they think it is.
They always think it means there's white space or the edge of the string
there. They never think of it as \w\W or \W\w transitions.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
As you see, it's conditional depending on what it's touching. That’s what the (?(COND)THEN|ELSE) clause is for.
This becomes an issue with things like:
$_ = qq('Tis Paul's parents' summer-house, isn't it?\n);
my $count = 0;
s{
(?(?=[\-\w']) (?<![\-\w']) | (?<![^\-\w']) )
( [\-\w'] + )
(?(?<=[\-\w']) (?![\-\w']) | (?![^\-\w']) )
}{
sprintf "(%s)[%d]", $1, ++$count
}gsex;
print;
which correctly prints
('Tis)[1] (Paul's)[2] (parents')[3] (summer-house)[4], (isn't)[5] (it)[6]?
Worrying about Unicode
1960s-style ASCII is about 50 years out of date. Just as whenever you see anyone write [a-z], it’s nearly always wrong, it turns out that things like dashes and quotation marks shouldn’t show up as literals in patterns, either. While we’re at it, you probably don’t want to use \w, because that includes numbers and underscores as well, not just alphabetics.
Imagine this string:
$_ = qq(\x{2019}Tis Ren\x{E9}e\x{2019}s great\x{2010}grandparents\x{2019} summer\x{2010}house, isn\x{2019}t it?\n);
which you could have as a literal with use utf8:
use utf8;
$_ = qq(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?\n);
This time I’ll go at the pattern a bit differently, separating out my definition of terms from their execution to try to make it more readable and thence maintainable:
#!/usr/bin/perl -l
use 5.10.0;
use utf8;
use open qw< :std :utf8 >;
use strict;
use warnings qw< FATAL all >;
use autodie;
$_ = q(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?);
my $count = 0;
s{ (?<WORD> (?&full_word) )
# the rest is just definition
(?(DEFINE)
(?<word_char> [\p{Alphabetic}\p{Quotation_Mark}] )
(?<full_word>
# next line won't compile cause
# fears variable-width lookbehind
#### (?<! (?&word_char) ) )
# so must inline it
(?<! [\p{Alphabetic}\p{Quotation_Mark}] )
(?&word_char)
(?:
\p{Dash}
| (?&word_char)
) *
(?! (?&word_char) )
)
) # end DEFINE declaration block
}{
sprintf "(%s)[%d]", $+{WORD}, ++$count;
}gsex;
print;
That code when run produces this:
(’Tis)[1] (Renée’s)[2] (great‐grandparents’)[3] (summer‐house)[4], (isn’t)[5] (it)[6]?
Ok, so that may have beeen FMTEYEWTK about fancy regexes, but aren’t you glad you asked? ☺

In plain regular expressions there isn't as far as I know.
On the other hand, there are several tools which offer it as an extension, for example grepWin. In the tool's help (press F1):
Internally it uses Boost's Perl Regular Expression engine but the ${count} is implemented within (as with other extensions).

Related

Get the second string of the URI with Perl regex

I need to get the second part of the URI, the possible URI are:
/api/application/v1/method
/web/application/v1/method
I can get "application" using:
([^\/api]\w*)
and
([^\/web]\w*)
But I know is not the best approach, what would be the good way?
Thanks!
Edit: thank you all for the input, the goal was to set the second parte of the uri into a header in apache with rewrite rules
A general regex (Perl or PCRE syntax) solution would be:
^/[^/]+/([^/]+)
Each section is delimited with /, so just capture as many non-/ characters as there are.
This is preferable to non-greedy regexes because it does not need to backtrack, and allows for whatever else the sections may contain, which can easily contain non-word characters such as - that won't be matched by \w.
There are so many options that we can do so, not sure which one would be best, but it could be as simple as:
\/(.+?)\/(.+?)\/.*
which our desired output is in the second capturing group $2.
Demo 1
Example
#!/usr/bin/perl -w
use strict;
use warnings;
use feature qw( say );
main();
sub main{
my $string = '/api/application/v1/method
/web/application/v1/method';
my $pattern = '\/(.+?)\/(.+?)\/.*';
my $match = replace($pattern, '$2', $string);
say $match , " is a match 💚💚💚 ";
}
sub replace {
my ($pattern, $replacement, $string) = #_;
$string =~s/$pattern/$replacement/gee;
return $string;
}
Output
application
application is a match 💚💚💚
Advice
zdim advises that:
A legitimate approach, notes:
(1) there is no need for the trailing .*
(2) Need /|$ (not just /), in case the path finishes without / (to
terminate the non-greedy pattern at the end of string, if there is no
/)
(3) note though that /ee can be vulnerable (even just to errors),
since the second evaluation (e) will run code if the first evaluation
results in code. And it may be difficult to ensure that that is always
done under full control. More to the point, for this purpose there is
no reason to run a substitution --- just match and capture is enough.
With all the regex, explicitly asked for, I'd like to bring up other approaches.
These also parse only a (URI style) path, like the regex ones, and return the second directory.
The most basic and efficient one, just split the string on /
my $dir = ( split /\//, $path )[2];
The split returns '' first (before the first /) thus we need the third element. (Note that we can use an alternate delimiter for the separator pattern, it being regex: split m{/}, $path.)
Use appropriate modules, for example URI
use URI;
my $dir = ( URI->new($path)->path_segments )[2];
or Mojo::Path
use Mojo::Path;
my $dir = Mojo::Path->new($path)->parts->[1];
What to use depends on details of what you do -- if you've got any other work with URLs and web then you clearly want modules for that; otherwise they may (or may not) be an overkill.
I've benchmarked these for a sanity check of what one is paying with modules.
The split either beats regex by up to 10-15% (the regex using negated character class and the one based on non-greedy .+? come around the same), or is about the same with them. They are faster than Mojo by about 30%, and only URI lags seriously, by a factor of 5 behind Mojo.
That's for paths typical for real-life URLs, with a handful of short components. With only two very long strings (10k chars), Mojo::Path (surprisingly for me) is a factor of six ahead of split (!), which is ahead of character-class regex by more than an order of magnitude.
The negated-character-class regex for such long strings beats the non-greedy (.+?) one by a factor of 3, good to know in its own right.
In all this the URI and Mojo objects were created once, ahead of time.
Benchmark code. I'd like to note that the details of these timings are far less important than the structure and quality of code.
use warnings;
use strict;
use feature 'say';
use URI;
use Mojo::Path;
use Benchmark qw(cmpthese);
my $runfor = shift // 3; #/
#my $path = '/' . 'a' x 10_000 . '/' . 'X' x 10_000;
my $path = q(/api/app/v1/method);
my $uri = URI->new($path);
my $mojo = Mojo::Path->new($path);
sub neg_cc {
my ($dir) = $path =~ m{ [^/]+ / ([^/]+) }x; return $dir; #/
}
sub non_greedy {
my ($dir) = $path =~ m{ .+? / (.+?) (?:/|$) }x; return $dir; #/
}
sub URI_path {
my $dir = ( $uri->path_segments )[2]; return $dir;
}
sub Mojo_path {
my $dir = $mojo->parts->[1]; return $dir;
}
sub just_split {
my $dir = ( split /\//, $path )[2]; return $dir;
}
cmpthese( -$runfor, {
neg_cc => sub { neg_cc($path) },
non_greedy => sub { non_greedy($path) },
just_split => sub { just_split($path) },
URI_path => sub { URI_path($path) },
Mojo_path => sub { Mojo_path($path) },
});
With a (10-second) run this prints, on a laptop with v5.16
Rate URI_path Mojo_path non_greedy neg_cc just_split
URI_path 146731/s -- -82% -87% -87% -89%
Mojo_path 834297/s 469% -- -24% -28% -36%
non_greedy 1098243/s 648% 32% -- -5% -16%
neg_cc 1158137/s 689% 39% 5% -- -11%
just_split 1308227/s 792% 57% 19% 13% --
One should keep in mind that the overhead of the function-call is very large for such a simple job, and in spite of Benchmark's work these numbers are probably best taken as a cursory guide.
Your pattern ([^\/api]\w*) consists of a capturing group and a negated character class that will first match 1 time not a /, a, p or i. See demo.
After that 0+ times a word char will be matched. The pattern could for example only match a single char which is not listed in the character class.
What you might do is use a capturing group and match \w+
^/(?:api|web)/(\w+)/v1/method
Explanation
^ Start of string
(?:api|web) Non capturing group with alternation. Match either api or web
(\w+) Capturing group 1, match 1+ word chars
/v1/method Match literally as in your example data.
Regex demo

Using regular expressions to find a word with the five letters abcde, each letter appearing exactly once, in any order, with no breaks in between

For example, the word debacle would work because of debac, but seabed would not work because: 1. there is no c in any 5-character sequence that can be formed, and 2. the letter e appears twice. As another example, feedback would work because of edbac. And remember, the solution must be done using only regular expressions.
A strategy I attempted to implement was: match the first letter if it's inside [a-e], and remember it. Then find the next letter in [a-e] but not the first letter. And so on. I wasn't sure what the syntax was (or even if some syntax existed) so my code didn't work:
open(DICT, "dictionary.txt");
#words = <DICT>;
foreach my $word(#words){
if ($word =~ /([a-e])([a-e^\1])([a-e^\1^\2])([a-e^\1^\2^\3])([a-e^\1^\2^\3^\4])/
){
print $word;
}
}
I was also thinking of using (?=regex) and \G but I wasn't sure how it would work out.
/
(?= .{0,4}a )
(?= .{0,4}b )
(?= .{0,4}c )
(?= .{0,4}d )
(?= .{0,4}e )
/xs
It's probably results in faster matching to generate a pattern from all combinations.
use Algorithm::Loops qw( NextPermute );
my #pats;
my #chars = 'a'..'e';
do { push #pats, quotemeta join '', #chars; } while NextPermute(#chars);
my $re = join '|', #pats;
abcde|abced|abdce|abdec|abecd|abedc|acbde|acbed|acdbe|acdeb|acebd|acedb|adbce|adbec|adcbe|adceb|adebc|adecb|aebcd|aebdc|aecbd|aecdb|aedbc|aedcb|bacde|baced|badce|badec|baecd|baedc|bcade|bcaed|bcdae|bcdea|bcead|bceda|bdace|bdaec|bdcae|bdcea|bdeac|bdeca|beacd|beadc|becad|becda|bedac|bedca|cabde|cabed|cadbe|cadeb|caebd|caedb|cbade|cbaed|cbdae|cbdea|cbead|cbeda|cdabe|cdaeb|cdbae|cdbea|cdeab|cdeba|ceabd|ceadb|cebad|cebda|cedab|cedba|dabce|dabec|dacbe|daceb|daebc|daecb|dbace|dbaec|dbcae|dbcea|dbeac|dbeca|dcabe|dcaeb|dcbae|dcbea|dceab|dceba|deabc|deacb|debac|debca|decab|decba|eabcd|eabdc|eacbd|eacdb|eadbc|eadcb|ebacd|ebadc|ebcad|ebcda|ebdac|ebdca|ecabd|ecadb|ecbad|ecbda|ecdab|ecdba|edabc|edacb|edbac|edbca|edcab|edcba
(This will get optimised into a trie in Perl 5.10+. Before 5.10, use Regexp::List.)
Your solution is clever but unfortunately [a-e^...] doesn't work, as you found. I don't believe there is a way to mix regular and negated character classes. I can think of a workaround using lookaheads though:
/(([a-e])(?!\2)([a-e])(?!\2)(?!\3)([a-e])(?!\2)(?!\3)(?!\4])([a-e])(?!\2)(?!\3)(?!\4])(?!\5)([a-e]))/
See it here: http://rubular.com/r/6pFrJe78b6.
UPDATE: Mob points out in the comments below, that alternation can be used to compact the above:
/(([a-e])(?!\2)([a-e])(?!\2|\3)([a-e])(?!\2|\3|\4])([a-e])(?!\2|\3|\4|\5)([a-e]))/
The new demo: http://rubular.com/r/UUS7mrz6Ze.
#! perl -lw
for (qw(debacle seabed feedback)) {
print if /([a-e])(?!\1)
([a-e])(?!\1)(?!\2)
([a-e])(?!\1)(?!\2)(?!\3)
([a-e])(?!\1)(?!\2)(?!\3)(?!\4)
([a-e])/x;
}

Is it possible to check if two groups are equal?

If I have some HTML like this:
<b>1<i>2</i>3</b>
And the following regex:
\<[^\>\/]+\>(.*?)\<\/[^\>]+\>
Then it will match:
<b>1<i>2</i>
I want it to only match HTML where the start and end tags are the same. Is there a way to do this?
Thanks,
Joe
Is there a way to do this?
Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.
Numbered Captures
Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:
#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
say "$1: $2" while m{
< ( [ib] ) >
(
(?:
(?! < /? \1 > ) .
) *
)
</ \1 >
}gsix;
}
Which when run, produces this:
$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar
Named Captures
It would be better to use named captures, which leads to this equivalent solution:
#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
say "$+{name}: $+{contents}" while m{
< (?<name> [ib] ) >
(?<contents>
(?:
(?! < /? \k<name> > ) .
) *
)
</ \k<name> >
}gsix;
}
Recursive Captures
Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:
( \( (?: [^()]++ | (?-1) )*+ \) )
I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.
#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration
while (my $line = <>) {
my #input = ( $line );
while (#input) {
my $cur = shift #input;
while ($cur =~ m{
< (?<name> [ib] ) >
(?<contents>
(?:
[^<]++
| (?0)
| (?! </ \k<name> > )
.
) *+
)
</ \k<name> >
}gsix)
{
say "$+{name}: $+{contents}";
push #input, $+{contents};
}
}
}
Which when demo’d produces this:
$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar
That’s still fairly simple, so if it works on your data, go for it.
Grammatical Patterns
However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.
As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.
For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:
(?(DEFINE) # begin regex subroutine defs for grammatical regex
(?<i_tag_end> < / i > )
(?<i_tag_start> < i (?&attributes) > )
(?<attributes> (?: \s* (?&one_attribute) ) *)
(?<one_attribute>
\b
(?&legal_attribute)
\s* = \s*
(?:
(?&quoted_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?&standard_attribute)
| (?&event_attribute)
)
(?<standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<event_attribute>
on click
| on dbl click
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?&quoted_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.
However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.
SUMMARY
I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:
You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.
Any one or more of those might well apply. In which case, don’t do it this way.
For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.
Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.
But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.
Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.
This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.
You probably don't want to use regular expressions with HTML.
But if you still want to do this you need to take a look at backreferences.
Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.
Related issues:
RegEx match open tags except XHTML self-contained tags

Is it possible to make conditional regex of following?

Hello I am wondering whether it is possible to do this type of regex:
I have certain characters representing okjects i.e. #,#,$ and operations that may be used on them like +,-,%..... every object has a different set of operations and I want my regex to find valid pairs.
So for examle I want pairs #+, #-, $+ to be matched, but yair $- not to be matched as it is invalid.
So is there any way to do this with regexes only, without doing some gymnastics inside language using regex engine?
every okject with it's own rules in []
/(#[+-]|\$[+]|#[+-])/
you need to properly escape special characters
Gymnastics is hard. Try something like /#\+|#-|\$\+/ or something like that.
Just remember, +, $, and ^ are reserved, so they'll need to be escaped.
Another approach, mix not allowed with raw combinations, but this might be slower.
/(?!\$-|\$\%)([\#\$\#][+\-\%])/, though not if there are many alternations of the first character.
my $str = '
#+, #-, $+ to be matched,
but yair $- not to be matched asit is invalid.
$% $- #% $%
';
my $regex =
qr/
(?!\$-|\$\%) # Specific combinations not allowed
(
[\#\$\#][+\-\%] # Raw combinations allowed
)
/x;
while ( $str =~ /$regex/g ) {
print "found: '$1'\n";
}
__END__
Output:
found: '#+'
found: '#-'
found: '$+'
found: '#%'

How should I handle regex-features labeled with "warning"?

How should I handle regex-features labeled with "warning" like "(?{ code })", "(??{ code })" or "Special Backtracking Control Verbs"? How serious should I take the warnings?
I kinda think they’re here to stay, one way or the other — especially code escapes. Code escapes have been with us for more than a decade.
The scariness of them — that they can call code in unforeseen ways — is taken care of by use re "eval". Also, the regex matcher hasn’t been reëntrant until 5.12 IIRC, which could limit their usefulness.
The string-eval version, (??{ code }), used to be the only way to do recursion, but since 5.10 we have a much better way to do that; benchmarking the speed differences shows the eval way is way slower in most cases.
I mostly use the block-eval version, (?{ code}), for adding debugging, which happens at a different granualarity than use re "debug". It used to vaguely bother me that the return value from the block-eval version’s wasn’t usable, until I realized that it was. You just had to use it as the test part of a conditional pattern, like this pattern for testing whether a number was made up of digits that were decreasing by one each position to the right:
qr{
^ (
( \p{Decimal_Number} )
(?(?= ( \d )) | $)
(?(?{ ord $3 == 1 + ord $2 }) (?1) | $)
) $
}x
Before I figured out conditionals, I would have written that this way:
qr{
^ (
( \p{Decimal_Number} )
(?= $ | (??{ chr(1+ord($2)) }) )
(?: (?1) | $ )
) $
}x
which is much less efficient.
The backtracking control verbs are newer. I use them mostly for getting all possible permutations of a match, and that requires only (*FAIL). I believe it is the (*ACCEPT) feature that is especially marked “highly experimental”. These have only been with us since 5.10.