Perl: Method to convert regexp with greedy quantifiers to non-greedy - regex

My user gives a regexp with quantifiers that default to being greedy. He can give any valid regexp. So the solution will have to deal with anything that the user can throw at me.
How do I convert the regexp so any greedy quantifier will be non-greedy?
Does Perl have a (?...:regexp) construct that forces the greedy default for quantifiers into a non-greedy one?
If not: Is there a different way I can force a regexp with greedy quantifiers into a non-greedy one?
E.g., a user may enter:
.*
[.*]
[.*]{4,10}
[.*{4,10}]{4,10}
While these four examples may look similar, they have completely different meanings.
If you simply add ? after every */} you will change the character sets in the last three examples.
Instead they should be changed to/behave like:
.*?
[.*]
[.*]{4,10}?
[.*{4,10}]{4,10}?
but where the matched string is the minimal match, and not first-match, that Perl will default to:
$a="aab";
$a=~/(a.*?b)$/;
# Matches aab, not ab
print $1;
But given the non-greedy regexp, the minimal match can probably be obtained by prepending .*:
$a="aab";
$a=~/.*(a.*?b)$/;
# Matches ab
print $1;

"Greedyness" is not a property of the whole regular expression. It's a property of a quantifier.
It can be controlled for each quantifier separately. Just add a ? after a quantifier to make it non-greedy, e.g.
[a-z]*?
a{2,3}?
[0-9]??
\s+?
And no, there isn't any built-in way to turn the whole regex to some "default-non-greedy" mode. You need to parse the regex, detect all quantifiers and change them accordingly. Maybe there's a regex-parsing library somewhere on CPAN.
The closest I've found so far is the Regexp::Parser module. I didn't try it, but looks like it could parse the regex, walk the tree, make appropriate changes and then build a modified regex. Please take a look.

You can use a state machine:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = ( ".*", "[.*]", "[.*]{4,10}", "[.*{4,10}]{4,10}" );
for (#regexes) {
print "give: $_\n";
my $ungreedy = make_ungreedy($_,0);
print "got: $ungreedy\n";
print "============================================\n"
}
sub make_ungreedy {
my $regex = shift;
my $class_state = 0;
my $escape_state = 0;
my $found = 0;
my $ungreedy = "";
for (split (//, $regex)) {
if ($found) {
$ungreedy .= "?" unless (/\?/);
$found = 0;
}
$ungreedy .= $_;
$escape_state = 0, next if ($escape_state);
$escape_state = 1, next if (/\\/);
$class_state = 1, next if (/\[/);
if ($class_state) {
$class_state = 0 if (/\]/);
next;
}
$found = 1 if (/[*}+]/);
}
$ungreedy .= '?' if $found;
return $ungreedy;
}

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Get the second string of the URI with Perl regex

I need to get the second part of the URI, the possible URI are:
/api/application/v1/method
/web/application/v1/method
I can get "application" using:
([^\/api]\w*)
and
([^\/web]\w*)
But I know is not the best approach, what would be the good way?
Thanks!
Edit: thank you all for the input, the goal was to set the second parte of the uri into a header in apache with rewrite rules
A general regex (Perl or PCRE syntax) solution would be:
^/[^/]+/([^/]+)
Each section is delimited with /, so just capture as many non-/ characters as there are.
This is preferable to non-greedy regexes because it does not need to backtrack, and allows for whatever else the sections may contain, which can easily contain non-word characters such as - that won't be matched by \w.
There are so many options that we can do so, not sure which one would be best, but it could be as simple as:
\/(.+?)\/(.+?)\/.*
which our desired output is in the second capturing group $2.
Demo 1
Example
#!/usr/bin/perl -w
use strict;
use warnings;
use feature qw( say );
main();
sub main{
my $string = '/api/application/v1/method
/web/application/v1/method';
my $pattern = '\/(.+?)\/(.+?)\/.*';
my $match = replace($pattern, '$2', $string);
say $match , " is a match 💚💚💚 ";
}
sub replace {
my ($pattern, $replacement, $string) = #_;
$string =~s/$pattern/$replacement/gee;
return $string;
}
Output
application
application is a match 💚💚💚
Advice
zdim advises that:
A legitimate approach, notes:
(1) there is no need for the trailing .*
(2) Need /|$ (not just /), in case the path finishes without / (to
terminate the non-greedy pattern at the end of string, if there is no
/)
(3) note though that /ee can be vulnerable (even just to errors),
since the second evaluation (e) will run code if the first evaluation
results in code. And it may be difficult to ensure that that is always
done under full control. More to the point, for this purpose there is
no reason to run a substitution --- just match and capture is enough.
With all the regex, explicitly asked for, I'd like to bring up other approaches.
These also parse only a (URI style) path, like the regex ones, and return the second directory.
The most basic and efficient one, just split the string on /
my $dir = ( split /\//, $path )[2];
The split returns '' first (before the first /) thus we need the third element. (Note that we can use an alternate delimiter for the separator pattern, it being regex: split m{/}, $path.)
Use appropriate modules, for example URI
use URI;
my $dir = ( URI->new($path)->path_segments )[2];
or Mojo::Path
use Mojo::Path;
my $dir = Mojo::Path->new($path)->parts->[1];
What to use depends on details of what you do -- if you've got any other work with URLs and web then you clearly want modules for that; otherwise they may (or may not) be an overkill.
I've benchmarked these for a sanity check of what one is paying with modules.
The split either beats regex by up to 10-15% (the regex using negated character class and the one based on non-greedy .+? come around the same), or is about the same with them. They are faster than Mojo by about 30%, and only URI lags seriously, by a factor of 5 behind Mojo.
That's for paths typical for real-life URLs, with a handful of short components. With only two very long strings (10k chars), Mojo::Path (surprisingly for me) is a factor of six ahead of split (!), which is ahead of character-class regex by more than an order of magnitude.
The negated-character-class regex for such long strings beats the non-greedy (.+?) one by a factor of 3, good to know in its own right.
In all this the URI and Mojo objects were created once, ahead of time.
Benchmark code. I'd like to note that the details of these timings are far less important than the structure and quality of code.
use warnings;
use strict;
use feature 'say';
use URI;
use Mojo::Path;
use Benchmark qw(cmpthese);
my $runfor = shift // 3; #/
#my $path = '/' . 'a' x 10_000 . '/' . 'X' x 10_000;
my $path = q(/api/app/v1/method);
my $uri = URI->new($path);
my $mojo = Mojo::Path->new($path);
sub neg_cc {
my ($dir) = $path =~ m{ [^/]+ / ([^/]+) }x; return $dir; #/
}
sub non_greedy {
my ($dir) = $path =~ m{ .+? / (.+?) (?:/|$) }x; return $dir; #/
}
sub URI_path {
my $dir = ( $uri->path_segments )[2]; return $dir;
}
sub Mojo_path {
my $dir = $mojo->parts->[1]; return $dir;
}
sub just_split {
my $dir = ( split /\//, $path )[2]; return $dir;
}
cmpthese( -$runfor, {
neg_cc => sub { neg_cc($path) },
non_greedy => sub { non_greedy($path) },
just_split => sub { just_split($path) },
URI_path => sub { URI_path($path) },
Mojo_path => sub { Mojo_path($path) },
});
With a (10-second) run this prints, on a laptop with v5.16
Rate URI_path Mojo_path non_greedy neg_cc just_split
URI_path 146731/s -- -82% -87% -87% -89%
Mojo_path 834297/s 469% -- -24% -28% -36%
non_greedy 1098243/s 648% 32% -- -5% -16%
neg_cc 1158137/s 689% 39% 5% -- -11%
just_split 1308227/s 792% 57% 19% 13% --
One should keep in mind that the overhead of the function-call is very large for such a simple job, and in spite of Benchmark's work these numbers are probably best taken as a cursory guide.
Your pattern ([^\/api]\w*) consists of a capturing group and a negated character class that will first match 1 time not a /, a, p or i. See demo.
After that 0+ times a word char will be matched. The pattern could for example only match a single char which is not listed in the character class.
What you might do is use a capturing group and match \w+
^/(?:api|web)/(\w+)/v1/method
Explanation
^ Start of string
(?:api|web) Non capturing group with alternation. Match either api or web
(\w+) Capturing group 1, match 1+ word chars
/v1/method Match literally as in your example data.
Regex demo

Regular Expression to find $0.00

Need to count the number of "$0.00" in a string. I'm using:
my $zeroDollarCount = ("\Q$menu\E" =~ tr/\$0\.00//);
but it doesn't work. The issue is the $ sign is throwing the regex off. It works if I just want to count the number of $, but fails to find $0.00.
How is this a duplicate? Your solution does not address dollar sign which is an issue for me.
You are using the transliteration operator tr///. That doesn't have anything to do with a pattern. You need the match operator m// instead. And because you want it to find all occurances of the pattern, use the /g modifier.
my $count = () = $menu =~ m/\$0\.00/g;
If we run this program, the output is 2.
use strict;
use warnings;
my $menu = '$0.00 and $0.00';
my $count = () = $menu =~ m/\$0\.00/g;
print $count;
Now lets take a look at what is going on. First, the pattern of the match.
/\$0\.00/
This is fairly straight-forward. There is a literal $, which we need to escape with a backslash \. The zero is followed by a literal dot ., which again we need to escape, because like the $ it has special meanings in regular expressions.
my $count = () = $menu =~ m/\$0\.00/g;
This whole line looks weird. We can break it up into a few lines to make it more readable.
my #matches = ( $menu =~ m/\$0\.00/g );
my $count = scalar #matches;
We need the /g switch on the regular expression match to make it match all occurrences. In list context, the match operation returns all matches (which will be the string "$0.00" a number of times). Because we want the count, we then force that into scalar context, which gives us the number of elements. That can be shortened to one line by the idiom shown above.

How do I check if a string has exactly one of a certain character

I'm trying to scan strings to see if they have exactly one of a certain character.
For example if I'm looking for a question mark
Hello? I'm here
Will match the regex however
Hello? Are you listening?
Will not
I've tried ?{1} and ?{1}[^?]+ but they both don't work. Can anyone point me in the right direction?
Why not do:
(\?)
and count the number of matches.
Or even more simply, count number of ? in string using tr///
my $c = $string1 =~ tr/?//;
You could do something like
my $cnt = () = $str =~ m/\Q$pat/g;
if ($cnt == 1) {
# matched
}
else {
# failed
}
$pat is the pattern (character in this case) you want to match, such as '?'.
If you're looking for a particular character only, you can use the transliteration operator, tr///:
my $count = $string =~ tr/?/?/;
if( $count == 1 ) {
...
}
With the transliteration operator, I can leave off the replacement side and any characters not lined up with a replacement character will use the previous replacement character. If there isn't a previous replacement character, it makes no replacement. I just leave out the second part of the tr///:
my $count = $string =~ tr/?//;
if( $count == 1 ) {
...
}
This won't work for patterns though. This is strictly for character-to-character replacements. For a pattern, you do the same thing with Lee Duhem's answer
You can use this regex:
^[^?]*\?[^?]*$
Online Demo

Sub-pattern in regex can't be dereferenced?

I have following Perl script to extract numbers from a log. It seems that the non-capturing group with ?: isn't working when I define the sub-pattern in a variable. It's only working when I leave out the grouping in either the regex-pattern or the sub-pattern in $number.
#!/usr/bin/perl
use strict;
use warnings;
my $number = '(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)';
#my $number = '-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?';
open(FILE,"file.dat") or die "Exiting with: $!\n";
while (my $line = <FILE>) {
if ($line =~ m{x = ($number). y = ($number)}){
print "\$1= $1\n";
print "\$2= $2\n";
print "\$3= $3\n";
print "\$4= $4\n";
};
}
close(FILE);
The output for this code looks like:
$1= 12.15
$2= 12.15
$3= 3e-5
$4= 3e-5
for an input of:
asdf x = 12.15. y = 3e-5 yadda
Those doubled outputs aren't desired.
Is this because of the m{} style in contrast to the regular m// patterns for regex? I only know the former style to get variables (sub-strings) in my regex expressions. I just noticed this for the backreferencing so possibly there are other differences for metacharacters?
The delimiters you use for the regular expression aren't causing any problems but the following is:
(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)
^^
Notice this isn't a capturing group, it is an optional colon :
Probably a typo mistake but it is causing the trouble.
Edit: It looks that it is not a typo mistake, i substituted the variables in the regex and I got this:
x = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)). y = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?))
^^ first and second group ^^ ^^ third and fourth grouop ^^
As you can see the first and second capturing group are capturing exactly the same thing, the same is happening for the third and fourth capturing group.
You're going to kick yourself...
Your regexp reads out as:
capture {
maybe-colon
maybe-minus
cluster { (?:(?:\d+\.?\d*)|(?:\.\d+))
cluster { (?:\d+\.?\d*)
1+ digits
maybe-dot
0+ digits
}
-or-
cluster { (?:\.\d+)
dot
1+digits
}
}
maybe cluster {
E or e
maybe + or -
1+ digets
} (?:[Ee][+-]?\d+)?
}
... which is what you're looking for.
However, when you then do your actual regexp, you do:
$line =~ m{x = $number. y = $number})
(the curly braces are a distraction.... you may use any \W if the m or s has been specified)
What this is asking is to capture whatever the regexp defined in $number is.... which is, itself, a capture.... hence $1 and $2 being the same thing.
Simply remove the capture braces from either $number or the regexp line.