Perl greedy regex is not acting greedy - regex

Giving the following code:
use strict;
use warnings;
my $text = "asdf(blablabla)";
$text =~ s/(.*?)\((.*)\)/$2/;
print "\nfirst match: $1";
print "\nsecond match: $2";
I expected that $2 would catch my last bracket, yet my output is:
If .* by default it's greedy why it stopped at the bracket?

The .* is a greedy subpattern, but it does not account for grouping. Grouping is defined with a pair of unescaped parentheses (see Use Parentheses for Grouping and Capturing).
See where your group boundaries are:
s/(.*?)\((.*)\)/$2/
| G1| |G2|
So, the \( and \) matching ( and ) are outside the groups, and will not be part of neither $1 nor $2.
If you need the ) be part of $2, use
s/(.*?)\((.*\))/$2/
^
A regex engine is processing both the string and the pattern from left to right. The first (.*?) is handled first, and it matches up to the first literal ( symbol as it is lazy (matches as few chars as possible before it can return a valid match), and the whole part before the ( is placed into Group 1 stack. Then, the ( is matched, but not captured, then (.*) matches any 0+ characters other than a newline up to the last ) symbol, and places the capture into Group 2. Then, the ) is just matched. The point is that .* grabs the whole string up to the end, but then backtracking happens since the engine tries to accommodate for the final ) in the pattern. The ) must be matched, but not captured in your pattern, thus, it is not part of Group 2 due to the group boundary placement. You can see the regex debugger at this regex demo page to see how the pattern matches your string.

Related

Regex to remove all parentheses except most external ones

I have been trying and reading many similar SO answers with no luck.
I need to remove parentheses in the text inside parentheses keeping the text. Ideally with 1 regex... or maybe 2?
My text is:
Alpha (Bravo( Charlie))
I want to achieve:
Alpha (Bravo Charlie)
The best I got so far is:
\\(|\\)
but it gets:
Alpha Bravo Charlie
You can use a regex like this:
(\(.*?)\((.*?)\)
With this replacement string:
$1$2
Regex demo
Update: as per ııı comment, since I don't know your full sample text I provide this regex in case you have this scenario
(\([^)]*)\((.*?)\)
Regex demo
From your post and comments, it seems you want to remove only the inner most parenthesis, for which you can use following regex,
\(([^()]*)\)
And replace with $1 or \1 depending upon your language.
In this regex \( matches a starting parenthesis and \) matches a closing parenthesis and ([^()]*) ensures the captured text doesn't contain either ( or ) which ensures it is the innermost parenthesis and places the captured text in group1, and whole match is replaced by what got captured in group1 text, thus getting rid of the inner most parenthesis and retaining the text inside as it is.
Demo
Your pattern \(|\) uses an alternation then will match either an opening or closing parenthesis.
If according to the comments there is only 1 pair of nested parenthesis, you could match:
(\([^()]*)\(([^()]*\)[^()]*)\)
( Start capturing group
\( Match opening parenthesis
[^()]* Match 0+ times not ( or )
) Close group 1
\( Match
( Capturing group 2
\([^()]*\) match from ( till )
[^()]* Match 0+ times not ( or )
) close capturing group
\) Match closing parenthesis
And replace with the first and the second capturing group.
Regex demo

Optional regular expression operator in PowerShell

In $string, I'm trying to phase out the first "-1" so the output of the string will be "test test test-Long.xml".
$string = 'test test test-1-Long.xml'
$string -replace '^(.*)-?\d?(-?.*)\.xml$', '$1$2'
My issue is that I need to make that same first "-1" pattern optional, as both the hyphen and number could not be there as well.
Why is the "?" operator not working? I've also tried {0,1} after each as well with no luck.
Regexes are greedy, so the engine can't decide what to match, and it is ambiguous.
I am not sure it's the best solution, but I could make it work this way:
$string -replace '^([^\-]*)-?\d?(-?.*)\.xml$', '$1$2'
Sole change: the first group must not contain the dash: that kind of "balances" the regex, avoiding the greedyness and that yields:
test test test-Long
Note: the output is not test test test-Long.xml as required in your answer. To do that, simply remove the xml suffix:
$string -replace '^([^\-]*)-?\d?(-?.*)', '$1$2'
The $string -replace '^(.*?)(?:-\d+)?(-.*?)\.xml$', '$1$2' should work if the hyphen is obligatory in the input. Or $string -replace '^((?:(?!-\d+).)*)(?:-\d+)?(.*)\.xml$', '$1$2' in case the input may have no hyphen.
See the regex demo 1 and regex demo 2.
Pattern details:
^ - start of string
(.*?) - Group 1 capturing any 0+ characters other than a newline as few as possible (as the *? quantifier is lazy) up to the first (NOTE: to increase regex performance, you may use a tempered greedy token based pattern instead of (.*?) - ((?:(?!-\d+).)*) that matches any text, but -+1 or more digits, thus, acting similarly to negated character class, but for a sequence of symbols)
(?:-\d+)? - non-capturing group with a greedy ? quantifier (so, this group has more priority for the regex engine, the previous capture will end before this pattern) capturing a hyphen followed with one or more digits
(-.*?) - Group 3 capturing an obligatory - and any 0+ chars other than LF, as few as possible up to
\.xml - literal text .xml
$ - end of string.
Why is the "?" operator not working?
It is not true. The quantifier ? works well as it matches one or zero occurrences of the quantified subpattern. However, the issue arises in combination with the first .* greedy dot matching subpattern. See your regex in action: the first capture group grabs the whole substring up to the last .xml, and the second group is empty. Why?
Because of backtracking and how greedy quantifier works. The .* matches any characters, but a newline, as many as possible. Thus, it grabs the whole string up to the end. Then, backtracking starts: one character at a time is given back and tested against the subsequent subpatterns.
What are they? -?\d?(-?.*) - all of them can match an empty string. The -? matches an empty string before .xml, ok, \d? matches there as well, -? and .* also matches there.
However, the .* grabs the whole string again, but there is the \.xml pattern to accommodate. So, the second capture group is just empty. In fact, there are more steps the regex engine performs (see the regex debugger page), but the main idea is like that.

Perl regex to extract digits from string with parenthesis

I have the following string:
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another variable.
The following statement does not work:
my ($NAdapter) = $string =~ /\((\d+)\)/;
What is the correct syntax?
\d+(?=[^(]*\))
You can use this.See demo.Yours will not work as inside () there is more data besides \d+.
https://regex101.com/r/fM9lY3/57
You could try something like
my ($NAdapter) = $string =~ /\(.*(\d+).*\)/;
After that, $NAdapter should include the number that you want.
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another
variable
Your regex (with some spaces for clarity):
/ \( (\d+) \) /x;
says to match:
A literal opening parenthesis, immediately followed by...
A digit, one or more times (captured in group 1), immediately followed by...
A literal closing parenthesis.
Yet, the substring you want to match:
(NIC 1)
is of the form:
A literal opening parenthesis, immediately followed by...
Some capital letters
STOP EVERYTHING! NO MATCH!
As an alternative, your substring:
(NIC 1)
could be described as:
Some digits, immediately followed by...
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (NIC 1234) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
(\d+) #Match any digit, one or more times, captured in group 1, followed by...
\) #a literal closing parenthesis.
#Parentheses have a special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
1234
Another description of your substring:
(NIC 1)
could be:
A literal opening parenthesis, immediately followed by...
Some non-digits, immediately followed by...
Some digits, immediately followed by..
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (ABC NIC789) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
789
If there might be spaces on some lines and not others, such as:
spaces
||
VV
(NIC 1 )
(NIC 2)
You can insert a \s* (any whitespace, zero or more times) in the appropriate place in the regex, for instance:
my ($match) = $string =~ /
#Parentheses have special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\s* #any whitespace, zero or more times, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.

Could someone explain the regex /(.*)\.(.*)/?

I want to get the file extension in Groovy with a regex, for let's say South.6987556.Input.csv.cop.
http://www.regexplanet.com/advanced/java/index.html shows me that the second group would really contain the cop extension. Which is what I want.
0: [0,27] South.6987556.Input.csv.cop
1: [0,23] South.6987556.Input.csv
2: [24,27] cop
I just don't understand why the result won't be
0: [0,27] South.6987556.Input.csv.cop
1: [0,23] South
2: [24,27] 6987556.Input.csv.cop
What should be the regex to get this kind of result?
Here is a visualization of this regex
(.*)\.(.*)
Debuggex Demo
in words
(.*) matches anything als large as possible and references it
\. matches one period, no reference (no brackets)
(.*) matches anything again, may be empty, and references it
in your case this is
(.*) : South.6987556.Input.csv
\. : .
(.*) : cop
it isn't just only South and 6987556.Input.csv.cop because the first part (.*) isn't optional but greedy and must be followed by a period, so the engine tries to match the largest possible string.
Your intended result would be created by this regex: (.*?)\.(.*). The ? after a quantifier (in this case *) switches the behaviour of the engine to ungreedy, so the smallest matching string will be searched. By default most regex engines are greedy.
To get the desired output, your regex should be:
((.*?)\.(.*))
DEMO
See the captured groups at right bottom of the DEMO site.
Explanation:
( group and capture to \1:
( group and capture to \2:
.*? any character except \n (0 or more
times) ? after * makes the regex engine
to does a non-greedy match(shortest possible match).
) end of \2
\. '.'
( group and capture to \3:
.* any character except \n (0 or more
times)
) end of \3
) end of \1

What do these Perl regexes mean?

What does the following syntax mean in Perl?
$line =~ /([^:]+):/;
and
$line =~ s/([^:]+):/$replace/;
See perldoc perlreref
[^:]
is a character class that matches any character other than ':'.
[^:]+
means match one or more of such characters.
I am not sure the capturing parentheses are needed. In any case,
([^:]+):
captures a sequence of one or more non-colon characters followed by a colon.
$line =~ /([^:]+):/;
The =~ operator is called the binding operator, it runs a regex or substitution against a scalar value (in this case $line). As for the regex itself, () specify a capture. Captures place the text that matches them in special global variables. These variables are numbered starting from one and correspond to the order the parentheses show up in, so given
"abc" =~ /(.)(.)(.)/;
the $1 variable will contain "a", the $2 variable will contain "b", and the $3 variable will contain "c" (if you haven't guessed yet . matches one character*). [] specifies a character class. Character classes will match one character in them, so /[abc]/ will match one character if it is "a", "b", or "c". Character classes can be negated by starting them with ^. A negated character class matches one character that is not listed in it, so [^abc] will match one character that is not "a", "b", or "c" (for instance, "d" will match). The + is called a quantifier. Quantifiers tell you how many times the preceding pattern must match. + requires the pattern to match one or more times. (the * quantifier requires the pattern to match zero or more times). The : has no special meaning to the regex engine, so it just means a literal :.
So, putting that information together we can see that the regex will match one or more non-colon characters (saving this part to $1) followed by a colon.
$line =~ s/([^:]+):/$replace/;
This is a substitution. Substitutions have two parts, the regex, and the replacement string. The regex part follows all of the same rules as normal regexes. The replacement part is treated like a double quoted string. The substitution replaces whatever matches the regex with the replacement, so given the following code
my $line = "key: value";
my $replace = "option";
$line =~ s/([^:]+):/$replace/;
The $line variable will hold the string "option value".
You may find it useful to read perldoc perlretut.
* except newline, unless the /m option is used, in which case it matches any character
The first one captures the part in front of a colon from a line, such as "abc" in the string "abc:foo". More precisely it matches at least one non-colon character (though as many as possible) directly before a colon and puts them into a capture group.
The second one substitutes said part, although this time including the colon by the contents of the variable $replace.
I may be misunderstanding some of the previous answers, but I think that there's a confusion about the second example. It will not replace only the captured item (i.e., one or more non-colons up until a colon) by $replaced. It will replace all of ([^:]+): with $replace - the colon as well. (The substitution operates on the match, not just the capture.)
This means if you don't include a colon in $replace (and you want one), you will get bit:
my $line = 'http://www.example.com/';
my $replace = 'ftp';
$line =~ s/([^:]+):/$replace/;
print "Here's \$line now: $line\n";
Output:
Here's $line now: ftp//www.example.com/ # Damn, no colon!
I'm not sure if you are just looking at example code, but you unless you plan to use the capture I'm not sure you really want it in these examples.
If you are very unfamiliar with regular expressions (or Perl), you should look at perldoc perlrequick before trying perldoc perlre or perldoc perlretut.
You want to return something matching one or more characters that are anything but : followed by a : and the second one you want to do the same thing but replace it with $replace.
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new('([^:]+):')->explain"
The regular expression:
(?-imsx:([^:]+):)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$line =~ /([^:]+):/;
Matches anything that does not contain : before :/
If $line = "http://www.google.com", it will match http (the variable $1 will contain http)
$line =~ s/([^:]+):/$replace/;
This time, replace the value matched by the content of the variable $replace