Why does the regex (aba?)+ not match with abab? [duplicate] - regex

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Match exact string
(3 answers)
Closed 5 years ago.
Given (aba?)+ as the Regex and abab is the string.
Why does it only matches aba?
Since the last a in the regex is optional, isn't abab a match as well?
tested on https://regex101.com/

The reason (aba?)+ only matches aba out of abab is greedy matching: The optional a in the first loop is tested before the group is tested again, and matches. Therefore, the remaining string is b, which does not match (aba?) again.
If you want to turn off greedy matching for this optional a, use a??, or write your regex differently.

Since (aba?)+ is greedy, your pattern tries to match as much as possible. And since it first matches "aba", the remaining "b" is not matched.
Try the non-greedy version (it will match the first and second "ab"'s):
$ echo "abab" | grep -Po "(aba?)+"
aba
$ echo "abab" | grep -Po "(aba??)+"
abab

The correct regex for this is:
^(aba??)+$
and not (aba??)+ as discussed with #WiktorStribizew and YSelf.

Related

Extract string between combination of words and characters [duplicate]

This question already has answers here:
How to use sed/grep to extract text between two words?
(14 answers)
Closed last year.
The community reviewed whether to reopen this question 10 months ago and left it closed:
Original close reason(s) were not resolved
I would like to keep the strings between (FROM and as), and (From and newline character).
Input:
FROM some_registry as registry1
FROM another_registry
Output:
some_registry
another_registry
Using the following sed command, I can extract the strings. Is there a way to combine the two sed commands?
sed -e 's/.*FROM \(.*\) as.*/\1/' | sed s/"FROM "//
Merging into one regex expression is hard here because POSIX regex does not support lazy quantifiers.
With GNU sed, you can pass the command as
sed 's/.*FROM \(.*\) as.*/\1/;s/FROM //' file
See this online demo.
However, if you have a GNU grep you can use a bit more precise expression:
#!/bin/bash
s='FROM some_registry as registry1
From another_registry'
grep -oP '(?i)\bFROM\s+\K.*?(?=\s+as\b|$)' <<< "$s"
See the online demo. Details:
(?i) - case insensitive matching ON
\b - a word boundary
FROM - a word
\s+ - one or more whitespaces
\K - "forget" all text matched so far
.*? - any zero or more chars other than line break chars as few as possible
(?=\s+as\b|$) - a positive lookahead that matches a location immediately followed with one or more whitespaces and then a whole word as, or end of string.

Perl regular expression (regex) fails when I make it optional [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am running the following snippet of code on Perl 5.22:
DB<41> x "up 34 days, 22:04 and more" =~ m/.*?(?:(\d+) days).*$/
0 34
The above code works as expected and pulls out the 34 from "34 days".
My question comes in when I make the capture group optional by adding a ? at the end of it like this:
DB<4> x "up 34 days, 22:04 and more" =~ m/.*?(?:(\d+) days)?.*$/
0 undef
Why does it no longer match the 34? I have searched the web, but couldn't find any questions that matched mine (if you do have a link that explains it, that would be fantastic).
Thanks, in advance, for your time.
Regexes work from left to right, always; and quantifiers always try first to match as much as they can, or as little as they can when made non-greedy (like .*?). When they reach an unmatchable state, only then they will back up and try a new match (backtracking). The key to regexes is working around what the regex engine will try first.
.*? will first try to match the empty string at the beginning of the string, since that's the least it can match. In the case of the first regex, that will not result in a successful overall match, so it eventually backtracks until .*? matches "up " so that the following group can match "34 days". But if you make the following group optional, the first thing it will try is to match initial pattern of .*? to the empty string followed by (?:(\d+) days)? matching the empty string (since it cannot match digits followed by "days" at that particular position, but it can match the empty string) followed by .* matching the rest of the string followed by the end of the string; a successful match.
Regexp::Debugger can be nice to visualize the behavior, as well as https://regex101.com/ (just beware that PCRE is not exactly the same as Perl regex).
Since both, .*? and (?:(\d+) days)? match the empty string and .*$ then matches any other string, i.e. also the the whole input string.
If you check the following
use strict;
use warnings;
my $s = "up 34 days, 22:04 and more";
if ($s =~ m/.*?(?:(\d+) days)(.*)$/) {
print("first:\n $1=\"$1\"\n \$2=\"$2\"\n");
}
if ($s =~ m/.*?(?:(\d+) days)?(.*)$/) {
print("second:\n \$1=\"$1\"\n \$2=\"$2\"\n");
}
you'll get
first:
34="34"
$2=", 22:04 and more"
second:
$1=""
$2="up 34 days, 22:04 and more"
as output (and a warning about $1 being undefined that you can ignore here) which illustrates that.

What is the difference between ".*" and ".*?" [duplicate]

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 4 years ago.
I wanted to catch comment on code (everything from "--" to the end of the line) using regular expressions in TCL.
So I tried {\\-\\-.*$} that should be - then - then any number of any characters and then end of the line. But it doesn't work!
Another post here suggested using .*? instead of .*.
So I tried {\\-\\-.*?$} and that works.
Just wanted to understand the difference between the two. According to any regular expression tutorial/man I read the ? condition should be a subset of *, so I am wondering what's going on there.
"?" makes de previous quantifier lazy, making it match as few characters as posible.
This is documented in the re_syntax man page. The question mark indicates the match should be non-greedy.
Let's look at an example:
% set string "-1234--ab-c-"
-1234--ab-c-
% regexp -inline -- {--.*-} $string
--ab-c-
% regexp -inline -- {--.*?-} $string
--ab-
The 1st match is greedy, matching to the last dash following the double dash.
The 2nd match is not greedy, only matching to the first dash following the double dash.
Note that the Tcl regex engine has a quirk: the first quantifier's greediness sets the greediness of the whole regex. This is documented (IMO obscurely) in the MATCHING section:
... A branch has the same preference as the first quantified atom in it which has a preference.
Let's try to match all the digits, the double dash, see how the non-greedy quantifiers work:
% regexp -inline -- {\d+--.*-} $string
1234--ab-c-
% regexp -inline -- {\d+--.*?-} $string
1234--ab-c-
Oops, the whole match is greedy, even though we asked for some non-greediness.
To satisfy this criteria, either we need to make the first quantifier non-greedy as well:
% regexp -inline -- {\d+?--.*?-} $string
1234--ab-
or make all the quantifiers greedy and use a negated bracket expression:
% regexp -inline -- {\d+--[^-]*-} $string
1234--ab-

Why Perl can match two places with '/$/g'? [duplicate]

This question already has answers here:
$ and Perl's global regular expression modifier
(3 answers)
Closed 8 years ago.
I wrote a sample code like this:
$var="123\n123\n\n\n\n\n1\n";
$var=~s/$/___/g;
print $var;
it output this:
123
123
1___
___
Why '/$/g' can match two places? I think it matched one is the last "\n" and the other is end of string. But I think it should only match the last line.
Be careful of zero width regular expressions. They often will not behave entirely the way that you expect.
In this case, the $ boundary can actually match both directly before the last newline and directly after. This is part of the spec of the $.
Therefore, your fix is to use the string end code \z instead of $:
$var = "abc\n";
$var =~ s/\z/<foo>/g;
print "'$var'";
Outputs:
'abc
<foo>'
g is a global modifier, that's why you're seeing replacement in all the places that $ would match. If you don't use g then only first match will be replaced. So without g output will be:
123
123
1___
Also see: $ and Perl's global regular expression modifier

Using a positive lookahead to remove the middle of a string

I'm currently attempting to remove text in the middle of this string:
RenameMe_12345_12365_130706T234502.txt
using the following regex:
^[a-zA-Z]+(?=_[0-9]+_[0-9]+).+$
in an attempt to return:
RenameMe_130706T234502.txt
but the regex returns the entire string without excluding the middle:
RenameMe_12345_12365_130706T234502.txt
Am I using the positive lookahead incorrectly, or am I approaching the problem incorrectly? Can positive lookaheads not be used this way?
replace this regex:
_.*_
with
_
example with sed tool:
kent$ echo RenameMe_12345_12365_130706T234502.txt|sed 's/_.*_/_/'
RenameMe_130706T234502.txt
You could do it with your own tool/programming language.
EDIT for OP's comment:
#CodingUnderDuress _.*_ is a single regex (BRE). It uses the .* greedy character to achieve your goal.
If you don't want to do the substitution, just with regex to match the parts you need, you could do:
(^[^_]*|_[^_]*$)
test with grep: (-E means ERE)
kent$ echo "RenameMe_12345_12365_130706T234502.txt"|grep -Eo '(^[^_]*|_[^_]*$)'
RenameMe
_130706T234502.txt
You can of course use look-behind/ahead, if you really love them. then you need PCRE. And I don't see why we need use look-around here for your requirement.
You can replace the contents of this by a empty character
_(\w+(?=_))*
Working
[1] Match the character `_`
[2] followed a set of word characters
[3] I have used positive look-ahead `?=_` to make sure the last `_` is not missed out
[4] Match the above 0 or more times
Use this
(?<=[^_])_\w+_(?=[^_]+)
to match the part you want to remove.