What is the difference between ".*" and ".*?" [duplicate] - regex

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 4 years ago.
I wanted to catch comment on code (everything from "--" to the end of the line) using regular expressions in TCL.
So I tried {\\-\\-.*$} that should be - then - then any number of any characters and then end of the line. But it doesn't work!
Another post here suggested using .*? instead of .*.
So I tried {\\-\\-.*?$} and that works.
Just wanted to understand the difference between the two. According to any regular expression tutorial/man I read the ? condition should be a subset of *, so I am wondering what's going on there.

"?" makes de previous quantifier lazy, making it match as few characters as posible.

This is documented in the re_syntax man page. The question mark indicates the match should be non-greedy.
Let's look at an example:
% set string "-1234--ab-c-"
-1234--ab-c-
% regexp -inline -- {--.*-} $string
--ab-c-
% regexp -inline -- {--.*?-} $string
--ab-
The 1st match is greedy, matching to the last dash following the double dash.
The 2nd match is not greedy, only matching to the first dash following the double dash.
Note that the Tcl regex engine has a quirk: the first quantifier's greediness sets the greediness of the whole regex. This is documented (IMO obscurely) in the MATCHING section:
... A branch has the same preference as the first quantified atom in it which has a preference.
Let's try to match all the digits, the double dash, see how the non-greedy quantifiers work:
% regexp -inline -- {\d+--.*-} $string
1234--ab-c-
% regexp -inline -- {\d+--.*?-} $string
1234--ab-c-
Oops, the whole match is greedy, even though we asked for some non-greediness.
To satisfy this criteria, either we need to make the first quantifier non-greedy as well:
% regexp -inline -- {\d+?--.*?-} $string
1234--ab-
or make all the quantifiers greedy and use a negated bracket expression:
% regexp -inline -- {\d+--[^-]*-} $string
1234--ab-

Related

Issue in matching regexp in TCL

I am having following pattern
Notif[0]:
some text multiple line
Notif[1]:
multiple line text
Notif[2]:
text again
Notif[3]:
text again
Finish
I am writting following regexp
set notifList [regexp -inline -all -nocase {Notif\[\d+\].*?(?=Notif|Finish)} $var]
It is not giving desired output
Output needed
I need a list with each `Notif`block
The reason is that your .*? acts as a greedy subpattern (=.* matching 0+ any characters incl. a newline) because the first quantifier in the pattern was a greedy one (see \d+). See this Tcl Regex reference:
A branch has the same preference as the first quantified atom in it which has a preference.
You need to just turn the first + quantified subpattern into a lazy one by adding a ? after it:
Notif\[\d+?\].*?(?=Notif|Finish)
^
This will prevent the .*? pattern to inherit the greediness from the \d+.
See the IDEONE demo

Perl : Decoding Regex

I would highly appreciate if somebody could help me understand the following.
=~/(?<![\w.])($val)(?![\w.])/gi)
This what i picked up but i dont understand this.
Lookaround: (?=a) for a lookahead, ?! for negative lookahead, or ?<= and ?<! for lookbehinds (positive and negative, respectively).
The regex seems to search for $val (i.e. string that matches the contents of the variable $val) not surrounded by word characters or dots.
Putting $val into parentheses remembers the corresponding matched part in $1.
See perlre for details.
Note that =~ is not part of the regex, it's the "binding operator".
Similarly, gi) is part of something bigger. g means the matching happens globally, which has different effects based on the context the matching occurs in, and i makes the match case insensitive (which could only influence $val here). The whole expression was in parentheses, probably, but we can't see the opening one.
Read (?<!PAT) as "not immediately preceded by text matching PAT".
Read (?!PAT) as "not immediately followed by text matching PAT".
I use these sites to help with testing and learning and decoding regex:
https://regex101.com/: This one dissects and explains the expression the best IMO.
http://www.regexr.com/
define $val then watch the regex engine work with rxrx - command-line REPL and wrapper for Regexp::Debugger
it shows output like this but in color
Matched
|
VVV
/(?<![\w.])(dog)(?![\w.])/
|
V
'The quick brown fox jumps over the lazy dog'
^^^
[Visual of regex at 'rxrx' line 0] [step: 189]
It also gives descriptions like this
(?<! # Match negative lookbehind
[\w.] # Match any of the listed characters
) # The end of negative lookbehind
( # The start of a capturing block ($1)
dog # Match a literal sequence ("dog")
) # The end of $1
(?! # Match negative lookahead
[\w.] # Match any of the listed characters
) # The end of negative lookahead

How does pattern matching work in Perl?

I want to know how pattern matching works in Perl.
My code is:
my $var = "VP KDC T. 20, pgcet. 5, Ch. 415, Refs %50 Annos";
if($var =~ m/(.*)\,(.*)/sgi)
{
print "$1\n$2";
}
I learnt that the first occurrence of comma should be matched. but here the last occurrence is being matched. The output I got is:
VP KDC T. 20, pgcet. 5, Ch. 415
Refs %50 Annos
Can someone please explain me how this matching works?
From docs:
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match
So, first (.*) will take as much as possible.
Simple workaround is using non-greedy quantifier: *?. Or match not every character, but all except comma: ([^,]*).
Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible.
For instance:
my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";
Run the preceding code, and here's what you get:
ississ
It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";
Now look what the code produces:
is
Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.
Source: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
Use question mark in your regex:
if($var =~ m/(.*?)\,(.*)/sgi)
{
print "$1\n$2";
}
So:
(.*)\, means: "match as much characters as you can as long as there will be a comma after them"
(.*?)\, means: "match any characters until you stumble upon a comma"
(.*)\, -you might expect that it will match till the first comma.
But it is greedy enough to match all the xcharacters it came across untill last comma instead of the first comma.
so
it matches till the last command.
and the second match is the rest of the line.
to avoid greedy pattern match adda ? after *

how to remove duplicate charecter strictly using regexp in tcl

How to remove duplicate characters in a string strictly using regexp in TCL?
e.g., I have a string like this aabbcddeffghh and I need only characters that are "abcdefgh". I tried with lsort unique, i am able to get unique characters:
join [lsort -unique [split $mystring {}]]
but i need using regexp command only.
You can't remove all non-consecutive double characters from a string with just Tcl's regsub command. It doesn't support access to back-references in lookahead sequences, which means that any scheme for removal will necessarily run into problems with overlapping match regions.
The simplest fix is to wrap in a while loop (with an empty body), using the fact that regsub will return the number of substitutions performed when it's given a variable to store the result in (last argument to it below):
set str "mississippi mud pie"
while {[regsub -all {(.)(.*)\1+} $str {\1\2} str]} {}
puts $str; # Prints "misp ude"
Try this one:
regsub -linestop -lineanchor -all {([a-z])\1+} $subject {\1} result
or
regsub -linestop -nocase -lineanchor -all {([a-z])\1+} $subject {\1} result
Explanation
{
( # Match the regular expression below and capture its match into backreference number 1
[a-z] # Match a single character in the range between ā€œaā€ and ā€œzā€
)
\1 # Match the same text as most recently matched by capturing group number 1
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
}
regsub -all {(.)(?=.*\1)} $subject {} result
It uses a look-ahead to check if there are any more instances of the character. If there are, it removes the character.
You will always retain the last character. It is not possible to do look-behinds in TCL without extra libraries.
More information about look-arounds: Regex tutorial - Lookahead and Lookbehind Zero-Width Assertions
Edit: Hmmm... Seems to be a bug with backreferences in Tcl 8.5. {(.).*\1} matches, but not {(.)(?=.*\1)}. It complains about Invalid backreference number. I can't see any solution to this without a backreference inside a look-ahead.
It might just be the version i tested it on (ideone.com/pFS0Q). I can't find any other version of a Tcl interpreter online to test.

vim regex with meta-characters

I have the following in a text file:
This is some text for cv_1 for example
This is some text for cv_001 for example
This is some text for cv_15 for example
I am trying to use regex cv_.*?\s to match cv_1, cv_001, cv_15 in the text. I know that the regex works. However, it doesn't match anything when I try it in Vim.
Do we need to do something special in Vim?
The non-greedy character ? doesn't work in Vim; you should use:
cv_.\{-}\s
...instead of:
cv_.*?\s
Here's a quick reference for matching:
* (0 or more) greedy matching
\+ (1 or more) greedy matching
\{-} (0 or more) non-greedy matching
\{-n,} (at least n) non-greedy matching
vim's regex syntax is a little different -- what you're looking for is
cv_.\{-}\s
(the \{-} being the vim equivalent of perl's *?, i.e., non-greedy 0-or-more). See here for a good tutorial on vim's regular expressions.