Issue in matching regexp in TCL - regex

I am having following pattern
Notif[0]:
some text multiple line
Notif[1]:
multiple line text
Notif[2]:
text again
Notif[3]:
text again
Finish
I am writting following regexp
set notifList [regexp -inline -all -nocase {Notif\[\d+\].*?(?=Notif|Finish)} $var]
It is not giving desired output
Output needed
I need a list with each `Notif`block

The reason is that your .*? acts as a greedy subpattern (=.* matching 0+ any characters incl. a newline) because the first quantifier in the pattern was a greedy one (see \d+). See this Tcl Regex reference:
A branch has the same preference as the first quantified atom in it which has a preference.
You need to just turn the first + quantified subpattern into a lazy one by adding a ? after it:
Notif\[\d+?\].*?(?=Notif|Finish)
^
This will prevent the .*? pattern to inherit the greediness from the \d+.
See the IDEONE demo

Related

What is the difference between ".*" and ".*?" [duplicate]

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 4 years ago.
I wanted to catch comment on code (everything from "--" to the end of the line) using regular expressions in TCL.
So I tried {\\-\\-.*$} that should be - then - then any number of any characters and then end of the line. But it doesn't work!
Another post here suggested using .*? instead of .*.
So I tried {\\-\\-.*?$} and that works.
Just wanted to understand the difference between the two. According to any regular expression tutorial/man I read the ? condition should be a subset of *, so I am wondering what's going on there.
"?" makes de previous quantifier lazy, making it match as few characters as posible.
This is documented in the re_syntax man page. The question mark indicates the match should be non-greedy.
Let's look at an example:
% set string "-1234--ab-c-"
-1234--ab-c-
% regexp -inline -- {--.*-} $string
--ab-c-
% regexp -inline -- {--.*?-} $string
--ab-
The 1st match is greedy, matching to the last dash following the double dash.
The 2nd match is not greedy, only matching to the first dash following the double dash.
Note that the Tcl regex engine has a quirk: the first quantifier's greediness sets the greediness of the whole regex. This is documented (IMO obscurely) in the MATCHING section:
... A branch has the same preference as the first quantified atom in it which has a preference.
Let's try to match all the digits, the double dash, see how the non-greedy quantifiers work:
% regexp -inline -- {\d+--.*-} $string
1234--ab-c-
% regexp -inline -- {\d+--.*?-} $string
1234--ab-c-
Oops, the whole match is greedy, even though we asked for some non-greediness.
To satisfy this criteria, either we need to make the first quantifier non-greedy as well:
% regexp -inline -- {\d+?--.*?-} $string
1234--ab-
or make all the quantifiers greedy and use a negated bracket expression:
% regexp -inline -- {\d+--[^-]*-} $string
1234--ab-

PCRE Regular expression : only one matching

I want to catch strings which respond to a pattern in a subject string.
Patterns examples: ##name##, ##address##, ##bankAccount##, ...
Subject example: This is the template with patterns : ##name##Your bank account is : ##bankAccount##Your address is : ##address##
With the following regex: .*(#{2}[a-zA-Z]*#{2}).*, only the last pattern is matched.
How to capture all the patterns, not just the last or first ?
Now that I've formatted your regex properly, the problem shows. A * in your regex was hidden since markdown took it to make the text italics.
Your opening .* matches greedily as much as it can, only backing up enough to let (#{2}[a-zA-Z]*#{2}) match. This matches the last pattern found in the line, everything before it having been matched by the .*.
You need to remove .* as I mentioned in my comment, and use preg_match_all:
$re = '~#{2}[a-zA-Z]*#{2}~';
preg_match_all($re, "##name##, ##address##, ##bankAccount##", $m);
print_r($m);
See the PHP demo
The .*#{2}[a-zA-Z]*#{2}.* matched 0 or more characters other than a newline at first, grabbing the whole line, and then backtracked until the last occurrence of #{2}[a-zA-Z]*#{2} pattern, and the last .* only grabbed the rest of the line. Removing the .* and using preg_match_all, all substrings matching the #{2}[a-zA-Z]*#{2} pattern can be extracted.

Multiline regex replacement in sed/vi

I need to replace this statement in a named.conf with regex
masters {
10.11.2.1;
10.11.2.2;
};
All my approaches with sed/vi do not work
%s/masters.*\}\;//g
does not match. Also tried with /s \s etc to match the newline.
In vim, you can force a pattern to match across newlines with \_, for example:
%s/masters {\_[^}]*};//g
It's important to replace .* with something more conservative like [^}]* if you prefix with \_, because * is greedy, so \_.* will try to match everything to the end of the document.

Regex non greedy match with tab character

I do not understand why if given the following line (vi with set list on)
10.0.6.5^IVirtual^IVmware^IHTTP, MS SQL, Windows SVC^IcpHelpdesk $
Why the following regex:
^I.*?^I
does not match the string 'Virtual' in my line above? I am using the regex below in my VI search and replace
:%s/^I.*?^I/replace/g
this returns no match however on the same string if I use
^I.*^I
I would get
10.0.6.5replaceIcpHelpdesk $
What I attempting to say with ^I.*?^I is from the first tab character (^I) match anything (with the dot except line breaks) zero or more times ( *? ) until you come to the next token with is the tab character (^I)
I don't see what I am missing and any help would be appreciated. Thank you
Are you talking about vim regex here? In that case the non-greedy quantifier is \{-}:
\t.\{-}\t
Otherwise you can do it by not matching tab characters with a negation group:
\t[^\t]*\t

how to remove duplicate charecter strictly using regexp in tcl

How to remove duplicate characters in a string strictly using regexp in TCL?
e.g., I have a string like this aabbcddeffghh and I need only characters that are "abcdefgh". I tried with lsort unique, i am able to get unique characters:
join [lsort -unique [split $mystring {}]]
but i need using regexp command only.
You can't remove all non-consecutive double characters from a string with just Tcl's regsub command. It doesn't support access to back-references in lookahead sequences, which means that any scheme for removal will necessarily run into problems with overlapping match regions.
The simplest fix is to wrap in a while loop (with an empty body), using the fact that regsub will return the number of substitutions performed when it's given a variable to store the result in (last argument to it below):
set str "mississippi mud pie"
while {[regsub -all {(.)(.*)\1+} $str {\1\2} str]} {}
puts $str; # Prints "misp ude"
Try this one:
regsub -linestop -lineanchor -all {([a-z])\1+} $subject {\1} result
or
regsub -linestop -nocase -lineanchor -all {([a-z])\1+} $subject {\1} result
Explanation
{
( # Match the regular expression below and capture its match into backreference number 1
[a-z] # Match a single character in the range between ā€œaā€ and ā€œzā€
)
\1 # Match the same text as most recently matched by capturing group number 1
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
}
regsub -all {(.)(?=.*\1)} $subject {} result
It uses a look-ahead to check if there are any more instances of the character. If there are, it removes the character.
You will always retain the last character. It is not possible to do look-behinds in TCL without extra libraries.
More information about look-arounds: Regex tutorial - Lookahead and Lookbehind Zero-Width Assertions
Edit: Hmmm... Seems to be a bug with backreferences in Tcl 8.5. {(.).*\1} matches, but not {(.)(?=.*\1)}. It complains about Invalid backreference number. I can't see any solution to this without a backreference inside a look-ahead.
It might just be the version i tested it on (ideone.com/pFS0Q). I can't find any other version of a Tcl interpreter online to test.