What does -line flag do in tcl regular expression? - regex

Below I have copied the code I had written. I don't know what the line flag does.
set value "hi this is venkat345
hi this is venkat435
hi this is venkat567"
regexp -all -line -- {(venkat.+)$} $value a b
puts "Full Match: $a"
puts "Sub Match1: $b"
The above code gives the following output
Full Match: venkat567
Sub Match1: venkat567
Can any one explain me when and where should I choose the -line flag in tcl regular expression

The man page has defined it well I believe:
-line
Enables newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning. With this flag, [^ bracket expressions and . never match newline, ^ matches an empty string after any newline in addition to its normal function, and $ matches an empty string before any newline in addition to its normal function. This flag is equivalent to specifying both -linestop and -lineanchor, or the (?n) embedded option (see the re_syntax manual page).
If you want to understand it another way, . and [^ ... ] usually match newlines, for example:
regexp -- {^....$} "ab\nc"
returns 1 (meaning the regexp matches the string, counting \n as 1 character) but using the -line switch will prevent . to match \n.
Similary:
regexp -- {^[^abc]+$} "de\nf"
will also return 1 because the negated class [^abc] is able to match a character that is not abc, which includes \n.
The second function of the -line switch makes ^ match at every beginning of line instead of matching only at the start of the whole string, and makes $ match at every end of line instead of matching only at the end of the whole string.
% set text {abc
abc}
abc
abc
% regexp -- {^abc$} $text
0
% regexp -line -- {^abc$} $text
1
As for the when and where, it will depend on what you are trying to do. Based on your sample code, it would seem to me that you need to get all the usernames beginning with venkat that can appear at the end of any line. Since you want to match many, you will need to use the -all and -inline switches to get the matched strings, and I would recommend to change the regexp a bit:
set value "hi this is venkat345
hi this is venkat435
hi this is venkat567"
# I removed the capture group and changed . to \S to match non-space characters
set results [regexp -all -inline -line -- {venkat\S+$} $value]
puts $results
# venkat345 venkat435 venkat567

-line just make sure your . will never match a newline.
According to the Tcl regexp documentation:
-line
Enables newline-sensitive matching. By default, newline is a
completely ordinary character with no special meaning. With this flag,
‘[^’ bracket expressions and ‘.’ never match newline, ‘^’ matches an
empty string after any newline in addition to its normal function, and
‘$’ matches an empty string before any newline in addition to its
normal function. This flag is equivalent to specifying both -linestop
and -lineanchor, or the (?n) embedded option (see METASYNTAX, below).
Here is the output without -line option:
Full Match: venkat345
hi this is venkat435
hi this is venkat567
Sub Match1: venkat345
hi this is venkat435
hi this is venkat567
The .+ just matches all the lines up to the value string end.

Related

tcl how to split a string by using regexp

I have some string with format
class(amber#good)
class(Back1#notgood)
class(back#good)
and I want to use regexp to get value of these string
Expected answer:
amber
Back1
back
And here's my cmd:
set string "class(amber#good)"
regexp -all {^\\([a-zA-z_0-9].\#$} $string $match
puts $match
But the answer is not what I expected
You can use
regexp {\(([^()#]+)} $string - match
See the Tcl demo online.
The \(([^()#]+) regex matches
\( - a ( char
([^()#]+) - Capturing group 1 (match): any one or more chars other than parentheses and #.
The hyphen is used since the whole-match value is not necessary, we are only interested to get Group 1 value.
Sometimes using regular expressions is error prone and/or overkill.
Here's an alternate answer using split:
lindex [split $string "()#"] 1

tcl: regexp match subsring at the end of string

I trying to match a substring occurs many times in string
str1 = st1.st2.{k}.st3.{k}.st4.{k}.
str2 = st1.st2.{k}.st3.{k}.st4.
I use regexp to match "{k}" at the end of str1:
regexp .*\.\{k\}\.$ $str1
but I got 0 !!
in fact I use regsub to test the regexp
regsub {.*\.\{k\}\.$} $str {}
result ==> empty
if the pattern is matched, the matched string will be removed !!
what missing in regexp expression ?
In your code, the regexp is returning the value 1 only, not 0. When you want to match the last occurrence of .{k}., you have to go ahead with sub-matches to get what you want.
set str1 st1.st2.{k}.st3.{k}.st4.{k}.
puts [regexp ".*(\.{k}\.)" $str1 whole last]
puts $last
Output :
1
.{k}.
The $ sign is not mandatory to specify the end of line as we simply want to match the last occurrence.
With the regsub, you should be using the back-reference to capture the 1st group, so that it can be replaced correctly.
puts [regsub "(.*)(\.{k}\.)" $str1 "\\1"]
Output :
st1.st2.{k}.st3.{k}.st4
What is wrong with regsub {.*\.\{k\}\.$} $str {} ?
Well, the pattern .*\.\{k\}\.$ will match the whole string and you are replacing it with empty string, which is why you are getting the empty result.
Reference : Noncapturing Subpatterns

Replace first matching character in string in PowerShell

In the following string,
apache:x:48:48:Apache:/var/www:/sbin/nologin
how could I replace the first colon (and this one only) with a comma so I would get the following string?
apache,x:48:48:Apache:/var/www:/sbin/nologin
Also, the code has to support a file with multiple lines and replace the first comma in each line only.
Use a regular expression:
PS C:\> $s = 'apache:x:48:48:Apache:/var/www:/sbin/nologin'
PS C:\> $s -replace '^(.*?):(.*)','$1,$2'
apache,x:48:48:Apache:/var/www:/sbin/nologin
Regexp breakdown:
^(.*?):: shortest match between the beginning of the string and a colon (i.e. the text before the first colon).
(.*): the remainder of the string (i.e. everything after the first colon).
The parantheses group the subexpressions, so they can be referenced in the replacement string as $1 and $2.
Further explanation:
^ matches the beginning of a string.
.* matches any number of characters (. ⇒ any character, * ⇒ zero or more times).
.*? does the same, but gives the shortest match (?) instead of the longest match. This is called a "non-greedy match".

Escaping braces with regexp in the middle of a string

I want to write a regular expression in tcl that can detect the presence of curly braces ({,}) in middle of a string and replace it with a backslash.
For example i/p:
designs/abc/def {/designs/abc/def/abc{123}defg} {abc/sed/123erf} -conect abc
o/p:
designs/abc/def {/designs/abc/def/abc\{123\}defg} {abc/sed/123erf} -conect abc
Since you mentioned that only braces surrounded by characters on both sides should be replaced, then I think that you need word boundaries:
% set input "designs/abc/def {/designs/abc/def/abc{123}defg} {abc/sed/123erf} -conect abc"
designs/abc/def {/designs/abc/def/abc{123}defg} {abc/sed/123erf} -conect abc
% regsub -all {\y[{}]\y} $input {\\\0} result
2
% puts $result
designs/abc/def {/designs/abc/def/abc\{123\}defg} {abc/sed/123erf} -conect abc
In Tcl, \y matches between \w and \W, that is between a word and a non-word character or between a word character and the beginning/end of string.
The replace of \\\0 gives a backslash and the matched string.
In case you can also have braces escaped at the beginning/end of string, you'll need something a bit different:
% set input "{/designs/abc/def/abc{123}defg}"
{/designs/abc/def/abc{123}defg}
% regsub -all {(?:\y|^)[{}](?:\y|$)} $input {\\\0} result
4
% puts $result
\{/designs/abc/def/abc\{123\}defg\}
Usually you can use lookaround to make that elegant, but you can fake it by including part of the match in the output: replace (\S)([{}])(\S) by \1\\\2\3.

how to remove duplicate charecter strictly using regexp in tcl

How to remove duplicate characters in a string strictly using regexp in TCL?
e.g., I have a string like this aabbcddeffghh and I need only characters that are "abcdefgh". I tried with lsort unique, i am able to get unique characters:
join [lsort -unique [split $mystring {}]]
but i need using regexp command only.
You can't remove all non-consecutive double characters from a string with just Tcl's regsub command. It doesn't support access to back-references in lookahead sequences, which means that any scheme for removal will necessarily run into problems with overlapping match regions.
The simplest fix is to wrap in a while loop (with an empty body), using the fact that regsub will return the number of substitutions performed when it's given a variable to store the result in (last argument to it below):
set str "mississippi mud pie"
while {[regsub -all {(.)(.*)\1+} $str {\1\2} str]} {}
puts $str; # Prints "misp ude"
Try this one:
regsub -linestop -lineanchor -all {([a-z])\1+} $subject {\1} result
or
regsub -linestop -nocase -lineanchor -all {([a-z])\1+} $subject {\1} result
Explanation
{
( # Match the regular expression below and capture its match into backreference number 1
[a-z] # Match a single character in the range between “a” and “z”
)
\1 # Match the same text as most recently matched by capturing group number 1
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
}
regsub -all {(.)(?=.*\1)} $subject {} result
It uses a look-ahead to check if there are any more instances of the character. If there are, it removes the character.
You will always retain the last character. It is not possible to do look-behinds in TCL without extra libraries.
More information about look-arounds: Regex tutorial - Lookahead and Lookbehind Zero-Width Assertions
Edit: Hmmm... Seems to be a bug with backreferences in Tcl 8.5. {(.).*\1} matches, but not {(.)(?=.*\1)}. It complains about Invalid backreference number. I can't see any solution to this without a backreference inside a look-ahead.
It might just be the version i tested it on (ideone.com/pFS0Q). I can't find any other version of a Tcl interpreter online to test.