Regexp not matching string with [] and / in Tcl - regex

I am unable to match regex with a pin name having patterns with / and []. How to match string with this expression in tcl regexp?
ISSUE:
% set inst "channel/rptrw12\[5\]"
channel/rptrw12[5]
% set pin "channel/rptrw12\[5\]/rpinv\[11\]/vcc"
channel/rptrw12[5]/rpinv[11]/vcc
% regexp -nocase "^$inst" $pin
0
PASSING CASE:
% regexp -nocase vcc $pin
1
% set pat "ctrl/crdtfifo"
ctrl/crdtfifo
% set pin2 "ctrl/crdtfifo/iwdatabuf"
ctrl/crdtfifo/iwdatabuf
% regexp -nocase $pat $pin2
1

Your problem is that you are fighting with RE engine metacharacters, specifically […], which defines a character set. If you want to continue using your current approach, you'll need to add more backslashes.
But you don't have to do that!
If you are asking the question “does this string exist in that string?” you can also consider using one of these:
Use string first and check if the result (where the substring is) is not negative:
if {[string first $inst $pin] >= 0} {
puts "Found it"
}
Use regexp ***=, which means “interpret the rest of this as a literal string, no metacharacters”:
if {[regexp ***=$inst $pin]} {
puts "Found it"
}
If you only want to match for equality at the start of the string (you're asking “does this string start with that string?”) you probably should instead do one of these:
Use string first and check if the resulting index is zero:
if {[string first $inst $pin] == 0} {
puts "Found '$inst' at the start of '$pin'"
}
Use string equal with the right option (very much like strncmp() in C, if you know that):
if {[string equal -length [string length $inst] $inst $pin]} {
puts "'$pin' starts with '$inst'"
}

If you remember your regular expressions, the [] syntax has special meaning in regexp. It defines a character group. For example:
[abc]
means match a or b or c.
Therefore the pattern:
channel/rptrw12[5]
means match the string:
channel/rptrw125
If you want to match the literal character [ in regexp you need to escape it (same with all other characters that have meaning in regexp like . or ? or ( etc.). So your pattern should be:
channel/rptrw12\[5\]
But remember, the characters \ and [ has special meaning in tcl strings. So your code must do:
set inst "channel/rptrw12\\\[5\\\]"
The first \ escapes the \ character so that tcl will insert a single \ into the string. The third \ escapes the [ character so that tcl will not try to execute a command or function named 5.
Alternatively you can use {} instead of "":
set inst {channel/rptrw12\[5\]}

Related

Glob pattern expression for a hexadecimal number in TCL?

I am trying understand the difference between glob and regex patterns. I need to do some pattern matching in TCL.
The purpose is to find out if a hexadecimal value has been entered.
The value may or may not start with 0x
The value shall contain between 1 and 12 hex characters i.e 0-9, a-f, A-F and these shall follow the 0x if it exists
The thing is that glob does not allow use of {a,b} to tell about how many characters to look for. Also, at start I tried to use (0x[Xx])? but I think this is not working.
It is not essential to use glob. I can see that there are subtle differences between glob and regex. I just want to know if this can be done only through regex and not glob.
Tcl's glob patterns are much simpler than regular expressions. All they support is:
* to mean any number of any character.
? to mean any single character.
[…] to mean any single character from the set (the chars inside the brackets, which may include ranges).
\x to mean mean a literal x (which can be any character). That's how you put a glob metacharacter in a glob pattern.
They're also always anchored at both ends. (Regular expressions are much more powerful. They're also slower. You pay for power.)
To match hex numbers like 0xF00d, you'd use a glob pattern like this:
0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
(or, as an actual Tcl command; we put the pattern in {braces} to avoid needing lots of backslashes for all the brackets…)
string match {0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]} $value
Note that we have to match an exact number of characters. (You can shorten the pattern by using case-insensitive matching, to 0x[0-9a-f][0-9a-f][0-9a-f][0-9a-f].)
Matching hex numbers is better done with regexp or scan (which also parses the hex number). Everyone likes to forget scan for parsing, yet it's quite good at it…
regexp {^0x([[:xdigit:]]+)$} $value -> theHexDigits
scan $value "0x%x" theParsedValue
The thing is that glob does not allow use of {a,b} to tell about how
many characters to look for. Also, at start I tried to use (0x[Xx])?
but I think this is not working.
A commonly used regular expression, not specific to Tcl at all, is ^(0[xX])?[A-Fa-f0-9]{1,12}$.
Update
As Donal writes, there is a power-cost tradeoff when it comes to regexp. I was curious and, for the given requirements (optional 0x prefix, range check [1,12]), found that a carefully crafted script using string operations incl. string match (see isHex1 below) outperforms regexp in this setting (see isHex2), whatever the input case:
proc isHex1 {str min max} {
set idx [string last "0x" $str]
if {$idx > 0} {
return 0
} elseif {$idx == 0} {
set str [string range $str 2 end]
}
set l [string length $str]
expr {$l >= $min && $l <= $max && [string match -nocase [string repeat {[0-9a-f]} $l] $str]}
}
proc isHex2 {str min max} {
set regex [format {^(0x)?[[:xdigit:]]{%d,%d}$} $min $max]
regexp $regex $str
}
isHex1 extends the idea of computing the string match pattern based on the input length (w/ or w/o prefix) and string repeat. My own timings suggest that isHex1 runs at least 40% faster than isHex2 (all using time, 10000 iterations), in a worst case (within range, final character decides). Other cases (e.g., out-of-range) are substantially faster.
The glob syntax is described in the string match documentation. Compared to regular expressions, glob is a blunt instrument.
With regular expressions, you get the standard character classes, including [:xdigit:] to match a hexadecimal digit.
To contrast with mrcalvin's answer, a Tcl-specific regex would be: (?i)^0x[[:xdigit:]]{1,12}$
the leading (?i) means the expression will be matched case-insensitively.
If all you care about is determining if the input is a valid number, you can use string is integer:
set s 0xdeadbeef
string is integer $s ;# => 1
set s deadbeef
string is integer $s ;# => 0
set s 0xdeadbeetle
string is integer $s ;# => 0

Check if string end with substring, Tcl

I have a searching string .state_s[0] and another two lists of strings:
{cache.state_s[0]} {cache.state_s[1]}
and
{cache.state_s[0]a} {cache.state_s[1]}
I need command(s) Tcl interpreter accepts to ask if the searching string is matching any of the items in the string list. What is also very important, the solution should only return positive result for the first list. I tried:
set pattern {.state_s[0]}
set escaped_pattern [string map {* \\* ? \\? [ \\[ ] \\] \\ \\\\} $pattern]
set m1 {{cache.state_s[0]} {cache.state_s[1]}}
set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
regexp $escaped_pattern $m1
regexp $escaped_pattern $m2
However, the above commands are returning "1" with both regexp calls.
Basically, I need a way to check if a substring (having special chars like [) is at the end of a string.
You have the elements as a list in a variable m1.
set m1 {{cache.state_s[0]} {cache.state_s[1]}}
set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
But, when you apply it against regexp, Tcl will treat the input m1 as a whole string, not a list.
Since both the list contains the string .state_s[0], regexp returning the result as 1.
If you want to apply the regular expression for each element, then I would recommend to use the lsearch with -regexp flag.
% set m1 {{cache.state_s[0]} {cache.state_s[1]}}
{cache.state_s[0]} {cache.state_s[1]}
% set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
{cache.state_s[0]a} {cache.state_s[1]}
%
% lsearch -regexp -inline $m1 {\.state_s\[0]$};
cache.state_s[0]
% lsearch -regexp -inline $m2 {\.state_s\[0]$}
%
The pattern I have used here is {\.state_s\[0]$}. The last $ symbol represents end of line. With this, we are ensuring that the element doesn't have any more characters in it. We don't have escape the closing square bracket ] in Tcl.

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

tcl regexp from variable and special characters

I am a bit confused
my input string is " foo/1"
my motivation is to set foo as a variable and regexp it :
set line " foo/1"
set a foo
regexp "\s$a" $line does not work
also I noticed that only if I use curly and giving the exact string braces it works
regexp {\sfoo} $line works
regexp "\sfoo" $line doesnt work
can somebody explain why?
thanks
Quick answer:
"\\s" == {\s}
Long answer:
In Tcl, if you type a string using "" for enclosing it, everything inside will be evaluated first and then used as a string. This means that \s is evaluated (interpreted) as an escape character, instead of two characters.
If you want to type \ character inside "" string you have to escape it as well: \\. In your case you would have to type "\\sfoo".
In case of {} enclosed strings, they are always quoted, no need for repeated backslash.
Using "" is good if you want to use variables or inline commands in the string, for example:
puts "The value $var and the command result: [someCommand $arg]"
The above will evaluate $var and [someCommand $arg] and put them into the string.
If you'd have used braces, for example:
puts {The value $var and the command result: [someCommand $arg]}
The string will not be evaluated. It will contain all the $ and [ characters, just like you typed them.

TCL regsub isn't working when the expression has [0]

I tried the following code:
set exp {elem[0]}
set temp {elem[0]}
regsub $temp $exp "1" exp
if {$exp} {
puts "######### 111111111111111 ################"
} else {
puts "########### 0000000000000000 ############"
}
of course, this is the easiest regsub possible (the words match completely), and still it doesnt work, and no substitution is done. if I write elem instead of elem[0], everything works fine.
I tried using {elem[0]}, elem[0], "elem[0]" etc, and none of them worked.
Any clue anyone?
This is the easiest regsub possible (the words match completely)
Actually, no, the words don't match. You see, in a regular expression, square brackets have meaning. Your expression {elem[0]} actually mean:
match the sequence of letters 'e'
followed by 'l'
followed by 'e'
followed by 'm'
followed by '0' (the character for the number zero)
So it would match the string "elem0" not "elem[0]" since the character after 'm' is not '0'.
What you want is {elem\[0\]} <-- backslash escapes special meaning.
Read the manual for tcl's regular expression syntax, re_syntax, for more info on how regular expressions work in tcl.
In addition to #slebetman's answer, if your want any special characters in your regular expression to be treated like plain text, there is special syntax for that:
set word {abd[0]}
set regex $word
regexp $regex $word ;# => returns 0, did not match
regexp "(?q)$regex" $word ;# => returns 1, matched
That (?q) marker must be the first part of the RE.
Also, if you're really just comparing literal strings, consider the simpler if {$str1 eq $str2} ... or the glob-style matching of [string match]