Need explanation of tcl regexp inline example in the man page please - regex

While trying to understand regexp and --inline use, saw this example but couldn't understand how it works.
Link to the man page is: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm#M13
In there, under --inline option, this example was given:
regexp -inline -- {\w(\w)} " inlined "
=> {in n}
regexp -all -inline -- {\w(\w)} " inlined "
=> {in n li i ne e}
How does this "{\w(\w)}" yield "{in n}"? Can someone explain please.
Appreciate the help.
Thanks

If -inline but not -all is not given, regexp returns a list consisting of one value for the entire region matched and one value for each submatch (regions captured by parentheses). To see what the entire match is, ignore the parentheses: the pattern is now {\w\w}, matching the two first word characters in the string (in). The first submatch is what you get if you skip one word character (the \w outside the parentheses) and then capture the next word character (the \w inside the parentheses), getting n.
If both -inline and -all are given, regexp does this repeatedly, restarting at the first character beyond the last entire match.

I think that to understand -inline, you must first understand that -inline puts the matches (and submatches) in a list. Because if you had...
regexp -- {\w(\w)} " inlined " m1 m2
You will have...
% puts $m1
in
% puts $m2
n
As the whole match in is stored in m1 while the submatch of the capture group n is stored in m2.
Putting those in a list (i.e. when using -inline) will give {in n}.
When you now have -all and -inline at the same time (assuming that you already know that -all retrieves all non-overlapping matches in regexp), you can no more use variable names after the input string, so you get a list containing all the matches and submatches and if I have to name them m and s (for match and submatch respectively), you have:
in n li i ne e
m s m s m s

Related

Check if string end with substring, Tcl

I have a searching string .state_s[0] and another two lists of strings:
{cache.state_s[0]} {cache.state_s[1]}
and
{cache.state_s[0]a} {cache.state_s[1]}
I need command(s) Tcl interpreter accepts to ask if the searching string is matching any of the items in the string list. What is also very important, the solution should only return positive result for the first list. I tried:
set pattern {.state_s[0]}
set escaped_pattern [string map {* \\* ? \\? [ \\[ ] \\] \\ \\\\} $pattern]
set m1 {{cache.state_s[0]} {cache.state_s[1]}}
set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
regexp $escaped_pattern $m1
regexp $escaped_pattern $m2
However, the above commands are returning "1" with both regexp calls.
Basically, I need a way to check if a substring (having special chars like [) is at the end of a string.
You have the elements as a list in a variable m1.
set m1 {{cache.state_s[0]} {cache.state_s[1]}}
set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
But, when you apply it against regexp, Tcl will treat the input m1 as a whole string, not a list.
Since both the list contains the string .state_s[0], regexp returning the result as 1.
If you want to apply the regular expression for each element, then I would recommend to use the lsearch with -regexp flag.
% set m1 {{cache.state_s[0]} {cache.state_s[1]}}
{cache.state_s[0]} {cache.state_s[1]}
% set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
{cache.state_s[0]a} {cache.state_s[1]}
%
% lsearch -regexp -inline $m1 {\.state_s\[0]$};
cache.state_s[0]
% lsearch -regexp -inline $m2 {\.state_s\[0]$}
%
The pattern I have used here is {\.state_s\[0]$}. The last $ symbol represents end of line. With this, we are ensuring that the element doesn't have any more characters in it. We don't have escape the closing square bracket ] in Tcl.

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1

How to match nth occurrence in a string using regular expression

How to match nth occurrence in a string using regular expression
set test {stackoverflowa is a best solution finding site
stackoverflowb is a best solution finding site stackoverflowc is a
best solution finding sitestackoverflowd is a best solution finding
sitestackoverflowe is a best solution finding site}
regexp -all {stackoverflow} $test
The above one give "5" as output
regexp {stackoverflow} $test
The above one give stackoverflow as result, here it is matching first occurrence of stackoverflow (i.e) stackoverflowa
My requirement is i want to match 5th occurrence of stackoverflow (i.e) stackoverflowe from the above given string.
Please some one clarify my question..Thanks
Then another one question
Try
set results [regexp -inline -all {stackoverflow.} $test]
# => stackoverflowa stackoverflowb stackoverflowc stackoverflowd stackoverflowe
puts [lindex $results 4]
I'll be back to explain this further shortly, making pancakes right now.
So.
The command returns a list (-inline) of all (-all) substrings of the string contained in test that match the string "stackoverflow" (less quotes) plus one character, which can be any character. This list is stored in the variable result, and by indexing with 4 (because indexing is zero-based), the fifth element of this list can be retrieved (and, in this case, printed).
The dot at the end of the expression wasn't in your expression: I added it to check that I really did get the right match. You can of course omit the dot to match "stackoverflow" exactly.
ETA (from Donal's comment): in many cases it's convenient to extract not the string itself, but its position and extent within the searched string. The -indices option gives you that (I'm not using the dot in the expression now: the index list makes it obvious which one of the "stackoverflow"s I'm getting anyway):
set indices [regexp -inline -all -indices {stackoverflow} $test]
# => {0 12} {47 59} {94 106} {140 152} {186 198}
You can then use string range to get the string match:
puts [string range $test {*}[lindex $indices 4]]
The lindex $indices 4 gives me the list 186 198; the {*} prefix makes the two elements in that list appear as two separate arguments in the invocation of string range.

greedy matching in regexp

I have the following output:
Player name: RON_98
Player name: RON_97
player name: RON_96
I need to get the RON part and the digital part after it(for example 98),I used the following regexp: regexp "(RON)_(\[0-9]*)",does this will match the RON_96 of the last line? "*" is greedy match, how to match only the first line of the output? do we have something like (RON)_(only match digital)? and can prevent it to match the rest of the line?
Always put regular expressions in braces in Tcl.
It's not technically necessary (you can use Tcl's language definition to exactly work out what backslashes would be needed to do it any other way) but it's simpler in all cases that you're likely to encounter normally.
The examples below will use this.
Regular expressions start matching as soon as they can. Then, under normal (greedy) circumstances they match as much text as they can. Thus, with your sample code and text, the matcher starts trying to match at the R on the first line and goes on to consume up to the 8, at which point it has a match and stops. You can verify this by asking regexp to report the indices into the string where the match happened instead of the substring that was matched (via the -indices option, documented on the manual page).
To get all the matches in a string, you have two options:
Pass the -all -inline options to regexp and process the list of results with foreach:
# Three variables in foreach; one for whole match, one for each substring
foreach {a b c} [regexp -all -inline {(RON)_([0-9]*)} $thedata] {
puts "matched '$a', with b=$b and c=$c"
}
Use the -indices option together with the -start option, all in a while loop, so you step through the string:
set idx 0
while {[regexp -start $idx -indices {(RON)_([0-9]*)} $thedata a b c]} {
puts "matched at '$a', with subranges '$b' and '$c'"
set extracted [string range $thedata {*}$c]
puts "the extracted value is '$extracted'"
# Advance the place where the next search will start from
set idx [expr {[lindex $a 1] + 1}]
}
I'd normally recommend using the first option; it's much easier to use! Sometimes the second is better as it provides more information and uses less intermediate storage, but it's also much trickier to get right.
Even if you select your stated regex to match multiple lines, it will not match more than the first occurance of what you stated, and this is "RON_98". It will stop after the last digit of the first match. You could even force it to stop after reading a line by using $ at the end of your RegEx (matches an end of line).
For reference, the [0-9] can be written easier as \d (Digit):
(RON)_\d*
is easier to read.

How do I extract all matches with a Tcl regex?

hi everybody i want solution for this regular expression, my problem is Extract all the hex numbers in the form H'xxxx, i used this regexp but i didn't get all hexvalues only i get one number, how to get whole hex number from this string
set hex "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set res [regexp -all {H'([0-9A-Z]+)&} $hex match hexValues]
puts "$res H$hexValues"
i am getting output is 5 H4D52
On -all -inline
From the documentation:
-all : Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
-inline : Causes the command to return, as a list, the data that would otherwise be placed in match variables. When using -inline, match variables may not be specified. If used with -all, the list will be concatenated at each iteration, such that a flat list is always returned. For each match iteration, the command will append the overall match data, plus one element for each subexpression in the regular expression.
Thus to return all matches --including captures by groups-- as a flat list in Tcl, you can write:
set matchTuples [regexp -all -inline $pattern $text]
If the pattern has groups 0…N-1, then each match is an N-tuple in the list. Thus the number of actual matches is the length of this list divided by N. You can then use foreach with N variables to iterate over each tuple of the list.
If N = 2 for example, you have:
set numMatches [expr {[llength $matchTuples] / 2}]
foreach {group0 group1} $matchTuples {
...
}
References
regular-expressions.info/Tcl
Sample code
Here's a solution for this specific problem, annotated with output as comments (see also on ideone.com):
set text "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set pattern {H'([0-9A-F]{4})}
set matchTuples [regexp -all -inline $pattern $text]
puts $matchTuples
# H'22EF 22EF H'2354 2354 H'4BD4 4BD4 H'4C4B 4C4B H'4D52 4D52 H'4DC9 4DC9
# \_________/ \_________/ \_________/ \_________/ \_________/ \_________/
# 1st match 2nd match 3rd match 4th match 5th match 6th match
puts [llength $matchTuples]
# 12
set numMatches [expr {[llength $matchTuples] / 2}]
puts $numMatches
# 6
foreach {whole hex} $matchTuples {
puts $hex
}
# 22EF
# 2354
# 4BD4
# 4C4B
# 4D52
# 4DC9
On the pattern
Note that I've changed the pattern slightly:
Instead of [0-9A-Z]+, e.g. [0-9A-F]{4} is more specific for matching exactly 4 hexadecimal digits
If you insist on matching the &, then the last hex string (H'4DC9 in your input) can not be matched
This explains why you get 4D52 in the original script, because that's the last match with &
Maybe get rid of the &, or use (&|$) instead, i.e. a & or the end of the string $.
References
regular-expressions.info/Finite Repetition, Anchors
I'm not Tclish, but I think you need to use both the -inline and -all options:
regexp -all -inline {H'([0-9A-Z]+)&} $string
EDIT: Here it is again, this time with a corrected regex (see the comments):
regexp -all -inline {H'[0-9A-F]+&} $string