tcl list index in regexp - regex

Is there a way to use the list index in a regexp ?
I've tried with this but it doesn't work :
if {[regexp {.*\[lindex $mylist 2\].*} $mystring]} {
puts "OK"
}
the value located at list index 2 of the list named mylist is not replaced in the regexp.
Thanks.

If you just want to see if some short simple string is present in another, regexp is not the right approach. Instead, use string first:
if {[string first [lindex $mylist 2] $mystring] >= 0} {
puts "OK"
}
If the list really has a regular expression in its third element, then it's enough to do this because Tcl always detects if an RE matches anywhere (it's patterns are unanchored by default):
if {[regexp -- [lindex $mylist 2] $mystring]} {
puts "OK"
}
The -- is just in case the RE starts with a - character, which can cause confusion. You could also use regexp to work a bit like that string first recipe:
if {[regexp ***=[lindex $mylist 2] $mystring]} {
But the code with string first will be faster! If you want anything much more complicated than this, it's probably a good idea to stop and think whether what you're doing is the right approach; when one finds oneself doing complicated substitutions in regular expressions, one is usually getting into a mess. (Or at least that's when I know I need to rethink.) Asking here — while providing a bit more context — can help you figure things out.

I'm not sure why you're using regexp that way. You don't have to match the whole string, you know?
You should be able to use:
if {[regexp [lindex $mylist 2] $mystring]} {
puts "OK"
}
Note that as long as there is a match anywhere in the string, regexp will match and return 1.
However, this might give you unexpected results with regexp metacharacters. If [lindex $mylist 2] doesn't contain any, you should be good.
Unless the element in the list was intended to be a regexp string, then won't be any issues.
If you have metacharacters in the element of the list, you might use another regexp first to escape them:
if {[regexp [regsub -all {[\]\[+*.^${}()?\\]} [lindex $mylist 2] {\\\0}] $mystring]} {
puts "OK"
}
[regsub -all {[\]\[+*.^${}()?\\]} [lindex $mylist 2] {\\\0}] adds a backslash to the metacharacters in [\[\]{}()?+*.^$\\] (i.e. the following characters []+{}()?+*.^$\)

The problem is that The braces inside the regex prevent evaluation of the list command.
There are various ways you can solve this, one of them is to give up the braces, and use quotation marks which allow evaluation of commands. In this case you need to escape the control characters:
if {[regexp ".*\\[lindex $mylist 2].*" $mystring]} {
puts "OK"
}
Alternatively you can use the subst command, but this would be verbose:
if {[regexp [subst -nobaclslashes {.*\[lindex $mylist 2\].*}] $mystring]} {
puts "OK"
}
The subst command with the switch I enabled substitutes everything inside it, except for the backslashes.

Related

TCL Find in list

I need to search in a list an exact word CUT_LEVEL_EP1.5 without being able to use the elements that are before and after my string. below is my list :
set l {L1555(CUT_LEVEL_EP1.5-1) L1560(CUT_LEVEL_EP1.5-2) L1565(CUT_LEVEL_EP1.5) L1570(CUT_LEVEL_EP1.5-3)}
here’s what I tried to do :
set index [lsearch -regexp $l {\yCUT_LEVEL_EP1.5\M}]
if {$index > -1} {
puts [lindex $l $index]
}
The result is not what I expect L1555(CUT_LEVEL_EP1.5-1) I would like this L1565(CUT_LEVEL_EP1.5)
As your requirements don't match the built-in definition of a "word", you cannot use the \M constraint escape. An alternative may be to use a negative lookahead:
lsearch -regexp $l {\yCUT_LEVEL_EP1.5(?![A-Z0-5.-])}
It looks to me like you want to find the string only if it is the entirety of the contents of the parentheses. (You also want to avoid the trap of . matching things other than itself.) You can do it easily, but only with a little trickery.
set str "CUT_LEVEL_EP1.5"
set index [lsearch -regexp $l ***=($str)]
The ***= says that the rest of the RE is a literal, and then we put the parentheses directly (as well as substituting in the string we are really looking for, safe because of the force-literal trick).

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1

need help in tcl command usage for regsub

I am new learner for tcl. I have some issue as below when using regsub. Consider the following scenario:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
regsub -all ${test2}_[1-9] $test1 [list] test1
I expected $test1 output is [prefix_abc_3 AAA_0]
However regsub has also removed the partial matched string which is prefix_abc_3. Does anyone here have any idea on how to regsub the exact words only in a list?
I tried to find solution via net but could not get any clue/hints. Appreciate if someone here can help me.
\m and \M in regexps match the beginning and end of a word respectively. But you don't have a string of words in test1, but a list of elements, and sometimes there's a difference so don't mix the two. regsub only handles strings while lsearch works with lists:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
set test1 [lsearch -all -inline -not -regexp $test1 "^${test2}_\[1-9\]\$"]
If the pattern is that simple, you can use the -glob option (the default) instead of -regexp and maybe save some processor time.
What exactly did you execute?
When I type the commands above into tclsh, it displays an error -
% set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
prefix_abc_3 abc_1 abc_2 AAA_0
% set test2 abc
abc
% regsub -all ${test2}_[1-9] [list] test1
invalid command name "1-9"
I'm unsure what you are trying to do. You start by inisitalising test1 as a list. You then treat it as a string by passing it to regsub. This is a completely legal thing to do, but may indicate that you are confused by something. Are you trying to test your substitution by applying it four times, to each of prefix_abc_3, abc_1, abc_2 and AAA_0? You can certainly do that the way you are, but a more natural way would be to do
foreach test $test1 {
regsub $pattern $test [list] testResult
puts stdout $testResult
}
Then again, what are you trying to achieve with your substitution? It looks as though your are trying to replace the stringabc with a null string, i.e. remove it altogether. Passing [list] as a null string is perfectly valid, but again may indicate confusion between lists and strings.
To achieve the result you want, all you need to do is add a leading space to your pattern, pass a space as the substitution string and escape the square brackets, i.e.
regsub -all " ${test2}_\[-9\]" $test1 " " test1
but I suspect that this is a made-up example and you're really trying to do something slightly different.
Edit
To obtain a list that contains just those list entries that don't exactly match your pattern, I suggest
proc removeExactMatches {input} {
set result [list]; # Initialise the result list
foreach inputElement $input {
if {![regexp {^abc_[0-9]$} $inputElement]} {
lappend result $inputElement
}
}
return $result
}
set test1 [removeExactMatches [list prefix_abc_3 abc_1 abc_2 AAA_0]]
Notes:
i) I don't use regsub at all.
ii) Although it's safe and legal to switch around between lists and strings, it all takes time and it obscures what I'm tryng to do, so I avoid it wherever possible. You seem to have a list of strings and you want to remove some of them, so that's what I use in my suggested solution. The regular expression commands in Tcl handle strings so I pass them strings.
iii) To ensure that the list elements match exactly, I anchor the pattern to the start and end of the string that I'm matching against using ^ and $.
iv) To prevent the interpreter from recognising the [1-9] in the regular expression pattern and trying to execute a (non-existant) command 1-9, I enclose the whole pattern string within curly brackets.
v) For greater generality, I might want to pass the pattern to the proc as well as the input list (of strings), in that case, I'd do
proc removeExactMatches {inputPattern input} {
.
.
.
set pattern "^"
append pattern $inputPattern
append pattern "\$"
.
.
.
if {![regub $pattern $inputElement]} {
.
.
.
}
set test1 [removeExactMatches {abc_[1-9]} {prefix_abc_3 abc_1 abc_2 AAA_0}]
to minimise the number of characters that had to be escaped. (Actually I probably wouldn't use the quotation marks for the start and end anchors within the proc - they aren't really needed and I'm a lazy typist!)
Looking at your original question, it seems that you might want to vary only the abc part of the pattern, in which case you might want to just pass that to your proc and append the _[0-9] as well as the anchors within it - don't forget to escape the square brackets or use curly brackets if you go down this route.

greedy matching in regexp

I have the following output:
Player name: RON_98
Player name: RON_97
player name: RON_96
I need to get the RON part and the digital part after it(for example 98),I used the following regexp: regexp "(RON)_(\[0-9]*)",does this will match the RON_96 of the last line? "*" is greedy match, how to match only the first line of the output? do we have something like (RON)_(only match digital)? and can prevent it to match the rest of the line?
Always put regular expressions in braces in Tcl.
It's not technically necessary (you can use Tcl's language definition to exactly work out what backslashes would be needed to do it any other way) but it's simpler in all cases that you're likely to encounter normally.
The examples below will use this.
Regular expressions start matching as soon as they can. Then, under normal (greedy) circumstances they match as much text as they can. Thus, with your sample code and text, the matcher starts trying to match at the R on the first line and goes on to consume up to the 8, at which point it has a match and stops. You can verify this by asking regexp to report the indices into the string where the match happened instead of the substring that was matched (via the -indices option, documented on the manual page).
To get all the matches in a string, you have two options:
Pass the -all -inline options to regexp and process the list of results with foreach:
# Three variables in foreach; one for whole match, one for each substring
foreach {a b c} [regexp -all -inline {(RON)_([0-9]*)} $thedata] {
puts "matched '$a', with b=$b and c=$c"
}
Use the -indices option together with the -start option, all in a while loop, so you step through the string:
set idx 0
while {[regexp -start $idx -indices {(RON)_([0-9]*)} $thedata a b c]} {
puts "matched at '$a', with subranges '$b' and '$c'"
set extracted [string range $thedata {*}$c]
puts "the extracted value is '$extracted'"
# Advance the place where the next search will start from
set idx [expr {[lindex $a 1] + 1}]
}
I'd normally recommend using the first option; it's much easier to use! Sometimes the second is better as it provides more information and uses less intermediate storage, but it's also much trickier to get right.
Even if you select your stated regex to match multiple lines, it will not match more than the first occurance of what you stated, and this is "RON_98". It will stop after the last digit of the first match. You could even force it to stop after reading a line by using $ at the end of your RegEx (matches an end of line).
For reference, the [0-9] can be written easier as \d (Digit):
(RON)_\d*
is easier to read.

How to implement a NFA or DFA based regexp matching algorithm to find all matches?

The matches can be overlapped.
But if multiple matches are found starting from a same position, pick the short one.
For example, to find regexp parttern "a.*d" in a string "abcabdcd", the answer should be {"abcabd", "abd"}. And "abcabdcd" and "abdcd" should not be included.
Most RE engines only match an RE once and greedily by default, and standard iteration strategies built around them tend to restart the search after the end of the previous match. To do other than that requires some extra trickery. (This code is Tcl, but you should be able to replicate it in many other languages.)
proc matchAllOverlapping {RE string} {
set matches {}
set nonGreedyRE "(?:${RE}){1,1}?"
set idx 0
while {[regexp -indices -start $idx $nonGreedyRE $string matchRange]} {
lappend matches [string range $string {*}$matchRange]
set idx [expr { [lindex $matchRange 0] + 1 }]
}
return $matches
}
puts [matchAllOverlapping a.*d abcabdcd]
This function is rather inefficient, but it solves your problem:
def find_shortest_overlapping_matches(pattern, line):
pat=re.compile(pattern)
n=len(line)
ret=[]
for start in xrange(0, n):
for end in xrange(start+1, n+1):
tmp=line[start:end]
mat=pat.match(tmp)
if mat is not None:
ret.append(tmp)
break
return ret
print find_shortest_overlapping_matches("a.*d", "abcabdcd")
Output:
['abcabd', 'abd']
The ranges assume your pattern contains at least one character and does not match an empty string. Additionally, you should consider using ? to make your patterns match non-greedily to improve performance and avoid the inner loop.