How to match nth occurrence in a string using regular expression - regex

How to match nth occurrence in a string using regular expression
set test {stackoverflowa is a best solution finding site
stackoverflowb is a best solution finding site stackoverflowc is a
best solution finding sitestackoverflowd is a best solution finding
sitestackoverflowe is a best solution finding site}
regexp -all {stackoverflow} $test
The above one give "5" as output
regexp {stackoverflow} $test
The above one give stackoverflow as result, here it is matching first occurrence of stackoverflow (i.e) stackoverflowa
My requirement is i want to match 5th occurrence of stackoverflow (i.e) stackoverflowe from the above given string.
Please some one clarify my question..Thanks
Then another one question

Try
set results [regexp -inline -all {stackoverflow.} $test]
# => stackoverflowa stackoverflowb stackoverflowc stackoverflowd stackoverflowe
puts [lindex $results 4]
I'll be back to explain this further shortly, making pancakes right now.
So.
The command returns a list (-inline) of all (-all) substrings of the string contained in test that match the string "stackoverflow" (less quotes) plus one character, which can be any character. This list is stored in the variable result, and by indexing with 4 (because indexing is zero-based), the fifth element of this list can be retrieved (and, in this case, printed).
The dot at the end of the expression wasn't in your expression: I added it to check that I really did get the right match. You can of course omit the dot to match "stackoverflow" exactly.
ETA (from Donal's comment): in many cases it's convenient to extract not the string itself, but its position and extent within the searched string. The -indices option gives you that (I'm not using the dot in the expression now: the index list makes it obvious which one of the "stackoverflow"s I'm getting anyway):
set indices [regexp -inline -all -indices {stackoverflow} $test]
# => {0 12} {47 59} {94 106} {140 152} {186 198}
You can then use string range to get the string match:
puts [string range $test {*}[lindex $indices 4]]
The lindex $indices 4 gives me the list 186 198; the {*} prefix makes the two elements in that list appear as two separate arguments in the invocation of string range.

Related

In Tcl how can I remove all zeroes to the left but the zeroes to the right should remain?

Folks! I ran into a problem that I can't solve by myself.
Since the numbers "08" and "09" cannot be read like the others (01,02,03,04, etc ...) and must be treated separately in the language Tcl.
I can't find a way to remove all [I say ALL because there are more than one on the same line] the leading zeros except the one on the right, which must remain intact.
It may sound simple to those who are already thoroughly familiar with the Tcl / Tk language. But for me, who started out and am looking for more information about Tcl / Tk, I read a lot of material on the internet, including this https: // stackoverflow.com/questions/2110864/handling-numbers-with-leading-zeros-in-tcl#2111822
So nothing to show me how to do this in one sweep eliminating all leading zeros.
I need you to give me a return like this: 2:9:10
I need this to later manipulate the result with the expr [arithmetic expression] command.
In this example it just removes a single leading zero:
set time {02:09:10}
puts [regsub {^0*(.+)} $time {\1}]
# Return: 2:09:10
If anyone can give me that strength friend?! I'm grateful right now.
The group (^|:) matches either the beginning of the string or a colon.
0+ matches one or more zeros. Replace with the group match \1,
otherwise the colons get lost. And of course, use -all to do all of
the matches in the target string.
% set z 02:09:10
02:09:10
% regsub -all {(^|:)0+} $z {\1} x
2
% puts $x
2:9:10
%
Edit: As Barmar points out, this will change :00 to an empty string.
A better regex might be:
regsub -all {(^|:)0} $z {\1} x
This will only remove a single leading 0.
You're only matching the 0 at the beginning of the string, you need to match after each : as well.
puts [regsub -all {(^|:)0*([^:])} $time {\1\2}]
In general it is best to use scan $str %d to convert a decimal number with possible leading zeroes to its actual value.
But in your case this will also work (and seems simpler to me than the answers given earlier and doesn't rely on the separator being a colon):
regsub -all {0*(\d+)} $time {\1}
This will remove any number of leading zeroes, but doesn't trim 00 down to an empty string. Also trailing zeroes will not be affected.
regsub -all {0*(\d+)} {0003:000:1000} {\1} => 3:0:1000
the scan command is useful here to extract three decimal numbers out of that string:
% set time {02:09:10}
02:09:10
% scan $time {%d:%d:%d} h m s
3
% puts [list $h $m $s]
2 9 10
There are a few tricky edge cases here. Specifically, the string 02:09:10:1001:00 covers the key ones (including middle zeroes, only zeroes). We can use a single substitution command to do the work:
regsub -all {\m0+(?=\d)} $str {}
(This uses a word start anchor and lookahead constraint.)
However, I would be more inclined to use other tools for this sort of thing. For times, for example, parsing them is better done with scan:
set time "02:09:10"
scan $time "%d:%d:%d" h m s
Or, depending on what is going on, clock scan (which handles dates as well, making it more useful in some cases and less in others).

Need explanation of tcl regexp inline example in the man page please

While trying to understand regexp and --inline use, saw this example but couldn't understand how it works.
Link to the man page is: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm#M13
In there, under --inline option, this example was given:
regexp -inline -- {\w(\w)} " inlined "
=> {in n}
regexp -all -inline -- {\w(\w)} " inlined "
=> {in n li i ne e}
How does this "{\w(\w)}" yield "{in n}"? Can someone explain please.
Appreciate the help.
Thanks
If -inline but not -all is not given, regexp returns a list consisting of one value for the entire region matched and one value for each submatch (regions captured by parentheses). To see what the entire match is, ignore the parentheses: the pattern is now {\w\w}, matching the two first word characters in the string (in). The first submatch is what you get if you skip one word character (the \w outside the parentheses) and then capture the next word character (the \w inside the parentheses), getting n.
If both -inline and -all are given, regexp does this repeatedly, restarting at the first character beyond the last entire match.
I think that to understand -inline, you must first understand that -inline puts the matches (and submatches) in a list. Because if you had...
regexp -- {\w(\w)} " inlined " m1 m2
You will have...
% puts $m1
in
% puts $m2
n
As the whole match in is stored in m1 while the submatch of the capture group n is stored in m2.
Putting those in a list (i.e. when using -inline) will give {in n}.
When you now have -all and -inline at the same time (assuming that you already know that -all retrieves all non-overlapping matches in regexp), you can no more use variable names after the input string, so you get a list containing all the matches and submatches and if I have to name them m and s (for match and submatch respectively), you have:
in n li i ne e
m s m s m s

need help in tcl command usage for regsub

I am new learner for tcl. I have some issue as below when using regsub. Consider the following scenario:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
regsub -all ${test2}_[1-9] $test1 [list] test1
I expected $test1 output is [prefix_abc_3 AAA_0]
However regsub has also removed the partial matched string which is prefix_abc_3. Does anyone here have any idea on how to regsub the exact words only in a list?
I tried to find solution via net but could not get any clue/hints. Appreciate if someone here can help me.
\m and \M in regexps match the beginning and end of a word respectively. But you don't have a string of words in test1, but a list of elements, and sometimes there's a difference so don't mix the two. regsub only handles strings while lsearch works with lists:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
set test1 [lsearch -all -inline -not -regexp $test1 "^${test2}_\[1-9\]\$"]
If the pattern is that simple, you can use the -glob option (the default) instead of -regexp and maybe save some processor time.
What exactly did you execute?
When I type the commands above into tclsh, it displays an error -
% set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
prefix_abc_3 abc_1 abc_2 AAA_0
% set test2 abc
abc
% regsub -all ${test2}_[1-9] [list] test1
invalid command name "1-9"
I'm unsure what you are trying to do. You start by inisitalising test1 as a list. You then treat it as a string by passing it to regsub. This is a completely legal thing to do, but may indicate that you are confused by something. Are you trying to test your substitution by applying it four times, to each of prefix_abc_3, abc_1, abc_2 and AAA_0? You can certainly do that the way you are, but a more natural way would be to do
foreach test $test1 {
regsub $pattern $test [list] testResult
puts stdout $testResult
}
Then again, what are you trying to achieve with your substitution? It looks as though your are trying to replace the stringabc with a null string, i.e. remove it altogether. Passing [list] as a null string is perfectly valid, but again may indicate confusion between lists and strings.
To achieve the result you want, all you need to do is add a leading space to your pattern, pass a space as the substitution string and escape the square brackets, i.e.
regsub -all " ${test2}_\[-9\]" $test1 " " test1
but I suspect that this is a made-up example and you're really trying to do something slightly different.
Edit
To obtain a list that contains just those list entries that don't exactly match your pattern, I suggest
proc removeExactMatches {input} {
set result [list]; # Initialise the result list
foreach inputElement $input {
if {![regexp {^abc_[0-9]$} $inputElement]} {
lappend result $inputElement
}
}
return $result
}
set test1 [removeExactMatches [list prefix_abc_3 abc_1 abc_2 AAA_0]]
Notes:
i) I don't use regsub at all.
ii) Although it's safe and legal to switch around between lists and strings, it all takes time and it obscures what I'm tryng to do, so I avoid it wherever possible. You seem to have a list of strings and you want to remove some of them, so that's what I use in my suggested solution. The regular expression commands in Tcl handle strings so I pass them strings.
iii) To ensure that the list elements match exactly, I anchor the pattern to the start and end of the string that I'm matching against using ^ and $.
iv) To prevent the interpreter from recognising the [1-9] in the regular expression pattern and trying to execute a (non-existant) command 1-9, I enclose the whole pattern string within curly brackets.
v) For greater generality, I might want to pass the pattern to the proc as well as the input list (of strings), in that case, I'd do
proc removeExactMatches {inputPattern input} {
.
.
.
set pattern "^"
append pattern $inputPattern
append pattern "\$"
.
.
.
if {![regub $pattern $inputElement]} {
.
.
.
}
set test1 [removeExactMatches {abc_[1-9]} {prefix_abc_3 abc_1 abc_2 AAA_0}]
to minimise the number of characters that had to be escaped. (Actually I probably wouldn't use the quotation marks for the start and end anchors within the proc - they aren't really needed and I'm a lazy typist!)
Looking at your original question, it seems that you might want to vary only the abc part of the pattern, in which case you might want to just pass that to your proc and append the _[0-9] as well as the anchors within it - don't forget to escape the square brackets or use curly brackets if you go down this route.

greedy matching in regexp

I have the following output:
Player name: RON_98
Player name: RON_97
player name: RON_96
I need to get the RON part and the digital part after it(for example 98),I used the following regexp: regexp "(RON)_(\[0-9]*)",does this will match the RON_96 of the last line? "*" is greedy match, how to match only the first line of the output? do we have something like (RON)_(only match digital)? and can prevent it to match the rest of the line?
Always put regular expressions in braces in Tcl.
It's not technically necessary (you can use Tcl's language definition to exactly work out what backslashes would be needed to do it any other way) but it's simpler in all cases that you're likely to encounter normally.
The examples below will use this.
Regular expressions start matching as soon as they can. Then, under normal (greedy) circumstances they match as much text as they can. Thus, with your sample code and text, the matcher starts trying to match at the R on the first line and goes on to consume up to the 8, at which point it has a match and stops. You can verify this by asking regexp to report the indices into the string where the match happened instead of the substring that was matched (via the -indices option, documented on the manual page).
To get all the matches in a string, you have two options:
Pass the -all -inline options to regexp and process the list of results with foreach:
# Three variables in foreach; one for whole match, one for each substring
foreach {a b c} [regexp -all -inline {(RON)_([0-9]*)} $thedata] {
puts "matched '$a', with b=$b and c=$c"
}
Use the -indices option together with the -start option, all in a while loop, so you step through the string:
set idx 0
while {[regexp -start $idx -indices {(RON)_([0-9]*)} $thedata a b c]} {
puts "matched at '$a', with subranges '$b' and '$c'"
set extracted [string range $thedata {*}$c]
puts "the extracted value is '$extracted'"
# Advance the place where the next search will start from
set idx [expr {[lindex $a 1] + 1}]
}
I'd normally recommend using the first option; it's much easier to use! Sometimes the second is better as it provides more information and uses less intermediate storage, but it's also much trickier to get right.
Even if you select your stated regex to match multiple lines, it will not match more than the first occurance of what you stated, and this is "RON_98". It will stop after the last digit of the first match. You could even force it to stop after reading a line by using $ at the end of your RegEx (matches an end of line).
For reference, the [0-9] can be written easier as \d (Digit):
(RON)_\d*
is easier to read.

How do I extract all matches with a Tcl regex?

hi everybody i want solution for this regular expression, my problem is Extract all the hex numbers in the form H'xxxx, i used this regexp but i didn't get all hexvalues only i get one number, how to get whole hex number from this string
set hex "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set res [regexp -all {H'([0-9A-Z]+)&} $hex match hexValues]
puts "$res H$hexValues"
i am getting output is 5 H4D52
On -all -inline
From the documentation:
-all : Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
-inline : Causes the command to return, as a list, the data that would otherwise be placed in match variables. When using -inline, match variables may not be specified. If used with -all, the list will be concatenated at each iteration, such that a flat list is always returned. For each match iteration, the command will append the overall match data, plus one element for each subexpression in the regular expression.
Thus to return all matches --including captures by groups-- as a flat list in Tcl, you can write:
set matchTuples [regexp -all -inline $pattern $text]
If the pattern has groups 0…N-1, then each match is an N-tuple in the list. Thus the number of actual matches is the length of this list divided by N. You can then use foreach with N variables to iterate over each tuple of the list.
If N = 2 for example, you have:
set numMatches [expr {[llength $matchTuples] / 2}]
foreach {group0 group1} $matchTuples {
...
}
References
regular-expressions.info/Tcl
Sample code
Here's a solution for this specific problem, annotated with output as comments (see also on ideone.com):
set text "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set pattern {H'([0-9A-F]{4})}
set matchTuples [regexp -all -inline $pattern $text]
puts $matchTuples
# H'22EF 22EF H'2354 2354 H'4BD4 4BD4 H'4C4B 4C4B H'4D52 4D52 H'4DC9 4DC9
# \_________/ \_________/ \_________/ \_________/ \_________/ \_________/
# 1st match 2nd match 3rd match 4th match 5th match 6th match
puts [llength $matchTuples]
# 12
set numMatches [expr {[llength $matchTuples] / 2}]
puts $numMatches
# 6
foreach {whole hex} $matchTuples {
puts $hex
}
# 22EF
# 2354
# 4BD4
# 4C4B
# 4D52
# 4DC9
On the pattern
Note that I've changed the pattern slightly:
Instead of [0-9A-Z]+, e.g. [0-9A-F]{4} is more specific for matching exactly 4 hexadecimal digits
If you insist on matching the &, then the last hex string (H'4DC9 in your input) can not be matched
This explains why you get 4D52 in the original script, because that's the last match with &
Maybe get rid of the &, or use (&|$) instead, i.e. a & or the end of the string $.
References
regular-expressions.info/Finite Repetition, Anchors
I'm not Tclish, but I think you need to use both the -inline and -all options:
regexp -all -inline {H'([0-9A-Z]+)&} $string
EDIT: Here it is again, this time with a corrected regex (see the comments):
regexp -all -inline {H'[0-9A-F]+&} $string