greedy matching in regexp

greedy matching in regexp - regex

I have the following output:
Player name: RON_98
Player name: RON_97
player name: RON_96
I need to get the RON part and the digital part after it(for example 98),I used the following regexp: regexp "(RON)_(\[0-9]*)",does this will match the RON_96 of the last line? "*" is greedy match, how to match only the first line of the output? do we have something like (RON)_(only match digital)? and can prevent it to match the rest of the line?

Always put regular expressions in braces in Tcl.
It's not technically necessary (you can use Tcl's language definition to exactly work out what backslashes would be needed to do it any other way) but it's simpler in all cases that you're likely to encounter normally.
The examples below will use this.
Regular expressions start matching as soon as they can. Then, under normal (greedy) circumstances they match as much text as they can. Thus, with your sample code and text, the matcher starts trying to match at the R on the first line and goes on to consume up to the 8, at which point it has a match and stops. You can verify this by asking regexp to report the indices into the string where the match happened instead of the substring that was matched (via the -indices option, documented on the manual page).
To get all the matches in a string, you have two options:
Pass the -all -inline options to regexp and process the list of results with foreach:
# Three variables in foreach; one for whole match, one for each substring
foreach {a b c} [regexp -all -inline {(RON)_([0-9]*)} $thedata] {
puts "matched '$a', with b=$b and c=$c"
}
Use the -indices option together with the -start option, all in a while loop, so you step through the string:
set idx 0
while {[regexp -start $idx -indices {(RON)_([0-9]*)} $thedata a b c]} {
puts "matched at '$a', with subranges '$b' and '$c'"
set extracted [string range $thedata {*}$c]
puts "the extracted value is '$extracted'"
# Advance the place where the next search will start from
set idx [expr {[lindex $a 1] + 1}]
}
I'd normally recommend using the first option; it's much easier to use! Sometimes the second is better as it provides more information and uses less intermediate storage, but it's also much trickier to get right.

Even if you select your stated regex to match multiple lines, it will not match more than the first occurance of what you stated, and this is "RON_98". It will stop after the last digit of the first match. You could even force it to stop after reading a line by using $ at the end of your RegEx (matches an end of line).
For reference, the [0-9] can be written easier as \d (Digit):
(RON)_\d*
is easier to read.

Related

In Tcl how can I remove all zeroes to the left but the zeroes to the right should remain?

Folks! I ran into a problem that I can't solve by myself.
Since the numbers "08" and "09" cannot be read like the others (01,02,03,04, etc ...) and must be treated separately in the language Tcl.
I can't find a way to remove all [I say ALL because there are more than one on the same line] the leading zeros except the one on the right, which must remain intact.
It may sound simple to those who are already thoroughly familiar with the Tcl / Tk language. But for me, who started out and am looking for more information about Tcl / Tk, I read a lot of material on the internet, including this https: // stackoverflow.com/questions/2110864/handling-numbers-with-leading-zeros-in-tcl#2111822
So nothing to show me how to do this in one sweep eliminating all leading zeros.
I need you to give me a return like this: 2:9:10
I need this to later manipulate the result with the expr [arithmetic expression] command.
In this example it just removes a single leading zero:
set time {02:09:10}
puts [regsub {^0*(.+)} $time {\1}]
# Return: 2:09:10
If anyone can give me that strength friend?! I'm grateful right now.

The group (^|:) matches either the beginning of the string or a colon.
0+ matches one or more zeros. Replace with the group match \1,
otherwise the colons get lost. And of course, use -all to do all of
the matches in the target string.
% set z 02:09:10
02:09:10
% regsub -all {(^|:)0+} $z {\1} x
2
% puts $x
2:9:10
%
Edit: As Barmar points out, this will change :00 to an empty string.
A better regex might be:
regsub -all {(^|:)0} $z {\1} x
This will only remove a single leading 0.

You're only matching the 0 at the beginning of the string, you need to match after each : as well.
puts [regsub -all {(^|:)0*([^:])} $time {\1\2}]

In general it is best to use scan $str %d to convert a decimal number with possible leading zeroes to its actual value.
But in your case this will also work (and seems simpler to me than the answers given earlier and doesn't rely on the separator being a colon):
regsub -all {0*(\d+)} $time {\1}
This will remove any number of leading zeroes, but doesn't trim 00 down to an empty string. Also trailing zeroes will not be affected.
regsub -all {0*(\d+)} {0003:000:1000} {\1} => 3:0:1000

the scan command is useful here to extract three decimal numbers out of that string:
% set time {02:09:10}
02:09:10
% scan $time {%d:%d:%d} h m s
3
% puts [list $h $m $s]
2 9 10

There are a few tricky edge cases here. Specifically, the string 02:09:10:1001:00 covers the key ones (including middle zeroes, only zeroes). We can use a single substitution command to do the work:
regsub -all {\m0+(?=\d)} $str {}
(This uses a word start anchor and lookahead constraint.)
However, I would be more inclined to use other tools for this sort of thing. For times, for example, parsing them is better done with scan:
set time "02:09:10"
scan $time "%d:%d:%d" h m s
Or, depending on what is going on, clock scan (which handles dates as well, making it more useful in some cases and less in others).

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar

Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp

To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

Need explanation of tcl regexp inline example in the man page please

While trying to understand regexp and --inline use, saw this example but couldn't understand how it works.
Link to the man page is: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm#M13
In there, under --inline option, this example was given:
regexp -inline -- {\w(\w)} " inlined "
=> {in n}
regexp -all -inline -- {\w(\w)} " inlined "
=> {in n li i ne e}
How does this "{\w(\w)}" yield "{in n}"? Can someone explain please.
Appreciate the help.
Thanks

If -inline but not -all is not given, regexp returns a list consisting of one value for the entire region matched and one value for each submatch (regions captured by parentheses). To see what the entire match is, ignore the parentheses: the pattern is now {\w\w}, matching the two first word characters in the string (in). The first submatch is what you get if you skip one word character (the \w outside the parentheses) and then capture the next word character (the \w inside the parentheses), getting n.
If both -inline and -all are given, regexp does this repeatedly, restarting at the first character beyond the last entire match.

I think that to understand -inline, you must first understand that -inline puts the matches (and submatches) in a list. Because if you had...
regexp -- {\w(\w)} " inlined " m1 m2
You will have...
% puts $m1
in
% puts $m2
n
As the whole match in is stored in m1 while the submatch of the capture group n is stored in m2.
Putting those in a list (i.e. when using -inline) will give {in n}.
When you now have -all and -inline at the same time (assuming that you already know that -all retrieves all non-overlapping matches in regexp), you can no more use variable names after the input string, so you get a list containing all the matches and submatches and if I have to name them m and s (for match and submatch respectively), you have:
in n li i ne e
m s m s m s

How to match nth occurrence in a string using regular expression

How to match nth occurrence in a string using regular expression
set test {stackoverflowa is a best solution finding site
stackoverflowb is a best solution finding site stackoverflowc is a
best solution finding sitestackoverflowd is a best solution finding
sitestackoverflowe is a best solution finding site}
regexp -all {stackoverflow} $test
The above one give "5" as output
regexp {stackoverflow} $test
The above one give stackoverflow as result, here it is matching first occurrence of stackoverflow (i.e) stackoverflowa
My requirement is i want to match 5th occurrence of stackoverflow (i.e) stackoverflowe from the above given string.
Please some one clarify my question..Thanks
Then another one question

Try
set results [regexp -inline -all {stackoverflow.} $test]
# => stackoverflowa stackoverflowb stackoverflowc stackoverflowd stackoverflowe
puts [lindex $results 4]
I'll be back to explain this further shortly, making pancakes right now.
So.
The command returns a list (-inline) of all (-all) substrings of the string contained in test that match the string "stackoverflow" (less quotes) plus one character, which can be any character. This list is stored in the variable result, and by indexing with 4 (because indexing is zero-based), the fifth element of this list can be retrieved (and, in this case, printed).
The dot at the end of the expression wasn't in your expression: I added it to check that I really did get the right match. You can of course omit the dot to match "stackoverflow" exactly.
ETA (from Donal's comment): in many cases it's convenient to extract not the string itself, but its position and extent within the searched string. The -indices option gives you that (I'm not using the dot in the expression now: the index list makes it obvious which one of the "stackoverflow"s I'm getting anyway):
set indices [regexp -inline -all -indices {stackoverflow} $test]
# => {0 12} {47 59} {94 106} {140 152} {186 198}
You can then use string range to get the string match:
puts [string range $test {*}[lindex $indices 4]]
The lindex $indices 4 gives me the list 186 198; the {*} prefix makes the two elements in that list appear as two separate arguments in the invocation of string range.

Regular expression in TCL

I have to parse this format using regexp in TCL.
Here is the format
wl -i eth1 country
Q1 (Q1/27) Q1
I'm trying to use the word country as a keyword to parse the format 'Q1 (Q1/27) Q1'.
I can do it if it is in a same line as country using the following regexp command.
regexp {([^country]*)country(.*)} $line match test country_value
But how can i tackle the above case?

Firstly, the regular expression you are using isn't doing quite the right thing in the first place, because [^country] matches a set of characters that consists of everything except the letters in country (so it matches from the h in eth1 onwards only, given the need to have country afterwards).
By default, Tcl uses the whole string to match against and newlines are just ordinary characters. (There is an option to make them special by also specifying -line, but it's not on by default.) This means that if I use your whole string and feed it through regexp with your regular expression, it works (well, you probably want to string trim $country_value at some point). This means that your real problem is in presenting the right string to match against.
If you're presenting lines one at a time (read from a file, perhaps) and you want to use a match against one line to trigger processing in the next, you need some processing outside the regular expression match:
set found_country 0
while {[gets $channel line] >= 0} {
if {$found_country} {
# Process the data...
puts "post-country data is $line"
# Reset the flag
set found_country 0
} elseif {[regexp {(.*) country$} $line -> leading_bits]} {
# Process some leading data...
puts "pre-country data is $leading_bits"
# Set the flag to handle the next line specially
set found_country 1
}
}
If you want to skip blank lines completely, put a if {$line eq ""} continue before the if {$found_country} ....

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js