Tcl Regexp confusion

Tcl Regexp confusion - regex

I have the following code in a tcl script
$a_list - {Hello1.my.name.is.not.adam.go.away,
Hello2.my.name.is.not.adam,
Hello3.my.name.is.not.adam.leave.me}
foreach a $a_list {if {[regexp adam [regsub {.*\.} $a {}]] == 1} {puts $a} }
My understanding is that this looks for the string adam in $a_list and it matches when adam is the last string.
For example
Hello1.my.name.is.not.adam.go.away ---> NO MATCH
Hello2.my.name.is.not.adam ---> MATCH
Hello3.my.name.is.not.adam.leave.me ---> NO MATCH
The problem I am facing is that I want to match with adam and then strip away everything away after including adam itself.
For example
Hello1.my.name.is.not.adam.go.away ---> MATCH
Hello2.my.name.is.not.adam ---> MATCH
Hello3.my.name.is.not.adam.leave.me ---> MATCH
In all cases above, it should change the string to
Hello1.my.name.is.not ---> MATCH
Hello2.my.name.is.not ---> MATCH
Hello3.my.name.is.not ---> MATCH
Your help is appreciated.
Thanks

Method 1 :
With simple string commands, we can get the desired result.
set input {Hello1.my.name.is.not.adam.go.away, Hello2.my.name.is.not.adam, Hello3.my.name.is.not.adam.leave.me noobuntu dinesh}
foreach elem $input {
# Getting the index of the word 'adam' in each element
set idx [string first "adam" $elem]
# If the word is not available, then 'idx' will have the value as '-1'
if {$idx!=-1} {
# string range will give the substring for the given indices
puts "->[string range $elem 0 [expr {$idx-1}]]"
}
}
will give output as follows,
->Hello1.my.name.is.not.
->Hello2.my.name.is.not.
->Hello3.my.name.is.not.
Method 2:
If you are interested only with regex patterns, then it can tweaked by regsub command as
set input {Hello1.my.name.is.not.adam.go.away, Hello2.my.name.is.not.adam, Hello3.my.name.is.not.adam.leave.me noobuntu dinesh}
foreach elem $input {
if {[regsub {(.*?)adam.*$} $elem {\1} result]} {
puts "->$result"
}
}
will produce output as
->Hello1.my.name.is.not.
->Hello2.my.name.is.not.
->Hello3.my.name.is.not.
Reference : string, regsub

The simplest approach to strip the word adam and everything after it in each element of a list, you use a simple regsub and lmap:
% lmap s $a_list {regsub {\madam\M.*} $s ""}
Hello1.my.name.is.not. Hello2.my.name.is.not. Hello3.my.name.is.not.
The \m only matches at the start of a word, and the \M only matches at the end of a word. It works because if the word isn't there, regsub does nothing.
Using Tcl 8.5? You won't have lmap, and will need to do this instead:
set result {}
foreach s $a_list {
lappend result [regsub {\madam\M.*} $s ""]
}
# The altered list is now in $result

Related

Using a capturing group of the match pattern of a regsub in the substitution itself?

In the code below, $html is a string of HTML. The code captures the full match and the capturing group for each match in a list. Then, if the list is not empty, it iterates through the list to replace the span tags with em tags along with the original text that was between them.
For example if the HTML is:
This is <span class='add'>span 1</span> and this is <span class='add'>span 2</span>. then $a would be a list of length 4: {<span class='add'>span 1</span>} {span 1} {<span class='add'>span 2</span>} {span 2}.
The sample code generates:
This is <em>span 1</em> and this is <em>span 2</em>. as expected; but it seems that this must be an inefficient way to do this and, somehow, the capturing group should be usable directly within the regsub expression.
Is this true and how is it done?
Something like:
set html [regsub "<span class='add'>(.+?)</span>" $html "<em>.../em>"]
where the ... is something that points to the captured group.
Thank you.
set a [regexp -all -inline -- {<span class='add'>(.+?)</span>} $html]
if { [llength $a] > 0 } {
foreach {x y} $a {
set html [regsub "<span class='add'>${y}</span>" $html "<em>${y}</em>"]
}
}

You can use a backreference in the replacement:
regsub -all {<span class='add'>(.+?)</span>} $html {<em>\1</em>}
EDIT: To trim leading and trailing spaces from the captured string, you can simply leave them out by matching leading and trailing spaces outside the parentheses:
regsub -all {<span class='add'>\s*(.+?)\s*</span>} $html {<em>\1</em>}

Get multiple matches with tcl regexp

How do I get all the matches in tcl using regexp command? For example I have a string as following
set line "foo \"aaa\"zz bar \"aaa:ccc\" ccc"
puts $line
foo "aaa"zz bar "aaa:ccc" ccc
I now want to get aaa and aaa:ccc. There could be any number of such matches.
I tried
regexp -all -inline {"(.*)"} $line
{"aaa"zz bar "aaa:ccc"} {aaa"zz bar "aaa:ccc}
But as seen this didn't work. What's the right way to get multiple matches and match everything within double quotes?

You can capture all between two quotes with "([^"]*)" pattern. When using the pattern with regexp -all -inline:
set matches [regexp -all -inline {"([^"]*)"} $line]
you will get all overall match values and all the captured substrings.
You will have to post-process the matches in order to get the final list of captures:
set line "foo \"aaa\"zz bar \"aaa:ccc\" ccc"
set matches [regexp -all -inline {"([^"]*)"} $line]
set res {}
foreach {match capture} $matches {
lappend res $capture
}
puts $res
See the online Tcl demo.
Output:
aaa aaa:ccc

An alternate method to extract the quoted strings is:
set unquoted [lmap quoted [regexp -all -inline {".*?"} $line] {string trim $quoted {"}}]
Or, split the string using quote as the delimiter, and take every second field.
set unquoted [lmap {a b} [split $line {"}] {set b}]
That gives you a trailing empty element since this split invocation results in a list with an odd number of elements.

Passing a match in regsub with & to a procedure (Tcl is being used)

I want to go through a comma separated string and replace matches with more comma separated elements.
i.e 5-A,B after the regsub should give me 1-A,2-A,3-A,4-A,5-A,B
The following is not working for me as & is being passed as an actual & instead of the actual match:
regsub -all {\d+\-\w+} $string [myConvertProc &]
However not attempting to pass the & and using it directly works:
regsub -all o "Hello World" &&&
> Hellooo Wooorld
Not sure what I am doing wrong in attempting to pass the value & holds to myConvertProc
Edit: I think my initial problem is the [myConvertProc &] is getting evaluated first, so I am actually passing '&' to the procedure.
How do I get around this within the regex realm? Is it possible?
Edit 2: I've already solved it using a foreach on a split list, so I'm just looking to see if this is possible within a regsub. Thanks!

You are correct in your first edit: the problem is that each argument to regsub is fully evaluated before executing the command.
One solution is to insert a command substitution string into the string, and then use subst on it:
set string [regsub -all {\d+\-\w+} $string {[myConvertProc &]}]
# -> [myConvertProc 5-A],B
set string [subst $string]
# -> 1-A,2-A,3-A,4-A,5-A,B
This will only work if there is nothing else in string that is subject to substitution (but you can of course turn off variable and backslash substitution).
The foreach solution is much better. An alternative foreach solution is to iterate over the result of regexp -indices -inline -all, but iterating over the parts of a split list is preferable if it works.
Update:
A typical foreach solution goes like this:
set res {}
foreach elem [split $string ,] {
if {[regexp -- {^\d+-\w+$} $elem]} {
lappend res [myConvertProc $elem]
} else {
lappend res $elem
}
}
join $res ,
That is, you collect a result list by looking at each element in the raw list. If the element matches your requirement, you convert it and add the result to the result list. If the element doesn't match, you just add it to the result list.
It can be simplified somewhat in Tcl 8.6:
join [lmap elem [split $string ,] {
if {[regexp -- {^\d+-\w+$} $elem]} {
myConvertProc $elem
} else {
set elem
}
}] ,
Which is the same thing, but the lmap command handles the result list for you.
Documentation: foreach, lappend, lmap, regexp, regsub, set, split, subst

Having issue with back reference in TCL

I have the following code:
set a "10.20.30.40"
regsub -all {.([0-9]+).([0-9]+).} $a {\2 \1} b
I am trying to grep 2nd and 3rd octet of the IP address.
Expected output:
20 30
Actual output:
20 04 0
What is my mistake here?

I'd stay away from regular expressions altogether:
set b [join [lrange [split $a .] 1 2]]
Split the value on dots, take the 2nd and 3nd elements, and join them with a space.

You need to set the variables for the match and captured groups, then you can access them. Here is an example:
set a "10.20.30.40"
set rest [regexp {[0-9]+\.([0-9]+)\.([0-9]+)\.[0-9]+} $a match submatch1 submatch2]
puts $submatch1
puts $submatch2
Output of the demo
20
30
EDIT:
You can use regsub and backerferences this way (I am now replacing the 3rd and 2nd octets, just for demonstration). Note that a literal dot must be escaped:
set a "10.20.30.40"
regsub -all {\.([0-9]+)\.([0-9]+)\.} $a {.\2.\1.} b
puts $b
Output of the demo:
10.30.20.40
To obtain a "20 30" string, you need to use
regsub -all {^[0-9]+\.([0-9]+)\.([0-9]+)\.[0-9]+$} $a {\1 \2} b

Match any repetitive pattern using tcl

I have a binary file that I don't know what is inside. Then, I convert it to hex number using binary scan $bin_data "H*" hex_data. The problem is how to match for any repetitive pattern (byte).
Example 1:
In the file: 0cabab79
Expected Output: abab
Example 2:
In the file: 0c1f1f03035d
Expected Output: 1f1f0303
Example 3:
In the file: 0c678967895d13
Expected Output: 67896789

You can use a more flexible regexp to get all the repeated patterns (at least 2 characters):
set inputs [list 0cabab79 0c1f1f03035d 0c678967895d13]
foreach input $inputs {
set out ""
foreach {whole sub} [regexp -all -inline {(..+)\1} $input] {
append out $whole
}
puts $out
}
# Output:
# abab
# 1f1f0303
# 67896789
If you want to make sure the pattern are in pairs of even number of characters (i.e. aaaaaa should give aaaa and not aaaaaa), then...
set inputs [list 0cabab79 0c1f1f03035d 0c678967895d13 aaaaaa]
foreach input $inputs {
set out ""
foreach {whole sub} [regexp -all -inline {((?:..)+)\1} $input] {
append out $whole
}
puts $out
}
# Output:
# abab
# 1f1f0303
# 67896789
# aaaa

You can do this with a regexp using a backreference:
regexp -all -inline {(..)\1} 0c1f1f03035d
This will return you a list containing the full-length repetition followed by the repeated element for all matches.
So for this one you would get
1f1f 1f 0303 03
Looping through these you can build your Expected Output e.g.
set op {};
foreach {rep sing} [regexp -all -inline {(..)\1} 0c1f1f03035d] {
append op $rep
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Tcl Regexp confusion - regex

Related

Using a capturing group of the match pattern of a regsub in the substitution itself?

Get multiple matches with tcl regexp

Passing a match in regsub with & to a procedure (Tcl is being used)

Having issue with back reference in TCL

Match any repetitive pattern using tcl

Categories

Resources