Case matching regexp - regex

I have been wondering about a regexp matching pattern in Tcl for some time and I've remained stumped as to how it was working. I'm using Wish and Tcl/Tk 8.5 by the way.
I have a random string MmmasidhmMm stored in $line and the code I have is:
while {[regexp -all {[Mm]} $line match]} {
puts $data $match
regsub {[Mm]} $line "" line
}
$data is a text file.
This is what I got:
m
m
m
m
m
m
While I was expecting:
M
m
m
m
M
m
I was trying some things to see how changing a bit would affect the results when I got this:
while {[regexp -all {^[Mm]} $line match]} {
puts $data $match
regsub {[Mm]} $line "" line
}
I get:
M
m
m
Surprisingly, $match keeps the case.
I was wondering why in the first case, $match automatically becomes lowercase for some reason. Unless I am not understanding how the regexp actually is working, I'm not sure what I could be doing wrong. Maybe there's a flag that fixes it that I don't know about?
I'm not sure I'll really use this kind of code some day, but I guess learning how it works might help me in other ways. I hope I didn't miss anything in there. Let me know if you need more information!

The key here is in your -all flag. The documentation for that said:
-all -- Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
That means the variable match contains the very last match, which is a lower case 'm'. Drop the -all flag and you will get what you want.
Update
If your goal is to remove all 'm' regardless of case, that whole block of code can be condensed into just one line:
regsub -all {[MM]} $line "" line
Or, more intuitively:
set line [string map -nocase {m ""} $line]; # Map all M's into nothing

Related

Regex a var that contains square brackets in tcl

I'm trying to edit a verilog file by finding a match in lines of a file and replacing the match by "1'b1". The problem is that the match is a bus with square brackets in the form "busname[0-9]".
for example in this line:
XOR2X1 \S12/gen_fa[8].fa_i/x0/U1 ( .A(\S12/bcomp [8]), .B(abs_gx[8]), .Y(
I need to replace "abs_gx[8]" by "1'b1".
So I tried to find a match by using this code:
#gets abs_gx[8]
set net "\{[lindex $data 0]\}"
#gets 1'b1
set X [lindex $data 1]
#open and read lines of file
set netlist [open "./$circuit\.v" r]
fconfigure $netlist -buffering line
gets $netlist line
#let's assume the line is XOR2X1 \S12/gen_fa[8].fa_i/x0/U1 ( .A(\S12/bcomp [8]), .B(abs_gx[8]), .Y(
if {[regexp "(.*.\[A-X\]\()$net\(\).*)" $line -inline]} {
puts $new "$1 1'b$X $2" }
elseif {[regexp "(.*.\[Y-Z\]\()$net(\).*)" $line]} {
puts $new "$1$2" }
else {puts $new $line}
gets $netlist line
I tried so much things and nothing seems to really match or I get an error because 8 is not a command because [8] gets interpreted as a command.
Any sneaky trick to place a variable in a regex without having it interpreted as a regular expression itself?
If you have an arbitrary string that you want to match exactly as part of a larger regular expression, you should precede all non-alphanumeric characters in the string by a backslash (\). Fortunately, _ is also not special in Tcl's REs, so you can use \W (equivalent to [^\w]) to match the characters you need to fix
set reSafe [regsub -all {\W} $value {\\&}]
If you're going to be doing that a lot, make a helper procedure.
proc reSafe {value} {
regsub -all {\W} $value {\\&}
}
(Yes, I'd like a way of substituting variables more directly, but the RE engine's internals are code I don't want to touch…)
If I understand correctly, you want to substitute $X for $net except when $net is preceded by Y( or Z( in which case you just delete $net. You could avoid the complications of regexp by using string map which just does literal substitutions - see https://www.tcl-lang.org/man/tcl8.6/TclCmd/string.htm#M34 . You would then need to specify the Y( and Z( cases separately, but that's easy enough when there are only two. So instead of the regsub lines you would do:
set line [string map [list Y($net Y( Z($net Z( $net $X] $line]
puts $new $line

regexp tcl to search for variables

I am trying to find the matching pattern using regexp command in the {if loop} . Still a newbie in tcl. The code is as shown below:
set A 0;
set B 2;
set address "my_street[0]_block[2]_road";
if {[regexp {street\[$A\].*block\[$B\]} $address]} {
puts "the location is found"
}
I am expecting the result to return "the location is found" as the $address contain matching A and B variables. i am hoping to able to change the A and B number for a list of $address. but I am not able to get the result to return "the location is found".
Thank you.
Tcl's regular expression engine doesn't do variable interpolation. (Should it? Perhaps. It doesn't though.) That means that you need to do it at the generic level, which is in general quite annoying but OK here as the variables only have numbers in, which are never RE metacharacters by themselves.
Basic version (with SO. MANY. BACKSLASHES.):
if {[regexp "street\\\[$A\\\].*block\\\[$B\\\]" $address]} {
Nicer version with format:
if {[regexp [format {street\[%d\].*block\[%d\]} $A $B] $address]} {
You could also use subst -nocommands -nobackslashes but that's getting less than elegant.
If you need to support general substitutions, it's sufficient to use regsub to do the protection.
proc protect {string} {
regsub -all {\W} $string {\\&}
}
# ...
if {[regexp [format {street\[%s\].*block\[%s\]} [protect $A] [protect $B]] $address]} {
It's overkill when you know you're working with alphanumeric substitutions into the RE.

Tcl regsub used with subst produces unexpected result

Edit:
I was trying to replace "xor_in0" with "xor_in[0]" and "xor_in1" with "xor_in[1]" for a given str parameter. Here "xor_in0", "xor_in1" is parameter passed in and I represent it as "key", and "xor_in[0]", "xor_in[1]" is the value parameter stored in an array. Notice the point here is to replace every "key" in "str" with "value" . Here is my testing code:
set str "(xor_in0^xor_in1)"
set str1 "xor_in0^xor_in1" # another input
set key "xor_in0"
set value "xor_in\[0\]"
set newstr ""
set nonalpha "\[^0-9a-zA-Z\]"
regsub -all [subst {^\[(*\]($key)($nonalpha+)}] $str [subst -nobackslashes {$value\2}] newstr
puts $newstr
But somehow it doesn't work... I also tried to remove [subst ...] and it still failed to match anything. This is somehow against my knowledge of regular expression. Please help.
Everything seems a bit over-complicated to me.
Let's look at the regsub that you're actually going to execute. There's a trick to doing that easily; if your command is:
regsub -all [subst {^\[(*\]($key)($nonalpha+)}] $str [subst -nobackslashes {$value\2}] newstr
Then we can print out what it's going to try to do with:
puts [list regsub -all [subst {^\[(*\]($key)($nonalpha+)}] $str [subst -nobackslashes {$value\2}] newstr]
That reveals that you're really doing this:
regsub -all {^[(*](xor_in0)([^0-9a-zA-z]+)} (xor_in0^xor_in1) {xor_in[0]\2} newstr
The part that looks a bit strange in there is the ([^0-9a-zA-z]+) at the end of the RE. It's legal but odd as we can write things a bit differently with \W for matching a non-alpha:
regsub -all {^[(*](xor_in0)(\W+)} $str {xor_in[0]\2} newstr
And that seems to work. What might the bug be then? The definition of nonalpha, as you're using "\[^0-9a-zA-z\]" instead of "\[^0-9a-zA-Z\]". Yes, a literal ^ lies in the ASCII (and Unicode) range from A to z…
OTOH, I'd actually expect a transformation to really be done like this:
set newstr [regsub -all {(\y[a-zA-Z]+_in)(\d+)} $str {\1[\2]}]
The only things you're not used to there are \y (a word boundary constraint) and \d (match any digit). Or, for a simple transformation (mapping all instances of a literal substring to another literal substring):
set newstr [string map [list $key $value] $str]
Actually the real problem to my question is the A-z typo :)
Simple is generally better:
regsub -all {\d+} $s {[&]} s
Takes care of your examples.

Passing a match in regsub with & to a procedure (Tcl is being used)

I want to go through a comma separated string and replace matches with more comma separated elements.
i.e 5-A,B after the regsub should give me 1-A,2-A,3-A,4-A,5-A,B
The following is not working for me as & is being passed as an actual & instead of the actual match:
regsub -all {\d+\-\w+} $string [myConvertProc &]
However not attempting to pass the & and using it directly works:
regsub -all o "Hello World" &&&
> Hellooo Wooorld
Not sure what I am doing wrong in attempting to pass the value & holds to myConvertProc
Edit: I think my initial problem is the [myConvertProc &] is getting evaluated first, so I am actually passing '&' to the procedure.
How do I get around this within the regex realm? Is it possible?
Edit 2: I've already solved it using a foreach on a split list, so I'm just looking to see if this is possible within a regsub. Thanks!
You are correct in your first edit: the problem is that each argument to regsub is fully evaluated before executing the command.
One solution is to insert a command substitution string into the string, and then use subst on it:
set string [regsub -all {\d+\-\w+} $string {[myConvertProc &]}]
# -> [myConvertProc 5-A],B
set string [subst $string]
# -> 1-A,2-A,3-A,4-A,5-A,B
This will only work if there is nothing else in string that is subject to substitution (but you can of course turn off variable and backslash substitution).
The foreach solution is much better. An alternative foreach solution is to iterate over the result of regexp -indices -inline -all, but iterating over the parts of a split list is preferable if it works.
Update:
A typical foreach solution goes like this:
set res {}
foreach elem [split $string ,] {
if {[regexp -- {^\d+-\w+$} $elem]} {
lappend res [myConvertProc $elem]
} else {
lappend res $elem
}
}
join $res ,
That is, you collect a result list by looking at each element in the raw list. If the element matches your requirement, you convert it and add the result to the result list. If the element doesn't match, you just add it to the result list.
It can be simplified somewhat in Tcl 8.6:
join [lmap elem [split $string ,] {
if {[regexp -- {^\d+-\w+$} $elem]} {
myConvertProc $elem
} else {
set elem
}
}] ,
Which is the same thing, but the lmap command handles the result list for you.
Documentation: foreach, lappend, lmap, regexp, regsub, set, split, subst

changing several expressions in one line in perl

I want to take a line containing several expressions of the same structure, containing 4 digit hexa numbers, and changing the number in that structure according to a hash table. I tried using this next peace of code:
while ($line =~ s/14'h([0-9,a-f][0-9,a-f][0-9,a-f][0-9,a-f])/14'h$hash_point->{$1}/g){};
Where $hash_point is a pointer to the hash table.
But it tells me that I try to use an undefined value, when I tried running the fallowing code:
while ($line =~ s/14'h([0-9,a-f][0-9,a-f][0-9,a-f][0-9,a-f])/14'h----/g){print $1," -> ",$hash_point->{$1},"\n";};
It changed all the wanted numbers to "----" but printed out the values only 2 times (there were much more changes).
Where is the problem?
This is what I used in the end:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".$hash_point->{$1}/ge;
and in order to account for numbers not in the hash I've added:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".((hash_point->{$1}) or ($1))/ge;
I also wanted to know what numbers don't appear at the hash:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".(($hash_point->{$1}) or (print "number $1 didn't change\n") &&($1))/ge;
and finaly, I wanted to be able to control whether the massage from the previous stage would be printed, I've added the use of $flag which in defined only if I want the massages to appear:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".(($hash_point->{$1}) or (((defined($flag)) && (print "number $1 didn't change\n")) or ($1)))/ge;
Your regexp seems to work well for me except when hexa number is not present in the hash.
I tried:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
use Data::Dumper;
my $line = q!14'hab63xx14'hab88xx14'hab64xx14'hab65xx14'hcdef!;
my $hash_point = {
ab63 => 'ONE',
ab64 => 'TWO',
ab65 => 'THREE',
};
while ($line =~ s/14'h([0-9,a-f][0-9,a-f][0-9,a-f][0-9,a-f])/14'h$hash_point->{$1}/g){};
say $line;
This produces:
Use of uninitialized value in concatenation (.) or string at C:\tests\perl\test5.pl line 15.
Use of uninitialized value in concatenation (.) or string at C:\tests\perl\test5.pl line 15.
14'hONExx14'hxx14'hTWOxx14'hTHREExx14'h
The errors are for numbers ab88 and cdef that are not keys in the hash.
Just a small correction, but both of your regexes don't do what you think it does.
/[a-f,0-9]/
Matches any character from a to f, 0 to 9, and a comma. You are looking for
/[a-z0-9]/
Not that this is what is breaking your program (M42 probably got it right, but we can't be sure unless you show us the hash).
Also, apologies, not enough rep to actually answer to other posts.
EDIT:
Well, you go through a lot of hoops in that answer, so here's how I'd do it instead:
s/14'h\K(\p{AHex}{4})/if (defined($hash_point->{$1})) {
$hash_point->{$1};
} else {
say $1 if $flag;
$1;
}/ge
Mainly because chaining and's and &&'s and sosuch generally makes for fairly hard-to-understand code. All whitespace is optional, so squash it for the one-liner!