Is there a way to do multiple substitutions using regsub? - regex

Is it possible to have do different substitutions in an expression using regsub?
example:
set a ".a/b.c..d/e/f//g"
Now, in this expression, is it possible to substitute
"." as "yes"
".." as "no"
"/" as "true"
"//" as "false" in a single regsub command?

With a regsub, no. There's a long-standing feature request for this sort of thing (which requires substitution with the result of evaluating a command on the match information) but it's not been acted on to date.
But you can use string map to do what you want in this case:
set a ".a/b.c..d/e/f//g"
set b [string map {".." "no" "." "yes" "//" "false" "/" "true"} $a]
puts "changed $a to $b"
# changed .a/b.c..d/e/f//g to yesatruebyescnodtrueetrueffalseg
Note that when building the map, if any from-value is a prefix of another, the longer from-value should be put first. (This is because the string map implementation checks which change to make in the order you list them in…)
It's possible to use regsub and subst to do multiple-target replacements in a two-step process, but I don't advise it for anything other than very complex cases! A nice string map is far easier to work with.

You may also try to do it yourself. This is a draft proc which you could use as a starting point. It is not production ready, and you must be carefull because substitutions after the first one work on already substituted string.
These are the parameters:
options is a list of options that will be passed to every call to regsub
resubList is a list of key/value pairs, where the key is a regular expression and the value is a substitution
string is the string you want to substitute
This is the procedure, and it simply calls regsub multiple times, once for every element in resubList and, at the end, it returns the final string.
proc multiregsub {options resubList string} {
foreach {re sub} $resubList {
set string [regsub {*}$options -- $re $string $sub]
}
return $string
}

Related

Is there a Perl regex metacharacter or a way to have specify a default value, if a subpattern capture does not match?

Here's the idea. I am parsing command line options but doing it across the entire command line, not by each #ARGV element separately.
program --format="%H:%M:%S" --timeout 12 --nofail
I want the parsing to work with these cases.
--name=value, easy to parse
--name value, pretty easy
--name no value, default the value to 1
Here is the regex which works, except it cannot do the missing value case
%options = "#ARGV" ~= /--([A-Za-z]+)[= ]([^-]\S*)/g;
i.e. match --name=value or --name value but not --name --name, --name --name is two names, not a --name=value pair.
If a --name has no value following it that matches the second capture in the regex, is there a way, within the regex, to specify a default, in my case a 1, to indicate "true". i.e. if an --name has no argument, like --nofail then set that argument to 1 indicating true.
Actually, in asking this I figured out a workaround using separate match statements which is fine. However, just out of curiosity, the question still stands, is there a Perl regex way to have a default if a submatch fails?
I don't see how to return a list reflecting a changed input from a regex alone. To change the input we need s{}{}er operator, as we need code in its replacement part to analyze captures and decide what to change; and, we get a string, not a list, which need be further processed (split).
Here is then one such take, with a minimal intrusion of code.
Match name and value, with = or space between them, and if value ($2) is undefined give it a value; so we need /e to implement that.† Once we are at it, put a space between all name-value pairs. This goes under /r so that the changed string is returned, and passed through split
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1.' '.($2//'7 ') }ergx;
The split can be done by another regex instead but that's still extra processing.
A complete program (with more flags added to the input)
use warnings;
use strict;
use feature 'say';
my $args = shift // q(--fmt="%H:%M" --f1 --time 12 --f2 --f3);
say $args;
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1 . ' ' . ($2//'1 ') }ergx;
say "$_ => $arg{$_}" for keys %arg;
This prints as expected. But note that there may be edge cases, and in particular having a space inside (a quoted) argument value, like "%H %M", would require a far more complex pattern.
I presume that the regex ask is for play/study. Normally this goes by libraries, like Getopt::Long. If that is somehow not possible then processing #ARGV term by term is nice and easy -- and fast.
† In order to actually do "if value ($2) is undefined give it a value" we need to run code in the replacement part, what is done under the /e modifier

Why does special characters in my variable disappear on doing an lindex in TCL?

I have a list in my application that i work on.. Its basically like this:
$item = {text1 text2 text3}
Then I pick up the first member in the list with:
lindex $item 0
On doing this text1 which used to be (say) abcdef\12345 becomes abcdef12345.
But its very important for me to not lose this \ . Why is it disappearing. THere are other characters like - and > which don't disappear. Please note that I cannot escape the \ in the text beforehand. If there's anything I can do before operating on the $item with lindex, please suggest.
The problem is that \ is a Tcl list metasyntax character, unlike -, > or any alphanumeric. You need to convert your string into a proper Tcl list before using lindex (or any other list-consuming operation) on it. To do that, you need to understand exactly what you mean by “words” in your input data. If your input data is a sequences of non-whitespace characters separated by single whitespace characters, you can use split to do the conversion to a list:
set properList [split $item]
# Now we can use it...
set theFirstWord [lindex $properList 0]
If you've got a different separator, split takes an optional extra character to say what to split by. For example, to split by colons (:) you do:
set properList [split $item ":"]
However, if you have other sorts of splitting rules, this doesn't work so well. For example, if you can split by multiple whitespace characters, it's actually better to use regexp (with the -all -inline options) to do the word-identification:
# Strictly, this *chooses* all sequences of one or more non-whitespace characters
set properList [regexp -all -inline {\S+} $item]
You can also do splitting by multi-character sequences, though in that case it is most easily done by mapping (with string map) the multi-character sequence to a single rare character first. Unicode means that there are lots of such characters to pick…
# NUL, \u0000, is a great character to pick for text, and terrible for binary data
# For binary data, choose something beyond \u00ff
set properList [split [string map {"BOUNDARY" "\u0000"} $item] "\u0000"]
Even more complex options are possible, but that's when you use splitx from Tcllib.
package require textutil::split
# Regular expression to describe the separator; very sophisticated approach
set properList [textutil::split::splitx $item {SPL+I*T}]
In tcl Lists can be created in several ways:
by setting a variable to be a list of values
set lst {{item 1} {item 2} {item 3}}
with the split command
set lst [split "item 1.item 2.item 3" "."]
with the list command.
set lst [list "item 1" "item 2" "item 3"]
And an individual list member can be accessed with the lindex command.
set x "a b c"
puts "Item 2 of the list {$x} is: [lindex $x 2]\n"
This will give output:
Item 2 of the list {a b c} is: c
And With respect to the question asked
You need to define the variable like this abcdef\\12345
In order to make this clear try to run the following command.
puts "\nI gave $100.00 to my daughter."
and
puts "\nI gave \$100.00 to my daughter."
The second one will give you the proper result.
If you don't have the option to change the text, try to save the text in curly braces, as mentioned in the first example.
set x {abcd\12345}
puts "A simple substitution: $x\n"
Output:
A simple substitution: abcd\12345
set y [set x {abcdef\12345}]
And check for this output:
puts "Remember that set returns the new value of the variable: X: $x Y: $y\n"
Output:
Remember that set returns the new value of the variable: X: abcdef\12345 Y: abcdef\12345

How do I use regex capture group as array index?

I'm trying to use regsub in TCL to replace a string with the value from an array.
array set myArray "
one 1
two 2
"
set myString "\[%one%\],\[%two%\]"
regsub -all "\[%(.+?)%\]" $myString "$myArray(\\1)" newString
My goal is to convert a string from "[%one%],[%two%]" to "1,2". The problem is that the capture group index is not resolved. I get the following error:
can't read "myArray(\1)": no such element in array
while executing
"regsub -all "\[%(.+?)%\]" $myString "$myArray(\\1)" newString"
This is a 2 step process in Tcl. Your main mistake here is using double quotes everywhere:
array set myArray {one 1 two 2}
set myString {[%one%],[%two%]}
regsub -all {\[%(.+?)%\]} $myString {$myArray(\1)} new
puts $new
puts [subst -nobackslash -nocommand $new]
$myArray(one),$myArray(two)
1,2
So we use regsub to search for the expression and replace it with the string representation of the variable we want to expand. Then we use the rarely-used subst command to perform the variable (only) substitution.
Apart from using regsub+subst (which is a decidedly tricky pair of commands to use safely in general) you can also do relatively simple transformations using string map. The trick is in how you prepare the mapping:
# It's conventional to use [array set] like this…
array set myArray {
one 1
two 2
}
set myString "\[%one%\],\[%two%\]"
# Build the transform
set transform {}
foreach {from to} [array get myArray] {
lappend transform "\[%$from%\]" $to
}
# Apply the transform
set changedString [string map $transform $myString]
puts "transformed from '$myString' to '$changedString'"
As long as each individual thing you want to go from and to is a constant string at the time of application, you can use string map to do it. The advantage? It's obviously correct. It's very hard to make a regsub+subst transform obviously correct (but necessary if you need a more complex transform; that's the correct way to do %XX encoding and decoding in URLs for example).

tcl regsub will not work

I'm trying to write an extremely simple piece of code and tcl is not cooperating. I can only imagine there is a very simple error I am missing in my code but I have absolutely no idea what it could be please help I'm so frustrated!!
My code is the following ...
proc substitution {stringToSub} {
set afterSub $stringToSub
regsub {^.*?/projects} "$stringToSub" "//Path/projects" afterSub
regsub {C:/projects} "$stringToSub" "//Path/projects" afterSub
return $afterSub
}
puts "[substitution /projects] "
puts "[substitution C:/projects] "
The substitution works fine for the second expression but not the first one. Why is that??
I have also tried using
regsub {^/projects} "$stringToSub" "//Path/projects" afterSub
and
regsub {/projects} "$stringToSub" "//Path/projects" afterSub
but neither are working. What is going on??
Since yours two regsub calls don't change the input string (i.e.: $stringToSub) but put the result in the string $afterSub which is returned by the function. You will always obtain the result of the last regsub call and the result of the first regsub call in $aftersub is always overwritten.
Note that the first pattern is more general and include all the strings matched by the second (assuming that $stringToSub is always a path). If you hope to obtain "//Path/projects" for your sample strings, you can simply remove the second regsub call:
proc substitution {stringToSub} {
set afterSub $stringToSub
regsub {^.*?/projects} "$stringToSub" "//Path/projects" afterSub
return $afterSub
}
The first two lines in your procedure will effectively do nothing, since regsub always overwrites the destination variable (afterSub) even when there's 0 matches/substitutions made. From the regsub manual:
This command matches the regular expression exp against string, and either copies string to the variable whose name is given by varName or returns string if varName is not present. (Regular expression matching is described in the re_syntax reference page.) If there is a match, then while copying string to varName (or to the result of this command if varName is not present) the portion of string that matched exp is replaced with subSpec.
There's no need to match C:/projects specifically, because ^.*?/projects will match that text?
The issue is that your second use of the regsub operation is overwriting the substituted value from the first regsub use.
We could simplify the code to just this:
proc substitution {stringToSub} {
return [regsub {^.*?/projects} $stringToSub "//Path/projects"]
}

Making a dynamic hash of arrays in foreach in perl based on regex

So I'm trying to make a hash of arrays based on a regex inside a foreach.
I'm getting some file paths, and they are of the format:
longfilepath/name.action.gz
so basically there will be files with the same name but diffent actions, so I want to make a hash with keys of name that are arrays of actions. I'm apparently doing something wrong as I keep getting this error when I run the code:
Not an ARRAY reference at ....the file I'm writing in
Which I don't get since I'm checking to see if its set, and if not declaring it as an array. I'm still getting used to perl, so I'm guessing my problem is something simple.
I should also say, that I've verified my regex is generating both the 'name' and 'action' strings properly so the problem is definitely in my foreach;
Thanks for your help. :)
My code is thus.
my %my_hash;
my $file_paths = glom("/this/is/mypath/*.*\.gz");
foreach my $path (#$bdr_paths){
$path =~ m"\/([^\/\.]+)\.([^\.]+)\.gz";
print STDERR "=>".Dumper($1)."\n\r";
print STDERR "=>".Dumper($2)."\n\r";
#add the entity type to a hash with the recipe as the key
if($my_hash{$1})
{
push($my_hash{$1}, $2);
}
else
{
$my_hash{$1} = ($2);
}
}
It’s glob, not glom. In glob expressions, the period is no metacharacter. → glob '/this/is/mypath/*.gz'.
The whole reason of using alternate regex delimiters is to avoid unneccessary escapes. The forward slash is no regex metacharacter, but a delimiter. Inside charclasses, many operators loose their specialness; no need to escape the period. Ergo m!/([^/.]+)\.([^.]+)\.gz!.
Don't append \n\r to your output. ① The Dumper function already appends a newline. ② If you are on a OS that expects a CRLF, then use the :crlf PerlIO layer, which transforms all \ns to a CRLF. You can add layers via binmode STDOUT, ':crlf'. ③ If you are doing networking, it might be better to specify the exact bytes you want to emit, e.g. \x0A\x0D or \012\015. (But in this case, also remove all PerlIO layers).
Using references as first arg to push doesn't work on perls older than v5.14.
Don't manually check whether you populated a slot in your hash or not; if it is undef and used as an arrayref, an array reference is automatically created there. This is known as autovivification. Of course, this requires you to perform this dereference (and skip the short form for push).
In Perl, parens only sort out precedence, and create list context when used on the LHS of an assignment. They do not create arrays. To create an anonymous array reference, use brackets: [$var]. Using parens like you do is useless; $x = $y and $y = ($y) are absolutely identical.
So you either want
push #{ $my_hash{$1} }, $2;
or
if ($my_hash{$1}) {
push $my_hash{$1}, $2;
} else {
$my_hash{$1} = [$2];
}
Edit: Three things I overlooked.
If glob is used in scalar context, it turns into an iterator. This is usually unwanted, unless when used in a while(my $path = glob(...)) { ... } like fashion. Otherwise it is more difficult to make sure the iterator is exhausted. Rather, use glob in list context to get all matches at once: my #paths = glob(...).
Where does $bdr_paths come from? What is inside?
Always check that a regex actually matched. This can avoid subtle bugs, as the captures $1 etc. keep their value until the next successful match.
When you say $my_hash{$1} = ($2); it evaluates it in list context and stores the last object of the list in the hash.
my %h;
$h{a} = ('foo');
$h{b} = ['bar'];
$h{c} = ('foo', 'bar', 'bat'); # Will cause warning if 'use warnings;'
print Dumper(\%h);
Gives
$VAR1 = {
'c' => 'bat',
'b' => [
'bar'
],
'a' => 'foo'
};
You can see that is stored as the value and not an array reference. So you can store an anonymous array ref with $my_hash{$1} = [$2]; Then you push onto it with push( #{ $my_hash{$1} }, $2);