Regexp trouble in TCL - regex

I have question about regexp in TCL.
How i can find and change some text in TCL string variable with regexp function.
Example of the text:
/folder/folder2/test-c+a+t -test1 -test2
I want to receive:
/folder/folder2/test-d+o+g
Or for example it can be just:
test-c+a+t
and i want to recieve:
test-d+o+g
Sorry for this addition:
In this situation:
/test-c+a+t/folder2/test-c+a+t -test1 -test2
i want to recieve:
/test-c+a+t/folder2/test-d+o+g -test1 -test2

% set old {/folder/folder2/test-c+a+t -test1 -test2}
/folder/folder2/test-c+a+t -test1 -test2
% set new [regsub {(test)-c\+a\+t.*} $old {\1-d+o+g}]
/folder/folder2/test-d+o+g
Note the literal + symbols need to be escaped because they are regular expression quantifiers.
http://tcl.tk/man/tcl8.5/TclCmd/re_syntax.htm
http://tcl.tk/man/tcl8.5/TclCmd/regsub.htm

In the specific case you mention here you would do better to use string map. Regular expressions are more flexible though so it all depends how specific your task is.
set modified [string map {test-c+a+t test-d+o+g} $original]
Otherwise, there is no substitute for learning how to use regular expression syntax. It is useful pretty much all the time so read the manual page, try various expressions and re-read the manual when you fail to match what you expected. Also try out sed, awk and grep for learning to use regexp's.

Either use string map or use regsub (possibly with the -all flag). Here are some examples of the two approaches:
set myString [string map [list "test-c+a+t" "test-d+o+g"] $myString]
set myString [regsub -all "***=test-c+a+t" $myString "test-d+o+g"]
### Or equivalently, for older Tcl versions...
regsub -all "***=test-c+a+t" $myString "test-d+o+g" myString
The string map can apply multiple changes in one sweep (the mapping a b b a would swap all a and b characters) but it only ever replaces literal strings and always replaces everything it can. The regsub command can do much more complex transformations and can much more selective about what it replaces, but it does require you to use regular expressions and it is slower in the case where a string map can do an equivalent job. However, the special leading ***= in the pattern means that the rest of the pattern is a literal string.

Related

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

tcl regular expression, attempting to pull out a string between two patterns

Gretings!
I am trying to use tcl regular expressions to strip off unwanted characters and keep the desired string.
The 4 basic string types are
I34/pAVDD_3
I32/pDVDD_15_2
I999/pAGND
I3/pDOUT_LG0
What I want to capture is what's in-between the p and the end of the string or the last underscore & number if it exists. With the strings above I want to capture AVDD, DVDD_15, AGND, and DOUT_LG0.
I thought I had it with [p](\w*)?[_][\d*] but it doesn't work with I3/pDOUT_LG0 and after quite awhile of trying different things, I can't find a pattern that will work.
Thanks!
How about
regexp {p(?:(\w+)_\d|(\w+))$} $str -> c1 c2
set result $c1$c2
One or the other will be empty, so the result is a simple concatenation of them.
Another possible solution is to strip off the unwanted parts:
regsub -all {.+p|_\d$} $str {}
Documentation:
regexp,
regsub,
Syntax of Tcl regular expressions

How to use Tcl regexp when the search string contains variables and spaces?

Here's the code where I would like to match $line with $ram using Tcl regexp.
set line { 0 DI /hdamrf};
set ram {/hdamrf};
regexp {\s+\d\s+DI\s+$ram} $line match; ## --> 0
Please help me construct a search string which can match the regular expression. It so happens that to use variables in search strings they have to be enclosed in curly braces. But curly braces doesn't allow me to use \s for detecting white space. Thanks.
This is one possibility:
regexp [format {\s+\d\s+DI\s+%s} $ram] $line match
This is another:
regexp "\\s+\\d\\s+DI\\s+$ram" $line match
Documentation: format, Syntax of Tcl regular expressions, regexp
Try it this way:
regexp [subst -nocommands -nobackslashes {\s+\d\s+DI\s+$ram}] $line match
For example:
% regexp [subst -nocommands -nobackslashes {\s+\d\s+DI\s+$ram}] $line match
1
See the manual page for subst here:;
This command performs variable substitutions, command substitutions, and backslash substitutions on its string argument and returns the fully-substituted result. The substitutions are performed in exactly the same way as for Tcl commands. As a result, the string argument is actually substituted twice, once by the Tcl parser in the usual fashion for Tcl commands, and again by the subst command.
If any of the -nobackslashes, -nocommands, or -novariables are specified, then the corresponding substitutions are not performed. For example, if -nocommands is specified, command substitution is not performed: open and close brackets are treated as ordinary characters with no special interpretation.
Since it's already in {...}, the part about the parser interpreting it twice isn't completely accurate :)

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?
The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.
No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.
A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.
I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.

Embedding evaluations in Perl regex

So i'm writing a quick perl script that cleans up some HTML code and runs it through a html -> pdf program. I want to lose as little information as possible, so I'd like to extend my textareas to fit all the text that is currently in them. This means, in my case, setting the number of rows to a calculated value based on the value of the string inside the textbox.
This is currently the regex i'm using
$file=~s/<textarea rows="(.+?)"(.*?)>(.*?)<\/textarea>/<textarea rows="(?{ length($3)/80 })"$2>$3<\/textarea>/gis;
Unfortunately Perl doesn't seem to be recognizing what I was told was the syntax for embedding Perl code inside search-and-replace regexs
Are there any Perl junkies out there willing to tell me what I'm doing wrong?
Regards,
Zach
The (?{...}) pattern is an experimental feature for executing code on the match side, but you want to execute code on the replacement side. Use the /e regular-expression switch for that:
#! /usr/bin/perl
use warnings;
use strict;
use POSIX qw/ ceil /;
while (<DATA>) {
s[<textarea rows="(.+?)"(.*?)>(.*?)</textarea>] {
my $rows = ceil(length($3) / 80);
qq[<textarea rows="$rows"$2>$3</textarea>];
}egis;
print;
}
__DATA__
<textarea rows="123" bar="baz">howdy</textarea>
Output:
<textarea rows="1" bar="baz">howdy</textarea>
The syntax you are using to embed code is only valid in the "match" portion of the substitution (the left hand side). To embed code in the right hand side (which is a normal Perl double quoted string), you can do this:
$file =~ s{<textarea rows="(.+?)"(.*?)>(.*?)</textarea>}
{<textarea rows="#{[ length($3)/80 ]}"$2>$3</textarea>}gis;
This uses the Perl idiom of "some string #{[ embedded_perl_code() ]} more string".
But if you are working with a very complex statement, it may be easier to put the substitution into "eval" mode, where it treats the replacement string as Perl code:
$file =~ s{<textarea rows="(.+?)"(.*?)>(.*?)</textarea>}
{'<textarea rows="' . (length($3)/80) . qq{"$2>$3</textarea>}}gise;
Note that in both examples the regex is structured as s{}{}. This not only eliminates the need to escape the slashes, but also allows you to spread the expression over multiple lines for readability.
Must this be done with regex? Parsing any markup language (or even CSV) with regex is fraught with error. If you can, try to utilize a standard library:
http://search.cpan.org/dist/HTML-Parser/Parser.pm
Otherwise you risk the revenge of Cthulu:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
(Yes, the article leaves room for some simple string-manipulation, so I think your soul is safe, though. :-)
I believe your problem is an unescaped /
If it's not the problem, it certainly is a problem.
Try this instead, note the \/80
$file=~s/<textarea rows="(.+?)"(.*?)>(.*?)<\/textarea>/<textarea rows="(?{ length($3)\/80 })"$2>$3<\/textarea>/gis;
The basic pattern for this code is:
$file =~ s/some_search/some_replace/gis;
The gis are options, which I'd have to look up. I think g = global, i = case insensitive, s = nothing comes to mind right now.
First, you need to quote the / inside the expression in the replacement text (otherwise perl will see a s/// operator followed by the number 80 and so on). Or you can use a different delimiter; for complex substitutions, matching brackets are a good idea.
Then you get to the main problem, which is that (?{...}) is only available in patterns. The replacement text is not a pattern, it's (almost) an ordinary string.
Instead, there is the e modifier to the s/// operator, which lets you write a replacement expression rather than replacement string.
$file =~ s(<textarea rows="(.+?)"(.*?)>(.*?)</textarea>)
("<textarea rows=\"" . (length($3)/80) . "\"$2>$3</textarea>")egis;
As per http://perldoc.perl.org/perlrequick.html#Search-and-replace, this can be accomplished with the "evaluation modifier s///e", e.g., you gis must have an extra e in it.
The evaluation modifier s///e wraps an eval{...} around the replacement string and the evaluated result is substituted for the matched substring. Some examples:
# convert percentage to decimal
$x = "A 39% hit rate";
$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"