Regular expression literal-text span - regex

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?

The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.

No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.

A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.

I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.

Related

matching cond in perl using double exclaimation

if ($a =~ m!^$var/!)
$var is a key in a two dimensional hash and $a is a key in another hash.
What is the meaning of this expressions?
This is a regular expression ("regex"), where the ! character is used as the delimiter for the pattern that is to be matched in the string that it binds to via the =~ operator (the $a† here).
It may clear it up to consider the same regex with the usual delimiter instead, $a =~ /^$var\// (then m may be omitted); but now any / used in the pattern clearly must be escaped. To avoid that unsightly and noisy \/ combo one often uses another character for the delimiter, as nearly any character may be used (my favorite is the curlies, m{^$var/}). ‡ §
This regex in the question tests whether the value in the variable $a begins with (by ^ anchor) the value of the variable $var followed by / (variables are evaluated and the result used). §
† Not a good choice for a variable name since $a and $b are used by the builtin sort
‡ With the pattern prepared ahead of time the delimiter isn't even needed
my $re = qr{^$var/};
if ($string =~ $re) ...
(but I do like to still use // then, finding it clearer altogether)
Above I use qr but a simple q() would work just fine (while I absolutely recommend qr). These take nearly any characters for the delimiter, as well.
§ Inside a pattern the evaluated variables are used as regex patterns, what is wrong in general (when this is intended they should be compiled using qr and thus used as subpatterns).
An unimaginative example: a variable $var = q(\s) (literal backslash followed by letter s) evaluated inside a pattern yields the \s sequence which is then treated as a regex pattern, for whitespace. (Presumably unintended; we just wanted \ and s.)
This is remedied by using quotemeta, /\Q$var\E/, so that possible metacharacters in $var are escaped; this results in the correct pattern for the literal characters, \\s. So a correct way to write the pattern is m{^\Q$var\E/}.
Failure to do this also allows the injection bug. Thanks to ikegami for commenting on this.
The match operator (m/.../) is one of Perl's "quote-like" operators. The standard usage is to use slashes before and after the regex that goes in the middle of the operator (and if you use slashes, then you can omit the m from the start of the operator). But if the regex itself contains a slash then it is convenient to use a different delimiter instead to avoid having to escape the embedded slash. In your example, the author has decided to use exclamation marks, but any non-whitespace character can be used.
Many Perl operators work like this - m/.../, s/.../.../, tr/.../.../, q/.../, qq/.../, qr/.../, qw/.../, qx/.../ (I've probably forgotten some).

tcl regular expression, attempting to pull out a string between two patterns

Gretings!
I am trying to use tcl regular expressions to strip off unwanted characters and keep the desired string.
The 4 basic string types are
I34/pAVDD_3
I32/pDVDD_15_2
I999/pAGND
I3/pDOUT_LG0
What I want to capture is what's in-between the p and the end of the string or the last underscore & number if it exists. With the strings above I want to capture AVDD, DVDD_15, AGND, and DOUT_LG0.
I thought I had it with [p](\w*)?[_][\d*] but it doesn't work with I3/pDOUT_LG0 and after quite awhile of trying different things, I can't find a pattern that will work.
Thanks!
How about
regexp {p(?:(\w+)_\d|(\w+))$} $str -> c1 c2
set result $c1$c2
One or the other will be empty, so the result is a simple concatenation of them.
Another possible solution is to strip off the unwanted parts:
regsub -all {.+p|_\d$} $str {}
Documentation:
regexp,
regsub,
Syntax of Tcl regular expressions

regexp if start with \{ or \"

I'm trying to write a regular expression that test if a variable start with a string character in TCL, I wrote this code but it doesn't work
if {[regexp {^\"\{.*} $data]} {puts "something" }
*string char in TCL starts with { or "
You need to pick the right regular expression and use it correctly. This can get a lot less confusing if you store the RE in a variable first, particularly with large regular expressions, but even in this case it helps you understand the difference between the literal RE and how it is used.
set RE {^[\"\{]}
if {[regexp $RE $theString]} {
puts "something"
}
Note that Tcl does not anchor its RE matching by default, so you don't need a leading or trailing .* if you are just determining if a RE matches.

Tcl Remove all characters after a string without removing the string

In tcl is there a way to trim out all character AFTER a designated string? I have seen lots of posts on removing all after and including the string but not what I am hoping to do. I have a script that searches for file names with the suffix .sv but some of them are .sv.**bunch of random stuff**. and I don't need the random stuff as it is not relevant to me.
I have experimented with different regsub and string trim commands but they always remove the .sv as well.
The results being appended to a list are similar to as follows...
test_module_1.sv.random_stuff
test_module_2.sv.random_stuff
test_module_3.sv.random_stuff
test_module_4.sv.random_stuff
test_module_5.sv.random_stuff
etc etc
You can place match matched parts of a regex pattern when you use regsub. An example:
regsub {(\.sv).*} $str {\1} new
Will remove .sv and anything after it if any and replace that by the first matched group, that is the part between parens, or in this case, .sv so that an input of example.sv.random will become example.sv.
However, you can also easily replace with .sv like so:
regsub {\.sv.*} $str {.sv} new
Or another approach not involving replacing would be to get the part of the string up until the .sv part:
string range $str 0 [expr {[string first ".sv" $str]+2}]
Here [string first ".sv" $str] gets the position of .sv in the string (if there are multiple, it will get the first), adds 2 characters (sv after . are 2 chars long) to it and string range gets all characters up to and including .sv.
Or if you want to stick with regexes:
regexp {.+?\.sv} $str match
$match will contain the result string. The expression used grabs all characters up to and including .sv.

Regexp trouble in TCL

I have question about regexp in TCL.
How i can find and change some text in TCL string variable with regexp function.
Example of the text:
/folder/folder2/test-c+a+t -test1 -test2
I want to receive:
/folder/folder2/test-d+o+g
Or for example it can be just:
test-c+a+t
and i want to recieve:
test-d+o+g
Sorry for this addition:
In this situation:
/test-c+a+t/folder2/test-c+a+t -test1 -test2
i want to recieve:
/test-c+a+t/folder2/test-d+o+g -test1 -test2
% set old {/folder/folder2/test-c+a+t -test1 -test2}
/folder/folder2/test-c+a+t -test1 -test2
% set new [regsub {(test)-c\+a\+t.*} $old {\1-d+o+g}]
/folder/folder2/test-d+o+g
Note the literal + symbols need to be escaped because they are regular expression quantifiers.
http://tcl.tk/man/tcl8.5/TclCmd/re_syntax.htm
http://tcl.tk/man/tcl8.5/TclCmd/regsub.htm
In the specific case you mention here you would do better to use string map. Regular expressions are more flexible though so it all depends how specific your task is.
set modified [string map {test-c+a+t test-d+o+g} $original]
Otherwise, there is no substitute for learning how to use regular expression syntax. It is useful pretty much all the time so read the manual page, try various expressions and re-read the manual when you fail to match what you expected. Also try out sed, awk and grep for learning to use regexp's.
Either use string map or use regsub (possibly with the -all flag). Here are some examples of the two approaches:
set myString [string map [list "test-c+a+t" "test-d+o+g"] $myString]
set myString [regsub -all "***=test-c+a+t" $myString "test-d+o+g"]
### Or equivalently, for older Tcl versions...
regsub -all "***=test-c+a+t" $myString "test-d+o+g" myString
The string map can apply multiple changes in one sweep (the mapping a b b a would swap all a and b characters) but it only ever replaces literal strings and always replaces everything it can. The regsub command can do much more complex transformations and can much more selective about what it replaces, but it does require you to use regular expressions and it is slower in the case where a string map can do an equivalent job. However, the special leading ***= in the pattern means that the rest of the pattern is a literal string.