Tcl regexp does not escape asterisk (*) - regex

In my script I get a string that looks like this:
Reading thisfile.txt
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
** Error: (errorcode) Cannot access file "somedir/anotherlib". <--
No such file or directory. (errno = ENOENT) <--
Reading anotherfile.txt
.....
But the two marked lines with the error code only appear from time to time.
I'm trying to use a regexpression to get the lines from after Reading thisfile.txt to the line before either Reading anotherfile.txt or, if it is there, before **.
So result should in every case look like this:
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
I have tried it with this regexp:
set pattern ".*Reading thisfile.txt\n(.*)\n.*Reading .*$"
Then I do
regexp -all $pattern $data -> result
But that only works if there is no error message.
So I'm trying to look for the *.
set pattern ".*Reading thisfile.txt\n(.*)\n.*\[\*|Reading\].*$"
But that also does not work. The part with ** Error is still there.
I wonder why. This one doesn't even compile:
set pattern ".*Reading thisfile.txt\n(.*)\n.*\*?.*Reading .*$"
any idea how I can find the and not match the *?

From the way you wrote your regex, you will have to use braces:
set pattern {.*Reading thisfile\.txt\n(.*)\n.*\*?.*Reading .*$}
If you used quotes, you would have had to use:
set pattern ".*Reading thisfile\\.txt\n(.*)\n.*\\*?.*Reading .*$"
i.e. basically put a second backslash to escape the first ones.
The above will be able to grab something; albeit everything between the first and the last Reading.
If you want to match from Reading thisfile.txt to the next line beginning with asterisk, then you could use:
set pattern {^Reading thisfile\.txt\n(.*?)\n(?=^Reading|^\*)}
regexp -all -lineanchor -- $pattern $data -> result
(?=^Reading|^\*) is a positive lookahead and I changed your (.*) to (.*?) so that you really get all the occurrences and not from the first to the last Reading.
The positive lookahead will match if either Reading or * is ahead and are both starting on a new line.
-lineanchor makes ^ match at every beginning of line instead of at the start of the string.
codepad demo
I forgot to mention that if you have more than one match, you will have to set the results of the regexp and use the -inline modifier instead of using the above construct (else you'll get only the last submatch)...
set results [regexp -all -inline -lineanchor -- $pattern $data]
foreach {main sub} $results {
puts $sub
}

I'm unfamiliar with tcl but the following regex will give you matches of which the 1st capture-group contains the filename and the 2nd capture-group contains all the lines you want:
^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)
Debuggex Demo
Basically the (?:[^\n]|\n(?!Reading|\*\*))* is saying "Match anything that isn't a new-line character or a new-line character not followed by either Reading or **".
What I'm getting from Jerry's answer is you'd define that in tcl like so:
set pattern {^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)}

Related

regular expression with negative matching

I want to do a regular expression to get the comment.
I want to distinguish of single comment /*afdafad */ and multiple comment /* appple .......
Single comment is ok, but I am confused with multiple line comment.
I tried this:
set line "/* using cmos4 delaymodel */"
regexp {\/\*.+[^*][^/]} $line
puts [regexp -inline {\/\*.*[^\*][^/]} $line]
Output:
{/* using cmos4 delaymodel *}
I can't escape the * symbol.
I expect that, I should match the line which is contain /* but no */ in the $line but I failed, so that how could I modify my regular expression?
Strictly speaking, this is not an answer to the regex-centric question. I just wanted to point out that in Tcl, you do not have to resort to a regex in your particular case (esp. if you assume your commented sources being well-formed etc.).
Suggestion
You may want consider an exercise of textual polishing, i.e., pre-processing your commented source into a source containing Tcl command sequences: [cmd ...]. In your case, delimiters opening and closing comments, respectively, turn into opening and closing brackets of a command sequence. The command executed could be a proc such as comment below, capturing and further handling your comment bodies or returning a placeholder into the processed text. Actual command execution (that is, comment capture) is then triggered by applying [subst] on the preformatted source.
Watch:
set input {/* this is a
multiline comment */ /* This is a [single line] comment */}
proc comment {body} {
puts "got a comment: '$body'"
return "/* ---%<--- */"
}
set tmp [string map {"[" "\[" "]" "\]" "/*" "[comment {" "*/" "}]"} $input]
set output [subst -novariables -nobackslashes $tmp]
Comments
Obviously, this gives you no direct means to validate the use of comment syntax etc. Either you are in a position to assume valid syntax use or, alternatively, you may check the pre-formatted Tcl string to be a complete Tcl script: [info complete $tmp]. This will only catch certain occurrences of unbalanced brackets (comment delimiters), though.
The discrimination between single-line vs. multi-line comments is not critical for capturing comments.
Depending on the source syntax, you would have to protect characters that could be misinterpreted as Tcl syntax during [subst]. E.g., brackets as genuine syntax elements or $. This must be controlled for using escapes using [string map] and restrictions to [subst] (-novariables, -nobackslashes).
It doesn't work because while [^\*] doesn't accept the *, [^/] will. The engine solves the match by letting [^\*] consume the blank before the *.
If you do
regexp -inline {(/\*.*)\*/} $line
you get
{/* using cmos4 delaymodel */} {/* using cmos4 delaymodel }
This is probably the easiest. You can get the capture by either one of
lindex [regexp -inline {(/\*.*)\*/} $line] 1
regexp {(/\*.*)\*/} $line -> a
In the latter case, the variable -> gets the full match and a gets the capture.
If the comments don't contain any asterisks, you could also use the regex /\*[^*]*, i.e. match everything from a comment start up to but not including the first asterisk.
(And you don't need to escape slashes in Tcl regexes, they are slash-friendly.)
Assuming that what you are matching the regex against doesn't contain irregularities such as strings like look like comments (e.g. in JavaScript something like var s = '/* incorrect comment */'), and that you are not too familiar with Tcl's regexp, then the chances are pretty high that your method to distinguish single comments might also be wrong. This is because by default, . in Tcl's regex matches newlines.
Thus for single line comments only, you might need something like:
regexp -linestop -inline -- {/\*.*\*/} $line
Without -linestop, the above will be able to match both single line comments and multi line comments.
And for multiline comments only, something like the below to force a newline inside the comment:
regexp -linestop -inline -- {/\*(?:[^*]|\*[^/])*?(?:[\r\n]+.*?)+\*/} $comment
Note: the second .* being lazy of the + being greedy have no impact on the regex here because all these are lazy due to the first quantifier being lazy. I made the second .* lazy because to me it looks a bit more explicit that this one absolutely needs to be lazy. The edge case it takes care of is something like this:
/* this is a
multiline comment */ /* This is a single line comment */

Tcl Remove all characters after a string without removing the string

In tcl is there a way to trim out all character AFTER a designated string? I have seen lots of posts on removing all after and including the string but not what I am hoping to do. I have a script that searches for file names with the suffix .sv but some of them are .sv.**bunch of random stuff**. and I don't need the random stuff as it is not relevant to me.
I have experimented with different regsub and string trim commands but they always remove the .sv as well.
The results being appended to a list are similar to as follows...
test_module_1.sv.random_stuff
test_module_2.sv.random_stuff
test_module_3.sv.random_stuff
test_module_4.sv.random_stuff
test_module_5.sv.random_stuff
etc etc
You can place match matched parts of a regex pattern when you use regsub. An example:
regsub {(\.sv).*} $str {\1} new
Will remove .sv and anything after it if any and replace that by the first matched group, that is the part between parens, or in this case, .sv so that an input of example.sv.random will become example.sv.
However, you can also easily replace with .sv like so:
regsub {\.sv.*} $str {.sv} new
Or another approach not involving replacing would be to get the part of the string up until the .sv part:
string range $str 0 [expr {[string first ".sv" $str]+2}]
Here [string first ".sv" $str] gets the position of .sv in the string (if there are multiple, it will get the first), adds 2 characters (sv after . are 2 chars long) to it and string range gets all characters up to and including .sv.
Or if you want to stick with regexes:
regexp {.+?\.sv} $str match
$match will contain the result string. The expression used grabs all characters up to and including .sv.

Non-greedy match from end of string with regsub

I have a folder path like following:
/h/apps/new/app/k1999
I want to remove the /app/k1999 part with the following regular expression:
set folder "/h/apps/new/app/k1999"
regsub {\/app.+$} $folder "" new_folder
But the result is /h: too many elements are being removed.
I noticed that I should use non-greedy matching, so I change the code to:
regsub {\/app.+?$} $folder "" new_folder
but the result is still /h.
What's wrong with the above code?
Non-greedy simply means that it will try to match the least amount of characters and increase that amount if the whole regex didn't match. The opposite - greedy - means that it will try to match as much characters as it can and reduce that amount if the whole regex didn't match.
$ in regex means the end of the string. Therefore something.+$ and something.+?$ will be equivalent, it is just that one will do more retries before it matches.
In your case /app.+ is matched by /apps and this is the first occurrence of /app in your string. You can fix it by being more explicit and adding the / that follows /app:
regsub {/app/.+$} $folder "" new_folder
If you are looking to match app as a whole word, you can make use of the word boundaries that in Tcl are \m and \M:
\m
matches only at the beginning of a word
\M matches only at the end of a word
We only need the \M as / is a non-word character and we do not need \m:
set folder "/h/apps/new/app/k1999"
regsub {/app\M.+$} $folder "" newfolder
puts $newfolder
See IDEONE demo
Result: /h/apps/new (we remove everything from a whole word app up to the end.)
If you want to remove just a part of the string inside the path, you can use negated class [^/]+ to make sure you only match a subpart of a path:
regsub {/app/[^/]+} $folder "" newfolder
The regular expression engine always starts matching as soon as it can; the greediness doesn't affect this. This means that in this case, it always starts matching too soon; you want the last match, not the first one.
If you use regexp -all -indices -inline, you can find out where the last match starts. That lets you then remove the part you actually don't want (e.g., by replacing it with an empty string:
set folder "/h/apps/new/app/k1999"
set indices [regexp -all -indices -inline {/app} $folder]
# This gets this value: {2 5} {11 14}
# If we have indices — if we had a match — we can do the rest of our processing
if {[llength $indices] > 0} {
# Get the '11'; the first sub-element of the last element
set index [lindex $indices end 0]
# Replace '/app/k1999' with the empty string
set newfolder [string replace $folder $index end ""]
} else {
set newfolder $folder; # In case there's no match...
}
You can use a regular expression substitution operation to remove a directory suffix from a path name, but that doesn't mean you should.
file join {*}[lmap dir [file split $folder] {if {$dir ne {app}} {set dir} break}]
# -> /h/apps/new
A path name is a string, but more properly it's a list of directory names:
file split $folder
# -> / h apps new app k1999
What you want is the sublist of directory names up to, but not including, the directory named "app".
lmap dir [file split $folder] {if {$dir ne {app}} {set dir} break}
# -> / h apps new
(The directory name can be tested however you wish; a couple of possibilities are {$dir ni {foo app bar}} to skip at alternative names, or {![string match app-* $dir]} for any name beginning with "app-".)
And when you've gotten the list of directory names you wanted, you join the elements of it back to a path name again, as above.
So why should you do it this way instead of by using a regular expression substitution operation? This question illustrates the problem well. Unless one is an RE expert or takes great care to read the documentation, one is likely to formulate a regular expression based on a hunch. In the worst case, it works the first time. If not, one is tempted to tinker with it until it does. And any sufficiently ununderstood (yep, that is a word) RE will seem to work most of the time with occasional false positives and negatives to keep things interesting.
Split it, truncate it, join it. Can't go wrong. And if it does, it goes obviously wrong, forcing you to fix it.
Documentation: break, file, if, lmap, set

Extract text from a multiline string using Perl

I have a string that covers several lines. I need to extract the text between two strings. For example:
Start Here Some example
text covering a few
lines. End Here
I need to extract the string, Start Here Some example text covering a few lines.
How do I go about this?
Use the /s regex modifier to treat the string as a single line:
/s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
$string =~ /(Start Here.*)End Here/s;
print $1;
This will capture up to the last End Here, in case it appears more than once in your text.
If this is not what you want, then you can use:
$string =~ /(Start Here.*?)End Here/s;
print $1;
This will stop matching at the very first occurrence of End Here.
print $1 if /(Start Here.*?)End Here/s;
Wouldn't the correct modifier to treat the string as a single line be (?s) rather than (/s) ? I've been wrestling with a similar problem for quite a while now and the RegExp Tester embedded in JMeter's View Results Tree listener shows my regular expression extractor with the regex
(?s)<FMSFlightPlan>(.*?)</FMSFlightPlan>
matches
<FMSFlightPlan>
C87D
AN NTEST/GL
- FPN/FN/RP:DA:GCRR:AA:EIKN:F:SAMAR,N30540W014249.UN873.
BAROK,N35580W010014..PESUL,N40529W008069..RELVA,N41512W008359..
SIVIR,N46000W008450..EMPER,N49000W009000..CON,N53545W008492
</FMSFlightPlan>
while the regex
(?s)<FMSFlightPlan>(.*?)</FMSFlightPlan>
does not match. Other regex testers show the same result. However when I try to execute a the script I get the Beanshell Assertion error:
Assertion failure message: org.apache.jorphan.util.JMeterException: Error invoking bsh method: eval Sourced file: inline evaluation of: ``import java.io.*; //write out the data results to a file outfile = "/Users/Dani . . . '' Token Parsing Error: Lexical error at line 12, column 380. Encountered: "\n" (10),
So something else is definitely wrong with mine. Anyway, just a suggestion

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.