regular expression with negative matching - regex

I want to do a regular expression to get the comment.
I want to distinguish of single comment /*afdafad */ and multiple comment /* appple .......
Single comment is ok, but I am confused with multiple line comment.
I tried this:
set line "/* using cmos4 delaymodel */"
regexp {\/\*.+[^*][^/]} $line
puts [regexp -inline {\/\*.*[^\*][^/]} $line]
Output:
{/* using cmos4 delaymodel *}
I can't escape the * symbol.
I expect that, I should match the line which is contain /* but no */ in the $line but I failed, so that how could I modify my regular expression?

Strictly speaking, this is not an answer to the regex-centric question. I just wanted to point out that in Tcl, you do not have to resort to a regex in your particular case (esp. if you assume your commented sources being well-formed etc.).
Suggestion
You may want consider an exercise of textual polishing, i.e., pre-processing your commented source into a source containing Tcl command sequences: [cmd ...]. In your case, delimiters opening and closing comments, respectively, turn into opening and closing brackets of a command sequence. The command executed could be a proc such as comment below, capturing and further handling your comment bodies or returning a placeholder into the processed text. Actual command execution (that is, comment capture) is then triggered by applying [subst] on the preformatted source.
Watch:
set input {/* this is a
multiline comment */ /* This is a [single line] comment */}
proc comment {body} {
puts "got a comment: '$body'"
return "/* ---%<--- */"
}
set tmp [string map {"[" "\[" "]" "\]" "/*" "[comment {" "*/" "}]"} $input]
set output [subst -novariables -nobackslashes $tmp]
Comments
Obviously, this gives you no direct means to validate the use of comment syntax etc. Either you are in a position to assume valid syntax use or, alternatively, you may check the pre-formatted Tcl string to be a complete Tcl script: [info complete $tmp]. This will only catch certain occurrences of unbalanced brackets (comment delimiters), though.
The discrimination between single-line vs. multi-line comments is not critical for capturing comments.
Depending on the source syntax, you would have to protect characters that could be misinterpreted as Tcl syntax during [subst]. E.g., brackets as genuine syntax elements or $. This must be controlled for using escapes using [string map] and restrictions to [subst] (-novariables, -nobackslashes).

It doesn't work because while [^\*] doesn't accept the *, [^/] will. The engine solves the match by letting [^\*] consume the blank before the *.
If you do
regexp -inline {(/\*.*)\*/} $line
you get
{/* using cmos4 delaymodel */} {/* using cmos4 delaymodel }
This is probably the easiest. You can get the capture by either one of
lindex [regexp -inline {(/\*.*)\*/} $line] 1
regexp {(/\*.*)\*/} $line -> a
In the latter case, the variable -> gets the full match and a gets the capture.
If the comments don't contain any asterisks, you could also use the regex /\*[^*]*, i.e. match everything from a comment start up to but not including the first asterisk.
(And you don't need to escape slashes in Tcl regexes, they are slash-friendly.)

Assuming that what you are matching the regex against doesn't contain irregularities such as strings like look like comments (e.g. in JavaScript something like var s = '/* incorrect comment */'), and that you are not too familiar with Tcl's regexp, then the chances are pretty high that your method to distinguish single comments might also be wrong. This is because by default, . in Tcl's regex matches newlines.
Thus for single line comments only, you might need something like:
regexp -linestop -inline -- {/\*.*\*/} $line
Without -linestop, the above will be able to match both single line comments and multi line comments.
And for multiline comments only, something like the below to force a newline inside the comment:
regexp -linestop -inline -- {/\*(?:[^*]|\*[^/])*?(?:[\r\n]+.*?)+\*/} $comment
Note: the second .* being lazy of the + being greedy have no impact on the regex here because all these are lazy due to the first quantifier being lazy. I made the second .* lazy because to me it looks a bit more explicit that this one absolutely needs to be lazy. The edge case it takes care of is something like this:
/* this is a
multiline comment */ /* This is a single line comment */

Related

Raku Regex to capture and modify the LFM code blocks

Update: Corrected code added below
I have a Leanpub flavored markdown* file named sample.md I'd like to convert its code blocks into Github flavored markdown style using Raku Regex
Here's a sample **ruby** code, which
prints the elements of an array:
{:lang="ruby"}
['Ian','Rich','Jon'].each {|x| puts x}
Here's a sample **shell** code, which
removes the ending commas and
finds all folders in the current path:
{:lang="shell"}
sed s/,$//g
find . -type d
In order to capture the lang value, e.g. ruby from the {:lang="ruby"} and convert it into
```ruby
I use this code
my #in="sample.md".IO.lines;
my #out;
for #in.kv -> $key,$val {
if $val.starts-with("\{:lang") {
if $val ~~ /^{:lang="([a-z]+)"}$/ { # capture lang
#out[$key]="```$0"; # convert it into ```ruby
$key++;
while #in[$key].starts-with(" ") {
#out[$key]=#in[$key].trim-leading;
$key++;
}
#out[$key]="```";
}
}
#out[$key]=$val;
}
The line containing the Regex gives
Cannot modify an immutable Pair (lang => True) error.
I've just started out using Regexes. Instead of ([a-z]+) I've tried (\w) and it gave the Unrecognized backslash sequence: '\w' error, among other things.
How to correctly capture and modify the lang value using Regex?
the LFM format just estimated
Corrected code:
my #in="sample.md".IO.lines;
my \len=#in.elems;
my #out;
my $k = 0;
while ($k < len) {
if #in[$k] ~~ / ^ '{:lang="' (\w+) '"}' $ / {
push #out, "```$0";
$k++;
while #in[$k].starts-with(" ") {
push #out, #in[$k].trim-leading;
$k++; }
push #out, "```";
}
push #out, #in[$k];
$k++;
}
for #out {print "$_\n"}
TL;DR
TL? Then read #jjemerelo's excellent answer which not only provides a one-line solution but much more in a compact form ;
DR? Aw, imo you're missing some good stuff in this answer that JJ (reasonably!) ignores. Though, again, JJ's is the bomb. Go read it first. :)
Using a Perl regex
There are many dialects of regex. The regex pattern you've used is a Perl regex but you haven't told Raku that. So it's interpreting your regex as a Raku regex, not a Perl regex. It's like feeding Python code to perl. So the error message is useless.
One option is to switch to Perl regex handling. To do that, this code:
/^{:lang="([a-z]+)"}$/
needs m :P5 at the start:
m :P5 /^{:lang="([a-z]+)"}$/
The m is implicit when you use /.../ in a context where it is presumed you mean to immediately match, but because the :P5 "adverb" is being added to modify how Raku interprets the pattern in the regex, one has to also add the m.
:P5 only supports a limited set of Perl's regex patterns. That said, it should be enough for the regex you've written in your question.
Using a Raku regex
If you want to use a Raku regex you have to learn the Raku regex language.
The "spirit" of the Raku regex language is the same as Perl's, and some of the absolute basic syntax is the same as Perl's, but it's different enough that you should view it as yet another dialect of regex, just one that's generally "powered up" relative to Perl's regexes.
To rewrite the regex in Raku format I think it would be:
/ ^ '{:lang="' (<[a..z]>+) '"}' $ /
(Taking advantage of the fact whitespace in Raku regexes is ignored.)
Other problems in your code
After fixing the regex, one encounters other problems in your code.
The first problem I encountered is that $key is read-only, so $key++ fails. One option is to make it writable, by writing -> $key is copy ..., which makes $key a read-write copy of the index passed by the .kv.
But fixing that leads to another problem. And the code is so complex I've concluded I'd best not chase things further. I've addressed your immediate obstacle and hope that helps.
This one-liner seems to solve the problem:
say S:g /\{\: "lang" \= \" (\w+) \" \} /```$0/ given "text.md".IO.slurp;
Let's try and explain what was going on, however. The error was a regular expression grammar error, caused by having a : being followed by a name, and all that inside a curly. {} runs code inside a regex. Raiph's answer is (obviously) correct, by changing it to a Perl regular expression. But what I've done here is to change it to a Raku's non-destructive substitution, with the :g global flag, to make it act on the whole file (slurped at the end of the line; I've saved it to a file called text.md). So what this does is to slurp your target file, with given it's saved in the $_ topic variable, and printed once the substitution has been made. Good thing is if you want to make more substitutions you can shove another such expression to the front, and it will act on the output.
Using this kind of expression is always going to be conceptually simpler, and possibly faster, than dealing with a text line by line.

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

How to use Tcl regexp when the search string contains variables and spaces?

Here's the code where I would like to match $line with $ram using Tcl regexp.
set line { 0 DI /hdamrf};
set ram {/hdamrf};
regexp {\s+\d\s+DI\s+$ram} $line match; ## --> 0
Please help me construct a search string which can match the regular expression. It so happens that to use variables in search strings they have to be enclosed in curly braces. But curly braces doesn't allow me to use \s for detecting white space. Thanks.
This is one possibility:
regexp [format {\s+\d\s+DI\s+%s} $ram] $line match
This is another:
regexp "\\s+\\d\\s+DI\\s+$ram" $line match
Documentation: format, Syntax of Tcl regular expressions, regexp
Try it this way:
regexp [subst -nocommands -nobackslashes {\s+\d\s+DI\s+$ram}] $line match
For example:
% regexp [subst -nocommands -nobackslashes {\s+\d\s+DI\s+$ram}] $line match
1
See the manual page for subst here:;
This command performs variable substitutions, command substitutions, and backslash substitutions on its string argument and returns the fully-substituted result. The substitutions are performed in exactly the same way as for Tcl commands. As a result, the string argument is actually substituted twice, once by the Tcl parser in the usual fashion for Tcl commands, and again by the subst command.
If any of the -nobackslashes, -nocommands, or -novariables are specified, then the corresponding substitutions are not performed. For example, if -nocommands is specified, command substitution is not performed: open and close brackets are treated as ordinary characters with no special interpretation.
Since it's already in {...}, the part about the parser interpreting it twice isn't completely accurate :)

Tcl regexp does not escape asterisk (*)

In my script I get a string that looks like this:
Reading thisfile.txt
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
** Error: (errorcode) Cannot access file "somedir/anotherlib". <--
No such file or directory. (errno = ENOENT) <--
Reading anotherfile.txt
.....
But the two marked lines with the error code only appear from time to time.
I'm trying to use a regexpression to get the lines from after Reading thisfile.txt to the line before either Reading anotherfile.txt or, if it is there, before **.
So result should in every case look like this:
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
I have tried it with this regexp:
set pattern ".*Reading thisfile.txt\n(.*)\n.*Reading .*$"
Then I do
regexp -all $pattern $data -> result
But that only works if there is no error message.
So I'm trying to look for the *.
set pattern ".*Reading thisfile.txt\n(.*)\n.*\[\*|Reading\].*$"
But that also does not work. The part with ** Error is still there.
I wonder why. This one doesn't even compile:
set pattern ".*Reading thisfile.txt\n(.*)\n.*\*?.*Reading .*$"
any idea how I can find the and not match the *?
From the way you wrote your regex, you will have to use braces:
set pattern {.*Reading thisfile\.txt\n(.*)\n.*\*?.*Reading .*$}
If you used quotes, you would have had to use:
set pattern ".*Reading thisfile\\.txt\n(.*)\n.*\\*?.*Reading .*$"
i.e. basically put a second backslash to escape the first ones.
The above will be able to grab something; albeit everything between the first and the last Reading.
If you want to match from Reading thisfile.txt to the next line beginning with asterisk, then you could use:
set pattern {^Reading thisfile\.txt\n(.*?)\n(?=^Reading|^\*)}
regexp -all -lineanchor -- $pattern $data -> result
(?=^Reading|^\*) is a positive lookahead and I changed your (.*) to (.*?) so that you really get all the occurrences and not from the first to the last Reading.
The positive lookahead will match if either Reading or * is ahead and are both starting on a new line.
-lineanchor makes ^ match at every beginning of line instead of at the start of the string.
codepad demo
I forgot to mention that if you have more than one match, you will have to set the results of the regexp and use the -inline modifier instead of using the above construct (else you'll get only the last submatch)...
set results [regexp -all -inline -lineanchor -- $pattern $data]
foreach {main sub} $results {
puts $sub
}
I'm unfamiliar with tcl but the following regex will give you matches of which the 1st capture-group contains the filename and the 2nd capture-group contains all the lines you want:
^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)
Debuggex Demo
Basically the (?:[^\n]|\n(?!Reading|\*\*))* is saying "Match anything that isn't a new-line character or a new-line character not followed by either Reading or **".
What I'm getting from Jerry's answer is you'd define that in tcl like so:
set pattern {^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)}

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?
The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.
No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.
A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.
I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.