RegExp - Block comments to single comment optimization - regex

Optimize?
I am having trouble with the following regular expression:
/^\s*(/?\*{1,2}(\s*(\b.*\b)\s*(\*/)?)?|\*/?\s*)$/g
I am wondering if I can improve this expression? Also, if anyone can find a problem with this expression, could you take note of it. Here is my live demo in action. It works for all the conditions that I have set up below
Test constraints
These Match
/**
/*
*
*/
/** Javadoc */
/* Block */
* Multi-line
/* Single Line */
/** A
/** A */
/* A
/* A */
These Shouldn't
7 * 8
// Regular comment
Results
After replacing the match with: // $3
I successfully converted them, despite some of them having trailing white-space:
//
//
//
// Javadoc
// Block
// Multi-line
// Single Line
// A
// A
// A
// A
Regex explained
/
^ Line start
\s* 0 or more white-space
( Start group 1
/? forward-slash (OPTIONAL)
\*{1,2} 1 to 2 asterisks
( Start group 2
\s* 0 or more white-space
( Start group 3
\b Start word boundry
.* 0 or more of anything
\b End word boundry
) End group 3
\s* 0 or more white-space
( Start group 4 (OPTIONAL)
\* 0 or more asterisks
/ Forward-slash
)? End group 4
)? End group 2 (OPTIONAL)
| OR
\* Asterisk
/? Forward-slash (OPTIONAL)
\s* 0 or more white-space
) End group 1
$ Line end
/
g Global; match all

Parsing C/C++ style comments are a little more complex when you have to take into account string embedding, escaping, and line continuation.
This also requires a single character consumption pass-thru method (after all else is checked). That being the case, of the many possible match alternations, there is only one item you're interrested in - the C-style comment /* */.
So, in the case of a global search and a lot of matches, only one interrests you, say a capture group 1 match. Of course, in your case a simple replace won't cut it.
So, you have to sit in a global find (but not replace) loop. Each match will be a part of the source string, no part is skipped, so each one will be appended to a destination string.
During the loop, when capture group 1 matches (C-style comment), you can do your not-so-simple substitution's to make it C++ comment, then append that result to the destination string.
Thats the jist of it. If this can't be done in whatever language you use, then it can't be done right, that is for sure!
Here is a list of regex's to perform the conversion to // when you'vee captured a C-style comment, they have to be performed in the order of appearance (the notation is Perl for example only):
# s{ ^ /\* (?: [^\S\n] | \*)* }{// }xmg;
# s{ ^ (?: [^\S\n] | \*)* \*/ $ }{// }xmg;
# s{ ^ $ }{// }xmg;
# s{ ^ (?: [^\S\n] | \*)+ }{// }xmg;
# s{ (?<![\s*]) (?: [^\S\n] | \*)+ $ }{}xmg;
# s{ (?<![\s*]) (?: [^\S\n] | \*)* \*/ $ }{}xmg;
This is the regex to use in your find loop (as explained above). This regex can be found on any Perl news group, its part of the FAQ.
# (?:(/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)|//(?:[^\\]|\\\n?)*?\n)|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
# -----------------------------------------------------------------
(?: # Comments
( # (1 start)
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
) # (1 end)
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
|
(?: # Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)

Related

VB.NET 2010: Matching Java multiline comments with Regex

I want to remove (Java/C/C++/..) multiline comments from a file. For this, I have written a regular expression:
/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/
This regular expression works well with Nodepad++ and Geany (search and replace all with nothing). The regex behaves differently in VB.NET.
I am using:
Microsoft Visual Studio 2010 (Version 10.0.40219.1 SP1Rel)
Microsoft .NET Framework (4.7.02053 SP1Rel)
The file I'm running replacements on is not that complex. I do not need to take care of any quoted text that might start or end comments.
#sln thank you for your detailed reply, I'll also quickly explain my regex as nicely as you did!
/\* Find the beginning of the comment.
[^\*]* Match any chars, but not an asterisk.
We need to deal with finding an asterisk now:
(\*+[^\*/][^\*]*)* This regex breaks down to:
\*+ Consume asterisk(s).
[^\*/] Match any other char that is not an asterisk or a / (would end the comment!).
[^\*]* Match any other chars that are not asterisks.
( )* Try to find more asterisks followed by other chars.
\*+/ Match 1 to n asterisks and finish the comment with /.
Here are two code snippets:
First:
text
/*
* block comment
*
*/ /* comment1 */ /* comment2 */
My text to keep.
/* more comments */
more text
Second:
text
/*
* block comment
*
*/ /* comment1 *//* comment2 */
My text to keep.
/* more comments */
more text
The only difference is the space between
/* comment1 *//* comment2 */
Deleting found matches with Notepad++ and Geany works perfectly for both cases. Using regular expressions from VB.NET fails for the second example. The result for the second example after deletion looks like this:
text
more text
But it should look like this:
text
My text to keep.
more text
I am using System.Text.RegularExpressions:
Dim content As String = IO.File.ReadAllText(file_path_)
Dim multiline_comment_remover As Regex = New Regex("/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/")
content = multiline_comment_remover.Replace(content, "")
I would like to have the same results with VB.NET as with Notepad++ and Geany. As answered by sln, my regex "should work in a weird way". The question is why does VB.NET fail to process this regex as intended? This question is still open.
Since sln's answer got my code working, I'll accept this answer. Although this doesn't explain why VB.NET doesn't like my regex. Thanks for all your help! I learned a lot!
I think you could use a generalized C++ comment stripper.
It's basically
Glbolly find with below, replace with $2
Demo PCRE: https://regex101.com/r/UldYK5/1
Demo Python: https://regex101.com/r/avfSfB/1
# raw: (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
# delimited: /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\\r\n]*))+|[^\/"'\\\r\n]+)+|[\S\s][^\/"'\\\r\n]*)/
(?m) # Multi-line modifier
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
# Quotes
# ======================
(?: # Quote and Non-Comment blocks
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| # --------------
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| # --------------
(?: # Qualified Linebreak's
\r? \n
(?:
(?= # If comment ahead just stop
(?: ^ [ \t]* )?
(?: /\* | // )
)
| # or,
[^/"'\\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)
)+
| # --------------
[^/"'\\\r\n]+ # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)+ # Grab multiple instances
| # or,
# ======================
# Pass through
[\S\s] # Any other char
[^/"'\\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end), Non - comments
If you use a particular engine that doesn't support assertions,
then you'd have to use this.
This won't preserve formatting though.
Usage same as above.
# (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)

Matching free() and malloc() calls with regular expressions

I'm creating a powershell script that parses a file containing C code and detects if it contains calls to free(), malloc() or realloc() functions.
file_one.c
int MethodOne()
{
return 1;
}
int MethodTwo()
{
free();
return 1;
}
file_two.c
int MethodOne()
{
//free();
return 1;
}
int MethodTwo()
{
free();
return 1;
}
check.ps1
$regex = "(^[^/]*free\()|(^[^/]*malloc\()|(^[^/]*realloc\()"
$file_one= "Z:\PATH\file_one.txt"
$file_two= "Z:\PATH\file_two.txt"
$contentOne = Get-Content $file_one -Raw
$contentOne -match $regex
$contentTwo = Get-Content $file_two-Raw
$contentTwo -match $regex
processing the whole file in a time seems to work well with contentOne,
in fact I get True (because of the free() in MethodTwo).
Processing contentTwo is not so lucky and returns False instead of True
(because of the free() in MethodTwo).
Can someone help me to write a better regex that works in both cases?
Sure, this is it
Raw:
^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/"'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())
Stringed:
"^(?>(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n))|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\b(?:free|malloc|realloc)\\()[\\S\\s](?:(?!\\b(?:free|malloc|realloc)\\()[^/\"'\\\\])*))*(?:(\\bfree\\()|(\\bmalloc\\()|(\\brealloc\\())"
Verbatim:
#"^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:""[^""\\]*(?:\\[\S\s][^""\\]*)*""|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/""'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())"
Explained
^
(?>
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: \r? \n ) # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| # OR,
(?! # ASSERT: Here, cannot be free / malloc / realloc {}
\b
(?: free | malloc | realloc )
\(
)
[\S\s] # Any char which could start a comment, string, etc..
# (Technically, we're going past a C++ source code error)
(?: # -------------------------
(?! # ASSERT: Here, cannot be free / malloc / realloc {}
\b
(?: free | malloc | realloc )
\(
)
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
)*
(?:
( \b free\( ) # (1), Free()
|
( \b malloc\( ) # (2), Malloc()
|
( \b realloc\( ) # (3), Realloc()
)
Some notes:
This only finds the first one from the beginning of string using ^ anchor.
To find them all, just remove the ^ from the regex.
This works because it matches everything up to what you're looking for.
In this case, what it found is in capture group 1, 2, or 3.
Good Luck !!
What the regex contains:
----------------------------------
* Format Metrics
----------------------------------
Atomic Groups = 1
Cluster Groups = 10
Capture Groups = 3
Assertions = 2
( ? ! = 2
Free Comments = 25
Character Classes = 12
edit
Per request, explanation of the part of the regex that handles
/**/ comments. This -> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
This is a modified unrolled-loop regex that assumes an opening delimiter
of /* and a closing one of */.
Notice that the open/close share a common character / in it's delimiter
sequence.
To be able to do this without lookaround assertions, a method is used
to shift the trailing delimiter's asterisk inside the loop.
Using this factoring, all that's needed is to check for a closing /
to complete the delimited sequence.
/\* # Opening delimiter /*
[^*]* # Optionally, consume all non-asterisks
\*+ # This must be 1 or more asterisks anchor's or FAIL.
# This is matched here to align the optional loop below
# because it is looking for the closing /.
(?: # The optional loop part
[^/*] # Specifically a single non / character (nor asterisk).
# Since a / will be the next closing delimiter, it must be excluded.
[^*]* # Optional non-asterisks.
# This will accept a / because it is supposed to consume ALL
# opening delimiter's as it goes
# and will consider the very next */ as a close.
\*+ # This must be 1 or more asterisks anchor's or FAIL.
)* # Repeat 0 to many times.
/ # Closing delimiter /

Golang regex replace excluding quoted strings

I'm trying to implement the removeComments function in Golang from this Javascript implementation. I'm hoping to remove any comments from the text. For example:
/* this is comments, and should be removed */
However, "/* this is quoted, so it should not be removed*/"
In the Javascript implementation, quoted matching are not captured in groups, so I can easily filter them out. However, in Golang, it seems it's not easy to tell whether the matched part is captured in a group or not. So how can I implement the same removeComments logic in Golang as the same in the Javascript version?
BACKGROUND
The correct way to do the task is to match and capture quoted strings (bearing in mind there can be escaped entities inside) and then matching the multiline comments.
REGEX IN-CODE DEMO
Here is the code to deal with that:
package main
import (
"fmt"
"regexp"
)
func main() {
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*")|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
txt := `random text
/* removable comment */
"but /* never remove this */ one"
more random *text*`
fmt.Println(reg.ReplaceAllString(txt, "$1"))
}
See the Playground demo
EXPLANATION
The regex I suggest is written with the Best Regex Trick Ever concept in mind and consists of 2 alternatives:
("[^"\\]*(?:\\.[^"\\]*)*") - Double quoted string literal regex - Group 1 (see the capturing group formed with the outer pair of unescaped parentheses and later accessible via replacement backreferences) matching double quoted string literals that can contain escaped sequences. This part matches:
" - a leading double quote
[^"\\]* - 0+ characters other than " and \ (as [^...] construct is a negated character class that matches any characters but those defined inside it) (the * is a zero or more occurrences matching quantifier)
(?:\\.[^"\\]*)*" - 0+ sequences (see the last * and the non-capturing group used only to group subpatterns without forming a capture) of an escaped sequence (the \\. matches a literal \ followed with any character) followed with 0+
characters other than " and \
| - or
/\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - multiline comment regex part matches *without forming a capture group (thus, unavailable from the replacement pattern via backreferences) and matches
/ - the / literal slash
\* - the literal asterisk
[^*]* - zero or more characters other than an asterisk
\*+ - 1 or more (the + is a one or more occurrences matching quantifier) asterisks
(?:[^/*][^*]*\*+)* - 0+ sequences (non-capturing, we do not use it later) of any character but a / or * (see [^/*]), followed with 0+ characters other than an asterisk (see [^*]*) and then followed with 1+ asterisks (see \*+).
/ - a literal (trailing, closing) slash.
NOTE: This multiline comment regex is the fastest I have ever tested. Same goes for the double quoted literal regex as "[^"\\]*(?:\\.[^"\\]*)*" is written with the unroll-the-loop technique in mind: no alternations, only character classes with * and + quantifiers are used in a specific order to allow the fastest matching.
NOTES ON PATTERN ENHANCEMENTS
If you plan to extend to matching single quoted literals, there is nothing easier, just add another alternative into the 1st capture group by re-using the double quoted string literal regex and replacing the double quotes with single ones:
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
^-------------------------^
Here is the single- and double-quoted literal supporting regex demo removing the miltiline comments
Adding a single line comment support is similar: just add //[^\n\r]* alternative to the end:
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`)
^-----------^
Here is single- and double-quoted literal supporting regex demo removing the miltiline and singleline comments
These do not preserve formatting
Preferred way (produces a NULL if group 1 is not matched)
works in golang playground -
# https://play.golang.org/p/yKtPk5QCQV
# fmt.Println(reg.ReplaceAllString(txt, "$1"))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// [^\n]* # Start // comment
(?: \n | $ ) # End // comment
)
|
( # (1 start), Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
) # (1 end)
Alternative way (group 1 is always matched, but could be empty)
works in golang playground -
# https://play.golang.org/p/7FDGZSmMtP
# fmt.Println(reg.ReplaceAllString(txt, "$1"))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))?((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)?)
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// [^\n]* # Start // comment
(?: \n | $ ) # End // comment
)?
( # (1 start), Non - comments
(?:
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
)?
) # (1 end)
The Cadilac - Preserves Formatting
(Unfortunately, this can't be done in Golang because Golang cannot do Assertions)
Posted incase you move to a different regex engine.
# raw: ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
# delimited: /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
(?: \r? \n | [\S\s] ) # Linebreak or Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
I've never read/written anything in Go, so bear with me. Fortunately, I know regex. I did a little research on Go regexes, and it would seem that they lack most modern features (such as references).
Despite that, I've developed a regex that seems to be what you're looking for. I'm assuming that all strings are single line. Here it is:
reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*+/`)
txt := `random text
/* removable comment */
"but /* never remove this */ one"
more random *text*`
fmt.Println(reg.ReplaceAllString(txt, "${1}"))
Variation: The version above will not remove comments that happen after quotation marks. This version will, but it may need to be run multiple times.
reg := regexp.MustCompile(
`(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*+/`
)
txt := `
random text
what /* removable comment */
hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/
more random *text*
`
newtxt := reg.ReplaceAllString(txt, "${1}")
fmt.Println(newtxt)
newtxt = reg.ReplaceAllString(newtxt, "${1}")
fmt.Println(newtxt)
Explanation
(?m) means multiline mode. Regex101 gives a nice explanation of this:
The ^ and $ anchors now match at the beginning/end of each line respectively, instead of beginning/end of the entire string.
It needs to be anchored to the beginning of each line (with ^) to ensure a quote hasn't started.
The first regex has this: [^"\n]*. Essentially, it's matching everything that's not " or \n. I've added parenthesis because this stuff isn't comments, so it needs to be put back.
The second regex has this: (([^"\n]*|("[^"\n]*"))*). The regex, with this statement can either match [^"\n]* (like the first regex does), or (|) it can match a pair of quotes (and the content between them) with "[^"\n]*". It's repeating so it works when there are more than one quote pair, for example. Note that, like the simpler regex, this non-comment stuff is being captured.
Both regexes use this: /\*([^*]+|(\*+[^/]))*\*+/. It matches /* followed by any amount of either:
[^*]+ Non * chars
or
\*+[^/] One or more *s that are not followed by /.
And then it matches the closing */
During replacement, the ${1} refers to the non-comment things that were captured, so they're reinserted into the string.
Just for fun another approach, minimal lexer implemented as state machine, inspired by and well described in Rob Pike talk http://cuddle.googlecode.com/hg/talk/lex.html. Code is more verbose but more readable, understandable and hackable then regexp. Also it can work with any Reader and Writer, not strings only so don't consumes RAM and should even be faster.
type stateFn func(*lexer) stateFn
func run(l *lexer) {
for state := lexText; state != nil; {
state = state(l)
}
}
type lexer struct {
io.RuneReader
io.Writer
}
func lexText(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
switch r {
case '"':
l.Write([]byte(string(r)))
return lexQuoted
case '/':
r, _, err = l.ReadRune()
if r == '*' {
return lexComment
} else {
l.Write([]byte("/"))
l.Write([]byte(string(r)))
}
default:
l.Write([]byte(string(r)))
}
}
return nil
}
func lexQuoted(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '"' {
l.Write([]byte(string(r)))
return lexText
}
l.Write([]byte(string(r)))
}
return nil
}
func lexComment(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '*' {
r, _, err = l.ReadRune()
if r == '/' {
return lexText
}
}
}
return nil
}
You can see it works http://play.golang.org/p/HyvEeANs1u
Demo
Play golang demo
(The workings at each stage are output and the end result can be seen by scrolling down.)
Method
A few "tricks" are used to work around Golang's somewhat limited regex syntax:
Replace start quotes and end quotes with a unique character. Crucially, the characters used to identify start and end quotes must be different from each other and extremely unlikely to appear in the text being processed.
Replace all comment starters (/*) that are not preceeded by an unterminated start quote with a unique sequence of one or more characters.
Similarly, replace all comment enders (*/) that are not succeeded by an end quote that does not have a start quote before it with a different unique sequence of one or more characters.
Remove all remaining /*...*/ comment sequences.
Unmask the previously "masked" comment starters/enders by reversing the replacements made in steps 2 and 3 above.
Limitations
The current demo doesn't address the possibility of a double quote appearing within a comment, e.g. /* Not expected: " */. Note: My feeling is this could be handled - just haven't put the effort in yet - so let me know if you think it could be an issue and I'll look into it.
Try this example..
play golang

Regex skip in C++

This is my string:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
# Block3 { }
Block4 {
anything here
}
I am using this regex to get each block name and inside content.
regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);
But this regex gets all inside of description too. There is a “skip” option in PHP that you can use to skip all descriptions.
What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match
But this is C++ and I cannot use this skip method. What should I do to skip all descriptions and just get Block4 in C++ regex?
This regex detects Block1, Block2, Block3 and Block4 but I want to skip Block1, Block2, Block3 and just get Block4 (skip descriptions). How do I have to edit my regex to get just Block4 (everything outside the descriptions)?
Since you requested this long regex, here it is.
This will not handle nested Blocks like block{ block{ } }
it would match block{ block{ } } only.
Since you specified you are using C++11 as the engine, I didn't use
recursion. This is easily changed to use recursion say if you were to use
PCRE or Perl, or even BOOST::Regex. Let me know if you'd want to see that.
As it is it's flawed, but works for your sample.
Another thing it won't do is parse Preprocessor Directives '#...' because
I forgot the rules for that (thought I did it recently, but can't find a record).
To use it, sit in a while ( regex_search() ) loop looking for a match on
capture group 1, if (m[1].success) etc.. That will be your block.
The rest of the matches are for comments, quotes, or non-comments, unrelated
to the block. These have to be matched to progress the match position.
The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. Like I said, use boost::regex or something.
Benchmark
Sample:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
Block4 {
// CommentedBlock{ asdfasdf }
anyth"}"ing here
}
Block5 {
/* CommentedBlock{ asdfasdf }
anyth}"ing here
*/
}
Results:
Regex1: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 8
Elapsed Time: 1.95 s, 1947.26 ms, 1947261 µs
Regex Explained:
# Raw: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
# Stringed: "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
( # (1 start), BLOCK
\w+ \s* \{
####################
(?: # ------------------------
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ------------------------
#####################
\}
) # (1 end), BLOCK
| # OR,
[\S\s] # Any other char
(?: # -------------------------
(?! # ASSERT: Here, cannot be a BLOCK{ }
\w+ \s* \{
(?: # ==============================
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ==============================
\}
) # ASSERT End
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
Tl;DR: Regular expressions cannot be used to parse full blown computer languages. What you want to do cannot be done with regular expressions. You need to develop a mini-C++ parser to filter out comments. The answer to this related question might point you in the right direction.
Regex can be used to process regular expressions, but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols
If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars. If you are really curious, enroll in a Formal Language Theory class.

Deleting multiple lines from recursive pcregrep search

In the header of every file in my project I have a block of comments that were generated by VCC with revision history for the file. We have moved away from VCC and no longer want to have the revision history in the file since it is obsolete.
I currently have a search pcregrep search that returns the exact results I'm looking for:
pcregrep -rM '(\$Rev)(?s)(.+?(?=\*\*))' *
I have tried piping the results into an xargs sed along with many other attempts to remove all the returned lines from the files but I get various errors including "File name too long"
I want to delete the entire block
Since you are talking about C++ files, you can't just find comments,
you have to parse comments because literal strings could contain comment
delimiters.
This has been done before, no use re-inventing the wheel.
A simple grep is not going to cut it. You need a simple macro or C# console app
that has better capabilities.
If you want to go this route, below is a regex for you.
Each match will either match group 1 (comments block) or group 2 (non-comments).
You need to either rewrite a new string, via appending the results of each match.
Or, use a callback function to do the replacement.
Each time it matches group 2 just append that (or return it if callback) unchanged.
When it matches group 1, you have to run another regular expression on the
contents to see if the comment block contains the Revision information.
If it does contain it, don't append (or return "" if callback) its contents.
If it doesn't, just append it unchanged.
So, its a 2 step process.
Pseudo-code:
// here read in the source sting from file.
string strSrc = ReadFile( name );
string strNew = "";
Matcher CmtMatch, RevMatch;
while ( GloballyFind( CommentRegex, strSrc, CmtMatch ) )
{
if ( CmtMatch.matched(1) )
{
string strComment = Match.value(1);
if ( FindFirst( RevisionRegex, strComment, RevMatch ) )
continue;
else
strNew += strComment;
}
else
strNew += Match.value(2);
}
// here write out the new string.
The same could be done via a ReplaceAll() using a callback function, if
using a Macro language. The logic goes in the callback.
Its not as hard as it looks, but if you want to do it right I'd do it this way.
And then, hey, you got a nifty utility to be used again.
Here is the regex expanded=, formatted and compressed.
(constructed with RegexFormat 6 (Unicode))
# raw: ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)
# delimited: /((?:(?:^\h*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:\h*\n(?=\h*(?:\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|\/\*|\/\/))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\\s]*)/
# Dbl-quoted: "((?:(?:^\\h*)?(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/(?:\\h*\\n(?=\\h*(?:\\n|/\\*|//)))?|//(?:[^\\\\]|\\\\\\n?)*?(?:\\n(?=\\h*(?:\\n|/\\*|//))|(?=\\n))))+)|(\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*'|[\\S\\s][^/\"'\\\\\\s]*)"
# Sing-quoted: '((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\\]|\\\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\\[\S\s]|[^"\\\])*"|\'(?:\\\[\S\s]|[^\'\\\])*\'|[\S\s][^/"\'\\\\\s]*)'
( # (1 start), Comments
(?:
(?: ^ \h* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
\h* \n
(?=
\h*
(?: \n | /\* | // )
)
)?
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
(?: # End // comment
\n
(?= # <- To preserve formatting
\h*
(?: \n | /\* | // )
)
| (?= \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
Incase you want something simpler -
This is the same regex without multiple comment block capture or format preserving. Same grouping and replacement principle applies.
# Raw: (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)