In the header of every file in my project I have a block of comments that were generated by VCC with revision history for the file. We have moved away from VCC and no longer want to have the revision history in the file since it is obsolete.
I currently have a search pcregrep search that returns the exact results I'm looking for:
pcregrep -rM '(\$Rev)(?s)(.+?(?=\*\*))' *
I have tried piping the results into an xargs sed along with many other attempts to remove all the returned lines from the files but I get various errors including "File name too long"
I want to delete the entire block
Since you are talking about C++ files, you can't just find comments,
you have to parse comments because literal strings could contain comment
delimiters.
This has been done before, no use re-inventing the wheel.
A simple grep is not going to cut it. You need a simple macro or C# console app
that has better capabilities.
If you want to go this route, below is a regex for you.
Each match will either match group 1 (comments block) or group 2 (non-comments).
You need to either rewrite a new string, via appending the results of each match.
Or, use a callback function to do the replacement.
Each time it matches group 2 just append that (or return it if callback) unchanged.
When it matches group 1, you have to run another regular expression on the
contents to see if the comment block contains the Revision information.
If it does contain it, don't append (or return "" if callback) its contents.
If it doesn't, just append it unchanged.
So, its a 2 step process.
Pseudo-code:
// here read in the source sting from file.
string strSrc = ReadFile( name );
string strNew = "";
Matcher CmtMatch, RevMatch;
while ( GloballyFind( CommentRegex, strSrc, CmtMatch ) )
{
if ( CmtMatch.matched(1) )
{
string strComment = Match.value(1);
if ( FindFirst( RevisionRegex, strComment, RevMatch ) )
continue;
else
strNew += strComment;
}
else
strNew += Match.value(2);
}
// here write out the new string.
The same could be done via a ReplaceAll() using a callback function, if
using a Macro language. The logic goes in the callback.
Its not as hard as it looks, but if you want to do it right I'd do it this way.
And then, hey, you got a nifty utility to be used again.
Here is the regex expanded=, formatted and compressed.
(constructed with RegexFormat 6 (Unicode))
# raw: ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)
# delimited: /((?:(?:^\h*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:\h*\n(?=\h*(?:\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|\/\*|\/\/))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\\s]*)/
# Dbl-quoted: "((?:(?:^\\h*)?(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/(?:\\h*\\n(?=\\h*(?:\\n|/\\*|//)))?|//(?:[^\\\\]|\\\\\\n?)*?(?:\\n(?=\\h*(?:\\n|/\\*|//))|(?=\\n))))+)|(\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*'|[\\S\\s][^/\"'\\\\\\s]*)"
# Sing-quoted: '((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\\]|\\\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\\[\S\s]|[^"\\\])*"|\'(?:\\\[\S\s]|[^\'\\\])*\'|[\S\s][^/"\'\\\\\s]*)'
( # (1 start), Comments
(?:
(?: ^ \h* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
\h* \n
(?=
\h*
(?: \n | /\* | // )
)
)?
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
(?: # End // comment
\n
(?= # <- To preserve formatting
\h*
(?: \n | /\* | // )
)
| (?= \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
Incase you want something simpler -
This is the same regex without multiple comment block capture or format preserving. Same grouping and replacement principle applies.
# Raw: (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
Related
I want to remove (Java/C/C++/..) multiline comments from a file. For this, I have written a regular expression:
/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/
This regular expression works well with Nodepad++ and Geany (search and replace all with nothing). The regex behaves differently in VB.NET.
I am using:
Microsoft Visual Studio 2010 (Version 10.0.40219.1 SP1Rel)
Microsoft .NET Framework (4.7.02053 SP1Rel)
The file I'm running replacements on is not that complex. I do not need to take care of any quoted text that might start or end comments.
#sln thank you for your detailed reply, I'll also quickly explain my regex as nicely as you did!
/\* Find the beginning of the comment.
[^\*]* Match any chars, but not an asterisk.
We need to deal with finding an asterisk now:
(\*+[^\*/][^\*]*)* This regex breaks down to:
\*+ Consume asterisk(s).
[^\*/] Match any other char that is not an asterisk or a / (would end the comment!).
[^\*]* Match any other chars that are not asterisks.
( )* Try to find more asterisks followed by other chars.
\*+/ Match 1 to n asterisks and finish the comment with /.
Here are two code snippets:
First:
text
/*
* block comment
*
*/ /* comment1 */ /* comment2 */
My text to keep.
/* more comments */
more text
Second:
text
/*
* block comment
*
*/ /* comment1 *//* comment2 */
My text to keep.
/* more comments */
more text
The only difference is the space between
/* comment1 *//* comment2 */
Deleting found matches with Notepad++ and Geany works perfectly for both cases. Using regular expressions from VB.NET fails for the second example. The result for the second example after deletion looks like this:
text
more text
But it should look like this:
text
My text to keep.
more text
I am using System.Text.RegularExpressions:
Dim content As String = IO.File.ReadAllText(file_path_)
Dim multiline_comment_remover As Regex = New Regex("/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/")
content = multiline_comment_remover.Replace(content, "")
I would like to have the same results with VB.NET as with Notepad++ and Geany. As answered by sln, my regex "should work in a weird way". The question is why does VB.NET fail to process this regex as intended? This question is still open.
Since sln's answer got my code working, I'll accept this answer. Although this doesn't explain why VB.NET doesn't like my regex. Thanks for all your help! I learned a lot!
I think you could use a generalized C++ comment stripper.
It's basically
Glbolly find with below, replace with $2
Demo PCRE: https://regex101.com/r/UldYK5/1
Demo Python: https://regex101.com/r/avfSfB/1
# raw: (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
# delimited: /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\\r\n]*))+|[^\/"'\\\r\n]+)+|[\S\s][^\/"'\\\r\n]*)/
(?m) # Multi-line modifier
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
# Quotes
# ======================
(?: # Quote and Non-Comment blocks
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| # --------------
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| # --------------
(?: # Qualified Linebreak's
\r? \n
(?:
(?= # If comment ahead just stop
(?: ^ [ \t]* )?
(?: /\* | // )
)
| # or,
[^/"'\\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)
)+
| # --------------
[^/"'\\\r\n]+ # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)+ # Grab multiple instances
| # or,
# ======================
# Pass through
[\S\s] # Any other char
[^/"'\\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end), Non - comments
If you use a particular engine that doesn't support assertions,
then you'd have to use this.
This won't preserve formatting though.
Usage same as above.
# (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
I'm creating a powershell script that parses a file containing C code and detects if it contains calls to free(), malloc() or realloc() functions.
file_one.c
int MethodOne()
{
return 1;
}
int MethodTwo()
{
free();
return 1;
}
file_two.c
int MethodOne()
{
//free();
return 1;
}
int MethodTwo()
{
free();
return 1;
}
check.ps1
$regex = "(^[^/]*free\()|(^[^/]*malloc\()|(^[^/]*realloc\()"
$file_one= "Z:\PATH\file_one.txt"
$file_two= "Z:\PATH\file_two.txt"
$contentOne = Get-Content $file_one -Raw
$contentOne -match $regex
$contentTwo = Get-Content $file_two-Raw
$contentTwo -match $regex
processing the whole file in a time seems to work well with contentOne,
in fact I get True (because of the free() in MethodTwo).
Processing contentTwo is not so lucky and returns False instead of True
(because of the free() in MethodTwo).
Can someone help me to write a better regex that works in both cases?
Sure, this is it
Raw:
^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/"'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())
Stringed:
"^(?>(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n))|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\b(?:free|malloc|realloc)\\()[\\S\\s](?:(?!\\b(?:free|malloc|realloc)\\()[^/\"'\\\\])*))*(?:(\\bfree\\()|(\\bmalloc\\()|(\\brealloc\\())"
Verbatim:
#"^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:""[^""\\]*(?:\\[\S\s][^""\\]*)*""|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/""'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())"
Explained
^
(?>
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: \r? \n ) # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| # OR,
(?! # ASSERT: Here, cannot be free / malloc / realloc {}
\b
(?: free | malloc | realloc )
\(
)
[\S\s] # Any char which could start a comment, string, etc..
# (Technically, we're going past a C++ source code error)
(?: # -------------------------
(?! # ASSERT: Here, cannot be free / malloc / realloc {}
\b
(?: free | malloc | realloc )
\(
)
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
)*
(?:
( \b free\( ) # (1), Free()
|
( \b malloc\( ) # (2), Malloc()
|
( \b realloc\( ) # (3), Realloc()
)
Some notes:
This only finds the first one from the beginning of string using ^ anchor.
To find them all, just remove the ^ from the regex.
This works because it matches everything up to what you're looking for.
In this case, what it found is in capture group 1, 2, or 3.
Good Luck !!
What the regex contains:
----------------------------------
* Format Metrics
----------------------------------
Atomic Groups = 1
Cluster Groups = 10
Capture Groups = 3
Assertions = 2
( ? ! = 2
Free Comments = 25
Character Classes = 12
edit
Per request, explanation of the part of the regex that handles
/**/ comments. This -> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
This is a modified unrolled-loop regex that assumes an opening delimiter
of /* and a closing one of */.
Notice that the open/close share a common character / in it's delimiter
sequence.
To be able to do this without lookaround assertions, a method is used
to shift the trailing delimiter's asterisk inside the loop.
Using this factoring, all that's needed is to check for a closing /
to complete the delimited sequence.
/\* # Opening delimiter /*
[^*]* # Optionally, consume all non-asterisks
\*+ # This must be 1 or more asterisks anchor's or FAIL.
# This is matched here to align the optional loop below
# because it is looking for the closing /.
(?: # The optional loop part
[^/*] # Specifically a single non / character (nor asterisk).
# Since a / will be the next closing delimiter, it must be excluded.
[^*]* # Optional non-asterisks.
# This will accept a / because it is supposed to consume ALL
# opening delimiter's as it goes
# and will consider the very next */ as a close.
\*+ # This must be 1 or more asterisks anchor's or FAIL.
)* # Repeat 0 to many times.
/ # Closing delimiter /
This is my string:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
# Block3 { }
Block4 {
anything here
}
I am using this regex to get each block name and inside content.
regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);
But this regex gets all inside of description too. There is a “skip” option in PHP that you can use to skip all descriptions.
What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match
But this is C++ and I cannot use this skip method. What should I do to skip all descriptions and just get Block4 in C++ regex?
This regex detects Block1, Block2, Block3 and Block4 but I want to skip Block1, Block2, Block3 and just get Block4 (skip descriptions). How do I have to edit my regex to get just Block4 (everything outside the descriptions)?
Since you requested this long regex, here it is.
This will not handle nested Blocks like block{ block{ } }
it would match block{ block{ } } only.
Since you specified you are using C++11 as the engine, I didn't use
recursion. This is easily changed to use recursion say if you were to use
PCRE or Perl, or even BOOST::Regex. Let me know if you'd want to see that.
As it is it's flawed, but works for your sample.
Another thing it won't do is parse Preprocessor Directives '#...' because
I forgot the rules for that (thought I did it recently, but can't find a record).
To use it, sit in a while ( regex_search() ) loop looking for a match on
capture group 1, if (m[1].success) etc.. That will be your block.
The rest of the matches are for comments, quotes, or non-comments, unrelated
to the block. These have to be matched to progress the match position.
The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. Like I said, use boost::regex or something.
Benchmark
Sample:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
Block4 {
// CommentedBlock{ asdfasdf }
anyth"}"ing here
}
Block5 {
/* CommentedBlock{ asdfasdf }
anyth}"ing here
*/
}
Results:
Regex1: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 8
Elapsed Time: 1.95 s, 1947.26 ms, 1947261 µs
Regex Explained:
# Raw: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
# Stringed: "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
( # (1 start), BLOCK
\w+ \s* \{
####################
(?: # ------------------------
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ------------------------
#####################
\}
) # (1 end), BLOCK
| # OR,
[\S\s] # Any other char
(?: # -------------------------
(?! # ASSERT: Here, cannot be a BLOCK{ }
\w+ \s* \{
(?: # ==============================
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ==============================
\}
) # ASSERT End
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
Tl;DR: Regular expressions cannot be used to parse full blown computer languages. What you want to do cannot be done with regular expressions. You need to develop a mini-C++ parser to filter out comments. The answer to this related question might point you in the right direction.
Regex can be used to process regular expressions, but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols
If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars. If you are really curious, enroll in a Formal Language Theory class.
I have a code base of thousands of files and need to grep for headers that have a certain token Q_OBJECT present but not in a comment. This includes single line // comments and multi-line /* ... */ comments.
What is the regex expression for this search?
This should work.
Do a global search, it will return if it matches
either:
Comments group 1
Quoted strings, or Non-token text group 2
Token text group 3
You just care if capture group 3 matched, it contains the token.
# (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?!Q_OBJECT)[\S\s](?:(?!Q_OBJECT)[^/"'\\])*)|(Q_OBJECT)
# '(/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\\]|\\\\\n?)*?\n)|("(?:\\\[\S\s]|[^"\\\])*"|\'(?:\\\[\S\s]|[^\'\\\])*\'|(?!Q_OBJECT)[\S\s](?:(?!Q_OBJECT)[^/"\'\\\])*)|(Q_OBJECT)'
# "(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*'|(?!Q_OBJECT)[\\S\\s](?:(?!Q_OBJECT)[^/\"'\\\\])*)|(Q_OBJECT)"
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
|
(?! Q_OBJECT )
[\S\s] # Any other char, but not these special tokens
# Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
(?: # But not these special tokens
(?! Q_OBJECT )
[^/"'\\]
)*
) # (2 end)
|
( # (3 start), Special Tokens
Q_OBJECT
) # (3 end)
Optimize?
I am having trouble with the following regular expression:
/^\s*(/?\*{1,2}(\s*(\b.*\b)\s*(\*/)?)?|\*/?\s*)$/g
I am wondering if I can improve this expression? Also, if anyone can find a problem with this expression, could you take note of it. Here is my live demo in action. It works for all the conditions that I have set up below
Test constraints
These Match
/**
/*
*
*/
/** Javadoc */
/* Block */
* Multi-line
/* Single Line */
/** A
/** A */
/* A
/* A */
These Shouldn't
7 * 8
// Regular comment
Results
After replacing the match with: // $3
I successfully converted them, despite some of them having trailing white-space:
//
//
//
// Javadoc
// Block
// Multi-line
// Single Line
// A
// A
// A
// A
Regex explained
/
^ Line start
\s* 0 or more white-space
( Start group 1
/? forward-slash (OPTIONAL)
\*{1,2} 1 to 2 asterisks
( Start group 2
\s* 0 or more white-space
( Start group 3
\b Start word boundry
.* 0 or more of anything
\b End word boundry
) End group 3
\s* 0 or more white-space
( Start group 4 (OPTIONAL)
\* 0 or more asterisks
/ Forward-slash
)? End group 4
)? End group 2 (OPTIONAL)
| OR
\* Asterisk
/? Forward-slash (OPTIONAL)
\s* 0 or more white-space
) End group 1
$ Line end
/
g Global; match all
Parsing C/C++ style comments are a little more complex when you have to take into account string embedding, escaping, and line continuation.
This also requires a single character consumption pass-thru method (after all else is checked). That being the case, of the many possible match alternations, there is only one item you're interrested in - the C-style comment /* */.
So, in the case of a global search and a lot of matches, only one interrests you, say a capture group 1 match. Of course, in your case a simple replace won't cut it.
So, you have to sit in a global find (but not replace) loop. Each match will be a part of the source string, no part is skipped, so each one will be appended to a destination string.
During the loop, when capture group 1 matches (C-style comment), you can do your not-so-simple substitution's to make it C++ comment, then append that result to the destination string.
Thats the jist of it. If this can't be done in whatever language you use, then it can't be done right, that is for sure!
Here is a list of regex's to perform the conversion to // when you'vee captured a C-style comment, they have to be performed in the order of appearance (the notation is Perl for example only):
# s{ ^ /\* (?: [^\S\n] | \*)* }{// }xmg;
# s{ ^ (?: [^\S\n] | \*)* \*/ $ }{// }xmg;
# s{ ^ $ }{// }xmg;
# s{ ^ (?: [^\S\n] | \*)+ }{// }xmg;
# s{ (?<![\s*]) (?: [^\S\n] | \*)+ $ }{}xmg;
# s{ (?<![\s*]) (?: [^\S\n] | \*)* \*/ $ }{}xmg;
This is the regex to use in your find loop (as explained above). This regex can be found on any Perl news group, its part of the FAQ.
# (?:(/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)|//(?:[^\\]|\\\n?)*?\n)|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
# -----------------------------------------------------------------
(?: # Comments
( # (1 start)
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
) # (1 end)
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
|
(?: # Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)