I have some c++ code with /* */ and // style comments. I want to have a way to remove them all automatically. Apparently, using an editor (e.g. ultraedit) with some regexp searching for /*, */ and // should do the job. But, on a closer look, a complete solution isn't that simple because the sequences /* or // may not represent a comment if they're inside another comment, string literal or character literal. e.g.
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
is a comment sequence inside a double quote. And, it isn't a simple task to determine if a string is inside a pair of valid double-quotes. While this
// this is a single line comment /* <--- this does not start a comment block
// this is a second comment line with an */ within
is comment sequences inside other comments.
Is there a more comprehensive way to remove comments from a C++ source taking into account string literal and comment? For example can we instruct the preprocessor to remove comments while doesn't carry out, say, #include directive?
The C pre-processor can remove the comments.
Edited:
I have updated so that we can use the MACROS to expand the #if statements
> cat t.cpp
/*
* Normal comment
*/
// this is a single line comment /* <--- this does not start a comment block
// this is a second comment line with an */ within
#include <stdio.h>
#if __SIZEOF_LONG__ == 4
int bits = 32;
#else
int bits = 16;
#endif
int main()
{
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
/*
* comment with a single // line comment enbedded.
*/
int x;
// A single line comment /* Normal enbedded */ Comment
}
Because we want the #if statements to expand correctly we need a list of defines.
That's relatively trivial. cpp -E -dM.
Then we pipe the #defines and the original file back through the pre-processor but prevent the includes from being expanded this time.
> cpp -E -dM t.cpp > /tmp/def
> cat /tmp/def t.cpp | sed -e s/^#inc/-#inc/ | cpp - | sed s/^-#inc/#inc/
# 1 "t.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "t.cpp"
#include <stdio.h>
int bits = 32;
int main()
{
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
int x;
}
Our SD C++ Formatter has an option to pretty print the source text and remove all comments. It uses our full C++ front end to parse the text, so it is not confused by whitespace, line breaks, string literals or preprocessor issues, nor will it break the code by its formatting changes.
If you are removing comments, you may be trying to obfuscate the source code. The Formatter also comes in an obfuscating version.
May someone vote up my own answer to my own question.
Thanks to Martin York's idea, I found that in Visual Studio, the solution looks very simple (subject to further testing). Just rename ALL preprocessor directives to something else, (something that is not valid c++ syntax is ok) and use the cl.exe with /P
cl target.cpp /P
and it produces a target.i. And it contains the source minus the comments. Just rename the previous directives back and there you go. Probably you will need to remove the #line directive generated by cl.exe.
This works because according to MSDN, the phases of translation is this:
Character mapping
Characters in the source file are mapped to the internal source representation. Trigraph sequences are converted to single-character internal representation in this phase.
Line splicing
All lines ending in a backslash () and immediately followed by a newline character are joined with the next line in the source file forming logical lines from the physical lines. Unless it is empty, a source file must end in a newline character that is not preceded by a backslash.
Tokenization
The source file is broken into preprocessing tokens and white-space characters. Comments in the source file are replaced with one space character each. Newline characters are retained.
Preprocessing
Preprocessing directives are executed and macros are expanded into the source file. The #include statement invokes translation starting with the preceding three translation steps on any included text.
Character-set mapping
All source character set members and escape sequences are converted to their equivalents in the execution character set. For Microsoft C and C++, both the source and the execution character sets are ASCII.
String concatenation
All adjacent string and wide-string literals are concatenated. For example, "String " "concatenation" becomes "String concatenation".
Translation
All tokens are analyzed syntactically and semantically; these tokens are converted into object code.
Linkage
All external references are resolved to create an executable program or a dynamic-link library
Comments are removed during Tokenization prior to Preprocessing phase. So just make sure during the preprocessing phase, nothing is available
for processing (removing all the directives) and its output should be just those processed by the previous 3 phases.
As to the user-defined .h files, use the /FI option to manually include them. The resultant .i file will be a combination of the .cpp and .h. without comments. Each piece is preceded by a #line with the proper filename. So it is easy to split them up by an editor. If we don't want to manually split them up, probably we need to use the macro/scripting facility of some editors to do it automatically.
So, now, we don't have to care about any of the preprocessor directives. Even better is line continuation character (backslash) is handled.
e.g.
// vc8.cpp : Defines the entry point for the console application.
//
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
/* comment here */
whatever error line is ok
-#else
some error line if NOERR not defined
// comment here
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
/*comment*/
void pr() {
printf(" /* "); /* comment inside string " */
// comment terminated by \
continue a comment line
printf(" "); /** " " string inside comment */
printf/* this is valid comment within line continuation */\
("some weird lines \
with line continuation");
}
After cl.exe vc8.cpp /P, it becomes this, and can then be fed to cl.exe again after restoring the directives (and removing the #line)
#line 1 "vc8.cpp"
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
whatever error line is ok
-#else
some error line if NOERR not defined
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
void pr() {
printf(" /* ");
printf(" ");
printf\
("some weird lines \
with line continuation");
}
You can use a rule-based parser (e.g. boost::spirit) to write syntax rules for comments. You will need to decide whether to process nested comments or not depending on your compiler. Semantic actions removing comments should be pretty straightforward.
Regex are not meant to parse languages, it's a frustrating attempt at best.
You actually need a full-blown parser for this. You might wish to consider Clang, rewriting is an explicit goal of the Clang libraries suite and there are already existing rewriters implemented that you could get inspiration from.
#include <iostream>
#include<fstream>
using namespace std;
int main() {
ifstream fin;
ofstream fout;
fin.open("input.txt");
fout.open("output.txt");
char ch;
while(!fin.eof()){
fin.get(ch);
if(ch=='/'){
fin.get(ch);
if(ch=='/' )
{ //cout<<"Detected\n";
fin.get(ch);
while(!(ch=='\n'||ch=='\0'))
{
//cout<<"while";
fin.get(ch);
}
}
if(ch=='*')
{
fin.get(ch);
while(!(ch=='*')){
fin.get(ch);
}
fin.get(ch);
if(ch=='/'){
// cout<<"Detected Multi-Line\n";
fin.get(ch);
}
}
}
fout<<ch;
}
return 0;
}
Related
Given the following C++ program:
#define SOME_MACRO \
(void) x; /* some comment in macro */ \
int main()
{
int x = 0;
/* some comment in main */
SOME_MACRO
SOME_MACRO
return 0;
}
I would like libclang to call me back on the comments expanded in SOME_MACRO.
I tried the to register a comment handler, and set-up the preprocessor output options as follow:
struct CommentPrinter : clang::CommentHandler {
bool HandleComment(Preprocessor & pp, SourceRange comment) {
llvm::outs() << "new comment : \n";
comment.dump(pp.getSourceManager());
return false;
}
};
struct frontend_t : clang::ASTFrontendAction
{
std::unique_ptr<clang::ASTConsumer>
CreateASTConsumer(clang::CompilerInstance& CI,
clang::StringRef source_file) override
{
CI.getPreprocessorOutputOpts().ShowComments = 1;
CI.getPreprocessorOutputOpts().ShowMacroComments = 1;
CI.getPreprocessor().addCommentHandler(new CommentPrinter);
return std::make_unique<ast_consumer_t>();
}
};
But my comment handler was called only on comments line 2 and 7. Do you known if it is possible to be also called on comments on line 8 and 9?
I don't think you can.
The C++ standard (link to draft) defines how files are processed in "translation phases". By the time the macro processor can get to work on your #define and macro expansions, the comments have already been removed:
phase 3:
The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments).
A source file shall not end in a partial preprocessing token or in a partial comment.
Each comment is replaced by one space character.
phase 4:
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.
[...]
All preprocessing directives are then deleted.
I try to write a macro like following:
taken from link
and I apply same rule to my software whit out success.
I notice some difference from C and C++, but I don't understand why, the macro are preprocessor job !
also I notice some difference passing to the macro the values coming from an enumerators.
#include <stdio.h>
#define CONCAT(string) "start"string"end"
int main(void)
{
printf(CONCAT("-hello-"));
return 0;
}
the reported link used to try online the code link to a demo on ideone allow selection of different language
C is ok but changing to C++ it doesn't work.
Also in my IDE Visual Studio Code (MinGw C++) doesn't work.
My final target is write a macro to parametrize printf() function, for Virtual Console application using some escape codes. I try to add # to the macro concatenation and seems work but in case I pass an enumerator to the macro I have unexpected result. the code is :
#include <stdio.h>
#define def_BLACK_TXT 30
#define def_Light_green_bck 102
#define CSI "\e["
#define concat_csi(a, b) CSI #a ";" #b "m"
#define setTextAndBackgColor(tc, bc) printf(concat_csi(bc, tc))
enum VtColors { RESET_COLOR = 0, BLACK_TXT = 30, Light_green_bck = 102 };
int main(void){
setTextAndBackgColor(30, 102);
printf("\r\n");
setTextAndBackgColor(def_BLACK_TXT , def_Light_green_bck );
printf("\r\n");
setTextAndBackgColor(VtColors::BLACK_TXT , VtColors::Light_green_bck );
printf("\r\n");
printf("\e[102;30m");// <<--- this is the expected result of macro expansion
}
//and the output is : ( in the line 3 seems preprocessor don't resolve enum (the others line are ok) )
[102;30m
[102;30m
[VtColors::Light_green_bck;VtColors::BLACK_TXTm
[102;30m
Obviously I want use enumerators as parameter... (or I will change to #define).
But I'm curious to understand why it happens, and why there is difference in preprocessor changing from C to C++.
If anyone know the solution, many thanks.
There appears to be some compiler disagreement here.
MSVC compiles it as C++ without any issues.
gcc produces a compilation error.
The compilation error references a C++ feature called "user-defined literals", where the syntax "something"suffix gets parsed as a user-defined literal (presuming that this user-defined literal gets properly declared).
Since the preprocessor phase should be happening before the compilation phase, I conclude that the compilation error is a compiler bug.
Note that adding some whitespace produces the same result whether it gets compiled as C or C++ (and makes gcc happy):
#define CONCAT(string) "start" string "end"
EDIT: as of C++11, user-defined literals are considered to be distinct tokens:
Phase 3
The source file is decomposed into comments, sequences of
whitespace characters (space, horizontal tab, new-line, vertical tab,
and form-feed), and preprocessing tokens, which are the following:
a)
header names such as or "myfile.h"
b) identifiers
c)
preprocessing numbers d) character and string literals , including
user-defined (since C++11)
emphasis mine.
This occurs before phase 4: preprocessor execution, so a compilation error here is the correct result. "start"string, with no intervening whitespace, gets parsed as a user-defined literal, before the preprocessor phase.
to summarize the behavioral is the following: (see comment in the code)
#include <stdio.h>
#define CONCAT_1(string) "start"#string"end"
#define CONCAT_2(string) "start"string"end"
#define CONCAT_3(string) "start" string "end"
int main(void)
{
printf(CONCAT_1("-hello-")); // wrong insert double quote
printf("\r\n");
printf(CONCAT_1(-hello-)); // OK but is without quote
printf("\r\n");
#if false
printf(CONCAT_2("-hello-")); // compiler error
printf("\r\n");
#endif
printf(CONCAT_3("-hello-")); // OK
printf("\r\n");
printf("start" "-hello-" "end"); // OK
printf("\r\n");
printf("start""-hello-""end"); // OK
printf("\r\n");
return 0;
}
output:
start"-hello-"end <<<--- wrong insert double quote
start-hello-end
start-hello-end
start-hello-end
start-hello-end
I have a definition which includes a path (with no escape sequence) like this one:
// Incorrect
#define PATH "c:\blah\blah\file.cfg"
I would rather like it as this:
// Corrected
#define PATH "c:\\blah\\blah\\file.cfg"
Though unfortunately I can not modify the macro definition (actually the script that generates the source that includes the macro...), except for adding prefixes. Now I need to open the file given in this path. I tried c++11 raw string literals like this:
// Modified definition
#define PATH R"c:\blah\blah\file.cfg"
std::ifstream(PATH); // error: unrecognised escape sequence
Now the question is how to replace all \ using a macro?
Notes (if does matter):
Compiler: MSVC 14.0
OS: Windows 7
The syntax for raw string that you generated is NOT correct.
Here's the correct one:
#define PATH R"(c:\blah\blah\file.cfg)"
Check the (6) syntax format at CPP reference:
prefix(optional) R "delimiter( raw_characters )delimiter" (6)
See: string literal
Example: http://ideone.com/OZggmK
You could make use of the preprocessor's stringify-operator #, which does not only encapsulate the parameter in double quotes but also escapes "ordinary" backslashes in the string. Then - at runtime - cut off the extra double quotes introduced by the stringify.
So the idea is the following:
somehow stringify PATH such that "c:\blah\blah\file.cfg" becomes
"\"c:\\blah\\blah\\file.cfg\"". Note that the string itself
contains double quotes as the first and the last character then.
at runtime, use substr to cut the value between the (unwanted)
double quotes
A bit tricky is to stringify a value that is itself provided as a macro. To do that, you can use a macro with variadic arguments (as these get expanded).
So the complete code would look as follows:
#define PATH "c:\blah\blah\file.cfg"
#define STRINGIFY_HELPER(A) #A
#define STRINGIFY(...) STRINGIFY_HELPER(__VA_ARGS__)
#define NORMALIZEPATH(P) string(STRINGIFY(P)).substr(1,strlen(STRINGIFY(P))-2)
int main() {
string filename = NORMALIZEPATH(PATH);
cout << "filename: " << filename << endl;
return 0;
}
Output:
filename: c:\blah\blah\file.cfg
What does this line mean? Especially, what does ## mean?
#define ANALYZE(variable, flag) ((Something.##variable) & (flag))
Edit:
A little bit confused still. What will the result be without ##?
A little bit confused still. What will the result be without ##?
Usually you won't notice any difference. But there is a difference. Suppose that Something is of type:
struct X { int x; };
X Something;
And look at:
int X::*p = &X::x;
ANALYZE(x, flag)
ANALYZE(*p, flag)
Without token concatenation operator ##, it expands to:
#define ANALYZE(variable, flag) ((Something.variable) & (flag))
((Something. x) & (flag))
((Something. *p) & (flag)) // . and * are not concatenated to one token. syntax error!
With token concatenation it expands to:
#define ANALYZE(variable, flag) ((Something.##variable) & (flag))
((Something.x) & (flag))
((Something.*p) & (flag)) // .* is a newly generated token, now it works!
It's important to remember that the preprocessor operates on preprocessor tokens, not on text. So if you want to concatenate two tokens, you must explicitly say it.
## is called token concatenation, used to concatenate two tokens in a macro invocation.
See this:
Macro Concatenation with the ## Operator
One very important part is that this token concatenation follows some very special rules:
e.g. IBM doc:
Concatenation takes place before any
macros in arguments are expanded.
If the result of a concatenation is a
valid macro name, it is available for
further replacement even if it
appears in a context in which it
would not normally be available.
If more than one ## operator and/or #
operator appears in the replacement
list of a macro definition, the order
of evaluation of the operators is not
defined.
Examples are also very self explaining
#define ArgArg(x, y) x##y
#define ArgText(x) x##TEXT
#define TextArg(x) TEXT##x
#define TextText TEXT##text
#define Jitter 1
#define bug 2
#define Jitterbug 3
With output:
ArgArg(lady, bug) "ladybug"
ArgText(con) "conTEXT"
TextArg(book) "TEXTbook"
TextText "TEXTtext"
ArgArg(Jitter, bug) 3
Source is the IBM documentation. May vary with other compilers.
To your line:
It concatenates the variable attribute to the "Something." and adresses a variable which is logically anded which gives as result if Something.variable has a flag set.
So an example to my last comment and your question(compileable with g++):
// this one fails with a compiler error
// #define ANALYZE1(variable, flag) ((Something.##variable) & (flag))
// this one will address Something.a (struct)
#define ANALYZE2(variable, flag) ((Something.variable) & (flag))
// this one will be Somethinga (global)
#define ANALYZE3(variable, flag) ((Something##variable) & (flag))
#include <iostream>
using namespace std;
struct something{
int a;
};
int Somethinga = 0;
int main()
{
something Something;
Something.a = 1;
if (ANALYZE2(a,1))
cout << "Something.a is 1" << endl;
if (!ANALYZE3(a,1))
cout << "Somethinga is 0" << endl;
return 1;
};
This is not an answer to your question, just a CW post with some tips to help you explore the preprocessor yourself.
The preprocessing step is actually performed prior to any actual code being compiled. In other words, when the compiler starts building your code, no #define statements or anything like that is left.
A good way to understand what the preprocessor does to your code is to get hold of the preprocessed output and look at it.
This is how to do it for Windows:
Create a simple file called test.cpp and put it in a folder, say c:\temp.
Mine looks like this:
#define dog_suffix( variable_name ) variable_name##dog
int main()
{
int dog_suffix( my_int ) = 0;
char dog_suffix( my_char ) = 'a';
return 0;
}
Not very useful, but simple. Open the Visual studio command prompt, navigate to the folder and run the following commandline:
c:\temp>cl test.cpp /P
So, it's the compiler your running (cl.exe), with your file, and the /P option tells the compiler to store the preprocessed output to a file.
Now in the folder next to test.cpp you'll find test.i, which for me looks like this:
#line 1 "test.cpp"
int main()
{
int my_intdog = 0;
char my_chardog = 'a';
return 0;
}
As you can see, no #define left, only the code it expanded into.
According to Wikipedia
Token concatenation, also called token pasting, is one of the most subtle — and easy to abuse — features of the C macro preprocessor. Two arguments can be 'glued' together using ## preprocessor operator; this allows two tokens to be concatenated in the preprocessed code. This can be used to construct elaborate macros which act like a crude version of C++ templates.
Check Token Concatenation
lets consider a different example:
consider
#define MYMACRO(x,y) x##y
without the ##, clearly the preprocessor cant see x and y as separate tokens, can it?
In your example,
#define ANALYZE(variable, flag) ((Something.##variable) & (flag))
## is simply not needed as you are not making any new identifier. In fact, compiler issues "error: pasting "." and "variable" does not give a valid preprocessing token"
Can a multi-line raw string literal be an argument of a preprocessor macro?
#define IDENTITY(x) x
int main()
{
IDENTITY(R"(
)");
}
This code doesn't compile in both g++4.7.2 and VC++11 (Nov.CTP).
Is it a compiler (lexer) bug?
Multiple line macro invocations are legal -
since you are using a raw string literal it should have compiled
There is a known GCC bug for this:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52852
If you had been using regular (nonraw) strings it would have been illegal.
This should have compiled:
printf(R"HELLO
WORLD\n");
But not this:
printf("HELLO
WORLD\n");
This should be coded as
printf("HELLO\nWORLD\n");
if a new line is intended between HELLO and WORLD or as
printf("HELLO "
"WORLD\n");
If no intervening new line was intended.
Do you want a new line in your literal? If so then couldn't you use
IDENTITY("(\n)");
The C compiler documentation at
http://gcc.gnu.org/onlinedocs/cpp.pdf
States that in section 3.3 (Macro Arguments) that
"The invocation of the macro need not be
restricted to a single logical line—it can cross
as many lines in the source file as you wish."