Given the following C++ program:
#define SOME_MACRO \
(void) x; /* some comment in macro */ \
int main()
{
int x = 0;
/* some comment in main */
SOME_MACRO
SOME_MACRO
return 0;
}
I would like libclang to call me back on the comments expanded in SOME_MACRO.
I tried the to register a comment handler, and set-up the preprocessor output options as follow:
struct CommentPrinter : clang::CommentHandler {
bool HandleComment(Preprocessor & pp, SourceRange comment) {
llvm::outs() << "new comment : \n";
comment.dump(pp.getSourceManager());
return false;
}
};
struct frontend_t : clang::ASTFrontendAction
{
std::unique_ptr<clang::ASTConsumer>
CreateASTConsumer(clang::CompilerInstance& CI,
clang::StringRef source_file) override
{
CI.getPreprocessorOutputOpts().ShowComments = 1;
CI.getPreprocessorOutputOpts().ShowMacroComments = 1;
CI.getPreprocessor().addCommentHandler(new CommentPrinter);
return std::make_unique<ast_consumer_t>();
}
};
But my comment handler was called only on comments line 2 and 7. Do you known if it is possible to be also called on comments on line 8 and 9?
I don't think you can.
The C++ standard (link to draft) defines how files are processed in "translation phases". By the time the macro processor can get to work on your #define and macro expansions, the comments have already been removed:
phase 3:
The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments).
A source file shall not end in a partial preprocessing token or in a partial comment.
Each comment is replaced by one space character.
phase 4:
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.
[...]
All preprocessing directives are then deleted.
Related
I try to write a macro like following:
taken from link
and I apply same rule to my software whit out success.
I notice some difference from C and C++, but I don't understand why, the macro are preprocessor job !
also I notice some difference passing to the macro the values coming from an enumerators.
#include <stdio.h>
#define CONCAT(string) "start"string"end"
int main(void)
{
printf(CONCAT("-hello-"));
return 0;
}
the reported link used to try online the code link to a demo on ideone allow selection of different language
C is ok but changing to C++ it doesn't work.
Also in my IDE Visual Studio Code (MinGw C++) doesn't work.
My final target is write a macro to parametrize printf() function, for Virtual Console application using some escape codes. I try to add # to the macro concatenation and seems work but in case I pass an enumerator to the macro I have unexpected result. the code is :
#include <stdio.h>
#define def_BLACK_TXT 30
#define def_Light_green_bck 102
#define CSI "\e["
#define concat_csi(a, b) CSI #a ";" #b "m"
#define setTextAndBackgColor(tc, bc) printf(concat_csi(bc, tc))
enum VtColors { RESET_COLOR = 0, BLACK_TXT = 30, Light_green_bck = 102 };
int main(void){
setTextAndBackgColor(30, 102);
printf("\r\n");
setTextAndBackgColor(def_BLACK_TXT , def_Light_green_bck );
printf("\r\n");
setTextAndBackgColor(VtColors::BLACK_TXT , VtColors::Light_green_bck );
printf("\r\n");
printf("\e[102;30m");// <<--- this is the expected result of macro expansion
}
//and the output is : ( in the line 3 seems preprocessor don't resolve enum (the others line are ok) )
[102;30m
[102;30m
[VtColors::Light_green_bck;VtColors::BLACK_TXTm
[102;30m
Obviously I want use enumerators as parameter... (or I will change to #define).
But I'm curious to understand why it happens, and why there is difference in preprocessor changing from C to C++.
If anyone know the solution, many thanks.
There appears to be some compiler disagreement here.
MSVC compiles it as C++ without any issues.
gcc produces a compilation error.
The compilation error references a C++ feature called "user-defined literals", where the syntax "something"suffix gets parsed as a user-defined literal (presuming that this user-defined literal gets properly declared).
Since the preprocessor phase should be happening before the compilation phase, I conclude that the compilation error is a compiler bug.
Note that adding some whitespace produces the same result whether it gets compiled as C or C++ (and makes gcc happy):
#define CONCAT(string) "start" string "end"
EDIT: as of C++11, user-defined literals are considered to be distinct tokens:
Phase 3
The source file is decomposed into comments, sequences of
whitespace characters (space, horizontal tab, new-line, vertical tab,
and form-feed), and preprocessing tokens, which are the following:
a)
header names such as or "myfile.h"
b) identifiers
c)
preprocessing numbers d) character and string literals , including
user-defined (since C++11)
emphasis mine.
This occurs before phase 4: preprocessor execution, so a compilation error here is the correct result. "start"string, with no intervening whitespace, gets parsed as a user-defined literal, before the preprocessor phase.
to summarize the behavioral is the following: (see comment in the code)
#include <stdio.h>
#define CONCAT_1(string) "start"#string"end"
#define CONCAT_2(string) "start"string"end"
#define CONCAT_3(string) "start" string "end"
int main(void)
{
printf(CONCAT_1("-hello-")); // wrong insert double quote
printf("\r\n");
printf(CONCAT_1(-hello-)); // OK but is without quote
printf("\r\n");
#if false
printf(CONCAT_2("-hello-")); // compiler error
printf("\r\n");
#endif
printf(CONCAT_3("-hello-")); // OK
printf("\r\n");
printf("start" "-hello-" "end"); // OK
printf("\r\n");
printf("start""-hello-""end"); // OK
printf("\r\n");
return 0;
}
output:
start"-hello-"end <<<--- wrong insert double quote
start-hello-end
start-hello-end
start-hello-end
start-hello-end
Considering the following snippet:
#include <stdio.h>
#define MY_MACRO(\
arg) \
arg
#define MY_MACRO2(t1, t2) t1##t2
#define MY_ a
#define MACRO b
int main() {
printf("%d\n", MY_MACRO2(MY_,MACRO)(45));
return 0;
}
It turns out to compile and display 45, however, if MY_ and MACRO were expanded before substitution, this code should not compile.
The reason why I notice this is when I read in the C standard the following:
6.10.3.1 (but also in C++ standard)
After the arguments for the invocation of a function-like macro have been identified,argument substitution takes place.A parameter in the replacement list, unless preceded by a # or ## preprocessing token or followed by a ## preprocessing token (see below), is replaced by the corresponding argument after all macros contained therein have been expanded. Before being substituted, each argument’s preprocessing tokens are completely macro replaced as if they formed the rest of the preprocessing file; no other preprocessing tokens are available
So if all macros contained in the arguments were expanded before replacement, why don't we end up with ab(45)?
To let constructions like X(X()) work. Note that while X() is expanded the X macro is disabled to avoid infinite recursions. Expanding arguments before expanding the macro let's one use X in the arguments.
A practical application of X(X()):
#define TEN(x) x x x x x x x x x x
#define HUNDRED(x) TEN(TEN(x))
I read about macros using Cppreference.
__LINE__ : expands to the source file line number, an integer constant, can be changed by the #line directive
I made c++ program to test __LINE__ macro.
#include <iostream>
using namespace std;
#line 10
#define L __LINE__
int main()
{
#line 20
int i = L;
cout<<i<<endl;
return 0;
}
Output :
20
Why output of the above code is 20? Why does not 10?
If you want to print 10 then change L into something that is not a macro:
constexpr int L = __LINE__;
Otherwise the macro L would be substituted on the line int i = L; and become:
int i = __LINE__;
Where it will have to be substituted again for the line number, and read the last #line directive.
Recall that macros perform token substitution. When you #define L __LINE__ it only specifies what tokens should L be substituted for when it appears in the source. It does not substitute anything at the point of L's own definition.
C++ - [cpp.replace]/9 or C - [6.10.3 Macro replacement]/9
A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of
the macro name to be replaced by the replacement list of preprocessing
tokens that constitute the remainder of the directive. The replacement
list is then rescanned for more macro names as specified below.
#define y 42
#define x y
This makes x defined as a sequence of preprocessing tokens that contains one token y. Not the token 42.
cout << x;
This will expad x to y and then y to 42.
#undef y
#define y "oops"
x is still defined as y.
cout << x;
You guess what happens. __LINE__ isn't special in this regard.
Your #line preprocessor directive changes the value to 20:
#line 20
The macro is expanded where used (not where defined) which is in your main() function, after the preprocessor directive which changes the value to 20.
What does this line mean? Especially, what does ## mean?
#define ANALYZE(variable, flag) ((Something.##variable) & (flag))
Edit:
A little bit confused still. What will the result be without ##?
A little bit confused still. What will the result be without ##?
Usually you won't notice any difference. But there is a difference. Suppose that Something is of type:
struct X { int x; };
X Something;
And look at:
int X::*p = &X::x;
ANALYZE(x, flag)
ANALYZE(*p, flag)
Without token concatenation operator ##, it expands to:
#define ANALYZE(variable, flag) ((Something.variable) & (flag))
((Something. x) & (flag))
((Something. *p) & (flag)) // . and * are not concatenated to one token. syntax error!
With token concatenation it expands to:
#define ANALYZE(variable, flag) ((Something.##variable) & (flag))
((Something.x) & (flag))
((Something.*p) & (flag)) // .* is a newly generated token, now it works!
It's important to remember that the preprocessor operates on preprocessor tokens, not on text. So if you want to concatenate two tokens, you must explicitly say it.
## is called token concatenation, used to concatenate two tokens in a macro invocation.
See this:
Macro Concatenation with the ## Operator
One very important part is that this token concatenation follows some very special rules:
e.g. IBM doc:
Concatenation takes place before any
macros in arguments are expanded.
If the result of a concatenation is a
valid macro name, it is available for
further replacement even if it
appears in a context in which it
would not normally be available.
If more than one ## operator and/or #
operator appears in the replacement
list of a macro definition, the order
of evaluation of the operators is not
defined.
Examples are also very self explaining
#define ArgArg(x, y) x##y
#define ArgText(x) x##TEXT
#define TextArg(x) TEXT##x
#define TextText TEXT##text
#define Jitter 1
#define bug 2
#define Jitterbug 3
With output:
ArgArg(lady, bug) "ladybug"
ArgText(con) "conTEXT"
TextArg(book) "TEXTbook"
TextText "TEXTtext"
ArgArg(Jitter, bug) 3
Source is the IBM documentation. May vary with other compilers.
To your line:
It concatenates the variable attribute to the "Something." and adresses a variable which is logically anded which gives as result if Something.variable has a flag set.
So an example to my last comment and your question(compileable with g++):
// this one fails with a compiler error
// #define ANALYZE1(variable, flag) ((Something.##variable) & (flag))
// this one will address Something.a (struct)
#define ANALYZE2(variable, flag) ((Something.variable) & (flag))
// this one will be Somethinga (global)
#define ANALYZE3(variable, flag) ((Something##variable) & (flag))
#include <iostream>
using namespace std;
struct something{
int a;
};
int Somethinga = 0;
int main()
{
something Something;
Something.a = 1;
if (ANALYZE2(a,1))
cout << "Something.a is 1" << endl;
if (!ANALYZE3(a,1))
cout << "Somethinga is 0" << endl;
return 1;
};
This is not an answer to your question, just a CW post with some tips to help you explore the preprocessor yourself.
The preprocessing step is actually performed prior to any actual code being compiled. In other words, when the compiler starts building your code, no #define statements or anything like that is left.
A good way to understand what the preprocessor does to your code is to get hold of the preprocessed output and look at it.
This is how to do it for Windows:
Create a simple file called test.cpp and put it in a folder, say c:\temp.
Mine looks like this:
#define dog_suffix( variable_name ) variable_name##dog
int main()
{
int dog_suffix( my_int ) = 0;
char dog_suffix( my_char ) = 'a';
return 0;
}
Not very useful, but simple. Open the Visual studio command prompt, navigate to the folder and run the following commandline:
c:\temp>cl test.cpp /P
So, it's the compiler your running (cl.exe), with your file, and the /P option tells the compiler to store the preprocessed output to a file.
Now in the folder next to test.cpp you'll find test.i, which for me looks like this:
#line 1 "test.cpp"
int main()
{
int my_intdog = 0;
char my_chardog = 'a';
return 0;
}
As you can see, no #define left, only the code it expanded into.
According to Wikipedia
Token concatenation, also called token pasting, is one of the most subtle — and easy to abuse — features of the C macro preprocessor. Two arguments can be 'glued' together using ## preprocessor operator; this allows two tokens to be concatenated in the preprocessed code. This can be used to construct elaborate macros which act like a crude version of C++ templates.
Check Token Concatenation
lets consider a different example:
consider
#define MYMACRO(x,y) x##y
without the ##, clearly the preprocessor cant see x and y as separate tokens, can it?
In your example,
#define ANALYZE(variable, flag) ((Something.##variable) & (flag))
## is simply not needed as you are not making any new identifier. In fact, compiler issues "error: pasting "." and "variable" does not give a valid preprocessing token"
I have some c++ code with /* */ and // style comments. I want to have a way to remove them all automatically. Apparently, using an editor (e.g. ultraedit) with some regexp searching for /*, */ and // should do the job. But, on a closer look, a complete solution isn't that simple because the sequences /* or // may not represent a comment if they're inside another comment, string literal or character literal. e.g.
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
is a comment sequence inside a double quote. And, it isn't a simple task to determine if a string is inside a pair of valid double-quotes. While this
// this is a single line comment /* <--- this does not start a comment block
// this is a second comment line with an */ within
is comment sequences inside other comments.
Is there a more comprehensive way to remove comments from a C++ source taking into account string literal and comment? For example can we instruct the preprocessor to remove comments while doesn't carry out, say, #include directive?
The C pre-processor can remove the comments.
Edited:
I have updated so that we can use the MACROS to expand the #if statements
> cat t.cpp
/*
* Normal comment
*/
// this is a single line comment /* <--- this does not start a comment block
// this is a second comment line with an */ within
#include <stdio.h>
#if __SIZEOF_LONG__ == 4
int bits = 32;
#else
int bits = 16;
#endif
int main()
{
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
/*
* comment with a single // line comment enbedded.
*/
int x;
// A single line comment /* Normal enbedded */ Comment
}
Because we want the #if statements to expand correctly we need a list of defines.
That's relatively trivial. cpp -E -dM.
Then we pipe the #defines and the original file back through the pre-processor but prevent the includes from being expanded this time.
> cpp -E -dM t.cpp > /tmp/def
> cat /tmp/def t.cpp | sed -e s/^#inc/-#inc/ | cpp - | sed s/^-#inc/#inc/
# 1 "t.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "t.cpp"
#include <stdio.h>
int bits = 32;
int main()
{
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
int x;
}
Our SD C++ Formatter has an option to pretty print the source text and remove all comments. It uses our full C++ front end to parse the text, so it is not confused by whitespace, line breaks, string literals or preprocessor issues, nor will it break the code by its formatting changes.
If you are removing comments, you may be trying to obfuscate the source code. The Formatter also comes in an obfuscating version.
May someone vote up my own answer to my own question.
Thanks to Martin York's idea, I found that in Visual Studio, the solution looks very simple (subject to further testing). Just rename ALL preprocessor directives to something else, (something that is not valid c++ syntax is ok) and use the cl.exe with /P
cl target.cpp /P
and it produces a target.i. And it contains the source minus the comments. Just rename the previous directives back and there you go. Probably you will need to remove the #line directive generated by cl.exe.
This works because according to MSDN, the phases of translation is this:
Character mapping
Characters in the source file are mapped to the internal source representation. Trigraph sequences are converted to single-character internal representation in this phase.
Line splicing
All lines ending in a backslash () and immediately followed by a newline character are joined with the next line in the source file forming logical lines from the physical lines. Unless it is empty, a source file must end in a newline character that is not preceded by a backslash.
Tokenization
The source file is broken into preprocessing tokens and white-space characters. Comments in the source file are replaced with one space character each. Newline characters are retained.
Preprocessing
Preprocessing directives are executed and macros are expanded into the source file. The #include statement invokes translation starting with the preceding three translation steps on any included text.
Character-set mapping
All source character set members and escape sequences are converted to their equivalents in the execution character set. For Microsoft C and C++, both the source and the execution character sets are ASCII.
String concatenation
All adjacent string and wide-string literals are concatenated. For example, "String " "concatenation" becomes "String concatenation".
Translation
All tokens are analyzed syntactically and semantically; these tokens are converted into object code.
Linkage
All external references are resolved to create an executable program or a dynamic-link library
Comments are removed during Tokenization prior to Preprocessing phase. So just make sure during the preprocessing phase, nothing is available
for processing (removing all the directives) and its output should be just those processed by the previous 3 phases.
As to the user-defined .h files, use the /FI option to manually include them. The resultant .i file will be a combination of the .cpp and .h. without comments. Each piece is preceded by a #line with the proper filename. So it is easy to split them up by an editor. If we don't want to manually split them up, probably we need to use the macro/scripting facility of some editors to do it automatically.
So, now, we don't have to care about any of the preprocessor directives. Even better is line continuation character (backslash) is handled.
e.g.
// vc8.cpp : Defines the entry point for the console application.
//
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
/* comment here */
whatever error line is ok
-#else
some error line if NOERR not defined
// comment here
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
/*comment*/
void pr() {
printf(" /* "); /* comment inside string " */
// comment terminated by \
continue a comment line
printf(" "); /** " " string inside comment */
printf/* this is valid comment within line continuation */\
("some weird lines \
with line continuation");
}
After cl.exe vc8.cpp /P, it becomes this, and can then be fed to cl.exe again after restoring the directives (and removing the #line)
#line 1 "vc8.cpp"
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
whatever error line is ok
-#else
some error line if NOERR not defined
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
void pr() {
printf(" /* ");
printf(" ");
printf\
("some weird lines \
with line continuation");
}
You can use a rule-based parser (e.g. boost::spirit) to write syntax rules for comments. You will need to decide whether to process nested comments or not depending on your compiler. Semantic actions removing comments should be pretty straightforward.
Regex are not meant to parse languages, it's a frustrating attempt at best.
You actually need a full-blown parser for this. You might wish to consider Clang, rewriting is an explicit goal of the Clang libraries suite and there are already existing rewriters implemented that you could get inspiration from.
#include <iostream>
#include<fstream>
using namespace std;
int main() {
ifstream fin;
ofstream fout;
fin.open("input.txt");
fout.open("output.txt");
char ch;
while(!fin.eof()){
fin.get(ch);
if(ch=='/'){
fin.get(ch);
if(ch=='/' )
{ //cout<<"Detected\n";
fin.get(ch);
while(!(ch=='\n'||ch=='\0'))
{
//cout<<"while";
fin.get(ch);
}
}
if(ch=='*')
{
fin.get(ch);
while(!(ch=='*')){
fin.get(ch);
}
fin.get(ch);
if(ch=='/'){
// cout<<"Detected Multi-Line\n";
fin.get(ch);
}
}
}
fout<<ch;
}
return 0;
}