Trying to understand the C preprocessor - c++

Why do these blocks of code yield different results?
Some common code:
#define PART1PART2 works
#define STRINGAFY0(s) #s
#define STRINGAFY1(s) STRINGAFY0(s)
case 1:
#define GLUE(a,b,c) a##b##c
STRINGAFY1(GLUE(PART1,PART2,*))
//yields
"PART1PART2*"
case 2:
#define GLUE(a,b) a##b##*
STRINGAFY1(GLUE(PART1,PART2))
//yields
"works*"
case 3:
#define GLUE(a,b) a##b
STRINGAFY1(GLUE(PART1,PART2*))
//yields
"PART1PART2*"
I am using MSVC++ from VS.net 2005 sp1
Edit:
it is currently my belief that the preprocessor works like this when expanding macros:
Step 1:
- take the body
- remove any whitespace around ## operators
- parse the string, in the case that an identifier is found that matches the name of a parameter:
-if it is next to a ## operator, replace the identifier with the literal value of the parameter (i.e. the string passed in)
-if it is NOT next to a ## operator, run this whole explanation process on the value of the parameter first, then replace the identifier with that result.
(ignoring the stringafy single '#' case atm)
-remove all ## operators
Step 2:
- take that resultant string and parse it for any macros
now, from that I believe that all 3 cases should produce the exact same resultant string:
PART1PART2*
and hence after step 2, should result in
works*
but at very least should result in the same thing.

cases 1 and 2 have no defined behavior since your are tempting to paste a * into one preprocessor token. According to the association rules of your preprocessor this either tries to glue together the tokens PART1PART2 (or just PART2) and *. In your case this probably fails silently, which is one of the possible outcomes when things are undefined. The token PART1PART2 followed by * will then not be considered for macro expansion again. Stringfication then produces the result you see.
My gcc behaves differently on your examples:
/usr/bin/gcc -O0 -g -std=c89 -pedantic -E test-prepro.c
test-prepro.c:16:1: error: pasting "PART1PART2" and "*" does not give a valid preprocessing token
"works*"
So to summarize your case 1 has two problems.
Pasting two tokens that don't result
in a valid preprocessor token.
evaluation order of the ## operator
In case 3, your compiler is giving the wrong result. It should
evaluate the arguments to
STRINGAFY1
to do that it has to expand GLUE
GLUE results in PART1PART2*
which must be expanded again
the result is works*
which then is passed to
STRINGAFY1

It's doing exactly what you are telling it to do. The first and second take the symbol names passed in and paste them together into a new symbol. The third takes 2 symbols and pastes them, then you are placing the * in the string yourself (which will eventually evaluate into something else.)
What exactly is the question with the results? What did you expect to get? It all seems to be working as I would expect it to.
Then of course is the question of why are you playing with the dark arts of symbol munging like this anyways? :)

Related

How to stringize string with trailing backslash

When I build my C++ project the compiler generates this equivalent macro:
#define SOLUTION_DIR "c:\dev\my_project\"
In a normally #defined macro the trailing escaped double quotes would trigger compiler errors due to the unterminated string but compiler can do whatever it wants and makes this available to the code literally even if the string is invalid.
The usual way to expand macro values to C strings:
#define STRINGIZE( x ) #x
#define EXPAND( x ) STRINGIZE( x )
doesn't work in this case due to the unterminated string passed as argument.
std::string s = EXPAND( SOLUTION_DIR );
...
error: newline in constant
Is there a way to extract the string value of this macro and use it in my code equivalent to:
std::string str = R"(c:\dev\my_project\)";
where R is raw character prefix described here https://en.cppreference.com/w/cpp/language/string_literal
Notes:
I tried re-writing these macros using the R prefix to avoid escaping
the final quote mark but couldn't get to a functional version.
I can tell the compiler to define SOLUTION_DIR string without the
surrounding quotes, but I can't avoid the trailing backslash. In
this case however I get other warnings and errors due to the unknown
escape sequences (\d) and the fact that the trailing
backslash is taken to indicate that the macro is continuing on
the next line.
Update:
Here's the context for those who think something is broken and needs to be fixed.
I use Visual Studio 2019 (VS). In the project properties "C++/Preprocessor/Preprocessor Definitions" one can define various macros in the format:
NAME1=VALUE1;NAME2=VALUE2;...
which are then made available at compile time as
#define NAME1 VALUE1
#define NAME2 VALUE2
VS generates a number of predefined macros (not C++ but build environment macros) for various directories and other values (debug/release, 32 or 64 bit etc). They take the form $(Name) and are set to some string value such as:
$(Configuration) Debug
$(SolutionDir) C:\dev\some_project\
They are used to create location independent project settings such as the temp or binary output directories, or set the correct environment for whatever version of the project is being built (for instance Debug/x64).
In my case I need to get a hold of the current solution path directly in my code, and using the $(SolutionDir) VS macro seemed the easiest way to do it.
So here's how I defined my SOLUTION_PATH macro in "Properties/Preprocessor/Preprocessor Definitions":
SOLUTION_DIR="$(SolutionDir)
which translates into the compile time macro described initially:
#define SOLUTION_DIR "c:\dev\my_project\"
However, by default many macros that expand to paths, including $(SolutionDir), contain a trailing backslash which can't be removed hence the "broken" macro above.
Generally an executable binary doesn't need to and should not know anything about its build directories, so the path related macros are not necessarily designed to be used to define C++ macros, and the trailing backslash is not an issue. But my project needs that information because it itself triggers other build actions that depend on the current environment.
So this is not a malfunction of any of the components, everything works as designed, it just happens that for my specific project it would be very useful to be able to do things this way, even if it's non-standard.
I was able to make this work by adding a trailing ".":
SOLUTION_DIR="$(SolutionDir)."
which results in the equivalent:
#define SOLUTION_DIR "C:\dev\my_project\."
which points to the same directory and now compiles with no errors.

How Should We Interpret a Macro with an Embedded Comma

How should we interpret the following macro definition using the C++ standard? Notice the main issue is that replacement-list for AA contains embedded comma (for, S)
#define AA for, S //<---note the embedded comma
#define VALUE_TO_STRING(x) ^x!
#define VALUE(x) VALUE_TO_STRING(x)
int _tmain(int argc, _TCHAR* argv[])
{
VALUE(AA)
return 0;
}
I've done a test with VC++2010 and the final result of the above looks like the following without any error but I've problem interpreting the steps that it took to come up with the result using C++03 (or C++11) standard:
int wmain(int argc, _TCHAR* argv[])
{
^for, S!
return 0;
}
I've done some step by step tests with VC++2010. First I commented out the 2nd macro to see what was happening in the first step:
#define AA for, S
//#define VALUE_TO_STRING(x) ^x!
#define VALUE(x) VALUE_TO_STRING(x)
The macro replacement is straight forward and yielded a sequence that looks like another function-like macro having TWO arguments:
int wmain(int argc, _TCHAR* argv[])
{
VALUE_TO_STRING(for, S)
return 0;
}
According to [cpp.rescan] the next step is to re-scan this for more macro names. The question here is should this new macro be interpreted as a function-like macro with 2 arguments or 1 argument "for, S".
The normal interpretation is to consider VALUE_TO_STRING() is given 2 arguments which is invalid and hence a preprocessor error is resulted. But how come the VC++ came up with a result without any error? Obviously, the second step VC++ took was to consider the for, S as 1 single argument which doesn't make sense and isn't defined by the C++ standard.
I've done a test with VC++2010...
MS's preprocessor was never made standard. They phrase it this odd way:
C99 __func__ and Preprocessor Rules ... For C99 preprocessor rules, "Partial" is listed because variadic macros are supported.
In other words, "we support variadic macros; therefore we qualify as partially compliant". AFAIK standard compliance for the preprocessor is considered very low priority by the MS team. So I wouldn't tend to use VC or VC++ as a model of the standard preprocessor. gcc's a better model of the standard preprocessor here.
Since this is about the preprocessor I'm going to focus the story on just this snippet:
#define AA for, S
#define VALUE_TO_STRING(x) ^x!
#define VALUE(x) VALUE_TO_STRING(x)
VALUE(AA)
I'll be referencing ISO-14882 2011 here, which uses different numbers than 1998/2003. Using those numbers, here's what happens starting at the expansion step, step by step... except for steps not relevant here which I'll skip.
The preprocessor sees VALUE(AA), which is a function-like invocation of a previously defined function-like macro. So the first thing it does is argument identification, referencing 16.3 paragraph 4:
[if not variadic] the number of arguments (including those arguments consisting of no preprocessing tokens) in an invocation of a function-like macro shall equal the number of parameters in the macro definition
...and a portion of 16.3.1 paragraph 1:
After the arguments for the invocation of a function-like macro have been identified,
At this step, the preprocessor identifies that there is indeed one argument, that the macro was defined with one argument, and that the parameter x matches the invocation argument AA. So far, argument matching and x is AA is all that happened.
Then we get to the next step, which is argument expansion. With respect to this step, the only thing about the replacement list that really matters is where the parameters are in it, and whether or not the parameters are part of stringification (# x) or pasting (x ## ... or ... ## x). If there are arguments in the replacement list that are neither, then those arguments are expanded (stringified or pasted versions of the arguments don't count during this step). This expansion happens first, before anything else interesting goes on in the invocation, and it occurs just as if the preprocessor were only expanding the invocation parameter.
In this case, the replacement list is VALUE_TO_STRING(x). Again, VALUE_TO_STRING might be a function-like macro, but since we're doing argument expansion right now we really don't care. The only thing we care about is that x is there, and it's not being stringified or pasted. x is being invoked with AA, so the preprocessor evaluates AA as if AA were on a line instead of VALUE(AA). AA is an object-like macro that expands to for, S. So the replacement list transforms into VALUE_TO_STRING(for, S).
This is the rest of 16.3.1 paragraph 1 in action:
A parameter in the replacement list, unless [stringified or pasted] is replaced by the corresponding argument after all macros contained therein have been expanded [...] as if they formed the rest of the preprocessing file
So far so good. But now we reach the next part, in 16.3.4:
After all parameters in the replacement list have been substituted and [stuff not happening here] the resulting preprocessing token sequence
is rescanned, along with all subsequent preprocessing tokens of the source file, for more macro names to replace.
This part evaluates VALUE_TO_STRING(for, S), as if that were the preprocessing token set (except that it also temporarily forgets that VALUE is a macro per 16.3.4p2, but that doesn't come into play here). That evaluation recognizes VALUE_TO_STRING as a function-like macro, being invoked like one, so argument identification begins again. Only here, VALUE_TO_STRING was defined to take one argument, but is invoked with two. That fails 16.3 p 4.
I think the answer is in order of expanding.
Your simulation of preprocessor expanding, i.e. your choice of which macro to expand first, does in my opinion not match what the preprocessor does.
I, acting as a preprocessor (according to standard I believed at first; but a comment contradicts), would expand your code in this order:
VALUE(AA)
VALUE_TO_STRING(AA)
^AA!
^for, S!
This matches the result of the preprocessor for the original code.
Note that by this order it never sees the code VALUE_TO_STRING(for, S), the closest it gets is VALUE_TO_STRING(AA). That code does not cause the question concerning the number of arguments.
I did not quote anything from the standard, I think your quotes are sufficient.
As mentioned in a comment below, my answer is now an attempt how the result could be explained, without assuming conforming preprocessor. Any answer explaining with conforming behaviour is definitely better.
By the way, acting as a compiler, I would probably not understand the
^anything! as a way to make a string from a value either. But that is not the question and I assume that the meaning was lost, when you prepared the minimal example. That is of course perfectly allright. It might however influence the expansion, if it ever expands to a quoted macro name, e.g. "AA". That would stop expanding and the result could unveil what happened.

Why is C/C++ preprocessor adding a space here?

I have a tiny problem with a preprocessor that puzzles me and I cannot find any explanation to it in the documentation/preprocessor/language spec.
#define booboo() aaa
booboo()bbb
booboo().bbb
is preprocessed into:
aaa bbb <--- why is space added here
aaa.bbb
After handling trigraphs, continued lines and comments, preprocessor works on preprocessor directives and divides input into preprocessing tokens and whitespace. booboo's replacement list comprises one pp-token which is identifier 'aaa'. booboo()bbb is divided into pp-tokens: 'booboo', '(', ')', 'bbb'. Sequence of 'booboo', '(', ')' is recognised as functional macro invocation and it should be expanded to 'aaa' and imho in output should look like 'aaabbb'. I said look like since - to human - it would look like one token whereas compiler would get 2 tokens 'aaa' and 'bbb' since no '##' operator was used that allows pp-token concatenation. Why/what rule makes cpp (c preprocessor) place additional space between 'aaa' and 'bbb' when 'booboo().bbb' results in 'aaa.bbb' without space?
Is this because cpp tries to make output (which is for humans mostly) unambinuous? Human is not able to tell that 'aaabbb' is composed from 2 tokens as it sees token's spelling only. Am I right? I've read C99 documentation about preprocessor and gcc's documentation for cpp. I see nothing about it.
If I am right we have similar situation here:
#define baba() +
baba()+
baba()-
results in:
+ +
+-
Otherwise (if '++' is the output) it would look to a human like '++' token but there would be 2 tokens '+' and '+'. Is it like with '##' operator that cpp checks if concatenation produces valid token but in shown cases wants to prevent human that concatenation was performed? '+-' is not ambiguous hence no space added
The result of preprocessing is to transform the source file into a list of tokens. In your case the list of tokens would look like, after tokenization:
....
booboo()
bbb
....
and then after macro replacement:
....
aaa
bbb
....
Then the compiler translates the list of tokens into an executable.
The whitespace you are seeing is just an implementation detail that your compiler etc. has chosen to lay out the preprocessing tokens when displaying an intermediate result to you. The standards say nothing about any intermediate processing files. It is not required that there be a separate program to do preprocessing either.
I wrote an ANSI C compiler myself in early 90's. As far as I remember, a comment token /....../ should be replaced by a single white-space. Macros do text replacement, in place. Not necessary that the tokens which result from text replacement of such macro expansion(s) be legal C language tokens. When a macro is defined as text 'aaa' it is just that text 'aaa' that makes its way into the input stream. C's parser may or may not see valid tokens as a result of that!
Hence, given:
define booboo() aaa
Expanding booboo()bbb should result in text aaabbb
What does that aaabbb mean is up to the user. But that aaabbb will not be preprocessed even if happens to be the name of a macro. That is for sure. But aaabbb could be a user identifier - no issues there.

Why do I need double layer of indirection for macros?

At: C++ FAQ - Miscellaneous technical issues - [39.6] What should be done with macros that need to paste two tokens together?
Could someone explain to me why? All I read is trust me, but I simply can't just trust on something because someone said so.
I tried the approach and I can't find any bugs appearing:
#define mymacro(a) int a ## __LINE__
mymacro(prefix) = 5;
mymacro(__LINE__) = 5;
int test = prefix__LINE__*__LINE____LINE__; // fine
So why do I need to do it like this instead (quote from the webpage):
However you need a double layer of indirection when you use ##.
Basically you need to create a special macro for "token pasting" such
as:
#define NAME2(a,b) NAME2_HIDDEN(a,b)
#define NAME2_HIDDEN(a,b) a ## b
Trust me on this — you really need to do
this! (And please nobody write me saying it sometimes works without
the second layer of indirection. Try concatenating a symbol with
__ LINE__ and see what happens then.)
Edit: Could someone also explain why he uses NAME2_HIDDEN before it's declared below? It seems more logical to define NAME2_HIDDEN macro before I use it. Is it some sort of trick here?
The relevant part of the C spec:
6.10.3.1 Argument substitution
After the arguments for the invocation of a function-like macro have been identified,
argument substitution takes place. A parameter in the replacement list, unless preceded
by a # or ## preprocessing token or followed by a ## preprocessing token (see below), is
replaced by the corresponding argument after all macros contained therein have been
expanded. Before being substituted, each argument’s preprocessing tokens are
completely macro replaced as if they formed the rest of the preprocessing file; no other
preprocessing tokens are available.
The key part that determines whether you want the double indirection or not is the second sentence and the exception in it -- if the parameter is involved in a # or ## operation (such as the params in mymacro and NAME2_HIDDEN), then any other macros in the argument are NOT expanded prior to doing the # or ##. If, on the other hand, there's no # or ## IMMEDIATELY in the macro body (as with NAME2), then other macros in the parameters ARE expanded.
So it comes down to what you want -- sometimes you want all macros expanded FIRST, and then do the # or ## (in which case you want the double layer indirection) and sometime you DO NOT want the macros expanded first (in which case you CAN'T HAVE double layer macros, you need to do it directly.)
__LINE__ is a special macro that is supposed to resolve to the current line number. When you do a token paste with __LINE__ directly, however, it doesn't get a chance to resolve, so you end up with the token prefix__LINE__ instead of, say, prefix23, like you would probably be expecting if you would write this code in the wild.
Chris Dodd has an excellent explanation for the first part of your question. As for the second part, about the definition sequence, the short version is that #define directives by themselves are not evaluated at all; they are only evaluated and expanded when the symbol is found elsewhere in the file. For example:
#define A a //adds A->a to the symbol table
#define B b //adds B->b to the symbol table
int A;
#undef A //removes A->a from the symbol table
#define A B //adds A->B to the symbol table
int A;
The first int A; becomes int a; because that is how A is defined at that point in the file. The second int A; becomes int b; after two expansions. It is first expanded to int B; because A is defined as B at that point in the file. The preprocessor then recognizes that B is a macro when it checks the symbol table. B is then expanded to b.
The only thing that matters is the definition of the symbol at the point of expansion, regardless of where the definition is.
The most non-technical answer, which I gathered from all links here, and link of links ;) is that, a single layer indirection macro(x) #x stringifies the inputted macro's name, but by using double layers, it will stringify the inputted macro's value.
#define valueOfPi 3
#define macroHlp(x) #x
#define macro(x) macroHlp(x)
#define myVarOneLayer "Apprx. value of pi = " macroHlp(valueOfPi)
#define myVarTwoLayers "Apprx. value of pi = " macro(valueOfPi)
printf(myVarOneLayer); // out: Apprx. value of pi = valueOfPi
printf(myVarOTwoLayers); // out: Apprx. value of pi = 3
What happens at printf(myVarOneLayer)
printf(myVarOneLayer) is expanded to printf("Apprx. value of pi = " macroHlp(valueOfPi))
macroHlp(valueOfPi) tries to stringify the input, the input itself is not evaluated. It's only purpose in life is to take an input and stringify. So it expands to "valueOfPi"
So, what happens at printf(myVarTwoLayers)
printf(myVarTwoLayers) is expanded to printf("Apprx. value of pi = " macro(valueOfPi)
macro(valueOfPi) has no stringification operation, i.e. there is no #x in it's expansion, but there is an x, so it has to evaluate x and input the value to macroHlp for stringification. It expands to macroHlp(3) which in turn will stringify the number 3, since it is using #x
The order in which macros are declared is not important, the order in which they are used is. If you were to actually use that macro before it was declared -- (in actual code that is, not in a macro which remains dormant until summoned) then you would get an error of sorts but since most sane people don't go around doing these kinds of things, writing a macro and then writing a function that uses a macro not yet defined further down, etc,etc... It seems your question isn't just one question but I'll just answer that one part. I think you should have broken this down a little more.

Parsing C/C++ source: How are token boundaries/interactions specified in lex/yacc?

I want to parse some C++ code, and as a guide I've been looking at the C lex/yacc definitions here: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html and http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
I understand the specifications of the tokens themselves, but not how they interact. eg. it's OK to have an operator such as = directly follow an identifier without intervening white space (ie. "foo="), but it's not OK to have a numerical constant immediately followed by an identifier (ie. 123foo). However, I don't see any way that such rules are represented.
What am I missing?... or is this lex/yacc too liberal in its acceptance of errors.
The lexer converts a character stream into a token stream (I think that's what you mean by token specification). The grammar specifies what sequences of tokens are acceptable. Hence, you won't see that something is not allowed; you only see what is allowed. Does that make sense?
EDIT
If the point is to get the lexer to distinguish the sequence "123foo" from the sequence "123 foo" one way is to add a specification for "123foo". Another way is to treat spaces as significant.
EDIT2
A syntax error can be "detected" from the lexer or the grammar production or the later stages of the compiler (think of, say, type errors, which are still "syntax errors"). Which part of the whole compilation process detects which error is largely a design issue (as it affects the quality of error messages), I think. In the given example, it probably makes more sense to outlaw "123foo" via a tokenization to an invalid token rather than relying on the non-existence of a production with a numeric literal followed by an identifier (at least, this is the behaviour of gcc).
The lexer is fine with 123foo and will split that into two tokens.
An integer constant
and an identifier.
But try and find the part in the syntax that allows those two tokens to sit side by side like that. Thus I bet the lexer is generating an error when it sees these two tokens.
Note the lexer does not care about whitespace (unless you explicitly tell it tow worry). In this case it just throws white space away:
[ \t\v\n\f] { count(); } // Throw away white space without looking.
Just to check this is what I built:
wget http://www.lysator.liu.se/c/ANSI-C-grammar-l.html > l.l
wget http://www.lysator.liu.se/c/ANSI-C-grammar-y.html > y.y
Edited file l.l to stop in the compiler complaining about undeclared functions:
#include "y.tab.h"
// Add the following lines
int yywrap();
void count();
void comment();
void count();
int check_type();
// Done adding lines
%}
Create the following file: main.c:
#include <stdio.h>
extern int yylex();
int main()
{
int x;
while((x = yylex()) != 0)
{
fprintf(stdout, "Token(%d)\n", x);
}
}
Build it:
$ bison -d y.y
y.y: conflicts: 1 shift/reduce
$ flex l.l
$ gcc main.c lex.yy.c
$ ./a.out
123foo
123Token(259)
fooToken(258)
Yes it split it into two tokens.
what's essentially going on is the lexical rules for each token type are greedy. For instance, the character sequence foo= cannot be interpreted as a single identifier, because identifiers don't contain symbols. on the other hand, 123abc is actually a numerical constant, though malformed, because numerical constants can end with a sequence of alphabetic characters that are used to express the type of the numerical constant.
You won't be able to parse C++ with lex and yacc, as it's an ambiguous grammar. You'd need a more powerful approach such as GLR or some hackish solution which modifies a lexer in runtime (that's what most of the current C++ parsers are doing).
Take a look at Elsa/Elkhound.