How parsers handle preprocessors and conditional compilation? - c++

I am trying to figure out how parsers handle preprocessor and conditional compilation. Using c++ as an example, are preprocessor directives included in c++ grammar rules, or is it a separate language and preprocessing happens before parsing. In both cases, how can a parser figure out errors in all possible branches and retrieve information about original code layout before preprocessing (such as number of line where the error occured)?

Taken from the C Preprocessor docs:
The C preprocessor informs the C compiler of the location in your source code where each token came from.
So in the case of GCC, the parser knows where the errors occur, because the preprocessor tells it. I am unsure whether this quotation refers to preprocessing tokens, or all C++ tokens.
This page has a few more details on how the magic happens.
The cpp_token structure contains line and col members. The lexer fills these in with the line and column of the first character of the token. Consequently, but maybe unexpectedly, a token from the replacement list of a macro expansion carries the location of the token within the #define directive, because cpplib expands a macro by returning pointers to the tokens in its replacement list.
[...] This variable therefore uniquely enumerates each line in the translation unit. With some simple infrastructure, it is straight forward to map from this to the original source file and line number pair
Here is a copy of the C++14(?) draft standard. The preprocessing grammar is in Appendix A.14. I'm not sure it matters whether you want to call it a separate language or not. Per [lex.phases] (section 2.2), C++ compilers behave as if preprocessing happens before the main translation/parsing happens.

Related

How to disable syntax check in c header file

I'm embedding Lua code in cplusplus; it's ok to write like
char const *lua_scripts = R"rawstring(
-- lua code
)rawstring";
But the Lua code inside the string doesn't have syntax highlight, so I split it into 3 files:
The first file is called head.txt
char const *lua_scripts = R"rawstring(
The second file is called body.lua
-- lua code
The third file is called tail.txt
)rawstring";
Then the original cpp file changed to
#include "head.txt"
#include "body.lua"
#include "tail.txt"
But when I compile, syntax error reported, because the compiler checked the file before inclusion. So how can I disable compiler checking syntax?
In C++, programs are parsed after preprocessing. But dividing the input into lexemes is done before preprocessing. The input to the preprocessor is a stream of tokens, not a stream of characters.
So a token cannot span two input files. And a string literal is a single token.
You also may not split preprocessor directives over two files, so #endif, #else, etc. must all be in the same file as the #if or #ifdef, and the last line in a file cannot end with a backslash line-splice.
You could easily write your own little merging program which builds a C++ file from the C++ and Lua source files. You could even write it in Lua, its not that complicated. Or you could do it with the M4 macro processor, which is most likely already installed in your compilation environment.
There are nine phases of translation that occur when C++ code is compiled. Phase 3 is when string literals are identified. Phase 4 is the preprocessor. By the time the compiler gets to #include your files, all the string literals in your original source file have been found and marked as such. There will not be another pass of your source file looking for more literals after the preprocessor is done.
When the preprocessor brings in a file, that file goes through the first four phases of translation before being inserted into your original source file. This is slightly different than the common, simplified perception of a header file being directly copied into a source file. Rather than a character-by-character copy, the header is copied token-by-token, where "token" means "preprocessing token", which includes such things as identifiers, operators, and literals.
In practice, the simplified view is adequate until you try to have language elements cross file boundaries. In particular, neither comments nor string literals can start in one file and extend into another. (There are other examples, but it's a bit more contrived to bring them into play.) You tried to have a string literal begin in one file, extend into a second, and end in a third. This does not work.
When the preprocessor brings in head.txt, the first three phases analyze it as five preprocessor tokens followed by a non-terminated raw string literal. This is what gets copied into your source file. Note that the non-terminated literal remains a non-terminated literal; it does not become a literal looking for an end.
When body.lua is brought in, it is treated just like any other header file. The preprocessor is not concerned about extensions. The file is brought in and subject to the phases of translation just like any other #include. Phase 3 will identify, using C++ syntax rules, string literals that begin in body.lua, but no part of body.lua will become part of a string literal that begins outside body.lua. Phase 3, including the identification of string literals, happens on this file in isolation.
Your approach has failed.
So how can I disable compiler checking syntax?
You cannot disable compiler syntax checking. That's like asking how can you have a person read a book without picking out letters and words. You've asked the compiler to process your code, and the first step of that is making sure the code is understandable, i.e. has valid syntax. It's questions like this that remind us that XY problems are as prevalent as ever.
Fortunately, though, you did mention your real problem: "doesn't have syntax highlight". Unfortunately, you did not provide enough information about your real problem, such as what program is providing the syntax highlighting. I subjected the following to two different syntax highlighters; one highlighted the Lua code as Lua code, and the other did not.
R"rawstring(
-- lua code
)rawstring"
If you are willing to ignore the highlighting on the first and last lines, and if your editor successfully applies the desired syntax highlighting, you could make this your body.lua file. Then the following C++ code should work.
char const *lua_scripts =
#include "body.lua"
;
Statements are not identified until phase seven – well after the preprocessor – so you can split statements across files.
You could use the unix xxd utility in a pre-build step to preprocess your body.lua file as follows:
xxd -i body.lua body.xxd
Then in your c++ code:
#include "body.xxd"
const std::string lua_scripts(reinterpret_cast<char *>(body), body_len);

Preprocessing: Is defining a shorthand for `import` legal?

For solving a code-golf challenge, I want to produce the smallest possible code. I had the idea of defining a shorthand for import:
#define I import
I<vector>;
short godbolt example
Of course, the intention here is to reuse I to actually save bytes.
Is this legal in C++20?
Thoughts / What I found out so far:
According to cppreference, "The module and import directives are also preprocessing directives". So I think it would boil down to the question whether we have a guarantee that the preprocessor first has to replace I with our definition?
I think handling the import directive should happen in translation phase 4, and for the whole phase, I should not be macro-expanded unless specified otherwise ([cpp.pre]-7). Is it specified otherwise for this case?
Is it possible this works as part of the preprocessor rescan?
Clang and GCC on godbolt do not compile, but AFAIK they don't yet support importing standard library headers without extra steps, and they give the same error message with the shorthand version, which indicates it would work(?)
The same approach, but with include instead of import, does not work with gcc and clang and thus probably isn't legal.
No.
[cpp.pre]/1:
A preprocessing directive consists of a sequence of preprocessing
tokens that satisfies the following constraints: At the start of
translation phase 4, the first token in the sequence, referred to as a
directive-introducing token, begins with the first character in the
source file (optionally after whitespace containing no new-line
characters) or follows whitespace containing at least one new-line
character, and is [...]
Preprocessing-directive-ness is determined at the start of translation phase 4, prior to any macro replacement. Therefore, I<vector>; is not recognized as a directive, and the import from the macro expansion of I is not replaced by the import-keyword token. This in turn means that it is not recognized as a module-import-declaration during translation phase 7, and is instead simply an ill-formed attempt to use the identifier import without a preceding declaration.
The point of this dance is to ensure that build systems can know a file's dependencies without having to fully preprocess the file - which would be required if imports can be formed from macro replacement.

How do I use a preprocessor macro inside an include?

I am trying to build freetype2 using my own build system (I do not want to use Jam, and I am prepared to put the time into figuring it out). I found something odd in the headers. Freetype defines macros like this:
#define FT_CID_H <freetype/ftcid.h>
and then uses them later like this:
#include FT_CID_H
I didn't think that this was possible, and indeed Clang 3.9.1 complains:
error: expected "FILENAME" or <FILENAME>
#include FT_CID_H
What is the rationale behind these macros?
Is this valid C/C++?
How can I convince Clang to parse these headers?
This is related to How to use a macro in an #include directive? but different because the question here is about compiling freetype, not writing new code.
I will address your three questions out of order.
Question 2
Is this valid C/C++?
Yes, this is indeed valid. Macro expansion can be used to produce the final version of a #include directive. Quoting C++14 (N4140) [cpp.include] 16.2/4:
A preprocessing directive of the form
# include pp-tokens new-line
(that does not match one of the two previous forms) is permitted. The preprocessing tokens after include
in the directive are processed just as in normal text (i.e., each identifier currently defined as a macro name is
replaced by its replacement list of preprocessing tokens). If the directive resulting after all replacements does
not match one of the two previous forms, the behavior is undefined.
The "previous forms" mentioned are #include "..." and #include <...>. So yes, it is legal to use a macro which expands to the header/file to include.
Question 1
What is the rationale behind these macros?
I have no idea, as I've never used the freetype2 library. That would be a question best answered by its support channels or community.
Question 3
How can I convince Clang to parse these headers?
Since this is legal C++, you shouldn't have to do anything. Indeed, user #Fanael has demonstrated that Clang is capable of parsing such code. There must be some problem other problem in your setup or something else you haven't shown.
Is this valid C/C++?
The usage is valid C, provided that the macro definition is in scope at the point where the #include directive appears. Specifically, paragraph 6.10.2/4 of C11 says
A preprocessing directive of the form
# include pp-tokens new-line
(that does not match one of the two previous forms) is permitted. The
preprocessing tokens after include in the directive are processed just
as in normal text. (Each identifier currently defined as a macro name
is replaced by its replacement list of preprocessing tokens.) The
directive resulting after all replacements shall match one of the two
previous forms.
(Emphasis added.) Inasmuch as the preprocessor has the same semantics in C++ as in C, to the best of my knowledge, the usage is also valid in C++.
What is the rationale behind these macros?
I presume it is intended to provide for indirection of the header name or location (by providing alternative definitions of the macro).
How can I convince Clang to parse these headers?
Provided, again, that the macro definition is in scope at the point where the #include directive appears, you shouldn't have to do anything. If indeed it is, then Clang is buggy in this regard. In that case, after filing a bug report (if this issue is not already known), you probably need to expand the troublesome macro references manually.
But before you do that, be sure that the macro definitions really are in scope. In particular, they may be guarded by conditional compilation directives -- in that case, the best course of action would probably be to provide whatever macro definition is needed (via the compiler command line) to satisfy the condition. If you are expected to do this manually, then surely the build documentation discusses it. Read the build instructions.

Why do parens prevent macro substitution?

While researching solutions to the windows min/max macro problem, I found an answer that I really like but I do not understand why it works. Is there something within the C++ specification that says that macro substitution doesn't occur within parens? If so where is that? Is this just a side effect of something else or is the language designed to work that way? If I use extra parens the max macro doesn't cause a problem:
(std::numeric_limits<int>::max)()
I'm working in a large scale MFC project, and there are some windows libraries that use those macros so I'd prefer not to use the #undef trick.
My other question is this. Does #undef max within a .cpp file only affect the file that it is used within, or would it undefine max for other compilation units?
Function-like macros only expand when the next thing after is an opening parenthesis. When surrounding the name with parentheses, the next thing after the name is a closing parenthesis, so no expansion occurs.
From C++11 § 16.3 [cpp.replace]/10:
Each subsequent instance of the function-like macro name followed by a ( as the next preprocessing token introduces the sequence of preprocessing tokens that is replaced by the replacement list in the definition (an invocation of the macro).
To answer the other question, preprocessing happens before normal compilation and linking, so doing an #undef in an implementation file will only affect that file. In a header, it affects every file that includes that header.

Multiple preprocessor directives on one line in C++

A hypothetical question: Is it possible to have a C++ program, which includes preprocessor directives, entirely on one line?
Such a line would look like this:
#define foo #ifdef foo #define bar #endif
What are the semantics of such a line?
Further, are there any combinations of directives which are impossible to construct on one line?
If this is compiler-specific then both VC++ and GCC answers are welcome.
A preprocessing directive must be terminated by a newline, so this is actually a single preprocessing directive that defines an object-like macro, named foo, that expands to the following token sequence:
# ifdef foo # define bar # endif
Any later use of the name foo in the source (until it is #undefed) will expand to this, but after the macro is expanded, the resulting tokens are not evaluated as a preprocessing directive.
This is not compiler-specific; this behavior is defined by the C and C++ standards.
Preprocessor directives are somewhat different than language statements, which are terminated by ; and use whitespace to delimit tokens. In the case of the preprocessor, the directive is terminated by a newline so it's impossible to do what you're attempting using the C++ language itself.
One way you could kind of simulate this is to put your desired lines into a separate header file and then #include it where you want. The separate header still has to have each directive on one line, but the point where you include it is just a single line, effectively doing what you asked.
Another way to accomplish something like that is to have a pre-C++ file that you use an external process to process into a C++ source file prior to compiling with your C++ compiler. This is probably rather more trouble than it's worth.