Parsing irregular c++ prototypes - c++

I am trying to build a program that parses and lists the content of header files. So far, so good, I found it easy parsing and listing headers I wrote, but when I started parsing cross platform API headers things got messy.
My current approach is rather simplistic, here is a pseudocode example of parsing the following function:
void foo(int a);
void is a type, so we are dealing with instancing a type
foo is the name of that type
foo is followed by brackets, meaning it is a function of type void named foo
int is a type...
a is the name of that type instance
foo is a function of type void that takes one parameter of type int named a
However, when I got into bigger and more complex headers I stumbled upon somewhat irregular prototypes, involving macros and god knows what. An example:
GLAPI void APIENTRY glEvalCoord1d( GLdouble u );
GLAPI and APIENTRY are platform dependent macros. Which kind of spoils my simple parsing scheme, since it expects the name of an object to follow its type. Those two macros happen to translate to either __stdcall, __declspec(dllimport) or extern but in theory they could mean anything, with their meaning being unclear until compile time.
How to write my parser so it can deal with such scenarios and not get confused? The macros themselves are defined at an earlier stage, so the parser can be aware GLAPI and APIENTRY are macros so they can simply be ignored, is this the way to go? Naturally this is just one of the many variations of irregularities the parser may stumble upon parsing through different headers, so any general techniques of how to deal with the parsing of any "legal" header content are welcome.

There isn't any real alternative to expanding the macros before you parse, at least if you want process header files with the same complexity as Microsoft's, or any other header files associated with a compiler system that has been around for 10 years or more.
The unpreprocessed source code is NOT C; it is simply unpreprocessed source code. The macros (and prepreprocessor conditionals which you surprising didn't mention) can edit the apparant source in not arbitrary but spectacularly complex fashion. And you can't often know what the macros used, or conditionals expanded, unless you process the #includes as well.
You can get GCC to do preprocessor expansion for you, and then parse it. That would be far
the easiest way to approach this.
That still leaves the problem of parsing real C code, with all the complexities of declarators, and ambiguities in fragments suchas T X; where the meaning of the statement depends on the declaration of T. To parse the headers accurately, you need a full C parser.
Our C Front End can do full preprocessing, or you can invoke it a mode in which some macros are expanded, and some are not. By tuning this set, you often parse such headers without exapanding every macro. Preprocessor conditionals are much more difficult, because they can occur at inconvenient (unstructured) places.

If all you want is the name and signature of functions, then a simple search and replace for macros should be sufficient.
However, you need to check if a macro contains keywords (like the return value). This may be possible by stripping macro definitions of every but keywords as they are defined, but tracking them and using a simple preprocessor will be necessary.
The platform dependent keywords, such as __declspec and __attribute__ have very limited syntax and there are only a few of them, so specifically removing those is possible.
You may want to take a look at how doxygen handles this, because it does almost exactly what you want and does handle macros. It allows a list of macros to be expanded as defined, and ones that should be expanded to a custom value. You could adapt that to expand __declspec(x) to nothing, and expand all others to their defined value by default.
This certainly isn't foolproof, but a search and replace is about the simplest functional solution you'll get. You need to follow the standard C++ preprocessor rules, which aren't terribly complex, with additional macros (const, declspec, etc) to strip extra attributes, and parse the final results.

Related

Use of '#' in unexpected way

There's a macro defined as:
#define SET_ARRAY(field, type) \
foo.field = bar[#field].data<type>();
foo is a structure with members that are of type int or float *. bar is of type cnpy::npz_t (data loaded from .npz file). I understand that the macro is setting the structure member pointer so that it is pointing to the corresponding data in bar from the .npy file contained in the .npz file, but I'm wondering about the usage bar[#field].
When I ran the code through the preprocessor, I get:
foo.struct_member_name = bar["struct_member_name"].data<float>();
but I've never seen that type of usage either. It looks like the struct member variable name is somehow getting converted to an array index or memory offset that resolves to the data within the cnpy::npz_t structure. Can anyone explain how that is happening?
# is actually a preprocessor marker. That means preprocessor commands (not functions), formally called "preprocessor directives", are being executed at compile time. Apart from commands, you'll also find something akin to constants (meaning they have predefined values, either static or dynamic - yes I used the term constants loosely, but I am oversimplifying this right now), but they aren't constants "in that way", they just seem like that to us.
A number of preprocessor commands that you will find are:
#define, #include, #undef, #if (yes, different from the normal "if" in code), #elif, #endif, #error - all those must be prefixed by a "#".
Some values might be the __FILE__, __LINE__, __cplusplus and more. These are not prefixed by #, but can be used in preprocessor macros. The values are dynamically set by the compiler, depending on context.
For more information on macros, you can check the MS Learn page for MSVS or the GNU page for GCC. For other preprocessor values, you can also see this SourceForge page.
And of course, you can define your own macro or pseudo-constants using the #define directive.
#define test_integer 7
Using test_integer anywhere in your code (or macros) will be replaced by 7 after compilation. Note that macros are case-sensitive, just like everything else in C and C++.
Now, let's talk about special cases of "#":
string-izing a parameter (also called "to stringify")
What that means is you can pass a parameters and it is turned into a string, which is what happened in your case. An example:
#define NAME_TO_STRING(x) #x
std::cout << NAME_TO_STRING(Hello) << std::endl;
This will turn Hello which is NOT a string, but an identifier, to a string.
concatenating two parameters
#define CONCAT(x1, x2) x1##x2
#define CONCAT_STRING(x1, x2) CONCAT(#x1,#x2)
#define CONCATENATE(x1, x2) CONCAT_STRING(x1, x2)
(yes, it doesn't work directly, you need a level of indirection for preprocessor concatenation to work; indirection means passing it again to a different macro).
std::cout << CONCATENATE(Hello,World) << std::endl;
This will turn Hello and World which are identifiers, to a concatenated string: HelloWorld.
Now, regarding usage of # and ##, that's a more advanced topic. There are many use cases from macro-magic (which might seem cool when you see it implemented - for examples, check the Unreal Engine as it's extensively used there, but be warned, such programming methods are not encouraged), helpers, some constant definitions (think #define TERRA_GRAV 9.807) and even help in some compile-time checks, for example using constexpr from the newest standards.
If you're curious what is the advantage of using #define versus a const float or const double, it might also be to not be part of the code (there is no actual syntax check on macros if they are not used).
In regards to helper macros, the most common are defining exports when building a library (search __declspec for MSVS and __attribute__ for GCC), the old style inclusion limitators (now replaced by #pragma once) to stop a *.h, *.hxx or *.hpp from being included multiple times in projects and debug handling (search for _DEBUG and assertions on Google). This paragraph handles slightly more advanced topics so I won't cover them here.
I tried to keep the explanation as simple as possible, so the terminology is not that formal. But if you really are curious, I am sure you can find more details online or you can post a comment on this answer :)

Extract all declared function names from header into boost.preprocessor

I have a C header file containing various declarations of functions, enums, structs, etc, and I hope to extract all declared function names into a boost.preprocessor data structure for iteration, using only the C preprocessor.
All function declarations have two fixed distinct macros around the return type, something like,
// my_header.h
FOO int * BAR f(long, double);
FOO void BAR g();
My goal is to somehow transform it into one of the above linked boost.preprocessor types, such as (f, g) or (f)(g). I believe it is possible by cleverly defining FOO and BAR, but have not succeeded after trying to play around with boost.preprocessor and P99.
I believe this task can only be done with the preprocessor as,
As a hard requirement, I need to stringify the function names as string literals later when iterating the list, so runtime string manipulation or existing C++ static reflection frameworks with template magic are out AFAIK.
While it can be done with the help of other tools, they are either fragile (awk or grep as ad-hoc parsers) or overly complex for the task (LLVM/GCC plugin for something robust). It is also a motivation to avoid external dependencies other than those strictly necessary i.e. a conforming C compiler.
I don't think this is going to work, due to limitations on where parentheses and commas need to occur.
What you can do, though, is the opposite. You could make a Boost.PP sequence that contains the signatures in some structured form and use it to generate the declarations as you showed them. In the end, you have the representation you want as well as the compiler's view of the declarations.
After some closer look at the internals of preprocessor tricks, I believe this is theoretically impossible. This answer is kind of a more detailed expansion on top of #sehe's nice answer.
The fundamental working principle of arbitrary preprocessor lists like those in boost.preprocessor is indirect recursion. As such, it requires a way to consume one argument and pass the remaining on. The only two ways for CPP are commas (which can separate arguments) and enclosing parentheses (which can invoke macros).
In the case of f(int, long), f is neither followed by a comma nor surrounded by a pair of parenthese, so there is no way for it to be separated from the following list by the preprocessor without knowing the name in advance.
It could have changed the game if there were a BAZ after f, but sadly there is not and I have no control over the said library header :(
There are other issues, albeit not as fatal, such as the UB of having preprocessor directives within macro definition or arguments.
Perhaps someday it would become possible to leverage the reflection TS to get all declared function names within a namespace as a consteval compile-time list and then iterate it with something along the lines of constexpr for, all in a semantic and type-safe manner... who knows

Is there a way to have the same #define statement in different files that are included into the same file

So, I have a file structure like this:
FileA
FileB
FileC
FileA includes FileB and FileC
FileB has:
#define image(i, j, w) (image[ ((i)*(w)) + (j) ])
and FileC has:
#define image(i, j, h) (image[ ((j)*(h)) + (i) ])
on compilation i get:
warning: "image" redefined
note: this is the location of the previous definition ...
Does this warning mean it changes the definition of the other file where it found it initially when compiling ?
Is there any way to avoid this warning while maintaining these two defines, and them applying their different definitions on their respective files?
Thankyou in advance :)
Does this warning mean it changes the definition of the other file where it found it initially when compiling ?
The program is ill-formed. The language doesn't specify what happens in this case. If the compiler accepts an ill-formed program, then you must read the documentation of the compiler to find out what they do in such case.
Note that the program might not even compile with other compilers.
Is there any way to avoid this warning while maintaining these two defines, and them applying their different definitions on their respective files?
Technically, you could use hack like this without touching either header:
#include "FileB"
#undef image
#include "FileC"
But a good solution - if you can modify the headers - is to not use macros. Edit the headers to get rid of them. Use functions instead, and declare them in distinct namespaces so that their names don't conflict.
Some rules of thumb:
Don't use unnecessary macros. Functions and variables are superior to macros.
Follow the common convention of using only upper case for macro names, if you absolutely need to use macros. It is important to make sure that macro names don't mix with non-macros because macros don't respect namespaces nor scopes.
If you need a macro within a single header, then undefine it immediately when it's no longer needed instead of leaking it into other headers.
Don't use names without namespaces. That will lead to name conflicts. Macros don't respect C++ namespaces, but you can instead prefix their names. For example, you could have FILE_B_IMAGE and FILE_C_IMAGE (or something more descriptive based on the concrete context).
They are not functionally equivalent, one can be seen as a row-wise iteration and the other a column-wise
This seems like a good argument for renaming the functions (or the macros, if you for some reason cannot replace them). Call one row_wise and the other column_wise or something along those lines. Use descriptive names!
Does this warning mean it changes the definition of the other file where it found it initially when compiling ?
For GCC (tagged) it means that the definition processed second is used from the point of the redefinition onward, including not only in the same file but at any places later in the translation unit where the macro identifier appears followed by a (. Previous appearances will have used the previous definition.
Neither the C language specification nor the C++ language specification provides a more general answer: the redefinition other than with an identical token sequence violates language constraints, therefore both the translation behavior and the execution behavior of a program containing such a non-matching redefinition are undefined.
Is there any way to avoid this warning while maintaining these two
defines, and them applying their different definitions on their
respective files?
If these definitions are meant to be used only within their respective files, then the easiest solution would be for each file to #undef image at the end. This would work in both C and C++.
If both are intended to be exposed for use by other files then you have a name collision that you will have to resolve one way or another. You might, for instance, add a distinguishing prefix to the definition and all uses of each one. In C++ only, you also have the option of resolving the name collision by changing the macros to [inline] functions and putting them in different namespaces. That would probably make it easier to adapt each one's users to the new names than prefixing the names would do.

Preprocessor and whitespaces rules

I am interested in defining my own language inside a C++ block (lets say for example main) and for that purpose I need to use the preprocessor and its directives my problem relies to the below rule:
#define INSERT create() ...
Is called a function-like definition and preprocessor does not allow any whitespaces in what we define ,
So when I use a function of my own language I got to parse right handy the below statement:
INSERT INTO variable_name VALUES(arg_list)
to a different two function calls lets say
insertINTO(variable_name) and valuePARSE(arg_list)
but since the preprocessor directive rules do not allow me to have whitespaces in my definition how I can reach the variable_name and then make the call to the first function call I want to achieve?
Any clues would be helpful.
PS: I tried using g++ -E file.cpp to see how preprocessor works and to adjust the syntax to be valid c++ rules.
The preprocessor included with most C++ compilers is probably way too weak for this kind of task. It was never designed for this kind of abuse. The boost preprocessor library could help you on the way, but I still think you're heading down a one-way street here.
If you really want to define your language this way, I suggest you either write your own preprocessor, or use one that is more powerful than the default one. Here is one chap who tried using Python as a C++ preprocessor.
1) define INSERT create() is not a function-like macro it's object-like, something like define INSERT(a, b, c) create(a, b, c) would be;
2) if you want to expand INSERT INTO variable_name VALUES(arg_list) into insertINTO(variable_name); valuePARSE(arg_list); you can do something like:
#define INSERT insertINTO(
#define INTO
#define VALUES(...) ); valueParse(__VA_ARGS__);
3) as you can see macros get ugly pretty easy and even the slightest error in your syntax will have you spend a lot of time tracking it down
4) since it's tagged C++ take a look at Boost.Proto or Boost.Preprocessor.

Pointless 'MIDL_INTERFACE' Macro in winapi?

After browsing some old code, I noticed that some classes are defined in this manner:
MIDL_INTERFACE("XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX")
Classname: public IUnknown {
/* classmembers ... */
};
However, the macro MIDL_INTERFACE is defined as:
#define MIDL_INTERFACE(x) struct
in C:/MinGW/include/rpcndr.h (somewhere around line 17). The macro itself is rather obviously entirely pointless, so what's the true purpose of this macro?
In the Windows SDK version that macro expands to
struct __declspec(uuid(x)) __declspec(novtable)
The first one allows use of the __uuidof keyword which is a nice way to get the guid of an interface from the typename. The second one suppresses the generation of the v-table, one that is never used for an interface. A space optimization.
This is because MinGW does not support COM (or rather, supports it extremely poorly). MIDL_INTERFACE is used when defining a COM component, and it is generated by the IDL compiler, which generates COM type libraries and class definitions for you.
On MSVC, this macro typically expands to more complicated initialization and annotations to expose the given C++ class to COM.
If I had to guess, it's for one of two use cases:
It's possible that there's an external tool that parses the files looking for declarations like these. The idea is that by having the macro evaluate to something harmless, the code itself compiles just fine, but the external tool can still look at the source code and extract information out of it.
Another option might be that the code uses something like the X Macro Trick to selectively redefine what this preprocessor directive means so that some other piece of the code can interpret the data in some other way. Depending on where the #define is this may or may not be possible, but it seems reasonable that this might be the use case. This is essentially a special-case of the first option.