Extract all declared function names from header into boost.preprocessor

Extract all declared function names from header into boost.preprocessor - c++

I have a C header file containing various declarations of functions, enums, structs, etc, and I hope to extract all declared function names into a boost.preprocessor data structure for iteration, using only the C preprocessor.
All function declarations have two fixed distinct macros around the return type, something like,
// my_header.h
FOO int * BAR f(long, double);
FOO void BAR g();
My goal is to somehow transform it into one of the above linked boost.preprocessor types, such as (f, g) or (f)(g). I believe it is possible by cleverly defining FOO and BAR, but have not succeeded after trying to play around with boost.preprocessor and P99.
I believe this task can only be done with the preprocessor as,
As a hard requirement, I need to stringify the function names as string literals later when iterating the list, so runtime string manipulation or existing C++ static reflection frameworks with template magic are out AFAIK.
While it can be done with the help of other tools, they are either fragile (awk or grep as ad-hoc parsers) or overly complex for the task (LLVM/GCC plugin for something robust). It is also a motivation to avoid external dependencies other than those strictly necessary i.e. a conforming C compiler.

I don't think this is going to work, due to limitations on where parentheses and commas need to occur.
What you can do, though, is the opposite. You could make a Boost.PP sequence that contains the signatures in some structured form and use it to generate the declarations as you showed them. In the end, you have the representation you want as well as the compiler's view of the declarations.

After some closer look at the internals of preprocessor tricks, I believe this is theoretically impossible. This answer is kind of a more detailed expansion on top of #sehe's nice answer.
The fundamental working principle of arbitrary preprocessor lists like those in boost.preprocessor is indirect recursion. As such, it requires a way to consume one argument and pass the remaining on. The only two ways for CPP are commas (which can separate arguments) and enclosing parentheses (which can invoke macros).
In the case of f(int, long), f is neither followed by a comma nor surrounded by a pair of parenthese, so there is no way for it to be separated from the following list by the preprocessor without knowing the name in advance.
It could have changed the game if there were a BAZ after f, but sadly there is not and I have no control over the said library header :(
There are other issues, albeit not as fatal, such as the UB of having preprocessor directives within macro definition or arguments.
Perhaps someday it would become possible to leverage the reflection TS to get all declared function names within a namespace as a consteval compile-time list and then iterate it with something along the lines of constexpr for, all in a semantic and type-safe manner... who knows

Related

Is it recommended to give variable names in a function declaration?

As I understand it, in a function declaration there is no need to give variable names, but it is still recommended.
For example: the setcur() function accepts two parameters : row number and
and column number.Hence can be declared as follows:
void setcur(int, int);
void setcur(int row, int col);
Why is it recommended to give variable names in a function declaration?

It really is more for readability sake. You only really need the variable type in the function declaration; however, it is good for someone reading your code to understand what these inputs actually are, assuming your name them something appropriate.
It will make your life a lot easier if you are working on a large file and you don't remember what your function in your .h file takes in as input.

Adding a decent variable name improves readability and helps documenting your function. Consider e.g.
void area(int, int); // which parameter comes first? area of what?
vs
void area(int radius, int height); // now it's clear that it is a cylinder, order is also clear
You may also want to use e.g. doxygen to generate the documentation automatically, in which case you usually document the header files. In that case, it makes sense to name the function parameters, so they correspond to the names of the doxygen's \params.

There's been significant discussion here if you're really curious:
Why do function prototypes include parameter names when they're not required?
The short answer is readability, but to use your example, basically
void setcur(int,int);
can be implemented as
(1) void setcur(int row,int col);
or
(2) void setcur(int col,int row);
(any other implementations are possible, but allow 2 as an example).
With the variable names in the header, you can easily get a sense of what to pass for row and col. Otherwise you need to dig into the code file which may exist separate from the header. Imagine if all you had was
void setcur(int, int)
and assumed (1), but it really was (2). You would have a hard to debug failure in your code.

While the names are not technically required, they serve as documentation to help the user. Also, they help your IDE to show better signatures, completions, etc.
Adding documentation in some tool-aware form might look like an alternative, but your code might also be used with different IDEs and, unlike comments, the variable names in the declarations are the only standardized way of providing more semantic information in the declaration of a function.
Also note that in case of libraries, the user of the library might not even see the definition of the function, all he has is the header file with the declaration and a binary library to link against.

Parsing irregular c++ prototypes

I am trying to build a program that parses and lists the content of header files. So far, so good, I found it easy parsing and listing headers I wrote, but when I started parsing cross platform API headers things got messy.
My current approach is rather simplistic, here is a pseudocode example of parsing the following function:
void foo(int a);
void is a type, so we are dealing with instancing a type
foo is the name of that type
foo is followed by brackets, meaning it is a function of type void named foo
int is a type...
a is the name of that type instance
foo is a function of type void that takes one parameter of type int named a
However, when I got into bigger and more complex headers I stumbled upon somewhat irregular prototypes, involving macros and god knows what. An example:
GLAPI void APIENTRY glEvalCoord1d( GLdouble u );
GLAPI and APIENTRY are platform dependent macros. Which kind of spoils my simple parsing scheme, since it expects the name of an object to follow its type. Those two macros happen to translate to either __stdcall, __declspec(dllimport) or extern but in theory they could mean anything, with their meaning being unclear until compile time.
How to write my parser so it can deal with such scenarios and not get confused? The macros themselves are defined at an earlier stage, so the parser can be aware GLAPI and APIENTRY are macros so they can simply be ignored, is this the way to go? Naturally this is just one of the many variations of irregularities the parser may stumble upon parsing through different headers, so any general techniques of how to deal with the parsing of any "legal" header content are welcome.

There isn't any real alternative to expanding the macros before you parse, at least if you want process header files with the same complexity as Microsoft's, or any other header files associated with a compiler system that has been around for 10 years or more.
The unpreprocessed source code is NOT C; it is simply unpreprocessed source code. The macros (and prepreprocessor conditionals which you surprising didn't mention) can edit the apparant source in not arbitrary but spectacularly complex fashion. And you can't often know what the macros used, or conditionals expanded, unless you process the #includes as well.
You can get GCC to do preprocessor expansion for you, and then parse it. That would be far
the easiest way to approach this.
That still leaves the problem of parsing real C code, with all the complexities of declarators, and ambiguities in fragments suchas T X; where the meaning of the statement depends on the declaration of T. To parse the headers accurately, you need a full C parser.
Our C Front End can do full preprocessing, or you can invoke it a mode in which some macros are expanded, and some are not. By tuning this set, you often parse such headers without exapanding every macro. Preprocessor conditionals are much more difficult, because they can occur at inconvenient (unstructured) places.

If all you want is the name and signature of functions, then a simple search and replace for macros should be sufficient.
However, you need to check if a macro contains keywords (like the return value). This may be possible by stripping macro definitions of every but keywords as they are defined, but tracking them and using a simple preprocessor will be necessary.
The platform dependent keywords, such as __declspec and __attribute__ have very limited syntax and there are only a few of them, so specifically removing those is possible.
You may want to take a look at how doxygen handles this, because it does almost exactly what you want and does handle macros. It allows a list of macros to be expanded as defined, and ones that should be expanded to a custom value. You could adapt that to expand __declspec(x) to nothing, and expand all others to their defined value by default.
This certainly isn't foolproof, but a search and replace is about the simplest functional solution you'll get. You need to follow the standard C++ preprocessor rules, which aren't terribly complex, with additional macros (const, declspec, etc) to strip extra attributes, and parse the final results.

Preprocessor and whitespaces rules

I am interested in defining my own language inside a C++ block (lets say for example main) and for that purpose I need to use the preprocessor and its directives my problem relies to the below rule:
#define INSERT create() ...
Is called a function-like definition and preprocessor does not allow any whitespaces in what we define ,
So when I use a function of my own language I got to parse right handy the below statement:
INSERT INTO variable_name VALUES(arg_list)
to a different two function calls lets say
insertINTO(variable_name) and valuePARSE(arg_list)
but since the preprocessor directive rules do not allow me to have whitespaces in my definition how I can reach the variable_name and then make the call to the first function call I want to achieve?
Any clues would be helpful.
PS: I tried using g++ -E file.cpp to see how preprocessor works and to adjust the syntax to be valid c++ rules.

The preprocessor included with most C++ compilers is probably way too weak for this kind of task. It was never designed for this kind of abuse. The boost preprocessor library could help you on the way, but I still think you're heading down a one-way street here.
If you really want to define your language this way, I suggest you either write your own preprocessor, or use one that is more powerful than the default one. Here is one chap who tried using Python as a C++ preprocessor.

1) define INSERT create() is not a function-like macro it's object-like, something like define INSERT(a, b, c) create(a, b, c) would be;
2) if you want to expand INSERT INTO variable_name VALUES(arg_list) into insertINTO(variable_name); valuePARSE(arg_list); you can do something like:
#define INSERT insertINTO(
#define INTO
#define VALUES(...) ); valueParse(__VA_ARGS__);
3) as you can see macros get ugly pretty easy and even the slightest error in your syntax will have you spend a lot of time tracking it down
4) since it's tagged C++ take a look at Boost.Proto or Boost.Preprocessor.

C++ Function Conventions?

Just had a 'Fundamentals of Programming' lecture at uni and was told that the convention for using/declaring functions is to have the main() function at the top of the program, with functions/procedures below it and to use forward declarations to prevent compiler errors.
However, I've always done it the other way - functions at top main() at bottom and not using forward declarations and don't think I've ever seen it otherwise.
Which is right? Or is it more a case of personal preference? Some clarification would be hugely appreciated.

It's up to you. Personally, I keep main on bottom because the times when I have other functions in there, it's either just one or two other little functions or a code snippet.
In real code, you'd have hopefully split your project up (having multiple "unrelated" functions in a file is bad) and so main would likely be nearly alone in the file. You'd just #include the things main needs to get things going and use them.

There could be the case when your functions are related to each other. If you simply write them above the main() without forward-declaration you have to order them so they'll know the functions they depend on. In some cases (circular references) it won't even be possible to compile without forward declaration.
When forward declaring the functions you won't run into this issue.
Additionally when having main() as first function, you'll produce more readable code, but thats perhaps just personal preference.
It could also be more readable cause another coder has on overview about the functions he'll find in the file.

If there is a standard, it's to have the function declarations in a .h file which is included. That way, it doesn't matter what the order of functions in the file is. I almost never write functions without a declaration, and it takes me aback when other people do.

The standard most professional C++ developers use is this.
two files per class: the header file (named e.g., Class.h), which only declares all of the class' data members and functions; and the implementation file (named Class.cpp in the same example), containing only implementations of those functions.
the main function (if your solution has it) goes into a file named, e.g. Main.cpp all by itself.
It is OK to put multiple classes or even the entire program into one file when you work on small-scale home or school projects. Because of the small scale, it doesn't really matter how you order the classes or the functions. You can do what is more convenient to you. But if your professor requires students to follow certain code standards for homework (e.g. put main() first), then you have to follow those standards.

This approach is most likely favoured by your professor and most likely the reason she's teaching it as convention. There are two options here, go with the flow (i.e. don't try to disrupt and potentially get marked down for stuff because your professor feels that you're not following "convention") or explain to her - there is no requirement to do it this way (or any other way), as long as the intention of the code is clear!

The advice to have declarations (function prototypes) is more related to C then to C++, because in C++ there are no implicit function declarations, so you must always declare before use. Therefore you will definitely have at least one function declaration which is not definition if you have a recursion involving more then one function.
For a small project use whatever style you want but be consistent.
For a large project you'll probably need several .cpp files and have the interfaces (be they classes or functions) defined in the header files. Also be consistent at least within a single file.
And the last thing, have I said to be consistent?

I normally use the second form because you have to maintain function declarations - they must have the right signature and name, and imo, they're just excessive typing when you could just define the functions first. No maintenance, no wasted time.

I think the principle that applies here is "Don't Repeat Yourself". If you use a forward declaration, you're repeating yourself unnecessarily.

What are the advantages and disadvantages of separating declaration and definition as in C++?

In C++, declaration and definition of functions, variables and constants can be separated like so:
function someFunc();
function someFunc()
{
//Implementation.
}
In fact, in the definition of classes, this is often the case. A class is usually declared with it's members in a .h file, and these are then defined in a corresponding .C file.
What are the advantages & disadvantages of this approach?

Historically this was to help the compiler. You had to give it the list of names before it used them - whether this was the actual usage, or a forward declaration (C's default funcion prototype aside).
Modern compilers for modern languages show that this is no longer a necessity, so C & C++'s (as well as Objective-C, and probably others) syntax here is histotical baggage. In fact one this is one of the big problems with C++ that even the addition of a proper module system will not solve.
Disadvantages are: lots of heavily nested include files (I've traced include trees before, they are surprisingly huge) and redundancy between declaration and definition - all leading to longer coding times and longer compile times (ever compared the compile times between comparable C++ and C# projects? This is one of the reasons for the difference). Header files must be provided for users of any components you provide. Chances of ODR violations. Reliance on the pre-processor (many modern languages do not need a pre-processor step), which makes your code more fragile and harder for tools to parse.
Advantages: no much. You could argue that you get a list of function names grouped together in one place for documentation purposes - but most IDEs have some sort of code folding ability these days, and projects of any size should be using doc generators (such as doxygen) anyway. With a cleaner, pre-processor-less, module based syntax it is easier for tools to follow your code and provide this and more, so I think this "advantage" is just about moot.

It's an artefact of how C/C++ compilers work.
As a source file gets compiled, the preprocessor substitutes each #include-statement with the contents of the included file. Only afterwards does the compiler try to interpret the result of this concatenation.
The compiler then goes over that result from beginning to end, trying to validate each statement. If a line of code invokes a function that hasn't been defined previously, it'll give up.
There's a problem with that, though, when it comes to mutually recursive function calls:
void foo()
{
bar();
}
void bar()
{
foo();
}
Here, foo won't compile as bar is unknown. If you switch the two functions around, bar won't compile as foo is unknown.
If you separate declaration and definition, though, you can order the functions as you wish:
void foo();
void bar();
void foo()
{
bar();
}
void bar()
{
foo();
}
Here, when the compiler processes foo it already knows the signature of a function called bar, and is happy.
Of course compilers could work in a different way, but that's how they work in C, C++ and to some degree Objective-C.
Disadvantages:
None directly. If you're using C/C++ anyway, it's the best way to do things. If you've got a choice of language/compiler, then maybe you can pick one where this is not an issue. The only thing to consider with splitting declarations into header files is to avoid mutually recursive #include-statements - but that's what include guards are for.
Advantages:
Compilation speed: As all included files are concatenated and then parsed, reducing the amount and complexity of code in included files will improve compilation time.
Avoid code duplication/inlining: If you fully define a function in a header file, each object file that includes this header and references this function will contain it's own version of that function. As a side-note, if you want inlining, you need to put the full definition into the header file (on most compilers).
Encapsulation/clarity: A well defined class/set of functions plus some documentation should be enough for other developers to use your code. There is (ideally) no need for them to understand how the code works - so why require them to sift through it? (The counter-argument that it's may be useful for them to access the implementation when required still stands, of course).
And of course, if you're not interested in exposing a function at all, you can usually still choose to define it fully in the implementation file rather than the header.

The standard requires that when using a function, a declaration must be in scope. This means, that the compiler should be able to verify against a prototype (the declaration in a header file) what you are passing to it. Except of course, for functions that are variadic - such functions do not validate arguments.
Think of C, when this was not required. At that time, compilers treated no return type specification to be defaulted to int. Now, assume you had a function foo() which returned a pointer to void. However, since you did not have a declaration, the compiler will think that it has to return an integer. On some Motorola systems for example, integeres and pointers would be be returned in different registers. Now, the compiler will no longer use the correct register and instead return your pointer cast to an integer in the other register. The moment you try to work with this pointer -- all hell breaks loose.
Declaring functions within the header is fine. But remember if you declare and define in the header make sure they are inline. One way to achieve this is to put the definition inside the class definition. Otherwise prepend the inline keyword. You will run into ODR violation otherwise when the header is included in multiple implementation files.

There are two main advantages to separating declaration and definition into C++ header and source files. The first is that you avoid problems with the One Definition Rule when your class/functions/whatever are #included in more than one place. Secondly, by doing things this way, you separate interface and implementation. Users of your class or library need only to see your header file in order to write code that uses it. You can also take this one step farther with the Pimpl Idiom and make it so that user code doesn't have to recompile every time the library implementation changes.
You've already mentioned the disadvantage of code repetition between the .h and .cpp files. Maybe I've written C++ code for too long, but I don't think it's that bad. You have to change all user code every time you change a function signature anyway, so what's one more file? It's only annoying when you're first writing a class and you have to copy-and-paste from the header to the new source file.
The other disadvantage in practice is that in order to write (and debug!) good code that uses a third-party library, you usually have to see inside it. That means access to the source code even if you can't change it. If all you have is a header file and a compiled object file, it can be very difficult to decide if the bug is your fault or theirs. Also, looking at the source gives you insight into how to properly use and extend a library that the documentation might not cover. Not everyone ships an MSDN with their library. And great software engineers have a nasty habit of doing things with your code that you never dreamed possible. ;-)

Advantage
Classes can be referenced from other files by just including the declaration. Definitions can then be linked later on in the compilation process.

You basically have 2 views on the class/function/whatever:
The declaration, where you declare the name, the parameters and the members (in the case of a struct/class), and the definition where you define what the functions does.
Amongst the disadvantages are repetition, yet one big advantage is that you can declare your function as int foo(float f) and leave the details in the implementation(=definition), so anyone who wants to use your function foo just includes your header file and links to your library/objectfile, so library users as well as compilers just have to care for the defined interface, which helps understanding the interfaces and speeds up compile times.

One advantage that I haven't seen yet: API
Any library or 3rd party code that is NOT open source (i.e. proprietary) will not have their implementation along with the distribution. Most companies are just plain not comfortable with giving away source code. The easy solution, just distribute the class declarations and function signatures that allow use of the DLL.
Disclaimer: I'm not saying whether it's right, wrong, or justified, I'm just saying I've seen it a lot.

One big advantage of forward declarations is that when used carefully you can cut down the compile time dependencies between modules.
If ClassA.h needs to refer to a data element in ClassB.h, you can often use just a forward references in ClassA.h and include ClassB.h in ClassA.cc rather than in ClassA.h, thus cutting down a compile time dependency.
For big systems this can be a huge time saver on a build.

Disadvantage
This leads to a lot of repetition. Most of the function signature needs to be put in two or more (as Paulious noted) places.

Separation gives clean, uncluttered view of program elements.
Possibility to create and link to binary modules/libraries without disclosing sources.
Link binaries without recompiling sources.

When done correctly, this separation reduces compile times when only the implementation has changed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js