How to programmatically alter source code in C++ file?

How to programmatically alter source code in C++ file? - c++

I need a way to take a C/C++ source code file, inspect and perform some modifications to it, and then write the modified variant back to disk. Possible use cases that I have for it are:
Mutation testing, such as intentionally corrupting calculation in order to check if tests can catch it.
Altering visibility scope or annotating functions and methods. In order to split a large file into several smaller files but still being able to link them together, I want to turn some static functions into external functions so that the linker can find them later.
Generation of mock implementations of existing functions methods. For all externally visible functions, create a function with identical prototype but with empty/dummy body so that other code can link against it.
Are there existing solutions for such workflow?
I am mostly interested in dealing with functions/methods. The rest of information contained in a file, such as includes, type definitions etc. are less important for me, but they must be preserved in the output so that the end result remains syntactically correct.
A straightforward approach of applying a bunch of regular expressions to extract/modify the text is possible. But it is obviously not reliable in a any way. I would like to avoid writing a full-blown C++ parser. Even having such a parser does not solve the follow-up task of saving the modified parse tree back to a file.

LibTooling and libclang are commonly used to develop such refactoring tools (clang-format, clang-tidy, etc.).

Related

C++ Source-to-Source Transformation with Clang

I am working on a project for which I need to "combine" code distributed over multiple C++ files into one file. Due to the nature of the project, I only need one entry function (the function that will be defined as the top function in the Xilinx High-Level-Synthesis software -> see context below). The signature of this function needs to be preserved in the transformation. Whether other functions from other files simply get copied into the file and get called as a subroutine or are inlined does not matter. I think due to variable and function scopes simply concatenating the files will not work.
Since I did not write the C++ code myself and it is quite comprehensive, I am looking for a way to do the transformation automatically. The possibilities I am aware of to do this are the following:
Compile the code to LLVM IR with inlining flags and use a C++/C backend to turn the LLVM code into the source language again. This will result in bad source code and require either an old release of Clang and LLVM or another backend like JuliaComputing. 1
The other option would be developing a tool that relies on using the AST and a library like LibTooling to restructure the code. This option would probably result in better code and put everything into one file without the unnecessary inlining. 2 However, this options seems too complicated to put the all the code into one file.
Hence my question: Are you aware of a better or simply alternative approach to solve this problem?
Context: The project aims to run some of the code on a Xilinx FPGA and the Vitis High-Level-Synthesis tool requires all code that is to be made into a single IP block to be contained in a single file. That is why I need to realise this transformation.

How to process c and c++ source code to calculate metrics for static code analysis?

Iam extending a software tool to calculate metrics for software projects.
The metrics are then used to do a static code analysis.
My task is to implement the calculation of metrics for c and c++ projects.
In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language.
I will state the process, problems and things i tried to solve them in chronological order and as good as possible.
Some metrics:
Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles
Method Count for Classes and Structs
Complexity for Classes, Structs and Functions/Methods
Dependencies for/between Classes and Structs
Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser.
Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.
LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html
First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST).
I tried the RecursiveASTVistor and the Matchfinder approaches without success.
So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST.
I was only able to react on a callback when a specific node in the AST was visited. But i didnt know in what context i currently was.
Eg. When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not.
But that information is needed to calculate the lines of code for a single class.
Second approach was using the LibClang interface via Python Bindings.
With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack.
Here i encountered a general problem with LibClang:
Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives. Just as he is supposed to do.
This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.
This is very bad because i wont be able to provide all the include files or directories for any c++ project.
This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not. Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.
This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar.
No preprocessor is executed before the parser.
This is good because the full source code is parsed and preprocessor directives are being ignored.
Bad thing is that the parser does not seem to be that tough. It fails on code which can be compiled leading to a broken AST. So this solution is not sufficient aswell.
My questions are:
Is there an option to deactivate the preprocessor before parsing a c/c++ source or header file with libClang?
So the source code is untouched and the AST is complete and detailed.
Is there a way to parse a c/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?
Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?
If you think this is not the right place to ask such questions feel free to redirect me to another place.

To answer your last question,
Since iam running out of options. What other approaches may be worth
looking at when it comes to analysing/parsing c/c++ source code?
Another approach is to parse the source code as if it were merely text. This avoids the need to preprocess the source, and to bring in a complex parser. See this paper for an example/introduction: "The Conceptual Cohesion of Classes" by Andrian Marcus, Denys Poshyvanyk. You can still collect such information as LOC and number of methods from this approach, without needing a full parser.
This approach has drawbacks (as does any approach):
It either 1) parses comments along with the source code, or 2) requires that you remove comments from the source. But the latter is an easy step. The reason that might be OK is that even the comments contain information regarding the code, which may help determine which modules are more closely coupled, etc.
It will lump local variables, method names, parameter names, etc. all into the "bag of words" that you are working with.

A tool to make the division of C++ code into .h and .cpp files transparent to the programmer

C++ was the the first language I've learnt so dividing source code into .h and .cpp files seemed obvious - but having learnt C# and Java they now appear to me terribly clumsy. They might have been useful back in the 60s, maybe still even in 80s, but in the era of modern tools, such as IDEs with section folding, and documentation generators, they seem obsolete.
Anyway, my question is: is there a tool which makes the presence of these two kind of files transparent to the programmer? For example by letting the coder write the definition of a method seemingly in the header file but actually saving it to the .cpp file?
(I know one can try to write a C++ program solely in header files but AFAIK this is not considered best practice, makes the program build even longer and makes it virtually impossible for two classes to reference each other.)

The discussion that I am seeing in the question, comments and the comments to the other answer seem to focus on the textual representation of the components. From the point of view of plain text, it makes sense to remove the headers altogether.
On the other hand, there is a second level to the separation of headers and cpp files, which is separating the interface from the implementation, and in doing so, removing implementation details from the interface.
This happens in different ways, in the simplest level how you implement a particular piece of functionality is irrelevant to the users of your component[*]. In many cases, you can have more types and functions in the .cpp file that are used just as details of implementation. Additionally, if you decide to implement a particular functionality directly or by depending on another library is a detail of implementation and should not leak to your users.
That separation might or might not be easy to implement in a tool that managed the separation of files automatically, and cannot be done by those that like the use of header only libraries.
Where some of you are claiming that there is no point in having to Go to definition, and would like to see the code altogether, what I see is the flexibility of deciding what parts of your component are details that should not be known by users. In many IDEs (and, heck, even in vim) there is a single keystroke combination that will take you from one to the other. In IDEs you can refactor the signature of the function and have the IDE apply the changes to both the header and the implementation (sometimes even to the uses)...
If you were to have a tool provide a unified view of both the header and implementation, it would probably be much harder to make the tool knowledgable of what parts of the code you are writing are part of the interface or the implementation, and the decisions that the tool might have could have an impact on the generated program.
The separate compilation model has its disadvantages but also advantages, and I feel that the discussion that is being held here is just scratching the surface of deeper design decisions.
[*] There seems to be quite a few people that believe that each class should have its own header and .cpp files, and I disagree, each header represents a component that might be a single class, or multiple classes and free functions the separation of code in files is part of the design, and in a single component you might have one or more public types together with possibly none or more internal types.

I don't know of any that makes the division into source/header completely transparent.
I do, however, know of some that make it considerably easier to handle. For example, Visual Assist X will let write your entire class in a header, then select member functions, and move them to a source file (i.e., a .cpp file).
That's not a complete cure (by any means), but it can/does make them much more bearable anyway.
Personally, I think it would be interesting to get rid of files completely, and instead use something like a database directly -- i.e., you have a database of functions, where the source code to that function is one column, object code another, internal compiler information about how to use it yet another, and so on. This would also make integrating version control pretty straightforward (basically just a stored transaction log).

I, for one, like the header files in C++, because they can serve as a "table of contents" of a sort, which gives you a quick overview of a class. This is, of course, assuming that you keep your headers short, and put the implementation of most functions into the .cpp.
Also, you can get plugins for Visual Studio that will move the implementation of a member function from the header into a .cpp for you. That, plus the "go to definition" and "go to declaration" commands in Visual Studio make headers very helpful for navigating a large code base.

Headers are pretty fundamental to the current compilation model of C++. I think there are things that would be a bit difficult to express without exposing the programmer directly to the header/cpp split. For example, in certain circumstances there's a significant difference between:
// header
class Foo {
Foo() = default;
}
and
// header
class Foo {
Foo();
};
// cpp
Foo::Foo() = default;
Despite that I think it would be worthwhile to think about improving the compilation model or making it easier to deal with.
For example, I can imagine a 'code bubbles' style IDE where the programmer writes definitions in code bubbles, and then groups bubbles together into translation units. The programmer would mark definitions as 'exported' to make them available to other TUs and for each TU he'd select exported items from other units to import. And all the while the IDE is maintaining it's own code bubble representation of the program, and when you build it generates the actual cpp files necessary for passing off to the compiler, creating forward declarations, etc. as needed. The cpp files would just be an intermediate build product.
There would be subtleties like the one I showed above though, and I'm not sure how the IDE I described would deal with that.

Why are forward declarations necessary? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Should C++ eliminate header files?
In languages like C# and Java there is no need to declare (for example) a class before using it. If I understand it correctly this is because the compiler does two passes on the code. In the first it just "collects the information available" and in the second one it checks that the code is correct.
In C and C++ the compiler does only one pass so everything needs to be available at that time.
So my question basically is why isn't it done this way in C and C++. Wouldn't it eliminate the needs for header files?

The short answer is that computing power and resources advanced exponentially between the time that C was defined and the time that Java came along 25 years later.
The longer answer...
The maximum size of a compilation unit -- the block of code that a compiler processes in a single chunk -- is going to be limited by the amount of memory that the compiling computer has. In order to process the symbols that you type into machine code, the compiler needs to hold all the symbols in a lookup table and reference them as it comes across them in your code.
When C was created in 1972, computing resources were much more scarce and at a high premium -- the memory required to store a complex program's entire symbolic table at once simply wasn't available in most systems. Fixed storage was also expensive, and extremely slow, so ideas like virtual memory or storing parts of the symbolic table on disk simply wouldn't have allowed compilation in a reasonable timeframe.
The best solution to the problem was to chunk the code into smaller pieces by having a human sort out which portions of the symbol table would be needed in which compilation units ahead of time. Imposing a fairly small task on the programmer of declaring what he would use saved the tremendous effort of having the computer search the entire program for anything the programmer could use.
It also saved the compiler from having to make two passes on every source file: the first one to index all the symbols inside, and the second to parse the references and look them up. When you're dealing with magnetic tape where seek times were measured in seconds and read throughput was measured in bytes per second (not kilobytes or megabytes), that was pretty meaningful.
C++, while created almost 17 years later, was defined as a superset of C, and therefore had to use the same mechanism.
By the time Java rolled around in 1995, average computers had enough memory that holding a symbolic table, even for a complex project, was no longer a substantial burden. And Java wasn't designed to be backwards-compatible with C, so it had no need to adopt a legacy mechanism. C# was similarly unencumbered.
As a result, their designers chose to shift the burden of compartmentalizing symbolic declaration back off the programmer and put it on the computer again, since its cost in proportion to the total effort of compilation was minimal.

Bottom line: there have been advances in compiler technology that make forward declarations unnecessary. Plus computers are thousands of times faster, and so can make the extra calculations necessary to handle the lack of forward declarations.
C and C++ are older and were standardized at a time when it was necessary to save every CPU cycle.

No, it would not obviate header files. It would eliminate the requirement to use a header to declare classes/functions in the same file. The major reason for headers is not to declare things in the same file though. The primary reason for headers is to declare things that are defined in other files.
For better or worse, the rules for the semantics of C (and C++) mandate the "single pass" style behavior. Just for example, consider code like this:
int i;
int f() {
i = 1;
int i = 2;
}
The i=1 assigns to the global, not the one defined inside of f(). This is because at the point of the assignment, the local definition of i hasn't been seen yet so it isn't taken into account. You could still follow these rules with a two-pass compiler, but doing so could be non-trivial. I haven't checked their specs to know with certainty, but my immediate guess would be that Java and C# differ from C and C++ in this respect.
Edit: Since a comment said my guess was incorrect, I did a bit of checking. According to the Java Language Reference, §14.4.2, Java seems to follow pretty close to the same rules as C++ (a little different, but not a whole lot.
At least as I read the C# language specification, (warning: Word file) however, it is different. It (§3.7.1) says: "The scope of a local variable declared in a local-variable-declaration (§8.5.1) is the block in which the declaration occurs."
This appears to say that in C#, the local variable should be visible throughout the entire block in which it is declared, so with code similar to the example I gave, the assignment would be to the local variable, not the global.
So, my guess was half right: Java follows (pretty much0 the same rule as C++ in this respect, but C# does not.

This is because of smaller compilation modules in C/C++. In C/C++, each .c/.cpp file is compiled separately, creating an .obj module. Thus the compiler needs the information about types and variables, declared in other compilation modules. This information is supplied in form of forward declarations, usually in header files.
C#, on the other side, compiles several .cs files into one big compilation module at once.
In fact, when referencing different compiled modules from a C# program, the compiler needs to know the declarations (type names etc.) the same way as C++ compiler does. This information is obtained from the compiled module directly. In C++ the same information is explicitly separated (that's why you cannot find out the variable names from C++-compiled DLL, but can determine it from .NET assembly).

The forward declarations in C++ are a way to provide metadata about the other pieces of code that might be used by the currently compiled source to the compiler, so it can generate the correct code.
That metadata can come from the author of the linked library/component. However, it can also be automatically generated (for example there are tools that generate C++ header files for COM objects). In any case, the C++ way of expressing that metadata is through the header files you need to include in your source code.
The C#/.Net also consume similar metadata at compile time. However, that metadata is automatically generated when the assembly it applies to is built and is usually embedded into it. Thus, when you reference in your C# project an assembly, you are essentially telling the compiler "look for the metadata you need in this assembly as well, please".
In other words, the metadata generation and consumption in C# is more transparent to the developers, allowing them to focus on what really matters - writing their own code.
There are also other benefits to having the metadata about the code bundled with the assembly as well. Reflection, code emitting, on-the-fly serialization - they all depend on the metadata to be able to generate the proper code at run-time.
The C++ analogue to this would be RTTI, although it's not widely-adopted due ot incompatible implementations.

From Eric Lippert, blogger of all things internal to C#: http://blogs.msdn.com/ericlippert/archive/2010/02/04/how-many-passes.aspx:
The C# language does not require that
declarations occur before usages,
which has two impacts, again, on the
user and on the compiler writer. [...]
The impact on the compiler writer is
that we have to have a “two pass”
compiler. In the first pass, we look
for declarations and ignore bodies.
Once we have gleaned all the
information from the declarations that
we would have got from the headers in
C++, we take a second pass over the
code and generate the IL for the
bodies.
To sum up, using something does not require declaring it in C#, whereas it does in C++. That means that in C++, you need to explicitly declare things, and it's more convenient and safe to do that with header files so you don't violate the One Definition Rule.

How to generate empty definitions given a header file

I have a 3rd-party library which for various reasons I don't wish to link against yet. I don't want to butcher my code though to remove all reference to its API, so I'd like to generate a dummy implementation of it.
Is there any tool I can use which spits out empty definitions of classes given their header files? It's fine to return nulls, false and 0 by default. I don't want to do anything on-the-fly or anything clever - the mock object libraries I've looked at appear quite heavy-weight? Ideally I want something to use like
$ generate-definition my_header.h > dummy_implemtation.cpp
I'm using Linux, GCC4.1

This is a harder problem than you might like, as parsing C++ can quickly become a difficult task. Your best bet would be to pick an existing parser with a nice interface.
A quick search found this thread which has many recommendations for parsers to do something similar.
At the very worst you might be able to use SWIG --> Python, and then use reflection on that to print a dummy implementation.
Sorry this is only a half-answer, but I don't think there is an existing tool to do this (other than a mocking framework, which is probably the same amount of work as using a parser).

Create one test application which reads the header file and creates the source file. Test application should parse the header file to know the function names.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js