How to extract headers from source file using clang? - c++

I am using clang ast matcher to extract some information fromt the source file. Now, I would also like to know the list of headers and dependency headers that the source file is using. For example, the source file abc.c has following header:
#include <def.h>
//#include <def_private.h>
During clang matcher, I need to make sure clang knows about the def.h, which is in the same directory. The def.h includes the following headers:
#include <iostream.h>
#include <string.h>
#include <float.h>
#include <math.h>
/*#include <boost>
* #inclde <fstream>*/
I do ast matcher to extract or identify information from abc.c. Now, I would like to extract all the headers or includes. This should include all of them:
#include <def.h>
#include <iostream.h>
#include <string.h>
#include <float.h>
#include <math.h>
I did some online research to do this, unfortunately all of them are involving regex (Regular expression to extract header name from c file) or how to do in visual studio (Displaying the #include hierarchy for a C++ file in Visual Studio).
I wonder if it is possible using clang. Also, please let me know if there is any other way to programmatically extract the headers that is more than just using regular expression.

OP says Any other way to programmatically extract the headers that is more than just using a regular expression. .... without clang is ok.
We both agree that regexes are simply incapable of doing this right. You need the source text parsed as a tree with the #include directives explicitl appearing in the tree.
I'm not a Clang expert. I suspect its internal tree reflects preprocessed source, so the #include constructs have vanished. The problem is then one of insisting on preprocessing the source text to parse it.
Our DMS Software Reengineering Toolkit with its C++17 capable parser can handle such parsing without expanding the directives. It can do this two ways: a) where preprocessor directives are "well structured" with respect to the source code, the C++ front end can be configured to capture a parse tree with the directives also parsed as trees in appropriate places; this works pretty well in practice at the price of sometimes having to hand-patch a particularly ugly conditional or macro call to make it "well structured, or b) parse capturing the preprocessor directives placed in (almost) arbitrary way; this captures the directives sometimes at the price of automatically duplicating small bits of code to in essence cause the good restructuring liked by case a).
In either case, the #include directives now appear explicitly in the AST, with the included file pretty much built as an auxiliary tree representing the included file. Such tree nodes easily found by a tree walk looking for such explicit include nodes.
DMS's ASTInterface provides ScanTree to walk across nodes and taking actions when some provided predicate is true of a node; checking for #include nodes is easy. It is useful to note that becaause the conditional directives are also retained, by walking up the tree from a #include onr can construct the condition under which that include file is actually included.
Of course, the header file itself is also parsed, producing a tree. Any includes it has appear in its tree body. One would have to run ScanTree over each of these trees to collect all the includes.
OP didn't say what he wanted to do with the #includes. DMS provides a lot beyond parsing to help OP achieve her purpose, including symbol table construction, control and dataflow analysis, tree pattern matching, tree-to-tree transformations expressed in terms of source language (C++) syntax, and finally source code (re)generated from a modified syntax tree.

Related

In C++, is it possible to include all headers in all sub-folders of a folder?

#include <iostream>
#include<eigen3/Eigen/src/Core/Matrix.h> // does not define Eigen::StorageOptions
// need something like this
#include<eigen3/Eigen/src/ everything_in_here >
int main()
{
Eigen::Matrix<double,2,2> mat;
std::cout << mat(0,0) <<std::endl;
return 0;
}
In there, I'm trying to build a matrix object and it always asks 6 template parameters with error message:
wrong number of template arguments (3, should be 6)
So I started adding them, 4th one is Eigen::StorageOptions and is not defined in the Matrix.h header. Also there are too many headers to search. So, can I include all files in there with a single #include?
You are not supposed to include anything from Eigen/src/... directly. If you need only the core components of Eigen, use
#include <Eigen/Core>
If you want to include everything related to dense matrix operations (e.g., decompositions and the Geometry module):
#include <Eigen/Dense>
If you also want to include the Sparse module
#include <Eigen/Sparse>
// or this to include Dense+Sparse:
#include <Eigen/Eigen>
You should make sure via compilation options, that the top-level Eigen-directory is in your include-path, e.g. on many Linux-environments, add -I/usr/include/eigen3 as an argument. Your IDE probably also has an option for that. If you use something like CMake there are lots of related questions for that.
If for whatever reason this does not work or you just need a quick-and-dirty work-around, you can write
#include <eigen3/Eigen/Core> //etc
No, there isn't. Only individual files can be included. (Though technically what that means is implementation-defined.) The preprocessor has no filesystem/directory concept. It considers the file names used in #include directives only as strings, not paths.
I am also not aware of any compiler extension to support this (although there may be ones).
You can generate the necessary #include directives in a script as a preparation step in the build process.

How to throw a compile error if the include paths for a library were set up not as intended

Our C++ library contains a file with a namethat is (considered) equal to one of the standard libraries' headers. In our case this is "String.h", which Windows considers to be the same as "string.h", but for the sake of this question it could be any other ile name used in the standard library.
Normally, this file name ambiguity is not a problem since a user is supposed to set up the include paths to only include the parent of the library folder (therefore requiring to include "LibraryFolder/String.h") and not the folder containing the header.
However, sometimes users get this wrong and directly set the include path to the containing folder. This means that "String.h" will be included in place of "string.h" in both the user code and in the standard library headers, resulting in a lot of compile errors that may not be easy to resolve or understand for beginners.
Is it possible, during compile-time, to detect such wrongly set up include paths in our libraries' header and throw a compile #warning or #error right away via directive, based on some sort of check on how the inclusion path was?
There's no failsafe way. If the compiler finds another file, it won't complain.
However, you could make it so you can detect it. In your own LibraryName/string.h, you could define a unique symbol, like
#define MY_STRING_H412a55af_7643_4bd6_be5c_4315d3a1e6b7
Then later in dependent code you could check
#ifndef MY_STRING_H412a55af_7643_4bd6_be5c_4315d3a1e6b7
#error "Custom standard library path not configured correctly"
#endif
Likewise you could use this to detect when the wrong version of the library was included.
[edit - as per comments]
Header inclusion can be summarized as :
Parse #include line to determine header name to look up
Depending on <Foo.h> or "Foo.h" form, determine set of locations (usually directories) to search
Interpret the header name, in an implementation-dependent way. (usually as a relative path). Note that this is not necessarily as a string, e.g. MSVC doesn't treat \ as a string escape character.
If the header is found (usually, if a file is found), replace the #include line with the content of that file. If not, fail the compilation.
(The parenthesized "usually" apply to MSVC, GCC, clang, etc but theoretically a compiler could compile directly from a git repository instead of disk files)
The problem here is that the test imagined (spelling of header name) must be located in the included header file. This test would necessarily be part of the replaced #include line, which therefore no longer exists and cannot be tested.
C++17 introduces __has_include but this does not affect the analysis: It would still have to occur in the included header file, and would not have the character sequence from the #include "Foo.h" available.
[old]
Probably the easiest way, especially for beginners is to have a LibraryName/LibraryName.h. Hopefully that name is unique.
The benefit is that once that works, users can replace #include "LibraryName.h" with just #include "String.h" as you know the path is right.
That said, "String.h" is asking for problems. Windows isn't case sensitive.
Use namespaces. In your case this would translate into something like this:
MyString/String.h
namespace my_namespace {
class string {
...
}
}
Now to make sure your std::string or any other class named string is not accidentally used instead of my_namespace::string (by any means, including but not limited to setting up your include paths incorrectly) you need to refer to your type using its fully qualified name, namely my_namespace::string. By doing this you avoid any naming clashes and are guaranteed to get a compile error if you don't include the correct header file (unless there's actually exists another class called my_namespace::string that is not yours). There are other ways to avoid these clashes (such as using my_namespace::string) but I'd rather be explicit about the types I'm using. This solution is costly however because it probably needs change all over your code base (changing all strings to my_namespace::string).
A somewhat less cumbersome alternative would be to change the name of the header String.h to something like MyString.h. This would quickly introduce compile errors but requires changing all your includes from #include "String.h" into#include "MyString.h"` (Should be much less effort compared to the first option).
I cannot think of any other way that requires less effort as of now. Since you were looking for a solution that would work in all similar scenarios I'd go with the namespaces if I were you and solve the problem once and for all. This would prevent any other existing/future naming clashes that may be in you code.

String concatenation for include path

Is there a way to concatenate 2 strings literals to form an include path?
Code stub:
#define INCLUDE_DIR "/include"
#include INCLUDE_DIR "/dummy.h"
Looking at this question, the answers point in a different direction (compiler command line). It is mentioned here that it is seemingly not possible, but I wonder if the topic has been dug enough.
(I do have an use case in which this is relevant, please focus your answers/comments on this question only.)
It really seems this is not possible. I will report here the relevant section from Eric Postpischil's answer (he doesn't seem to be active anymore).
The compiler will do macro replacement on an #include line (per C
2011 [N1570] 6.10.2 4), but the semantics are not fully defined and
cannot be used to concatenate file path components without additional
assistance from the C implementation. So about all this allows you to
do is some simple substitution that provides a complete path, such as:
#define MyPath "../../path/to/my/file.h"
#include MyPath
Link to documentation. In particular this section doesn't leave much hope for portable solutions:
The method by which a sequence of preprocessing tokens between
a < and a > preprocessing token pair or a pair of " characters
is combined into a single header name preprocessing token is
implementation-defined.
For completeness, maybe something can be tried using https://stackoverflow.com/a/27830271/2436175. I'll investigate that when I have a moment...
I'm not sure that this is exactly what you want but anyway.
#define DECORATE(x) <x>
#define MAKE_PATH(root, file) DECORATE(root file)
#define SYS_DIR(file) MAKE_PATH(sys/, file)
#define ARPA_DIR(file) MAKE_PATH(arpa/, file)
#include SYS_DIR(types.h)
#include SYS_DIR(socket.h)
#include ARPA_DIR(inet.h)
Note, that generated filenames contain extra space - <sys/ types.h>, so it may not be a cross-compiler solution. But at least for me it works on Linux host on GCC 4.8 / 4.9.
P.S. It would be nice if someone could check this snippet with another compilers, e.g. MSVC.
Simply avoid the space and the concatenation (##) and use the < > it makes all simplier:
#include <QtCore/QtGlobal>
#define QT_VERSION_PREFIX QT_VERSION_MAJOR.QT_VERSION_MINOR.QT_VERSION_PATCH
#define _CONCATE(a, c) <a/QT_VERSION_PREFIX/a/private/c>
#include _CONCATE(QtWidgets, qwidgettextcontrol_p.h)

#include anywhere

Is the #include <file> meant to be used for headers only or is it simply a mechanical "inject this code here" that can be used anywhere in the code?
What if I use it in the middle of a cpp function to just "inject" code from a single source? will this work or will compilers scream about this?
It is a mechanical inject the code here device. You can include a text file containing Goethe's Faust if you wish to. You can put it anywhere, even in the middle of a function (of course, #include needs a fresh line!).
However, it's strong convention to only use #include for header files. There may be reasons where I wouldn't object on it, for example pulling in machine-generated code or merging all translation units in a single file.
Not only does it work anywhere, but it can lead to some interesting techniques. Here's an example that generates an enumeration and a corresponding string table that are guaranteed to be in sync.
Animals.h:
ANIMAL(Anteater)
ANIMAL(Baboon)
...
ANIMAL(Zebra)
AnimalLibrary.h:
#define ANIMAL(name) name,
enum Animals {
#include "Animals.h"
AnimalCount
};
#undef ANIMAL
extern char * AnimalTable[AnimalCount];
AnimalLibrary.cpp:
#include "AnimalLibrary.h"
#define ANIMAL(name) #name,
char * AnimalTable[AnimalCount] = {
#include "Animals.h"
};
main.cpp:
#include "AnimalLibrary.h"
int main()
{
cout << AnimalTable[Baboon];
return 0;
}
Be sure not to put the usual include guards in any file that will be included multiple times!
Gotta agree with William Pursell though that this technique will have people scratching their heads.
Compilers will not complain, but everyone who has to maintain the code will.
It will work - more or less its semantic meaning is: place code in that file here
EDIT: For abusing usages of #include I can just recommend the following:
#include "/dev/console"
This allows for everything: a one-liner that can do everything, an error, its just a matter of compilation...
Should work, it's processed by your preprocessor, your compiler won't even see it.
#include and other preprocessor directives like #define or #import, can appear anywhere in the source, but will only apply to the code after that inclusion. It is meant to include the referenced code into the source file that calls it.
This MSDN page explains it quite well. http://msdn.microsoft.com/en-us/library/36k2cdd4(v=VS.71).aspx
include is handled by the preprocessor and is a mechanism to inject code. There are no restrictions on the file being included or where this #include is placed in your code (thought it should be in its own line). As long as the file specified can be found by the preprocessor it will import its contents into the current file.
Conventionally you do this for header files. I've seen this being done with cpp files during template instantiation (with proper #ifdef so you don't include it multiple times causing multiple symbol definition error).
If you have a long constant, you can do this for other file types as well. (Though there are better ways of handling long string constants)

Preprocessor directives

When we see #include <iostream>, it is said to be a preprocessor directive.
#include ---> directive
And, I think:
<iostream> ---> preprocessor
But, what is meant by "preprocessor" and "directive"?
It may help to think of the relationship between a "directive" and being "given directions" (i.e. orders). "preprocessor directives" are directions to the preprocessor about changes it should make to the code before the later stages of compilation kick in.
But, what's the preprocessor? Well, its name reflects that it processes the source code before the "main" stages of compilation. It's simply there to process the textual source code, modifying it in various ways. The preprocessor doesn't even understand the tokens it operates on - it has no notion of types or variables, classes or functions - it's all just quoted- and/or parentheses- grouped, comma- and/or whitespace separated text to be manhandled. This extra process gives more flexibility in selecting, combining and even generating parts of the program.
EDIT addressing #SWEngineer's comment: Many people find it helpful to think of the preprocessor as a separate program that modifies the C++ program, then gives its output to the "real" C++ compiler (this is pretty much the way it used to be). When the preprocessor sees #include <iostream> it thinks "ahhha - this is something I understand, I'm going to take care of this and not just pass it through blindly to the C++ compiler". So, it searches a number of directories (some standard ones like /usr/include and wherever the compiler installed its own headers, as well as others specified using -I on the command line) looking for a file called "iostream". When it finds it, it then replaces the line in the input program saying "#include " with the complete contents of the file called "iostream", adding the result to the output. BUT, it then moves to the first line it read from the "iostream" file, looking for more directives that it understands.
So, the preprocessor is very simple. It can understand #include, #define, #if/#elif/#endif, #ifdef and $ifndef, #warning and #error, but not much else. It doesn't have a clue what an "int" is, a template, a class, or any of that "real" C++ stuff. It's more like some automated editor that cuts and pastes parts of files and code around, preparing the program that the C++ compiler proper will eventually see and process. The preprocessor is still very useful, because it knows how to find parts of the program in all those different directories (the next stage in compilation doesn't need to know anything about that), and it can remove code that might work on some other computer system but wouldn't be valid on the one in use. It can also allow the program to use short, concise macro statements that generate a lot of real C++ code, making the program more manageable.
#include is the preprocessor directive, <iostream> is just an argument supplied in addition to this directive, which in this case happens to be a file name.
Some preprocessor directives take arguments, some don't, e.g.
#define FOO 1
#ifdef _NDEBUG
....
#else
....
#endif
#warning Untested code !
The common feature is that they all start with #.
In Olden Times the preprocessor was a separate tool which pre-processed source code before passing it to the compiler front-end, performing macro substitutions and including header files, etc. These days the pre-processor is usually an integral part of the compiler, but it essentially just does the same job.
Preprocessor directives, such as #define and #ifdef, are typically used to make source programs easy to change and easy to compile in different execution environments. Directives in the source file tell the preprocessor to perform specific actions. For example, the preprocessor can replace tokens in the text, insert the contents of other files into the source file...
#include is a preprocessor directive meaning that it is use by the preprocessor part of the compiler. This happens 'before' the compilation process. The #include needs to specify 'what' to include, this is supplied by the argument iostream. This tells the preprocessor to include the file iostream.h.
More information:
Preprocessor Directives on MSDN
Preprocessor directives on cplusplus.com