How to use LLVM to generate a call graph? - llvm

I'm looking into generating a call-graph for the linux kernel that would include function pointers (see my previous question Static call graph generation for the Linux kernel for more information). I've been told LLVM should be suitable for this purpose, however I was unable to find the relevant information on llvm.org
Any help, including pointers to relevant documentation, would be appreciated.

First, you have to compile your kernel into LLVM IR (instead of native object files). Then, using llvm-ld, combine all the IR object files into a single large module. It could be quite a tricky thing to do, you'll have to modify the makefiles heavily, but I believe it is doable.
Now you can do your analysis. A simple call graph can be generated using the opt tool with -dot-callgraph pass. It is unlikely to handle function pointers, so you may want to modify it.
Tracking all the possible data flow paths that would carry your function pointers is quite a challenge, and in general case it is impossible to do (if there are any pointer to integer casts, if pointers are stored in complicated data structures, etc.). For a majority of specific cases you can try to implement a global abstract interpretation to approximate all the possible data flow paths for your pointers. It would not be accurate, of course, but then you'll get at least a conservative approximation.

Related

Converting endianness of struct-Data

What i have is:
a hex file with the bytes of a c-struct in it, orderd in big-endian
the struct definition as *.h file
the struct information as dwarf2 debug info
My application has to be written in C / C++. Intermediate scripts using for example python would be ok.
What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian.
And during this process, i will have to reverse the bytes of each struct member.
The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.
So i could generate the conversion function automatically by:
Finding and parsing the types inside multiple *.h files
Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
loading the struct via the conversion function.
Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.
Does anybody know a trick that might help in this case?
Thanks and greetings!
Remark:
Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.
Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.
To give some more scope:
The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file.
I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.
There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.
The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.
Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.
What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.
From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.
This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.
Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.
The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.
You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).
If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).
Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.

gcc - how to detect pointer-based memory access

I'm focusing on micropython, specifically the branch dynamic-native-modules.
This feature will, in the future, allow you to compile a C/C++ function into a native .obj and package it together with a .py interface for a huge speed boost.
Awesome! But the issue is that if you're using a RTOS, which doesn't have virtual memory, then any executing native code can access any part of the address space including peripherals, the RTOS' state etc.
You don't want the user to be able to do something like this:
void user_func()
{
/* point to arbitrary memory, potentially the reset registers, flash erase . . . you get the point */
int * a = (int*)0x1234;
*a = 0x10110000; // DESTROY!!!
}
Even the following should be disallowed:
void user_func()
{
int a;
(int*)(&a-1000) = 0x10010111;
}
SOLUTIONS?
Create own version of gcc (for each binary format)
Decompile .obj files and detect use of pointers (for each binary binary format)
FEEDBACK TO COMMENTS
I get that it may be impossible to stop a malicious user but that's not the #1 worry. We want to stop well-meaning but accidental code. If it's not possible to stop every, single case that's ok.
If we can prohibit/detect explicit pointer accesses and simply provide warnings regarding array use, that is still very valuable.
WARNING: YOU'RE USING AN ARRAY! MAKE SURE YOU DON'T GO OUT-OF-BOUNDS
Your best chance is a GCC plugin which looks at the frontend-generated GENERIC or GIMPLE IRs and implements the policies you want. Depending on the policies and the source code you want to accept, this could be a lot of work and very difficult.
If you want a purely syntax-based or type-based approach (simply rejecting all pointer arithmetic), Clang with its ASTs is easier to work with than GCC.
There could be a way of doing what you want - as long as you protect from honest mistakes, not deliberate attempts to break things.
First, you embrace C++ Core Guidelines and use support from the tools: https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-tools
For your purposes, it will stop your code from using raw pointers and pointer arithmetic in non-library code. Period. Core guidelines do more than that, but this is what is related to your question.
Than, you do a simple grep on the code to make sure it is not using pointers blessed by Guidelines - this would be a case of more or less simple grep.
In practice, it is not reasonably possible.
You might consider changing the compilation (e.g. with a GCC plugin, like mentioned in Florian Weimer's answer) to check every access to arrays, but that makes the generated code significantly slower. Your native hardware is slow enough, you don't want to make it even more slow.
And Python is not exactly the right source language. Its dynamic typing would make its quite slow already.
Perhaps you might consider a statically typed language, like Ocaml (combined with a JIT or AOT compilation library like GCCJIT, etc). The type system (and its inference) increase the safety (and the speed) of the generated code, and with lot of hard work (several years, probably worth a PhD
since you'll need do new research) you might improve the type inference to even infer the occurrences where the array index is never out-of-bounds (and an index boundary check is not even needed).
In most case, upgrading the hardware (to a Raspberry Pi style one, with an MMU, and probably the ability to have a real OS with some virtual memory) is probably the most pragmatic approach
PS. Be aware of Rice's theorem. Most static source analysis cannot work reliably and well (be sound and complete, in technical terms).

Creating a modular language in LLVM?

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?
This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

tool for finding which functions can ultimately cause a call to a (list of) low level functions

I have a very large C++ program where certain low level functions should only be called from certain contexts or while taking specific precautions. I am looking for a tool that shows me which of these low-level functions are called by much higher level functions. I would prefer this to be visible in the IDE with some drop down or labeling, possibly in annotated source output, but any easier method than manually searching the call-graph will help.
This is a problem of static analysis and I'm not helped by a profiler.
I am mostly working on mac, linux is OK, and if something is only available on windows then I can live with that.
Update
Just having the call-graph does not make it that much quicker to answer the question, "does foo() potentially cause a call to x() y() or z()". (or I'm missing something about the call-graph tools, perhaps I need to write a program that traverses it to get a solution?)
There exists Clang Static Analyzer which uses LLVM which should also be present on OS X. Actually i'm of the opinion that this is integrated in Xcode. Anyway, there exists a GUI.
Furthermore there are several LLVM passes, where you can generate call graphs, but i'm not sure if this is what you want.
The tool Scientific Toolworks "Understand" tool is supposed to be able to produce call graphs for C and C++.
Doxygen also supposedly produces call graphs.
I don't have any experience with either of these, but some harsh opinions. You need to keep in mind that I'm a vendor of another tool, so take this opinion with a big grain of salt.
I have experience building reasonably accurate call graphs for massive C systems (25 million lines) with 250,000 functions.
One issue I encounter in building a realistic call graph are indirect function calls, and for C++, overloaded method function calls. In big systems, there are a lot of both of these. To determine what gets called when FOO gets invoked, your tool has to have to deep semantic understanding of how the compiler/language resolves an overloaded call, and for indirect function calls, a reasonably precise determination of what a function pointer might actually point-to in a big system. If you don't get these reasonably right, your call graph will contain a lot of false positives (e.g., bogus claims of A calls B), and on scale false positives are a disaster.
For C++, you must have what amounts to the full compiler front end. Neither Understand or Doxygen have this, so I don't see how they can actually understand C++'s overloading/Koenig lookup rules. Neither Understand or Doxygen make any attempt that I know of to reason about indirect function calls.
Our DMS Software Reengineering Toolkit does build calls graphs for C reasonably well, even with indirect function pointers, using a C-language precise front end.
We have C++ language precise front end, and it does the overload resolution correctly (to the extent the C++ committee agrees on it, and we understand what they said, and what the individual compilers do [they don't always agree]), and we have something like Doxygen that shows this information. We don't presently have function pointer analysis for C++ but we are working on it (we have full control flow graphs within methods and that's a big step).
I understand CLANG has some option for computing call graphs, and I'd expect that to be accurate on overloads since Clang is essentially a C++ compiler implemented with a bunch of components. I don't know what, if anything Clang does to analyze function pointers.

Open-source C++ scanning library

Rationale: In my day-to-day C++ code development, I frequently need to
answer basic questions such as who calls what in a very large C++ code
base that is frequently changing. But, I also need to have some
automated way to exactly identify what the code is doing around a
particular area of code. "grep" tools such as Cscope are useful (and
I use them heavily already), but are not C++-language-aware: They
don't give any way to identify the types and kinds of lexical
environment of a given use of a type or function a such way that is
conducive to automation (even if said automation is limited to
"read-only" operations such as code browsing and navigation, but I'm
asking for much more than that below).
Question: Does there exist already an open-source C/C++-based library
(native, not managed, not Microsoft- or Linux-specific) that can
statically scan or analyze a large tree of C++ code, and can produce
result sets that answer detailed questions such as:
What functions are called by some supplied function?
What functions make use of this supplied type?
Ditto the above questions if C++ classes or class templates are involved.
The result set should provide some sort of "handle". I should be able
to feed that handle back to the library to perform the following types
of introspection:
What is the byte offset into the file where the reference was made?
What is the reference into the abstract syntax tree (AST) of that
reference, so that I can inspect surrounding code constructs? And
each AST entity would also have file path, byte-offset, and
type-info data associated with it, so that I could recursively walk
up the graph of callers or referrers to do useful operations.
The answer should meet the following requirements:
API: The API exposed must be one of the following:
C or C++ and probably is "C handle" or C++-class-instance-based
(and if it is, must be generic C o C++ code and not Microsoft- or
Linux-specific code constructs unless it is to meet specifics of
the given platform), or
Command-line standard input and standard output based.
C++ aware: Is not limited to C code, but understands C++ language
constructs in minute detail including awareness of inter-class
inheritance relationships and C++ templates.
Fast: Should scan large code bases significantly faster than
compiling the entire code base from scratch. This probably needs to
be relaxed, but only if Incremental result retrieval and Resilient
to small code changes requirements are fully met below.
Provide Result counts: I should be able to ask "How many results
would you provide to some request (and no don't send me all of the
results)?" that responds on the order of less than 3 seconds versus
having to retrieve all results for any given question. If it takes
too long to get that answer, then wastes development time. This is
coupled with the next requirement.
Incremental result retrieval: I should be able to then ask "Give me
just the next N results of this request", and then a handle to the
result set so that I can ask the question repeatedly, thus
incrementally pulling out the results in stages. This means I
should not have to wait for the entire result set before seeing
some subset of all of the results. And that I can cancel the
operation safely if I have seen enough results. Reason: I need to
answer the question: "What is the build or development impact of
changing some particular function signature?"
Resilient to small code changes: If I change a header or source
file, I should not have to wait for the entire code base to be
rescanned, but only that header or source file
rescanned. Rescanning should be quick. E.g., don't do what cscope
requires you to do, which is to rescan the entire code base for
small changes. It is understood that if you change a header, then
scanning can take longer since other files that include that header
would have to be rescanned.
IDE Agnostic: Is text editor agnostic (don't make me use a specific
text editor; I've made my choice already, thank you!)
Platform Agnostic: Is platform-agnostic (don't make me only use it
on Linux or only on Windows, as I have to use both of those
platforms in my daily grind, but I need the tool to be useful on
both as I have code sandboxes on both platforms).
Non-binary: Should not cost me anything other than time to download
and compile the library and all of its dependencies.
Not trial-ware.
Actively Supported: It is likely that sending help requests to mailing lists
or associated forums is likely to get a response in less than 2
days.
Network agnostic: Databases the library builds should be able to be used directly on
a network from 32-bit and 64-bit systems, both Linux and Windows
interchangeably, at the same time, and do not embed hardcoded paths
to filesystems that would otherwise "root" the database to a
particular network.
Build environment agnostic: Does not require intimate knowledge of my build environment, with
the notable exception of possibly requiring knowledge of compiler
supplied CPP macro definitions (e.g. -Dmacro=value).
I would say that CLang Index is a close fit. However I don't think that it stores data in a database.
Anyway the CLang framework offer what you actually need to build a tool tailored to your needs, if only because of its C, C++ and Objective-C parsing / indexing capabitilies. And since it's provided as a set of reusable libraries... it was crafted for being developed on!
I have to admit that I haven't used either because I work with a lot of Microsoft-specific code that uses Microsoft compiler extensions that i don't expect them to understand, but the two open source analyzers I'm aware of are Mozilla Pork and the Clang Analyzer.
If you are looking for results of code analysis (metrics, graphs, ...) why not use a tool (instead of API) to do that? If you can, I suggest you to take a look at Understand.
It's not free (there's a trial version) but I found it very useful.
Maybe Doxygen with GraphViz could be the answer of some of your constraints but not all,for example the analysis of Doxygen is not incremental.