gcc - how to auto instrument every basic block

gcc - how to auto instrument every basic block - c++

GCC has an auto-instrument options for function entry/exit.
-finstrument-functions Generate instrumentation calls for entry and exit to functions. Just after function entry and just before function
exit, the following profiling functions will be called with the
address of the current function and its call site. (On some platforms,
__builtin_return_address does not work beyond the current function, so the call site information may not be available to the profiling
functions otherwise.)
void __cyg_profile_func_enter (void *this_fn,
void *call_site);
void __cyg_profile_func_exit (void *this_fn,
void *call_site);
I would like to have something like this for every "basic block" so that I can log, dynamically, execution of every branch.
How would I do this?

There is a fuzzer called American Fuzzy Lop, it solves very similar problem of instrumenting jumps between basic blocks to gather edge coverage: if basic blocks are vertices what jumps (edges) were encountered during execution. It may be worth to see its sources. It has three approaches:
afl-gcc is a wrapper for gcc that substitutes as by a wrapper rewriting assembly code according to basic block labels and jump instructions
plugin for Clang compiler
patch for QEMU for instrumenting already compiled code
Another and probably the simplest option may be to use DynamoRIO dynamic instrumentation system. Unlike QEMU, it is specially designed to implement custom instrumentation (either as rewriting machine code by hand or simply inserting calls that even may be automatically inlined in some cases, if I get documentation right). If you think dynamic instrumentation is something very hard, look at their examples -- they are only about 100-200 lines (but you still need to read their documentation at least here and for used functions since it may contain important points: for example DR constructs dynamic basic blocks, which are distinct from a compiler's classic basic blocks). With dynamic instrumentation you can even instrument used system libraries. In case it is not what you want, you may use something like
static module_data_t *traced_module;
// in dr_client_main
traced_module = dr_get_main_module();
// in basic block event handler
void *app_pc = dr_fragment_app_pc(tag);
if (!dr_module_contains_addr(traced_module, app_pc)) {
return DR_EMIT_DEFAULT;
}

Related

How to get a caller graph from a given symbol in a binary

This question is related to a question I've asked earlier this day: I wonder if it's possible to generate a caller graph from a given function (or symbol name e.g. taken from nm), even if the function of interest is not part of "my" source code (e.g. located in a library, e.g. malloc())
For example to know where malloc is being used in my program named foo I would first lookup the symbol name:
nm foo | grep malloc
U malloc##GLIBC_2.2.5
And then run a tool (which might need a specially compiled/linked version of my program or some compiler artifacts):
find_usages foo-with-debug-symbols "malloc##GLIBC_2.2.5"
Which would generate a (textual) caller graph I can then process further.
Reading this question I found radare2 which seems to accomplish nearly everything you can imagine but somehow I didn't manage to generate a caller graph from a given symbol yet..
Progress
Using radare2 I've managed to generate a dot caller graph from an executable, but something is still missing. I'm compiling the following C++ program which I'm quite sure has to use malloc() or new:
#include <string>
int main() {
auto s = std::string("hello");
s += " welt";
return 0;
}
I compile it with static libraries in order to be sure all calls I want to analyze can be found in the binary:
g++ foo.cpp -static
By running nm a.out | grep -E "_Znwm|_Znam|_Znwj|_Znaj|_ZdlPv|_ZdaPv|malloc|free" you can see a lot of symbols which are used for memory allocation.
Now I run radare2 on the executable:
r2 -qAc 'agCd' a.out > callgraph.dot
With a little script (inspired by this answer) I'm looking for a call-path from any symbol containing "sym.operatornew" but there seems to be none!
Is there a way to make sure all information needed to generate a call graph from/to any function which get's called inside that binary?
Is there a better way to run radare2? It looks like the different call graph visualization types provide different information - e.g. the ascii art generator does provide names for symbols not provided by the dot generator while the dot generator provides much more details regarding calls.

In general, you cannot extract an exact control flow graph from a binary, because of indirect jumps and calls there. A machine code indirect call is jumping into the content of some register, and you cannot reliably estimate all the values that register could take (doing so could be proven equivalent to the halting problem).
Is there a way to make sure all information needed to generate a call graph from/to any function which get's called inside that binary?
No, and that problem is equivalent to the halting problem, so there would be never a sure way to get that call graph (in a complete and sound way).
The C++ compiler would (usually) generate indirect jumps for virtual function calls (they jump thru the vtable) and probably when using a shared library (read Drepper's How To Write Shared Libraries paper for more).
Look into the BINSEC tool (developed by colleagues from CEA, LIST and by INRIA), at least to find references.
If you really want to find most (but not all) dynamic memory allocations in your C++ source code, you might use static source code analysis (like Frama-C or Frama-Clang) and other tools, but they are not a silver bullet.
Remember that allocating functions like malloc or operator new could be put in function pointer locations (and your C++ code might have some allocator deeply buried somewhere, then you are likely to have indirect calls to malloc)
Maybe you could spend months of effort in writing your own GCC plugin to look for calls to malloc after optimizations inside the GCC compiler (but notice that GCC plugins are tied to one particular version of GCC). I am not sure it is worth the effort. My old (obsolete, non maintained) GCC MELT project was able to find calls to malloc with a size above some given constant. Perhaps in at least a year -end of 2019 or later- my successor project (bismon, funded by CHARIOT H2020 project) might be mature enough to help you.
Remember also that GCC is capable of quite fancy optimizations related to malloc. Try to compile the following C code
//file mallfree.c
#include <stdlib.h>
int weirdsum(int x, int y) {
int*ar2 = malloc(2*sizeof(int));
ar2[0] = x; ar2[1] = y;
int r = ar2[0] + ar2[1];
free (ar2);
return r;
}
with gcc -S -fverbose-asm -O3 mallfree.c. You'll see that the generated mallfree.s assembler file contain no call to malloc or to free. Such an optimization is permitted by the As-if rule, and is practically useful to optimize most usages of C++ standard containers.
So what you want is not simple even for apparently "simple" C++ code (and is impossible in the general case).
If you want to code a GCC plugin and have more than a full year to spend on that issue (or could pay at least 500k€ for that), please contact me. See also
https://xkcd.com/1425/ (your question is a virtually impossible one).
BTW, of course, what you really care about is dynamic memory allocation in optimized code (you really want inlining and dead code elimination, and GCC does that quite well with -O3 or -O2). When GCC is not optimizing at all (e.g. with -O0 which is the implicit optimization) it would do a lot of "useless" dynamic memory allocation, specially with C++ code (using the C++ standard library). See also CppCon 2017: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” talk.

ELF INIT section code to prepopulate objects used at runtime

I'm fairly new to c++ and am really interested in learning more. Have been reading quite a bit. Recently discovered the init/fini elf sections.
I started to wonder if & how one would use the init section to prepopulate objects that would be used at runtime. Say for example you wanted
to add performance measurements to your code, recording the time, filename, linenumber, and maybe some ID (monotonic increasing int for ex) or name.
You would place for example:
PROBE(0,"EventProcessing",__FILE__,__LINE__)
...... //process event
PROBE(1,"EventProcessing",__FILE__,__LINE__)
......//different processing on same event
PROBE(2,"EventProcessing",__FILE__,__LINE__)
The PROBE could be some macro that populates a struct containing this data (maybe on an array/list, etc using the id as an indexer).
Would it be possible to have code in the init section that could prepopulate all of this data for each PROBE (except for the time of course), so only the time would need to be retrieved/copied at runtime?
As far as I know the __attribute__((constructor)) can not be applied to member functions?
My initial idea was to create some kind of
linked list with each node pointing to each probe and code in the init secction could iterate it populating the id, file, line, etc, but
that idea assumed I could use a member function that could run in the "init" section, but that does not seem possible. Any tips appreciated!

As far as I understand it, you do not actually need an ELF constructor here. Instead, you could emit descriptors for your probes using extended asm statements (using data, instead of code). This also involves switching to a dedicated ELF section for the probe descriptors, say __probes.
The linker will concatenate all the probes and in an array, and generate special symbols __start___probes and __stop___probes, which you can use from your program to access thes probes. See the last paragraph in Input Section Example.
Systemtap implements something quite similar for its userspace probes:
User Space Probe Implementation
Adding User Space Probing to an Application (heapsort example)
Similar constructs are also used within the Linux kernel for its self-patching mechanism.

There's a pretty simple way to have code run on module load time: Use the constructor of a global variable:
struct RunMeSomeCode
{
RunMeSomeCode()
{
// your code goes here
}
} do_it;
The .init/.fini sections basically exist to implement global constructors/destructors as part of the ABI on some platforms. Other platforms may use different mechanisms such as _start and _init functions or .init_array/.deinit_array and .preinit_array. There are lots of subtle differences between all these methods and which one to use for what is a question that can really only be answered by the documentation of your target platform. Not all platforms use ELF to begin with…
The main point to understand is that things like the .init/.fini sections in an ELF binary happen way below the level of C++ as a language. A C++ compiler may use these things to implement certain behavior on a certain target platform. On a different platform, a C++ compiler will probably have to use different mechanisms to implement that same behavior. Many compilers will give you tools in the form of language extensions like __attributes__ or #pragmas to control such platform-specific details. But those generally only make sense and will only work with that particular compiler on that particular platform.

You don't need a member function (which gets a this pointer passed as an arg); instead you can simply create constructor-like functions that reference a global array, like
#define PROBE(id, stuff, more_stuff) \
__attribute__((constructor)) void \
probeinit##id(){ probes[id] = {id, stuff, 0/*to be written later*/, more_stuff}; }
The trick is having this macro work in the middle of another function. GNU C / C++ allows nested functions, but IDK if you can make them constructors.
You don't want to declare a static int dummy#id = something because then you're adding overhead to the function you profile. (gcc has to emit a thread-safe run-once locking mechanism.)
Really what you'd like is some kind of separate pass over the source that identifies all the PROBE macros and collects up their args to declare
struct probe global_probes[] = {
{0, "EventName", 0 /*placeholder*/, filename, linenum},
{1, "EventName", 0 /*placeholder*/, filename, linenum},
...
};
I'm not confident you can make that happen with CPP macros; I don't think it's possible to #define PROBE such that every time it expands, it redefines another macro to tack on more stuff.
But you could easily do that with an awk/perl/python / your fave scripting language program that scans your program and constructs a .c that declares an array with static storage.
Or better (for a single-threaded program): keep the runtime timestamps in one array, and the names and stuff in a separate array. So the cache footprint of the probes is smaller. For a multi-threaded program, stores to the same cache line from different threads is called false sharing, and creates cache-line ping-pong.
So you'd have #define PROBE(id, evname, blah blah) do { probe_times[id] = now(); }while(0)
and leave the handling of the later args to your separate preprocessing.

Automatically wrap C/C++ function at compile-time with annotation

In my C/C++ code I want to annotate different functions and methods so that additional code gets added at compile-time (or link-time). The added wrapping code should be able to inspect context (function being called, thread information, etc.), read/write input variables and modify return values and output variables.
How can I do that with GCC and/or Clang?

Take a look at instrumentation functions in GCC. From man gcc:
-finstrument-functions
Generate instrumentation calls for entry and exit to functions. Just after function entry and just before function exit, the following profiling functions will be called with the address of the current function and its call site. (On some platforms,
"__builtin_return_address" does not work beyond the current function, so the call site information may not be available to the profiling functions otherwise.)
void __cyg_profile_func_enter (void *this_fn,
void *call_site);
void __cyg_profile_func_exit (void *this_fn,
void *call_site);
The first argument is the address of the start of the current function, which may be looked up exactly in the symbol table.
This instrumentation is also done for functions expanded inline in other functions. The profiling calls will indicate where, conceptually, the inline function is entered and exited. This means that addressable versions of such functions must be available. If
all your uses of a function are expanded inline, this may mean an additional expansion of code size. If you use extern inline in your C code, an addressable version of such functions must be provided. (This is normally the case anyways, but if you get lucky
and the optimizer always expands the functions inline, you might have gotten away without providing static copies.)
A function may be given the attribute "no_instrument_function", in which case this instrumentation will not be done. This can be used, for example, for the profiling functions listed above, high-priority interrupt routines, and any functions from which the
profiling functions cannot safely be called (perhaps signal handlers, if the profiling routines generate output or allocate memory).

Insert text into C++ code between functions

I have following requirement:
Adding text at the entry and exit point of any function.
Not altering the source code, beside inserting from above (so no pre-processor or anything)
For example:
void fn(param-list)
{
ENTRY_TEXT (param-list)
//some code
EXIT_TEXT
}
But not only in such a simple case, it'd also run with pre-processor directives!
Example:
void fn(param-list)
#ifdef __WIN__
{
ENTRY_TEXT (param-list)
//some windows code
EXIT_TEXT
}
#else
{
ENTRY_TEXT (param-list)
//some any-os code
if (condition)
{
return; //should become EXIT_TEXT
}
EXIT_TEXT
}
So my question is: Is there a proper way doing this?
I already tried some work with parsers used by compilers but since they all rely on running a pre-processor before parsing, they are useless to me.
Also some of the token generating parser, which do not need a pre-processor are somewhat useless because they generate a memory-mapping of tokens, which then leads to a complete new source code, instead of just inserting the text.
One thing I am working on is to try it with FLEX (or JFlex), if this is a valid option, I would appreciate some input on it. ;-)
EDIT:
To clarify a little bit: The purpose is to allow something like a stack trace.
I want to trace every function call, and in order to follow the call-hierachy, I need to place a macro at the entry-point of a function and at the exit point of a function.
This builds a function-call trace. :-)
EDIT2: Compiler-specific options are not quite suitable since we have many different compilers to use, and many that are propably not well supported by any tools out there.

Unfortunately, your idea is not only impractical (C++ is complex to parse), it's also doomed to fail.
The main issue you have is that exceptions will bypass your EXIT_TEXT macro entirely.
You have several solutions.
As has been noted, the first solution would be to use a platform dependent way of computing the stack trace. It can be somewhat imprecise, especially because of inlining: ie, small functions being inlined in their callers, they do not appear in the stack trace as no function call was generated at assembly level. On the other hand, it's widely available, does not require any surgery of the code and does not affect performance.
A second solution would be to only introduce something on entry and use RAII to do the exit work. Much better than your scheme as it automatically deals with multiple returns and exceptions, it suffers from the same issue: how to perform the insertion automatically. For this you will probably want to operate at the AST level, and modify the AST to introduce your little gem. You could do it with Clang (look up the c++11 migration tool for examples of rewrites at large) or with gcc (using plugins).
Finally, you also have manual annotations. While it may seem underpowered (and a lot of work), I would highlight that you do not leave logging to a tool... I see 3 advantages to doing it manually: you can avoid introducing this overhead in performance sensitive parts, you can retain only a "summary" of big arguments and you can customize the summary based on what's interesting for the current function.

I would suggest using LLVM libraries & Clang to get started.
You could also leverage the C++ language to simplify your process. If you just insert a small object into the code that is constructed on function scope entrance & rely on the fact that it will be destroyed on exit. That should massively simplify recording the 'exit' of the function.

This does not really answer you question, however, for your initial need, you may use the backtrace() function from execinfo.h (if you are using GCC).
How to generate a stacktrace when my gcc C++ app crashes

Logging code execution in C++

Having used gprof and callgrind many times, I have reached the (obvious) conclusion that I cannot use them efficiently when dealing with large (as in a CAD program that loads a whole car) programs. I was thinking that maybe, I could use some C/C++ MACRO magic and somehow build a simple (but nice) logging mechanism. For example, one can call a function using the following macro:
#define CALL_FUN(fun_name, ...) \
fun_name (__VA_ARGS__);
We could add some clocking/timing stuff before and after the function call, so that every function called with CALL_FUN gets timed, e.g
#define CALL_FUN(fun_name, ...) \
time_t(&t0); \
fun_name (__VA_ARGS__); \
time_t(&t1);
The variables t0, t1 could be found in a global logging object. That logging object can also hold the calling graph for each function called through CALL_FUN. Afterwards, that object can be written in a (specifically formatted) file, and be parsed from some other program.
So here comes my (first) question: Do you find this approach tractable ? If yes, how can it be enhanced, and if not, can you propose a better way to measure time and log callgraphs ?
A collegue proposed another approach to deal with this problem, which is annotating with a specific comment each function (that we care to log). Then, during the make process, a special preprocessor must be run, parse each source file, add logging logic for each function we care to log, create a new source file with the newly added (parsing) code, and build that code instead. I guess that reading CALL_FUN... macros (my proposal) all over the place is not the best approach, and his approach would solve this problem. So what is your opinion about this approach?
PS: I am not well versed in the pitfalls of C/C++ MACROs, so if this can be developed using another approach, please say it so.
Thank you.

Well you could do some C++ magic to embed a logging object. something like
class CDebug
{
CDebug() { ... log somehow ... }
~CDebug() { ... log somehow ... }
};
in your functions then you simply write
void foo()
{
CDebug dbg;
...
you could add some debug info
dbg.heythishappened()
...
} // not dtor is called or if function is interrupted called from elsewhere.

I am bit late, but here is what I am doing for this:
On Windows there is a /Gh compiler switch which makes the compiler to insert a hidden _penter function at the start of each function. There is also a switch for getting a _pexit call at the end of each function.
You can utilizes this to get callbacks on each function call. Here is an article with more details and sample source code:
http://www.johnpanzer.com/aci_cuj/index.html
I am using this approach in my custom logging system for storing the last few thousand function calls in a ring buffer. This turned out to be useful for crash debugging (in combination with MiniDumps).
Some notes on this:
The performance impact very much depends on your callback code. You need to keep it as simple as possible.
You just need to store the function address and module base address in the log file. You can then later use the Debug Interface Access SDK to get the function name from the address (via the PDB file).
All this works suprisingly well for me.

Many nice industrial libraries have functions' declarations and definitions wrapped into void macros, just in case. If your code is already like that -- go ahead and debug your performance problems with some simple asynchronous trace logger. If no -- post-insertion of such macros can be an unacceptably time-consuming.
I can understand the pain of running an 1Mx1M matrix solver under valgrind, so I would suggest starting with so called "Monte Carlo profiling method" -- start the process and in parallel run pstack repeatedly, say each second. As a result you will have N stack dumps (N can be quite significant). Then, the mathematical approach would be to count relative frequencies of each stack and make a conclusion about the ones most frequent. In practice you either immediately see the bottleneck or, if no, you switch to bisection, gprof, and finally to valgrind's toolset.

Let me assume the reason you are doing this is you want to locate any performance problems (bottlenecks) so you can fix them to get higher performance.
As opposed to measuring speed or getting coverage info.
It seems you're thinking the way to do this is to log the history of function calls and measure how long each call takes.
There's a different approach.
It's based on the idea that mainly the program walks a big call tree.
If time is being wasted it is because the call tree is more bushy than necessary,
and during the time that's being wasted, the code that's doing the wasting is visible on the stack.
It can be terminal instructions, but more likely function calls, at almost any level of the stack.
Simply pausing the program under a debugger a few times will eventually display it.
Anything you see it doing, on more than one stack sample, if you can improve it, will speed up the program.
It works whether or not the time is being spent in CPU, I/O or anything else that consumes wall clock time.
What it doesn't show you is tons of stuff you don't need to know.
The only way it can not show you bottlenecks is if they are very small,
in which case the code is pretty near optimal.
Here's more of an explanation.

Although I think it will be hard to do anything better than gprof, you can create a special class LOG for instance and instantiate it in the beginning of each function you want to log.
class LOG {
LOG(const char* ...) {
// log time_t of the beginning of the call
}
~LOG(const char* ...) {
// calculate the total time spent,
//by difference between current time and that saved in the constructor
}
};
void somefunction() {
LOG log(__FUNCTION__, __FILE__, ...);
.. do other things
}
Now you can integrate this approach with the preprocessing one you mentioned. Just add something like this in the beginning of each function you want to log:
// ### LOG
and then you replace the string automatically in debug builds (shoudn't be hard).

May be you should use a profiler. AQTime is a relatively good one for Visual Studio. (If you have VS2010 Ultimate, you already have a profiler.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js