Automatically wrap C/C++ function at compile-time with annotation - c++

In my C/C++ code I want to annotate different functions and methods so that additional code gets added at compile-time (or link-time). The added wrapping code should be able to inspect context (function being called, thread information, etc.), read/write input variables and modify return values and output variables.
How can I do that with GCC and/or Clang?

Take a look at instrumentation functions in GCC. From man gcc:
-finstrument-functions
Generate instrumentation calls for entry and exit to functions. Just after function entry and just before function exit, the following profiling functions will be called with the address of the current function and its call site. (On some platforms,
"__builtin_return_address" does not work beyond the current function, so the call site information may not be available to the profiling functions otherwise.)
void __cyg_profile_func_enter (void *this_fn,
void *call_site);
void __cyg_profile_func_exit (void *this_fn,
void *call_site);
The first argument is the address of the start of the current function, which may be looked up exactly in the symbol table.
This instrumentation is also done for functions expanded inline in other functions. The profiling calls will indicate where, conceptually, the inline function is entered and exited. This means that addressable versions of such functions must be available. If
all your uses of a function are expanded inline, this may mean an additional expansion of code size. If you use extern inline in your C code, an addressable version of such functions must be provided. (This is normally the case anyways, but if you get lucky
and the optimizer always expands the functions inline, you might have gotten away without providing static copies.)
A function may be given the attribute "no_instrument_function", in which case this instrumentation will not be done. This can be used, for example, for the profiling functions listed above, high-priority interrupt routines, and any functions from which the
profiling functions cannot safely be called (perhaps signal handlers, if the profiling routines generate output or allocate memory).

Related

gcc - how to auto instrument every basic block

GCC has an auto-instrument options for function entry/exit.
-finstrument-functions Generate instrumentation calls for entry and exit to functions. Just after function entry and just before function
exit, the following profiling functions will be called with the
address of the current function and its call site. (On some platforms,
__builtin_return_address does not work beyond the current function, so the call site information may not be available to the profiling
functions otherwise.)
void __cyg_profile_func_enter (void *this_fn,
void *call_site);
void __cyg_profile_func_exit (void *this_fn,
void *call_site);
I would like to have something like this for every "basic block" so that I can log, dynamically, execution of every branch.
How would I do this?
There is a fuzzer called American Fuzzy Lop, it solves very similar problem of instrumenting jumps between basic blocks to gather edge coverage: if basic blocks are vertices what jumps (edges) were encountered during execution. It may be worth to see its sources. It has three approaches:
afl-gcc is a wrapper for gcc that substitutes as by a wrapper rewriting assembly code according to basic block labels and jump instructions
plugin for Clang compiler
patch for QEMU for instrumenting already compiled code
Another and probably the simplest option may be to use DynamoRIO dynamic instrumentation system. Unlike QEMU, it is specially designed to implement custom instrumentation (either as rewriting machine code by hand or simply inserting calls that even may be automatically inlined in some cases, if I get documentation right). If you think dynamic instrumentation is something very hard, look at their examples -- they are only about 100-200 lines (but you still need to read their documentation at least here and for used functions since it may contain important points: for example DR constructs dynamic basic blocks, which are distinct from a compiler's classic basic blocks). With dynamic instrumentation you can even instrument used system libraries. In case it is not what you want, you may use something like
static module_data_t *traced_module;
// in dr_client_main
traced_module = dr_get_main_module();
// in basic block event handler
void *app_pc = dr_fragment_app_pc(tag);
if (!dr_module_contains_addr(traced_module, app_pc)) {
return DR_EMIT_DEFAULT;
}

Lookup table to Function Pointer Array C++ performance

I have a following code to emulate basic system on my pc (x86):
typedef void (*op_fn) ();
void add()
{
//add Opcode
//fetch next opcode
opcodes[opcode]();
}
void nop()
{
//NOP opcode
//fetch next opcode
opcodes[opcode]();
}
const op_fn opcodes[256] =
{
add,
nop,
etc...
};
and i call this "table" via opcodes[opcode]()
I am trying to improve performance of my interpreter.
What about inlining every function, like
inline void add()
inline void nop()
Is there any benefits of doing it?
Is there anyway to make it go faster?
Thanks
Just because you flag a method as inline it doesn't require the compiler to do so - it's more of a hint than an order.
Given that you are storing the opcode handlers in an array the compiler will need to place the address of the function into the array, therefore it can't inline it.
There's actually nothing wrong with your approach. If you really think you've got performance issues then get some metrics, otherwise don't worry (at this point!). The concept of a table of pointers to functions is nothing new - it's actually how C++ implement virtual functions (ie the vtable).
"Inline" means "don't emit a function call; instead, substitute the function body at compile time."
Calling through a function pointer means "do a function call, the details of which won't be known until runtime."
The two features are fundamentally opposed. (The best you could hope for is that a sufficiently advanced compiler could statically determine which function is being called through a function pointer in very limited circumstances and inline those.)
switch blocks are typically implemented as jump tables, which could have less overhead than function calls, so replacing your function pointer array with a switch block and using inline might make a difference.
inline is just a hint to your compiler, it does not guarantee any inlining being done. You should read up on inlining (maybe at the ISO C++ FAQ), as too much inlining can actually make your code slower (through code bloat and associated virtual memory trashing ).

Detouring and using a _thiscall as a hook (GCC calling convention)

I've recently been working on detouring functions (only in Linux) and so far I've had great success. I was developing my own detouring class until I found this. I modernized the code a bit and converted it to C++ (as a class of course). That code is just like any other detour implementation, it replaces the original function address with a JMP to my own specified 'hook' function. It also creates a 'trampoline' for the original function.
Everything works flawlessly but I'd like to do one simple adjustement. I program in pure C++, I use no global functions and everything is enclosed in classes (just like Java/C#). The problem is that this detouring method breaks my pattern. The 'hook' function needs to be a static/non-class function.
What I want to do is to implement support for _thiscall hooks (which should be pretty simple with the GCC _thiscall convention). I've had no success modifying this code to work with _thiscall hooks. What I want as an end result is something just as simple as this; PatchAddress(void * target, void * hook, void * class);. I'm not asking anyone to do this for me, but I would like to know how to solve/approach my problem?
From what I know, I should only need to increase the 'patch' size (i.e it's now 5 bytes, and I should require an additional 5 bytes?), and then before I use the JMP call (to my hook function), I push my 'this' pointer to the stack (which should be as if I called it as a member function). To illustrate:
push 'my class pointer'
jmp <my hook function>
Instead of just having the 'jmp' call directly/only. Is that the correct approach or is there something else beneath that needs to be taken into account (note: I do not care about support for VC++ _thiscall)?
NOTE: here's is my implementation of the above mentioned code: header : source, uses libudis86
I tried several different methods and among these were JIT compile (using libjit) which proved successful but the method did not provide enough performance for it to be usable. Instead I turned to libffi, which is used for calling functions dynamically at run-time. The libffi library had a closure API (ffi_prep_closure_loc) which enabled me to supply my 'this' pointer to each closure generated. So I used a static callback function and converted the void pointer to my object type and from there I could call any non-static function I wished!

calling kernel32.dll function without including windows.h

if kernel32.dll is guaranteed to loaded into a process virtual memory,why couldn't i call function such as Sleep without including windows.h?
the below is an excerpt quoting from vividmachine.com
5. So, what about windows? How do I find the addresses of my needed DLL functions? Don't these addresses change with every service pack upgrade?
There are multitudes of ways to find the addresses of the functions that you need to use in your shellcode. There are two methods for addressing functions; you can find the desired function at runtime or use hard coded addresses. This tutorial will mostly discuss the hard coded method. The only DLL that is guaranteed to be mapped into the shellcode's address space is kernel32.dll. This DLL will hold LoadLibrary and GetProcAddress, the two functions needed to obtain any functions address that can be mapped into the exploits process space. There is a problem with this method though, the address offsets will change with every new release of Windows (service packs, patches etc.). So, if you use this method your shellcode will ONLY work for a specific version of Windows. Further dynamic addressing will be referenced at the end of the paper in the Further Reading section.
The article you quoted focuses on getting the address of the function. You still need the function prototype of the function (which doesn't change across versions), in order to generate the code for calling the function - with appropriate handling of input and output arguments, register values, and stack.
The windows.h header provides the function prototype that you wish to call to the C/C++ compiler, so that the code for calling the function (the passing of arguments via register or stack, and getting the function's return value) can be generated.
After knowing the function prototype by reading windows.h, a skillful assembly programmer may also be able to write the assembly code to call the Sleep function. Together with the function's address, these are all you need to make the function call.
With some black magic you can ;). there have been many custom implementations of GetProcAddress, which would allow you to get away with not needing windows.h, this however isn't be all and end all and could probably end up with problems due to internal windows changes. Another method is using toolhlp to enumerate the modules in the process to get kernel.dll's base, then spelunks its PE for the EAT and grab the address of GetProcAddress. from there you just need function pointer prototypes to call the addresses correctly(and any structure defs needed), which isn't too hard meerly labour intensive(if you have many functions), infact under windows xp this is required to disable DEP due to service pack differencing, ofc you need windows.h as a reference to get this, you just don't need to include it.
You'd still need to declare the function in order to call it, and you'd need to link with kernel32.lib. The header file isn't anything magic, it's basically just a lot of function declarations.
I can do it with 1 line of assembly and then some helper functions to walk the PEB
file by hard coding the correct offsets to different members.
You'll have to start here:
static void*
JMIM_ASM_GetBaseAddr_PEB_x64()
{
void* base_address = 0;
unsigned long long var_out = 0;
__asm__(
" movq %%gs:0x60, %[sym_out] ; \n\t"
:[sym_out] "=r" (var_out) //:OUTPUTS
);
//: printf("[var_out]:%d\n", (int)var_out);
base_address=(void*)var_out;
return( base_address );
}
Then use windbg on an executable file to inspect the data structures on your machine.
A lot of the values you'll be needing are hard to find and only really documented by random hackers. You'll find yourself on a lot of malware writing sites digging for answers.
dt nt!_PEB -r #$peb
Was pretty useful in windbg to get information on the PEB file.
There is a full working implementation of this in my game engine.
Just look in: /DEP/PEB2020 for the code.
https://github.com/KanjiCoder/AAC2020
I don't include <windows.h> in my game engine. Yet I use "GetProcAddress"
and "LoadLibraryA". Might be in-advisable to do this. But my thought was the more
moving parts, the more that can go wrong. So figured I'd take the "#define WIN32_LEAN_AND_MEAN" to it's absurd conclusion and not include <windows.h> at all.

What is the purpose of __cxa_pure_virtual?

Whilst compiling with avr-gcc I have encountered linker errors such as the following:
undefined reference to `__cxa_pure_virtual'
I've found this document which states:
The __cxa_pure_virtual function is an error handler that is invoked when a pure virtual function is called.
If you are writing a C++ application that has pure virtual functions you must supply your own __cxa_pure_virtual error handler function. For example:
extern "C" void __cxa_pure_virtual() { while (1); }
Defining this function as suggested fixes the errors but I'd like to know:
what the purpose of this function is,
why I should need to define it myself and
why it is acceptable to code it as an infinite loop?
If anywhere in the runtime of your program an object is created with a virtual function pointer not filled in, and when the corresponding function is called, you will be calling a 'pure virtual function'.
The handler you describe should be defined in the default libraries that come with your development environment. If you happen to omit the default libraries, you will find this handler undefined: the linker sees a declaration, but no definition. That's when you need to provide your own version.
The infinite loop is acceptable because it's a 'loud' error: users of your software will immediately notice it. Any other 'loud' implementation is acceptable, too.
1) What's the purpose of the function __cxa_pure_virtual()?
Pure virtual functions can get called during object construction/destruction. If that happens, __cxa_pure_virtual() gets called to report the error. See Where do "pure virtual function call" crashes come from?
2) Why might you need to define it yourself?
Normally this function is provided by libstdc++ (e.g. on Linux), but avr-gcc and the Arduino toolchain don't provide a libstdc++.
The Arduino IDE manages to avoid the linker error when building some programs because it compiles with the options "-ffunction-sections -fdata-sections" and links with "-Wl,--gc-sections", which drops some references to unused symbols.
3) Why is it acceptable to code __cxa_pure_virtual() as an infinite loop?
Well, this is at least safe; it does something predictable. It would be more useful to abort the program and report the error. An infinite loop would be awkward to debug, though, unless you have a debugger that can interrupt execution and give a stack backtrace.