My goal is to determine all the possible exit points from an LLVM function. I know that terminator instructions end basic blocks, either to exit the function or to branch to another part of the same function. Among the terminator instructions, I am clear on most of them:
ret and resume exit the function
br, switch, indirectbr branch to other blocks in the same function
invoke, catchswitch related to exception control flow, and also should not exit the function
(unreachable can be ignored for this purpose)
I would like to seek clarification on catchret and cleanupret. I have compiled example exception handling code (clang++ on Mac and Ubuntu) and do not see these instructions in the compiled LLVM IR. Are these only used for specific ABIs?
Related
I would like to call ARM/ARM64 ASM code from C++. ASM code contains syscall and a relocation to external function.
ARM architecture here is not so important, I just want to understand how to solve my problem conceptually.
I have following ASM syscall (output from objdump -d) which is called inside shared library:
198: d28009e8 mov x8, #0x4f // #79
19c: d4000001 svc #0x0
1a0: b140041f cmn x0, #0x1, lsl #12
1a4: da809400 cneg x0, x0, hi
1a8: 54000008 b.hi 0 <__set_errno_internal>
1ac: d65f03c0 ret
This piece of code calls fstatat64 syscall and sets errno through external __set_errno_internal function.
readelf -r shows following relocation for __set_errno_internal function:
00000000000001a8 R_AARCH64_CONDBR19 __set_errno_internal
I want to call this piece of code from C++, so I converted it to buffer:
unsigned char machine_code[] __attribute__((section(".text"))) =
"\xe8\x09\x80\xd2"
"\x01\x00\x00\xd4"
"\x1f\x04\x40\xb1"
"\x00\x94\x80\xda"
"\x08\x00\x00\x54" // Here we have mentioned relocation
"\xc0\x03\x5f\xd6";
EDIT: Important detail - I chose to use buffer (not inline assembly etc) because I want to run extra processing on this buffer (for example decryption function on string literal as a software protection mechanism but that's not important here) before it gets evaluated as machine code.
Afterwards, buffer can be cast to function and called directly to execute machine code. Obviously there is a problem with relocation, it's not fixed automatically and I have to fix it manually. But during run-time I can't do it because .text section is read-only & executable.
Although I have almost full control over source code I must not turn off stack protection & other features to make that section writable (don't ask why). So it seems that relocation fix should be performed during link stage somehow. As far as I know shared library contains relative offsets (for similar external function calls) after relocations are fixed by linker and binary *.so file should contain correct offsets (without need of run-time relocation work), so fixing that machine_code buffer during linking should be possible.
I'm using manually built Clang 7 compiler and I have full control over LLVM passes so I thought maybe it's possible to write some kind of LLVM pass which executes during link time. Though it looks like ld is called in the end so maybe LLVM passes will not help here (not an expert here).
Different ideas would be appreciated also.
As you can see problem is pretty complicated. Maybe you have some directions/ideas how to solve this? Thanks!
There's already a working, packaged mechanism to handle relocations. It's called dlsym(). While it doesn't directly give you a function pointer, all major C++ compilers support reinterpret_casting the result of dlsym to any ordinary function pointer. (Member functions are another issue altogether, but that's not relevant here)
GCC has an auto-instrument options for function entry/exit.
-finstrument-functions Generate instrumentation calls for entry and exit to functions. Just after function entry and just before function
exit, the following profiling functions will be called with the
address of the current function and its call site. (On some platforms,
__builtin_return_address does not work beyond the current function, so the call site information may not be available to the profiling
functions otherwise.)
void __cyg_profile_func_enter (void *this_fn,
void *call_site);
void __cyg_profile_func_exit (void *this_fn,
void *call_site);
I would like to have something like this for every "basic block" so that I can log, dynamically, execution of every branch.
How would I do this?
There is a fuzzer called American Fuzzy Lop, it solves very similar problem of instrumenting jumps between basic blocks to gather edge coverage: if basic blocks are vertices what jumps (edges) were encountered during execution. It may be worth to see its sources. It has three approaches:
afl-gcc is a wrapper for gcc that substitutes as by a wrapper rewriting assembly code according to basic block labels and jump instructions
plugin for Clang compiler
patch for QEMU for instrumenting already compiled code
Another and probably the simplest option may be to use DynamoRIO dynamic instrumentation system. Unlike QEMU, it is specially designed to implement custom instrumentation (either as rewriting machine code by hand or simply inserting calls that even may be automatically inlined in some cases, if I get documentation right). If you think dynamic instrumentation is something very hard, look at their examples -- they are only about 100-200 lines (but you still need to read their documentation at least here and for used functions since it may contain important points: for example DR constructs dynamic basic blocks, which are distinct from a compiler's classic basic blocks). With dynamic instrumentation you can even instrument used system libraries. In case it is not what you want, you may use something like
static module_data_t *traced_module;
// in dr_client_main
traced_module = dr_get_main_module();
// in basic block event handler
void *app_pc = dr_fragment_app_pc(tag);
if (!dr_module_contains_addr(traced_module, app_pc)) {
return DR_EMIT_DEFAULT;
}
I'm importing a stack-tracing C code (found somewhere on Stack Overflow) in my code to trace where memory blocks have been allocated:
struct layout
{
struct layout *ebp;
void *ret;
};
struct layout *fr;
__asm__("movl %%ebp, %[fp]" : /* output */ [fp] "=r" (fr));
for (int i=1 ; i<8 && (unsigned char*) fr > dsRAM; i++) {
x[i] = (size_t) fr->ret;
fr = fr->ebp;
}
Things work fairly well, except that in some calls, the code is missing some functions near the top of the stack, e.g. GDB will report:
malloc() at main.cpp
operator new() from libstdc++.so.6
TestBasicScript() at BasicScript.cpp
main() at main.cpp
While the code fills x[] with the addresses of malloc, new operator and main(), missing TestBasicScript.
The code got compiled by g++ 4.5.1 (old devkit for homebrew console programming) with the following flags:
CFLAGS += -I libgeds/source/ -I wrappers -I $(DEVKITPRO)/include -DARM9 \
-include wrappers/nds/system.h -include wrappers/fake.h
CFLAGS += -m32 -Duint=uint32_t -g -Wall -Weffc++ -fno-omit-frame-pointer
I tried to use __builtin_return_address() instead, but I get pretty much the same result with much longer code.
EDIT: I noted that I'm systematically missing the caller of operator new, which could be explained if the code of _Znwj don't setup a stack frame. So the list of questions become :
How does GDB manage to find that TestBasicScript() function call if it's not in the stack frames list ?
How do I configure linking steps so that debug-friendly variant of libstdc++ (if any) is used ?
Original sub-question "Is there compile-time options that guarantee I can trace 100% of the calls to my malloc clone ?" is thus answered by #chqrlie: -O0 is all I should need. But it will be effective only if applied on all my binaries, shared libraries included.
There are many reasons why some frames might be omitted, like for example inlining and optimization (although the provided CFLAGS do not contain optimization flags and the default is AFAIK no optimization).
Anyway, for GCC there is builtin support of stack walking, by using backtrace(), backtrace_symbols() and perhaps combined with abi::__cxa_demangle(), you can try those as well.
Other option is to use libunwind, I was trying it as well with quite good results (and in its source code you can see some useful techniques for in-app stack walking).
All the above usually don't work very well with optimized (release) executables, in particular if they do not contain the debug info (although it might have been generated and stored aside) the printed stack will be useless (besides skipped frames because of the optimization).
An ultimate technique which works even for optimized code is generating a core dump. There you have all the information about the stack (the binary itself does not need to contain the debuginfo, it just can be left aside and only used for examining the core offline), and as a bonus values of all variables on the stack, information about all threads currently running etc.
For tracing memory allocations it is probably an overkill (it is also quite slow), but sometimes it can be pretty useful. In one of my projects I created a working implementation of such core dumper which is still present in the production code.
Note that you can actually generate a core dump of the app without terminating the application - the implementation I created basically works as follows:
fork() the process at the point where the core dump should be generated
the child process calls abort() to generate the core dump (the call stack of the forked process is the same as the original process), i.e. only the forked process is terminated by the abort()
the original parent process uses waitpid() to wait until the child process generates the core dump and terminates (with a guard counter to not wait forever)
then the original process continues running (and writes to the log that the diagnostic core has been generated along with the PID of the forked process which was used to generate the core)
This turned out to work pretty well in some situations where a diagnostic stack trace was required for release production application.
EDIT: Another option which I also tried is using ptrace() (if I remember well, that is also one of the techniques used by the libunwind mentioned above and actually also by GDB). That works the similar way - spawning a child process by fork() and then calling ptrace(PTRACE_TRACEME) in there; the parent process can then issue various ptrace() calls to examine the stack of the child (which happens to be the same as the stack of the parent at the point of fork()). I think the libunwind source code contain its use so you can examine it there.
The compiler may not always generate a stack frame with %ebp pointing the the previous frame. For some functions, it may generate code that uses %esp based addressing to retrieve the arguments, for others it may generate tail recursion with a jump instead of a call/ret sequence. The stack trace as you are trying to scan it may be incomplete.
Try compiling the whole project with optimisation disabled (-O0).
Following up on this question, is it possible for llvm to generate code that may jump to an arbitrary address within a function in the same address space?
i.e.
void func1() {
...
<code that jumps to addr2>
...
}
void func2() {
...
addr2:
<some code in func2()>
...
}
Yes,No,Yes,No,(yes) - It depends on the level you look at and what you mean with possible:
Yes, as the llvm backend will produce target specific assembler
instructions and those assembler instructions allow to set the
program counter to an abitrary value.
No, because - as far as I know - the llvm ir (the intermediate representation into which a frontend like clang compiles your c code) hasn't any instructions that would allow abitrary jumps between (llvm-ir) functions.
Yes, because the frontend COULD certainly produce code, that simulates that behaviour (breaking up func2 into multiple separate functions).
No, because C and C++ don't allow such jumps to ARBITRARY positions and so clang will not compile any program that tries to do that (e.g. via goto)
(yes) the c longjmp macro jumps back to a place in the control flow that you have already visited (where you called setjmp) but also restores (most) of the system state. EDIT: However, this is UB if func2 isn't somewhere up in the current callstack from where you jump.
Our project uses a few boost 1.48 libraries on several platforms, including Windows, Mac, Android, and IOS.
We are able to consistently get the IOS version of the project to crash (nontrivially but reliably) when using IOS, and
from our investigation we see that ~thread_data_base is being called on the thread's thread_info while its thread is still running.
This seems to happen as a result of the smart pointer reaching a zero count, even though it is obviously still
in scope in the thread_proxy function which creates it and runs the requested function in the thread.
This seems to happen in various cases - the call stack is not identical between crashes, though there are a few
variations which are common.
Just to be clear - this often requires running code which is creating hundreds of threads, though there are
never more than about 30 running simultaneously. I have "been lucky" and got it very very early in the
run also, but that's rare.
I created a version of the destructor which actually catches the code red-handed:
in libs/thread/src/pthread/thread.cpp:
thread_data_base::~thread_data_base()
{
boost::detail::thread_data_base* const thread_info=detail::get_current_thread_data();
void *void_thread_info = (void *) thread_info;
void *void_this = (void *) this;
// is somebody destructing the thread_data other than its own thread?
// (remember that its own which should no longer point to it anyway,
// because of the call to detail::set_current_thread_data(0) in thread_proxy)
if (void_thread_info) { // == void_this) {
__builtin_trap();
}
}
I should note that (as seen from the commented-out code) I had previously checked to see that void_thread_info == void_this because I
was only checking for the case where the thread's current thread_info was killing itself.
I have also seen cases where the value returned by get_current_thread_data is non-zero and
different from "this", which is really weird.
Also when I first wrote that version of the code, I wrote:
if (((void*)thread_info) == ((void*)this))
and at run-time I got some very weird exception that said I something about a virtual function table
or something like that - I don't remember. I decided that it was trying to call "==" for this object type
and was unhappy with that, so I rewrote as above, putting the conversions to void * as separate
lines of code. That in itself is quite suspicious to me. I am not one to run to rush to blame compilers, but...
I should also note that when we did catch this happening the trap, we saw the destructor for
~shared_count appear twice consecutively on the stack in Xcode source. Very doubleweird.
We tried to look at the disassembly, but couldn't make much out of it.
Again - it looks like this is always a result of the shared_count which seems to be owned by
the shared_ptr which owns the thread_info reaching zero too early.
Update: it seems that it is possible to get into situations which reach the above trap without the situation doing any harm. Since fixing the issue (see answer) I have seen it happen, but always after thread_info->run() has finished executing. Don't yet understand how...but it's working.
Some additional info:
I should note that the boost.sh from Pete Goodliffe (and modified by others) that is commonly used to compile boost for IOS
has the following note in the header:
: ${EXTRA_CPPFLAGS:="-DBOOST_AC_USE_PTHREADS -DBOOST_SP_USE_PTHREADS"}
# The EXTRA_CPPFLAGS definition works around a thread race issue in
# shared_ptr. I encountered this historically and have not verified that
# the fix is no longer required. Without using the posix thread primitives
# an invalid compare-and-swap ARM instruction (non-thread-safe) was used for the
# shared_ptr use count causing nasty and subtle bugs.
#
# Should perhaps also consider/use instead: -BOOST_SP_USE_PTHREADS
I use those flags, but to no avail.
I found the following which is very tantalizing - it looks like they had the same issue in std::thread:
http://llvm.org/bugs/show_bug.cgi?format=multiple&id=12730
That was suggestive of using an alternate implementation inside boost for arm processors which seems also to directly address this issue:
spinlock_gcc_arm.hpp
The version included with boost 1.48 uses outdated arm assembly.
I took the updated version from boost 1.52, but I'm having trouble compiling it.
I get the following error:
predicated instructions must be in IT block
I found a reference to what looks to be a similar use of this instruction here:
https://zeromq.jira.com/browse/LIBZMQ-414
I was able to use the same idea to get the 1.52 code to compile by modifying
the code as follows (I inserted an appropriate IT instruction)
__asm__ __volatile__(
"ldrex %0, [%2]; \n"
"cmp %0, %1; \n"
"it ne; \n"
"strexne %0, %1, [%2]; \n"
BOOST_SP_ARM_BARRIER :
"=&r"( r ): // outputs
"r"( 1 ), "r"( &v_ ): // inputs
"memory", "cc" );
But in any case, there are ifdefs in this file which look for the arm architecture, which is not defined that way in my environment. After I simply edited the file so that only ARM 7 code
was left, the compiler complains about the definition of BOOST_SP_ARM_BARRIER:
In file included from ./boost/smart_ptr/detail/spinlock.hpp:35:
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:39:13: error: instruction requires a CPU feature not currently enabled
BOOST_SP_ARM_BARRIER :
^
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:13:32: note: expanded from macro 'BOOST_SP_ARM_BARRIER'
# define BOOST_SP_ARM_BARRIER "dmb"
Any ideas??
Figured this out. It turns out that the boost.sh script that I mention in the question chose the incorrect boost flag to address this problem - instead of BOOST_SP_USE_PTHREADS (and the other flag there with it, BOOST_AC_USE_PTHREADS) it turns out that what is needed on IOS is BOOST_SP_USE_SPINLOCK. This ends up giving pretty much the identical solution used in the std::thread issue referred to in the question.
If you are compiling for any modern IOS device which uses ARM 7, but using an older boost (we are using 1.48), you need to copy the file spinlock_gcc_arm.hpp from a more recent boost (like 1.52). That file is #ifdef'd for the different arm architectures, but it is not clear to me that the defines it is looking for are defined in the IOS compile environment using the script. So you can either edit the file (violent but effective) or invest some time to figure out how to make this tidy and correct.
In any case, you may need to insert the extra assembly instruction that I did above in the question:
"it ne; \n"
I have not yet gone back to see if I can delete that now that I have my compile environment working problem.
However, we're not done yet. The code used in boost for this option includes, as discussed, ARM assembly language instructions. The ARM chips support two instruction sets which can't be mixed in a given module (not sure of the scope, but evidently file by file is an acceptable granularity when compiling). The instructions used in boost for this locking include non-Thumb instructions, but IOS by default uses the Thumb instruction set. The boost code, aware of the instruction set issue, checks to see that you have arm enabled but not thumb, but by default in IOS, thumb is on.
Getting the compiler to generate non-thumb ARM code depends on which compiler you are using in IOS - Apple's LLVM or LLVM GCC. GCC is deprecated, and Apple's LLVM is the default when you use XCode.
For the default Clang + Apple LLVM 4.1, you need to compile using the -mno-thumb flag. Also any files in your IOS app which use any part of boost which uses smart pointers will also have to be compiled using -mno-thumb.
To compile boost like this, I think you can just add -mno-thumb to the EXTRA_CPP_FLAGS in the script. (I modified the user-config.jam directly while experimenting and haven't yet gone back to clean up.)
For your app, in Xcode you need to select your target, then go into the Build Phases tab, and there select Compile sources. There you have the option of adding compile flags, so for each relevant file (which includes boost), add the -mno-thumb flag. You can do this directly in project.pbxproj also where each file has
settings = { COMPILER_FLAGS = ""; };
you just change this to
settings = { COMPILER_FLAGS = "-mno-thumb"; };
But there's a little more. You also have to modify the darwin.jam file in the tools/build/v2/tools directory. In boost 1.48, there is a code that says:
case arm :
{
options = -arch armv6;
}
This has to be modified to
case arm :
{
options = -arch armv7 ;
}
Finally, in the boost.sh script, in the function writeBjamUserConfig(), you should remove the references to -arch armv6.
If somebody knows how to do this a little more generally and cleanly, I'm sure we'd all benefit. For now, this is where I've gotten to, and I hope that this will help other IOS boost threads users. I hope that the various variants on the boost.sh IOS script out there will be updated. I plan to add some more links to this answer later.
Update: For a great article which describes the issue on the processor level,
see here:
http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu
Enjoy!
I use boost.asio, boost.thread, boost.smart_ptr etc. on iOS platform, the app always crash when run in release mode, which throws signal sigabrt. The crash call stack is :
__stack_chk_fail
boost::asio::detail::completion_handle
boost::asio::detail::task_ios_service_operation::complete
boost::asio::detail::task_io_service::do_run_one
boost::asio::detail::task_ios_service::run
boost::asio::io_service::run
![when create a asio work with creating new thread and io_service][1]
When trying to solve the problem, I found the following articles:
[boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-build][2]
[The issue of spin_lock and thumb on iOS][3]
Then I try to add -mno-thumb to my project compile flag, and the problem occured in release mode is gone.
However, a new bug bring out : EXC_ARM_DA_ALIGN, which crashed at where I try to convert network data to host-endian.
As[this article][4] says, the ARM instructions strict that the memory data must be aligned.
And follow the article [Exc_arm_da_align][5], I fix it by using memcpy for the data convert, instead of directly converting from the pointer.
[1]: http://i.stack.imgur.com/3ijF4.png
[2]: http://stackoverflow.com/questions/4201262/boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-builds/4245821#4245821
[3]: http://groups.google.com/group/boost-list/browse_thread/thread/7dc1e80659182ab3
[4]: https://brewx.qualcomm.com/bws/content/gi/common/appseng/en/knowledgebase/docs/kb95.html
[5]: http://www.cnblogs.com/unionfind/archive/2013/02/25/2932262.html