Segmentation Fault in Multithreaded program and incomplete information on gdb backtrace - c++

I am writing a program which use both OS threads and user threads (fibers, I have written this user Threading program with context switching through assembly language). The problem is that the program sometimes ends with a segmentation fault but other times it doesn't.
The problem is due to a function getting called, with invalid arguments, which shouldn't get called. I think gdb backtrace isn't giving the proper information. Here is the output of my gdb program
#0 0x0000000000000000 in ?? ()
#1 0x0000555555555613 in thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
#2 0x000055555555c791 in start_thread () at contextSwitch2.s:57
#3 0x0000000000000000 in ?? ()
fn is the function I want to run as a user thread, arg is the argument passed to that function.
I have a function Spawn in my user threading library code which push the two arguments (fn and arg) and the pointer to start_thread on the stack and thus start_thread, an assembly function, gets called which call the c++ function thread_entry to call the function fn with arguments arg.
I am not expecting a call to start_thread or thread_entry at the point of error so I am not sure how start_thread gets called. Even if it gets called then Spawn() should have called start_thread as it is the only function which calls start_thread. But Spawn is not shown in gdb backtrace.
Some online posts have mentioned the possibility of stack corruption or something similar the result of error and they have prescribed the use of "record btrace pt". I have spent considerable time setting up intel btrace pt support in the kernel/gdb but I was unable to set it up so I am not going through that route.
Here is a link to my code with compilation instructions:
https://github.com/smartWaqar/userThreading

I set a breakpoint on thread_entry, and observed:
...
[Thread 0x7ffff7477700 (LWP 203995) exited]
parentId: 1
OST 1 Hello A0 on CPU 2
current_thread_num 0 next_thread_num 1
After Thread Exit
After changeOSThread
OST 1 Hello C1 on CPU 2 ---------------
Before changeOSThread
**************** In changeOSThread **************
current_thread_num 1 next_thread_num 2
Thread 3 "a.out" hit Breakpoint 1, thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
243 fn(arg) ;
(gdb) bt
#0 thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
#1 0x000055555555c181 in start_thread () at context.s:57
#2 0x0000000000000000 in ?? ()
Conclusions:
GDB is giving you correct crash stack trace.
You do in fact call thread_entry with fn==0, which of course promptly crashes.
There is something racy going on, as this doesn't happen every time.
Even if it gets called then Spawn() should have called start_thread as it is the only function which calls start_thread
I've observed the following "call" to strart_thread:
Thread 2 "a.out" hit Breakpoint 1, start_thread () at context.s:53
53 push %rbp
(gdb) bt
#0 start_thread () at context.s:53
#1 0x0000555555555e4f in changeOSThread (parentId=<error reading variable>) at t.cc:196
#2 0x0000000000000000 in ?? ()
So I think your mental model of who calls start_thread is wrong.
This is a bit too much code for me to look at. If you want additional help, please reduce the test case to bare minimum.

Related

What is __lll_lock_wait_private and what can cause a hang while malloc_consolidate is called?

I have used 2 threads, but they are getting stuck with following stack trace:
Thread 2:
(gdb) bt
#0 0x00007f9e1d7625bc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f9e1d6deb35 in _L_lock_17166 () from /lib64/libc.so.6
#2 0x00007f9e1d6dbb73 in malloc () from /lib64/libc.so.6
#3 0x00007f9e1d6c4bad in __fopen_internal () from /lib64/libc.so.6
#4 0x00007f9e1dda2210 in std::__basic_file<char>::open(char const*, std::_Ios_Openmode, int) () from /lib64/libstdc++.so.6
#5 0x00007f9e1dddd5ba in std::basic_filebuf<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) () from /lib64/libstdc++.so.6
#6 0x00000000005e1244 in fatalSignalHandler(int, siginfo*, void*) ()
#7 <signal handler called>
#8 0x00007f9e1d6d6839 in malloc_consolidate () from /lib64/libc.so.6
#9 0x00007f9e1d6d759e in _int_free () from /lib64/libc.so.6
_int_free is getting called as a result of default destructor.
Thread 1:
(gdb) bt
#0 0x00007f9e2a4ed54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f9e2a4e8e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f9e2a4e8d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
Via Threads getting stuck with few threads at point "in __lll_lock_wait" I get to know that __lll_lock_wait() is called if we are not able to get a lock on the mutex, since something else (In this case I guess the Thread 2) is still locking it.
But Thread 2 is also stuck with given stack trace, and since they are not with debug symbols, I can't check who is the owner of the mutex. So my questions are:
What is the use of / cause of __lll_lock_wait_private ()
Is there any hint what and where could the issue be? Without availability of debug symbols.
Several times I have seen hang in case of malloc_consolidate() on linux.. Is this a well known and yet to be solved issue?
Frames 6 and 7 of thread 2 suggest a custom signal handler was installed. Frame 5 suggests it is trying to do something like write to a file (std::ofstream?).
That is not allowed. Very little is allowed in signal handlers, and definitely not iostreams.
Suppose you are in a function like malloc_consolidate which may have to touch the global arena, and take a lock to do it, and a signal comes along. If you allocate memory in the signal handler, you also need the same lock, which is already being held. Thread 2 is deadlocking itself.

How calling zmq_poll() leads to calling zmq_abort()?

I'm using zmq version 4.2.2. My program crashes because of a call to zmq_abort() which calls abort(). According to stack trace, if I understand correctly, zmq_abort() is called from src/socket_poller.cpp:54. However, that line is the beginning of the function definition:
zmq::socket_poller_t::~socket_poller_t ()
The function does not have direct calls to zmq_abort() or any assert macros that would call it. There are not many asserts or any direct calls to zmq_abort() in the whole file either. However, other lines in the stack trace seem to match the source code in github:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/socket_poller.cpp#L54
How does execution end up in zmq_abort()?
Beginning of stack trace:
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12cc9d4700 (LWP 23680))]
(gdb) where
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f12ce123415 in __GI_abort () at abort.c:90
#2 0x00007f12ce8db9c9 in zmq::zmq_abort (errmsg_=<optimized out>) at ../zeromq-4.2.2/src/err.cpp:87
#3 0x00007f12ce918cbe in zmq::socket_poller_t::~socket_poller_t (this=0x7f12c8004150,
__in_chrg=<optimized out>) at ../zeromq-4.2.2/src/socket_poller.cpp:54
#4 0x00007f12ce91793a in zmq_poller_destroy (poller_p_=0x7f12cc9d2af8)
at ../zeromq-4.2.2/src/zmq.cpp:1236
#5 0x00007f12ce917e14 in zmq_poller_poll (timeout_=<optimized out>, nitems_=2, items_=0x1)
at ../zeromq-4.2.2/src/zmq.cpp:854
#6 zmq_poll (items_=items_#entry=0x7f12cc9d2c20, nitems_=nitems_#entry=2, timeout_=timeout_#entry=5000)
at ../zeromq-4.2.2/src/zmq.cpp:866
zmq_abort() was called from an assertation macro in signaler_t's destructor:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/signaler.cpp#L143. The signaler_t object is a member of socket_poller_t. I don't know for sure why the call to the destructor is not shown in the stack trace.
I was trying not to ask (directly) what was wrong with my code because it was infeasible to provide a code sample, but I'll mention that it turned out to be that a file descriptor was erroneously closed twice in another thread. Between the two close operations, zmq_poll() created a socket_poller_t object. signaler_t's constructor opened an eventfd, which was the same fd (number) that had been closed earlier. Then, the other thread closed the same fd again, leading the destructor to get EBADF on close() and callzmq_abort().

gdb program core: don't get the specific bug

I use gdb command as follows to localize the segmentation fault, but it shows ?? so that I am confused. What does it mean? How to avoid it?
$ gdb program core
...
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000048d0000048c in ?? ()
(gdb) bt
#0 0x0000046a00000469 in ?? ()
#1 0x0000046c0000046b in ?? ()
#2 0x0000046e0000046d in ?? ()
#3 0x000004700000046f in ?? ()
#4 0x0000047300000472 in ?? ()
#5 0x0000047600000475 in ?? ()
#6 0x0000047800000477 in ?? ()
#7 0x0000047a00000479 in ?? ()
#8 0x0000047d0000047b in ?? ()
...
I find that the array is out of bounds and I solved it. But I still confused with the phenomenon above.
0x0000048d0000048c
This looks like you've called a function through a function pointer, but that pointer has been overwritten with two integers: 0x48d == 1165 and 0x48c == 1164 (do these values look like something that your program is using?).
You should use bt to tell you how you got there.
You should probably use Valgrind or Address Sanitizer to check for uninitialized or dangling memory and buffer overflow (which are some of the common ways to end up with invalid function pointer).
Update:
Now that you show the stack trace, it's an almost 100% guarantee that you have some local array of integers which you've overflown (filling it with values like 1129, 1130, 1131, etc.), thus corrupting your stack.
Address Sanitizer (available in recent versions of GCC) should point you straight at where the bug is.
This means that your program crashed in a function unknow by gdb (function not provided by the symbol table)
try these two options, in the given order:
if you are debugging a target, be sure that all your code layers are compiled with the option -g if you are using gcc.
You can give manually the symbol table to gdb with the command file "binary_with_symbol_table" and it will give you the function and the address of the bug.
Note that many exceptions may be hidden behind a segmentation fault.

Simultaneous abort() in two threads

I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.

What caused the mysterious duplicate entry in my stack?

I'm investigating a deadlock bug. I took a core with gcore, and found that one of my functions seems to have called itself - even though it does not make a recursive function call.
Here's a fragment of the stack from gdb:
Thread 18 (Thread 4035926944 (LWP 23449)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x005133de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2 0x00510017 in _L_mutex_lock_182 () from /lib/tls/libpthread.so.0
#3 0x080d653c in ?? ()
#4 0xf7c59480 in ?? () from LIBFOO.so
#5 0x081944c0 in ?? ()
#6 0x081944b0 in ?? ()
#7 0xf08f3b38 in ?? ()
#8 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#9 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#10 0xf7c36006 in FOO::RequesterImpl::releaseObject ()
from LIBFOO.so
#11 0xf7e2afbf in BAR::BAZ::unsubscribe (this=0x80d0070, sSymbol=#0xf6ded018)
at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:176
...more stack
I've elided some of the names: FOO & BAR are namespaces.BAZ is a class.
The interesting part is #8 and #9, the call to Service::releaseObject(). This function does not call itself, nor does it call any function that calls it back... it is not recursive. Why then does it appear in the stack twice?
Is this an artefact created by the debugger, or could it be real?
You'll notice that the innermost call is waiting for a mutex - I think this could be my deadlock. Service::releaseObject() locks a mutex, so if it magically teleported back inside itself, then a deadlock most certainly could occur.
Some background:
This is compiled using g++ v3.4.6 on RHEL4. It's a 64-bit OS, but this is 32-bit code, compiled with -m32. It's optimised at -O3. I can't guarantee that the application code was compiled with exactly the same options as the LIBFOO code.
Class Service has no virtual functions, so there's no vtable. Class RequesterImpl inherits from a fully-virtual interface, so it does have a vtable.
Stacktraces are unreliable on x86 at any optimization level: -O1 and higher enable -fomit-frame-pointer.
The reason you get "bad" stack is that __lll_mutex_lock_wait has incorrect unwind descriptor (it is written in hand-coded assembly). I believe this was fixed somewhat recently (in 2008), but can't find the exact patch.
Once the GDB stack unwinder goes "off balance", it creates bogus frames (#2 through #8), but eventually stumbles on a frame which uses frame pointer, and produces correct stack trace for the rest of the stack.