I'm using zmq version 4.2.2. My program crashes because of a call to zmq_abort() which calls abort(). According to stack trace, if I understand correctly, zmq_abort() is called from src/socket_poller.cpp:54. However, that line is the beginning of the function definition:
zmq::socket_poller_t::~socket_poller_t ()
The function does not have direct calls to zmq_abort() or any assert macros that would call it. There are not many asserts or any direct calls to zmq_abort() in the whole file either. However, other lines in the stack trace seem to match the source code in github:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/socket_poller.cpp#L54
How does execution end up in zmq_abort()?
Beginning of stack trace:
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12cc9d4700 (LWP 23680))]
(gdb) where
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f12ce123415 in __GI_abort () at abort.c:90
#2 0x00007f12ce8db9c9 in zmq::zmq_abort (errmsg_=<optimized out>) at ../zeromq-4.2.2/src/err.cpp:87
#3 0x00007f12ce918cbe in zmq::socket_poller_t::~socket_poller_t (this=0x7f12c8004150,
__in_chrg=<optimized out>) at ../zeromq-4.2.2/src/socket_poller.cpp:54
#4 0x00007f12ce91793a in zmq_poller_destroy (poller_p_=0x7f12cc9d2af8)
at ../zeromq-4.2.2/src/zmq.cpp:1236
#5 0x00007f12ce917e14 in zmq_poller_poll (timeout_=<optimized out>, nitems_=2, items_=0x1)
at ../zeromq-4.2.2/src/zmq.cpp:854
#6 zmq_poll (items_=items_#entry=0x7f12cc9d2c20, nitems_=nitems_#entry=2, timeout_=timeout_#entry=5000)
at ../zeromq-4.2.2/src/zmq.cpp:866
zmq_abort() was called from an assertation macro in signaler_t's destructor:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/signaler.cpp#L143. The signaler_t object is a member of socket_poller_t. I don't know for sure why the call to the destructor is not shown in the stack trace.
I was trying not to ask (directly) what was wrong with my code because it was infeasible to provide a code sample, but I'll mention that it turned out to be that a file descriptor was erroneously closed twice in another thread. Between the two close operations, zmq_poll() created a socket_poller_t object. signaler_t's constructor opened an eventfd, which was the same fd (number) that had been closed earlier. Then, the other thread closed the same fd again, leading the destructor to get EBADF on close() and callzmq_abort().
Related
I have a small to medium size application which combines Fortran and C++. The main is written in Fortran, but one module is in c++. This module returns pointers to class objects which are stored on the Fortran size. During the creation on one of these pointers the system is throwing the following error:
malloc(): memory corruption
Thread 1 "bc_test" received signal SIGABRT, Aborted.
__GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff4a60801 in __GI_abort () at abort.c:79
#2 0x00007ffff4aa9897 in __libc_message (action=action#entry=do_abort,
fmt=fmt#entry=0x7ffff4bd6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff4ab090a in malloc_printerr (
str=str#entry=0x7ffff4bd4e0e "malloc(): memory corruption") at malloc.c:5350
#4 0x00007ffff4ab4994 in _int_malloc (av=av#entry=0x7ffff4e0bc40 <main_arena>,
bytes=bytes#entry=44) at malloc.c:3738
#5 0x00007ffff4ab72ed in __GI___libc_malloc (bytes=44) at malloc.c:3065
#6 0x00007ffff50bc298 in operator new(unsigned long) ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000555555578967 in My_Class::My_Class(this=0x7fffffffd4e0, n=11)
at /home/.../my_class.cpp:20
Using gdb I have found that the error is thrown during a call to new. More specifically during a call to new within the constructor of an object being created via new (a basic new call works as expected). The line throwing the error is the following:
int* test = new int[n];
in this case n is an integer with n=11.
I don't think that the problem is due to a lack of memory as I have only allocated 2 small class instances and a few basic variables at this point. I also believe this would throw a different error if this were the problem.
Unfortunately I haven't managed to create a MWE. I've now run out of ideas of how to fix this problem. What can cause this error? How can it be debugged beyond finding the line throwing the error?
Other stack overflow results concerning "malloc(): memory corruption" errors are due to accessing unallocated memory however this isn't the case here as it is the allocation call itself which is throwing the error.
Memory corruption errors do not always manifest themselves in the place where the error was committed. As a result the gdb backtrace is often useless for finding the error. Instead a memory analysis/debugging tool such as Valgrind should be used.
I am writing a program which use both OS threads and user threads (fibers, I have written this user Threading program with context switching through assembly language). The problem is that the program sometimes ends with a segmentation fault but other times it doesn't.
The problem is due to a function getting called, with invalid arguments, which shouldn't get called. I think gdb backtrace isn't giving the proper information. Here is the output of my gdb program
#0 0x0000000000000000 in ?? ()
#1 0x0000555555555613 in thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
#2 0x000055555555c791 in start_thread () at contextSwitch2.s:57
#3 0x0000000000000000 in ?? ()
fn is the function I want to run as a user thread, arg is the argument passed to that function.
I have a function Spawn in my user threading library code which push the two arguments (fn and arg) and the pointer to start_thread on the stack and thus start_thread, an assembly function, gets called which call the c++ function thread_entry to call the function fn with arguments arg.
I am not expecting a call to start_thread or thread_entry at the point of error so I am not sure how start_thread gets called. Even if it gets called then Spawn() should have called start_thread as it is the only function which calls start_thread. But Spawn is not shown in gdb backtrace.
Some online posts have mentioned the possibility of stack corruption or something similar the result of error and they have prescribed the use of "record btrace pt". I have spent considerable time setting up intel btrace pt support in the kernel/gdb but I was unable to set it up so I am not going through that route.
Here is a link to my code with compilation instructions:
https://github.com/smartWaqar/userThreading
I set a breakpoint on thread_entry, and observed:
...
[Thread 0x7ffff7477700 (LWP 203995) exited]
parentId: 1
OST 1 Hello A0 on CPU 2
current_thread_num 0 next_thread_num 1
After Thread Exit
After changeOSThread
OST 1 Hello C1 on CPU 2 ---------------
Before changeOSThread
**************** In changeOSThread **************
current_thread_num 1 next_thread_num 2
Thread 3 "a.out" hit Breakpoint 1, thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
243 fn(arg) ;
(gdb) bt
#0 thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
#1 0x000055555555c181 in start_thread () at context.s:57
#2 0x0000000000000000 in ?? ()
Conclusions:
GDB is giving you correct crash stack trace.
You do in fact call thread_entry with fn==0, which of course promptly crashes.
There is something racy going on, as this doesn't happen every time.
Even if it gets called then Spawn() should have called start_thread as it is the only function which calls start_thread
I've observed the following "call" to strart_thread:
Thread 2 "a.out" hit Breakpoint 1, start_thread () at context.s:53
53 push %rbp
(gdb) bt
#0 start_thread () at context.s:53
#1 0x0000555555555e4f in changeOSThread (parentId=<error reading variable>) at t.cc:196
#2 0x0000000000000000 in ?? ()
So I think your mental model of who calls start_thread is wrong.
This is a bit too much code for me to look at. If you want additional help, please reduce the test case to bare minimum.
I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.
So my understanding of both pthread_exit and pthread_cancel is that they both cause an exception-like thing called a "forced unwind" to be thrown out of the relevant stack frame in the target thread. This can be caught in order to do thread-specific clean-up, but must be re-thrown or else we get an implicit abort() at the end of the catch block that didn't re-throw.
In the case of pthread_cancel, that happens either immediately on receipt of the associated signal, or the next entry into a cancellation point, or when the signal is next unblocked, depending on the thread's cancellation state and type.
In the case of pthread_exit, the calling thread immediately undergoes a forced unwind.
Fine. This "exception" is a normal part of the process of killing a thread. So why, even when I re-throw it, is it causing std::terminate() to be called, aborting my whole application?
Note that I'm catching and re-throwing the exception a couple times.
Note also that I'm calling pthread_exit out of my SIGTERM signal handler. This works fine in my toy test code, compiled with g++ 4.3.2, which has a thread run signal(SIGTERM, handler_that_calls_pthread_exit) and then sit in a tight while loop until it gets the TERM signal. But it doesn't work in the real application.
Relevant stack frames:
(gdb) where
#0 0x0000003425c30265 in raise () from /lib64/libc.so.6
#1 0x0000003425c31d10 in abort () from /lib64/libc.so.6
#2 0x00000000012b7740 in sv_bsd_terminate () at exception_handlers.cpp:38
#3 0x00002aef65983aa6 in __cxxabiv1::__terminate (handler=0x518)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:43
#4 0x00002aef65983ad3 in std::terminate ()
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:53
#5 0x00002aef65983a5a in __cxxabiv1::__gxx_personality_v0 (
version=<value optimized out>, actions=<value optimized out>,
exception_class=<value optimized out>, ue_header=0x645bcd80,
context=0x645bb940)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libstdc++-v3/libsupc++/eh_personality.cc:657
#6 0x00002aef6524d68c in _Unwind_ForcedUnwind_Phase2 (exc=0x645bcd80,
context=0x645bb940)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libgcc/../gcc/unwind.inc:180
#7 0x00002aef6524d723 in _Unwind_ForcedUnwind (exc=0x645bcd80,
stop=<value optimized out>, stop_argument=0x645bc1a0)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libgcc/../gcc/unwind.inc:212
#8 0x000000342640cf80 in __pthread_unwind () from /lib64/libpthread.so.0
#9 0x00000034264077a5 in pthread_exit () from /lib64/libpthread.so.0
#10 0x0000000000f0d959 in threadHandleTerm (sig=<value optimized out>)
at osiThreadLauncherLinux.cpp:46
#11 <signal handler called>
Thanks!
Eric
Note also that I'm calling
pthread_exit out of my SIGTERM signal
handler.
This is your problem. To quote from the POSIX specs (http://pubs.opengroup.org/onlinepubs/009695399/functions/signal.html):
If the signal occurs other than as the result of calling abort(), raise(), kill(), pthread_kill(), or sigqueue(), the behavior is undefined if the signal handler refers to any object with static storage duration other than by assigning a value to an object declared as volatile sig_atomic_t, or if the signal handler calls any function in the standard library other than one of the functions listed in Signal Concepts.
The list of permitted functions is given at http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03, and does not include pthread_exit(). Therefore your program is exhibiting undefined behaviour.
I can think of three choices:
Set a flag in the signal handler which is checked by the thread periodically, rather than trying to exit directly from the signal handler.
Use sigwait() to explicitly wait for the signal on an independent thread. This thread can then explicitly call pthread_cancel() on the thread you wish to exit.
Mask the signal, and call sigpending() periodically on the thread that is to be exited, and exit if the signal is pending.
I'm investigating a deadlock bug. I took a core with gcore, and found that one of my functions seems to have called itself - even though it does not make a recursive function call.
Here's a fragment of the stack from gdb:
Thread 18 (Thread 4035926944 (LWP 23449)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x005133de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2 0x00510017 in _L_mutex_lock_182 () from /lib/tls/libpthread.so.0
#3 0x080d653c in ?? ()
#4 0xf7c59480 in ?? () from LIBFOO.so
#5 0x081944c0 in ?? ()
#6 0x081944b0 in ?? ()
#7 0xf08f3b38 in ?? ()
#8 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#9 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#10 0xf7c36006 in FOO::RequesterImpl::releaseObject ()
from LIBFOO.so
#11 0xf7e2afbf in BAR::BAZ::unsubscribe (this=0x80d0070, sSymbol=#0xf6ded018)
at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:176
...more stack
I've elided some of the names: FOO & BAR are namespaces.BAZ is a class.
The interesting part is #8 and #9, the call to Service::releaseObject(). This function does not call itself, nor does it call any function that calls it back... it is not recursive. Why then does it appear in the stack twice?
Is this an artefact created by the debugger, or could it be real?
You'll notice that the innermost call is waiting for a mutex - I think this could be my deadlock. Service::releaseObject() locks a mutex, so if it magically teleported back inside itself, then a deadlock most certainly could occur.
Some background:
This is compiled using g++ v3.4.6 on RHEL4. It's a 64-bit OS, but this is 32-bit code, compiled with -m32. It's optimised at -O3. I can't guarantee that the application code was compiled with exactly the same options as the LIBFOO code.
Class Service has no virtual functions, so there's no vtable. Class RequesterImpl inherits from a fully-virtual interface, so it does have a vtable.
Stacktraces are unreliable on x86 at any optimization level: -O1 and higher enable -fomit-frame-pointer.
The reason you get "bad" stack is that __lll_mutex_lock_wait has incorrect unwind descriptor (it is written in hand-coded assembly). I believe this was fixed somewhat recently (in 2008), but can't find the exact patch.
Once the GDB stack unwinder goes "off balance", it creates bogus frames (#2 through #8), but eventually stumbles on a frame which uses frame pointer, and produces correct stack trace for the rest of the stack.