C++ server crashes with abort() in _UTF8_init() on free() - c++

I'm having problems with C++ code loaded via dlopen() by a C++ CGI server. After a while, the program crashes unexpectedly, but consistently at memory management function call (such as free(), calloc(), etc.) and produces core dump similar to this:
#0 0x0000000806b252dc in kill () from /lib/libc.so.6
#1 0x0000000804a1861e in raise () from /lib/libpthread.so.2
#2 0x0000000806b2416d in abort () from /lib/libc.so.6
#3 0x0000000806abdb45 in _UTF8_init () from /lib/libc.so.6
#4 0x0000000806abdfcc in _UTF8_init () from /lib/libc.so.6
#5 0x0000000806abeb1d in _UTF8_init () from /lib/libc.so.6
... the rest of the stack
Has anyone seen something like this before?
What is _UTF8_init() and why would memory management functions call it?

That smells like a corrupted heap, likely due to a buffer overrun somewhere in your code. Try running your program with Valgrind and look for any errors or warnings it emits.

Related

Debugging malloc(): memory corruption on new

I have a small to medium size application which combines Fortran and C++. The main is written in Fortran, but one module is in c++. This module returns pointers to class objects which are stored on the Fortran size. During the creation on one of these pointers the system is throwing the following error:
malloc(): memory corruption
Thread 1 "bc_test" received signal SIGABRT, Aborted.
__GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff4a60801 in __GI_abort () at abort.c:79
#2 0x00007ffff4aa9897 in __libc_message (action=action#entry=do_abort,
fmt=fmt#entry=0x7ffff4bd6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff4ab090a in malloc_printerr (
str=str#entry=0x7ffff4bd4e0e "malloc(): memory corruption") at malloc.c:5350
#4 0x00007ffff4ab4994 in _int_malloc (av=av#entry=0x7ffff4e0bc40 <main_arena>,
bytes=bytes#entry=44) at malloc.c:3738
#5 0x00007ffff4ab72ed in __GI___libc_malloc (bytes=44) at malloc.c:3065
#6 0x00007ffff50bc298 in operator new(unsigned long) ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000555555578967 in My_Class::My_Class(this=0x7fffffffd4e0, n=11)
at /home/.../my_class.cpp:20
Using gdb I have found that the error is thrown during a call to new. More specifically during a call to new within the constructor of an object being created via new (a basic new call works as expected). The line throwing the error is the following:
int* test = new int[n];
in this case n is an integer with n=11.
I don't think that the problem is due to a lack of memory as I have only allocated 2 small class instances and a few basic variables at this point. I also believe this would throw a different error if this were the problem.
Unfortunately I haven't managed to create a MWE. I've now run out of ideas of how to fix this problem. What can cause this error? How can it be debugged beyond finding the line throwing the error?
Other stack overflow results concerning "malloc(): memory corruption" errors are due to accessing unallocated memory however this isn't the case here as it is the allocation call itself which is throwing the error.
Memory corruption errors do not always manifest themselves in the place where the error was committed. As a result the gdb backtrace is often useless for finding the error. Instead a memory analysis/debugging tool such as Valgrind should be used.

What are the possible reasons for POSIX SIGBUS?

My program recently crashed with the following stack;
Program terminated with signal 7, Bus error.
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
#1 0x00007f0f35f8042e in skgesigOSCrash () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#2 0x00007f0f36222ca9 in kpeDbgSignalHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#3 0x00007f0f35f8063e in skgesig_sigactionHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#4 <signal handler called>
What should I check in my code to avoid this? Or is this something Oracle should fix?
Main reasons you could get a bus error revolves around inaccessible memory. This could be due to many reasons:
Accessing through a deleted pointer.
Accessing through an uninitialized pointer.
Accessing through a NULL pointer.
Accessing the address which is not yours. It could be due to overflow errors.
Try adding the following to the $ORACLE_HOME/network/admin/*.ora file:
DIAG_ADR_ENABLED=OFF
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE
This sounds like an Oracle issue.
And also Oracle's libraries seem to be compiled by Intel compilers.

Custom memory manager with thread local storage

There is a custom memory manager in our program, all of our malloc/free calls are managed by the memory manager, but in the initial of the program getpwuid() will be call and in some customers' machine with nss_ldap activated it will call the malloc from libc not from our memory manager which leads to an error in our memory manager, the stack report from gdb is:
Breakpoint 2, 0x0000003df8cc6eb0 in brk () from /lib64/libc.so.6
0 0x0000003df8cc6eb0 in brk () from /lib64/libc.so.6
1 0x0000003df8cc6f72 in sbrk () from /lib64/libc.so.6
2 0x0000003df8c73d29 in __default_morecore () from /lib64/libc.so.6
3 0x0000003df8c70090 in _int_malloc () from /lib64/libc.so.6
4 0x0000003df8c70c9d in malloc () from /lib64/libc.so.6
5 0x0000003df880fc65 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
6 0x00002aaaae302a7c in _nss_ldap_inc_depth () from /lib64/libnss_ldap.so.2
7 0x00002aaaae2f91a4 in _nss_ldap_enter () from /lib64/libnss_ldap.so.2
8 0x00002aaaae2f942c in _nss_ldap_getbyname () from /lib64/libnss_ldap.so.2
9 0x00002aaaae2f9aa9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2
10 0x0000003df8c947c5 in getpwuid_r##GLIBC_2.2.5 () from /lib64/libc.so.6
11 0x0000003df8c9412f in getpwuid () from /lib64/libc.so.6
12 0x0000000001414be3 in lc_username ()
I've traced the code of _nss_ldap_inc_depth(), it seems the __tls_get_addr() got call because the thread local storage is used, I've try to change the memory manager to shared library but the __tls_get_addr() still call the malloc from libc, how can I made it call our memory manager instead of libc's ??
You can use LD_PRELOAD to load your library before any other library (including glibc) and it will be linked instead, something like:
$ LD_PRELOAD=/path/to/library/libmymalloc.so /bin/myprog
There's a tutorial here that shows how it works, it even has an example interposed malloc
You can change your memory manager to use mmap instread of brk.
There can be only one user of brk in a process. So if you haven't replaced all calls to malloc and related functions (calloc, strdup and more), you must not use brk.
mmap, however, has no such problems. Your memory manager can use mmap, and malloc can still work in parallel.

Simultaneous abort() in two threads

I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.

Boost threads coring on startup

I have a program that brings up and tears down multiple threads throughout its life. Everything works great for awhile, but eventually, I get the following core dump stack trace.
#0 0x009887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x007617a5 in raise () from /lib/tls/libc.so.6
#2 0x00763209 in abort () from /lib/tls/libc.so.6
#3 0x003ec1bb in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x003e9ed1 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.6
#5 0x003e9f06 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x003ea04f in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0x00d5562b in boost::thread::start_thread () from /h/Program/bin/../lib/libboost_thread-gcc34-mt-1_39.so.1.39.0
At first, I was leaking threads, and figured the core was due to hitting some maximum limit of number of current threads, but now it seems that this problems occurs even when I don't. For reference, in the core above there were 13 active threads executing.
I did some searching to try and figure out why start_thread would core, but I didn't come across anything. Anyone have any ideas?
start_thread is throwing an uncaught exception, see which exceptions can start_thread throw and place a catch around it to see what is the problem.
What are the values carried by thread_resource_error? It looks like you can call native_error() to find out.
Since this is a wrapper around pthreads there are only a couple of possibilities - EAGAIN, EINVAL and EPERM. It looks as if boost has exceptions it would likely throw for EINVAL and EPERM - i.e. unsupported_thread_option() and thread_permission_error().
That pretty much leaves EAGAIN so I would double check that you really aren't exceeding the system limits on the number of threads. You are sure you are joining them, or if detached, they are really gone?