Debugging malloc(): memory corruption on new - c++

I have a small to medium size application which combines Fortran and C++. The main is written in Fortran, but one module is in c++. This module returns pointers to class objects which are stored on the Fortran size. During the creation on one of these pointers the system is throwing the following error:
malloc(): memory corruption
Thread 1 "bc_test" received signal SIGABRT, Aborted.
__GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff4a60801 in __GI_abort () at abort.c:79
#2 0x00007ffff4aa9897 in __libc_message (action=action#entry=do_abort,
fmt=fmt#entry=0x7ffff4bd6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff4ab090a in malloc_printerr (
str=str#entry=0x7ffff4bd4e0e "malloc(): memory corruption") at malloc.c:5350
#4 0x00007ffff4ab4994 in _int_malloc (av=av#entry=0x7ffff4e0bc40 <main_arena>,
bytes=bytes#entry=44) at malloc.c:3738
#5 0x00007ffff4ab72ed in __GI___libc_malloc (bytes=44) at malloc.c:3065
#6 0x00007ffff50bc298 in operator new(unsigned long) ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000555555578967 in My_Class::My_Class(this=0x7fffffffd4e0, n=11)
at /home/.../my_class.cpp:20
Using gdb I have found that the error is thrown during a call to new. More specifically during a call to new within the constructor of an object being created via new (a basic new call works as expected). The line throwing the error is the following:
int* test = new int[n];
in this case n is an integer with n=11.
I don't think that the problem is due to a lack of memory as I have only allocated 2 small class instances and a few basic variables at this point. I also believe this would throw a different error if this were the problem.
Unfortunately I haven't managed to create a MWE. I've now run out of ideas of how to fix this problem. What can cause this error? How can it be debugged beyond finding the line throwing the error?
Other stack overflow results concerning "malloc(): memory corruption" errors are due to accessing unallocated memory however this isn't the case here as it is the allocation call itself which is throwing the error.

Memory corruption errors do not always manifest themselves in the place where the error was committed. As a result the gdb backtrace is often useless for finding the error. Instead a memory analysis/debugging tool such as Valgrind should be used.

Related

How calling zmq_poll() leads to calling zmq_abort()?

I'm using zmq version 4.2.2. My program crashes because of a call to zmq_abort() which calls abort(). According to stack trace, if I understand correctly, zmq_abort() is called from src/socket_poller.cpp:54. However, that line is the beginning of the function definition:
zmq::socket_poller_t::~socket_poller_t ()
The function does not have direct calls to zmq_abort() or any assert macros that would call it. There are not many asserts or any direct calls to zmq_abort() in the whole file either. However, other lines in the stack trace seem to match the source code in github:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/socket_poller.cpp#L54
How does execution end up in zmq_abort()?
Beginning of stack trace:
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12cc9d4700 (LWP 23680))]
(gdb) where
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f12ce123415 in __GI_abort () at abort.c:90
#2 0x00007f12ce8db9c9 in zmq::zmq_abort (errmsg_=<optimized out>) at ../zeromq-4.2.2/src/err.cpp:87
#3 0x00007f12ce918cbe in zmq::socket_poller_t::~socket_poller_t (this=0x7f12c8004150,
__in_chrg=<optimized out>) at ../zeromq-4.2.2/src/socket_poller.cpp:54
#4 0x00007f12ce91793a in zmq_poller_destroy (poller_p_=0x7f12cc9d2af8)
at ../zeromq-4.2.2/src/zmq.cpp:1236
#5 0x00007f12ce917e14 in zmq_poller_poll (timeout_=<optimized out>, nitems_=2, items_=0x1)
at ../zeromq-4.2.2/src/zmq.cpp:854
#6 zmq_poll (items_=items_#entry=0x7f12cc9d2c20, nitems_=nitems_#entry=2, timeout_=timeout_#entry=5000)
at ../zeromq-4.2.2/src/zmq.cpp:866
zmq_abort() was called from an assertation macro in signaler_t's destructor:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/signaler.cpp#L143. The signaler_t object is a member of socket_poller_t. I don't know for sure why the call to the destructor is not shown in the stack trace.
I was trying not to ask (directly) what was wrong with my code because it was infeasible to provide a code sample, but I'll mention that it turned out to be that a file descriptor was erroneously closed twice in another thread. Between the two close operations, zmq_poll() created a socket_poller_t object. signaler_t's constructor opened an eventfd, which was the same fd (number) that had been closed earlier. Then, the other thread closed the same fd again, leading the destructor to get EBADF on close() and callzmq_abort().

What are the possible reasons for POSIX SIGBUS?

My program recently crashed with the following stack;
Program terminated with signal 7, Bus error.
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
#1 0x00007f0f35f8042e in skgesigOSCrash () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#2 0x00007f0f36222ca9 in kpeDbgSignalHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#3 0x00007f0f35f8063e in skgesig_sigactionHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#4 <signal handler called>
What should I check in my code to avoid this? Or is this something Oracle should fix?
Main reasons you could get a bus error revolves around inaccessible memory. This could be due to many reasons:
Accessing through a deleted pointer.
Accessing through an uninitialized pointer.
Accessing through a NULL pointer.
Accessing the address which is not yours. It could be due to overflow errors.
Try adding the following to the $ORACLE_HOME/network/admin/*.ora file:
DIAG_ADR_ENABLED=OFF
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE
This sounds like an Oracle issue.
And also Oracle's libraries seem to be compiled by Intel compilers.

Pimpl + QSharedPointer - Destructor = Disaster

Yesterday I ran into misery which took me 24 hours of frustration. The problem boiled down to unexpected crashes occurring on random basis. To complicate things, debugging reports had absolutely random pattern as well. To complicate it even more, all debugging traces were leading to either random Qt sources or native DLLs, i.e. proving every time that the issue is rather not on my side.
Here you are a few examples of such lovely reports:
Program received signal SIGSEGV, Segmentation fault.
0x0000000077864324 in ntdll!RtlAppendStringToString () from C:\Windows\system32\ntdll.dll
(gdb) bt
#0 0x0000000077864324 in ntdll!RtlAppendStringToString () from C:\Windows\system32\ntdll.dll
#1 0x000000002efc0230 in ?? ()
#2 0x0000000002070005 in ?? ()
#3 0x000000002efc0000 in ?? ()
#4 0x000000007787969f in ntdll!RtlIsValidHandle () from C:\Windows\system32\ntdll.dll
#5 0x0000000000000000 in ?? ()
warning: HEAP: Free Heap block 307e5950 modified at 307e59c0 after it was freed
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000778bf0b2 in ntdll!ExpInterlockedPopEntrySListFault16 () from C:\Windows\system32\ntdll.dll
(gdb) bt
#0 0x00000000778bf0b2 in ntdll!ExpInterlockedPopEntrySListFault16 () from C:\Windows\system32\ntdll.dll
#1 0x000000007786fd34 in ntdll!RtlIsValidHandle () from C:\Windows\system32\ntdll.dll
#2 0x0000000077910d20 in ntdll!RtlGetLastNtStatus () from C:\Windows\system32\ntdll.dll
#3 0x00000000307e5950 in ?? ()
#4 0x00000000307e59c0 in ?? ()
#5 0x00000000ffffffff in ?? ()
#6 0x0000000000220f10 in ?? ()
#7 0x0000000077712d60 in WaitForMultipleObjectsEx () from C:\Windows\system32\kernel32.dll
#8 0x0000000000000000 in ?? ()
Program received signal SIGSEGV, Segmentation fault.
0x0000000000a9678a in QBasicAtomicInt::ref (this=0x8) at ../../include/QtCore/../../../qt-src/src/corelib/arch/qatomic_x86_64.h:121
121 : "memory");
(gdb) bt
#0 0x0000000000a9678a in QBasicAtomicInt::ref (this=0x8) at ../../include/QtCore/../../../qt-src/src/corelib/arch/qatomic_x86_64.h:121
#1 0x00000000009df08e in QVariant::QVariant (this=0x21e4d0, p=...) at d:/Distributions/qt-src/src/corelib/kernel/qvariant.cpp:1426
#2 0x0000000000b4dde9 in QList<QVariant>::value (this=0x323bd480, i=1) at ../../include/QtCore/../../../qt-src/src/corelib/tools/qlist.h:666
#3 0x00000000009ccff7 in QObject::property (this=0x3067e900,
name=0xa9d042a <QCDEStyle::drawPrimitive(QStyle::PrimitiveElement, QStyleOption const*, QPainter*, QWidget const*) const::pts5+650> "_q_stylerect")
at d:/Distributions/qt-src/src/corelib/kernel/qobject.cpp:3742
#4 0x0000000000000000 in ?? ()
As you can see this stuff is pretty nasty, it gives one no useful information. But, there was one thing I didn't pay attention to. It was a weird warning during compilation which is also hard to catch with an eye:
In file included from d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer.h:50:0,
from d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/QSharedPointer:1,
from ../../../../source/libraries/Project/sources/Method.hpp:4,
from ../../../../source/libraries/Project/sources/Slot.hpp:4,
from ../../../../source/libraries/Project/sources/Slot.cpp:1:
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h: In instantiation of 'static void QtSharedPointer::ExternalRefCount<T>::deref(QtSharedPointer::ExternalRefCount<T>::Data*, T*) [with T = Project::Method::Private; QtSharedPointer::ExternalRefCount<T>::Data = QtSharedPointer::ExternalRefCountData]':
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:336:11: required from 'void QtSharedPointer::ExternalRefCount<T>::deref() [with T = Project::Method::Private]'
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:401:38: required from 'QtSharedPointer::ExternalRefCount<T>::~ExternalRefCount() [with T = Project::Method::Private]'
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:466:7: required from here
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:342:21: warning: possible problem detected in invocation of delete operator: [enabled by default]
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:337:28: warning: 'value' has incomplete type [enabled by default]
Actually, I turned to this warning only as a last resort because in such a desperate pursuit to find a bug, the code was already infected with logging to death literally.
After reading it carefully, I recalled that, for instance, if one uses std::unique_ptr or std::scoped_ptr for Pimpl - one should certainly provide desctructor, otherwise the code won't even compile. However, I also remember that std::shared_ptr does not care about destructor and works fine without it. It was another reason why I didn't pay attention to this strange warning. Long story short, when I added destructor, this random crashing stopped. Looks like Qt's QSharedPointer has some design flaws compared to std::shared_ptr. I guess it would be better, if Qt developers transformed this warning into error because debugging marathons like that are simply not worth one's time, effort and nerves.
My questions are:
What's wrong with QSharedPointer? Why destructor is so vital?
Why crashing happened when there was no destructor? These objects (which are using Pimpl + QSharedPointer) are created on stack and no other objects have access to them after their death. However, crashing happened during some random period of time after their death.
Has anyone ran into issues like that before? Please, share your experience.
Are there other pitfalls
like that in Qt - ones that I must know about for sure to stay
safe in future?
Hopefully, these questions and my post in general will help others to avoid the hell I've been to for the past 24 hours.
The issue has been worked around in Qt 5, see https://codereview.qt-project.org/#change,26974
The compiler calling the wrong destructor or assuming a different memory layout probably lead to some kind of memory corruption. I'd say a compiler should give an error for this issue and not a warning.
You'll run into a similar issue with std::unique_ptr, which can also cause broken destructors if used with an incomplete type. The fix is pretty trivial, of course - I declare a constructor for the class, then define it in the implementation file as
MyClass::~MyClass() = default;
The reason that this is an issue for std::unique_ptr but not std::shared_ptr is that the destructor is part of the type of the former, but is a member of the latter.

C++ server crashes with abort() in _UTF8_init() on free()

I'm having problems with C++ code loaded via dlopen() by a C++ CGI server. After a while, the program crashes unexpectedly, but consistently at memory management function call (such as free(), calloc(), etc.) and produces core dump similar to this:
#0 0x0000000806b252dc in kill () from /lib/libc.so.6
#1 0x0000000804a1861e in raise () from /lib/libpthread.so.2
#2 0x0000000806b2416d in abort () from /lib/libc.so.6
#3 0x0000000806abdb45 in _UTF8_init () from /lib/libc.so.6
#4 0x0000000806abdfcc in _UTF8_init () from /lib/libc.so.6
#5 0x0000000806abeb1d in _UTF8_init () from /lib/libc.so.6
... the rest of the stack
Has anyone seen something like this before?
What is _UTF8_init() and why would memory management functions call it?
That smells like a corrupted heap, likely due to a buffer overrun somewhere in your code. Try running your program with Valgrind and look for any errors or warnings it emits.

Simultaneous abort() in two threads

I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.