C++ heap corruption and valgrind - c++

I have a core on both Solaris/Linux platforms and I don´t see the problem.
On a linux platform, I have the following core:
(gdb) where
#0 0x001aa81b in do_lookup_x () from /lib/ld-linux.so.2
#1 0x001ab0da in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x001afa05 in _dl_fixup () from /lib/ld-linux.so.2
#3 0x001b5c90 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x00275e4c in __gxx_personality_v0 () from /opt/gnatpro/lib/libstdc++.so.6
#5 0x00645cfe in _Unwind_RaiseException_Phase2 (exc=0x2a7b10, context=0xffd58434) at ../../../src/libgcc/../gcc/unwind.inc:67
#6 0x00646082 in _Unwind_RaiseException (exc=0x2a7b10) at ../../../src/libgcc/../gcc/unwind.inc:136
#7 0x0027628d in __cxa_throw () from /opt/gnatpro/lib/libstdc++.so.6
#8 0x00276e4f in operator new(unsigned int) () from /opt/gnatpro/lib/libstdc++.so.6
#9 0x08053737 in Receptor::receive (this=0x93c12d8, msj=...) at Receptor.cc:477
#10 0x08099666 in EventProcessor::run (this=0xffd75580) at EventProcessor.cc:437
#11 0x0809747d in SEventProcessor::run (this=0xffd75580) at SEventProcessor.cc:80
#12 0x08065564 in main (argc=1, argv=0xffd76734) at my_project.cc:20
On a Solaris platform I have another core:
$ pstack core.ultimo
core 'core.ultimo' of 9220: my_project_sun
----------------- lwp# 1 / thread# 1 --------------------
0006fa28 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Dend6kM_pk2_ (1010144, 1ce84, ffbd0df8, ffb7a18c, fffffff8, ffbedc7c) + 30
0005d580 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Esize6kM_I_ (1010144, 219, 1ce84, ffffffff, fffffff8, ffbedc7c) + 30
0005ab14 __1cTReceptorHreceive6MrnKMensaje__v_ (33e630, ffbede70, ffffffff, 33e634, 33e68c, 0) + 1d4
0015df78 __1cREventProcessorDrun6M_v_ (ffbede18, 33e630, dcc, 1, 33e730, 6e) + 350
00159a50 __1cWSEventProcessorDrun6M_v_ (da08000, 2302f7, 111de0c, 159980, ff1fa07c, cc) + 48
000b6acc main (1, ffbeef74, ffbeef7c, 250000, 0, 0) + 16c
00045e10 _start (0, 0, 0, 0, 0, 0) + 108
----------------- lwp# 2 / thread# 2 --------------------
...
The piece of code is:
...
msj2.tipo(UPDATE);
for(i = 0; i < distr.size(); ++i)
{
distr[i]->insert(new Mensaje(msj2)); **--> Receptor.cc:477**
}
...
This core happens randomly, sometimes the process is running for weeks.
The size of the core is 4291407872 B.
I am running valgrind to see if the heap is corrupted but by now I have not encountered problems as "Invalid read", "Invalid write" ...
Also, when I was running valgrind I have found twice the following message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to the core? I think that I have seen these errors with valgrind before and they weren´t as important and the ones that say "Invalid read/write".
If you have any idea how to solve this problem, it would be highly appreciated.

The core size is the clue. The largest 32-bit unsigned number is 4,294,967,295. Your core is quite close to that indicating that the process is out of memory. The most likely cause is a memory leak.
See my recent article Memory Leaks in C/C++
Valgrind will find the issue for you on Linux. You have to start it with the --leak-check option for this. It will check for leaks when the process exits gracefully so you will need a way to shut the process down.
Dtrace with dbx on Solaris will also likely work.

Also, when I was running valgrind I have found twice the following
message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to
the core?
Yes, that could result in a SIGSEGV, as it's quite likely undefined behavior. (I'm not going to say it's definitely undefined behavior without seeing the actual code - but it likely is.) It's not likely that doing that can cause a SIGSEGV, but then again the intermittent failure you're seeing doesn't happen all that often. So you do need to fix that problem.
In addition to valgrind, on Solaris you can also use libumem and watchmalloc to check for problems managing heap memory. See the man pages for umem_debug and watchmalloc to get started.
To use dbx on Solaris, you need to have Solaris Studio installed (it's free). Solaris Studio also offers a way to use the run-time memory checking of dbx without having to directly invoke the dbx debugger. See the man page for bcheck. The bcheck man page will be in the Solaris Studio installation directory tree, in the man directory.
And if it is a memory leak, you should be able to see the process address space growing over time.

Related

How can I find out the root cause of SIGTRAP core dump of GDB

My app is randomly (once a day) crashed and I have tried several ways to find out the reason but no luck.
With other core dump or segmentation fault cases, I can locate where does it happen by gdb, but for this case, gdb don't give me too much hint.
I need some advice for my continuous debugging, please help.
GDB output when my app crashed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/greystone/myapp/myapp'.
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x00007f5d3a435afb in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
[Current thread is 1 (Thread 0x7f5cea3d4700 (LWP 14353))]
(gdb) bt full
#0 0x00007f5d3a435afb in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#1 0x00007f5d3a435c6f in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#2 0x00007f5d3a472742 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#3 0x00007f5d3a42cab3 in g_main_context_new () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#4 0x00007f5d3f4894c9 in QEventDispatcherGlibPrivate::QEventDispatcherGlibPrivate(_GMainContext*) () from /opt/Qt5.9.2/5.9.2/gcc_64/lib/libQt5Core.so.5
No symbol table info available.
#5 0x00007f5d3f4895a1 in QEventDispatcherGlib::QEventDispatcherGlib(QObject*) () from /opt/Qt5.9.2/5.9.2/gcc_64/lib/libQt5Core.so.5
No symbol table info available.
#6 0x00007f5d3f266870 in ?? () from /opt/Qt5.9.2/5.9.2/gcc_64/lib/libQt5Core.so.5
No symbol table info available.
#7 0x00007f5d3f267758 in ?? () from /opt/Qt5.9.2/5.9.2/gcc_64/lib/libQt5Core.so.5
No symbol table info available.
#8 0x00007f5d3efa76ba in start_thread (arg=0x7f5cea3d4700) at pthread_create.c:333
__res =
pd = 0x7f5cea3d4700
now =
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140037043603200, 4399946704104667801, 0, 140033278038543, 8388608, 140037073195984, -4344262468029171047, -4344357617020880231}, mask_was_saved = 0}},
priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call =
pagesize_m1 =
sp =
freesize =
__PRETTY_FUNCTION__ = "start_thread"
#9 0x00007f5d3e43c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
Solutions I have tried
Search topic related with SIGTRAP
People said it is in debug mode and there are somewhere in the code set break point. However, my app is compiled in release mode without break point.
Catch signal handler and ignore SIGTRAP
No success, I can only ignore SIGTRAP sent by "kill -5 pid". With the SIGTRAP occurs randomly in runtime, my app is still crashed
Fix memory leak in code
Initialize pointer with nullptr
Double check mysql C API race conditions
Double check delete array action and double check assign value for the index out of array boundaries
Check signals and slots
My app is built on Qt frameworks as a GUI application, there are many signals and slots I have checked but no ideas how are they related to SIGTRAP core dump.
Check exceptions for opencv
I use opencv for image processing tasks. I have checked for exception cases
Shared memory
Memory shared between main process and sub processes were carefully checked
Example code
A lot of code in my app, but because gdb don't give me exactly where does it happen, so I don't know which code I should share. If you need it for checking for suggestion, please tell me which part of the code you would like to check. My app have these following parts.
Mysql in C api mysql 5.7.29
User interface (alot) by Qt framework 5.9.2
Image processing with opencv 2.4.9
Process flow in multi threading by Qt framework 5.9.2
If there is any ideas, please share me some keywords then I could search about it and apply to my app. Thanks for your help.
for this case, gdb don't give me too much hint
GDB tells you exactly what happened, you just didn't understand it.
What's happening is that some code in libglib called g_logv(..., G_LOG_FLAG_FATAL, ...), which eventually calls _g_log_abort(), which executes int3 (debug breakpoint) instruction.
You should be able to (gdb) x/i 0x00007f5d3a435afb and see that instruction.
It looks like g_main_context_new() may have failed to allocate memory.
In any case, you should look in the application stderr logs for the reason libglib is terminating your program (effectively, libglib calls an equivalent of abort, because some precondition has failed).

SIGABRT invoked via istringstream.getline()

The scenario is a file read into an unsigned char buffer, the buffer is put into an istringstream, and the lines in it iterated through.
istringstream data((char*)buffer);
char line[1024]
while (data.good()) {
data.getline(line, 1024);
[...]
}
if (
data.rdstate() & (ios_base::badbit | ios_base::failbit)
) throw foobarException (
Initially, it was this foobarException which was being caught, which didn't say much because it was a very unlikely case -- the file is /proc/stat, the buffer is fine, and only the first few lines are actually iterated this way then the rest of the data is discarded (and the loop broken out of). In fact, that clause has never fired previously.1
I want to stress the point about about only the first few lines being used, and that in debugging etc. the buffer obviously still has plenty of data left in it before the failure, so nothing is hitting EOF.
I stepped through with a debugger to check the buffer was being filled from the file appropriately and each iteration of getline() was getting what it should, right up until the mysterious point of failure -- although since this was a fatal error, there was not much more information to get at that point. I then changed the above code to trap and report the error in more detail:
istringstream data((char*)buffer);
data.exceptions(istringstream::failbit | istringstream::badbit);
char line[1024];
while (data.good()) {
try { data.getline(line, 1024); }
catch (istringstream::failure& ex) {
And suddenly things changed -- rather than catching and reporting an error, the process was dying through SIGABRT inside the try. A backtrace looks like this:
#0 0x00007ffff6b2fa28 in __GI_raise (sig=sig#entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007ffff6b3162a in __GI_abort () at abort.c:89
#2 0x00007ffff7464add in __gnu_cxx::__verbose_terminate_handler ()
at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007ffff7462936 in __cxxabiv1::__terminate (handler=<optimized out>)
at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007ffff7462981 in std::terminate ()
at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:57
#5 0x00007ffff7462b99 in __cxxabiv1::__cxa_throw (obj=obj#entry=0x6801a0,
tinfo=0x7ffff7749740 <typeinfo for std::ios_base::failure>,
dest=0x7ffff7472890 <std::ios_base::failure::~failure()>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:87
#6 0x00007ffff748b9a6 in std::__throw_ios_failure (
__s=__s#entry=0x7ffff7512427 "basic_ios::clear")
at ../../../../../libstdc++-v3/src/c++11/functexcept.cc:126
#7 0x00007ffff74c938a in std::basic_ios<char, std::char_traits<char> >::clear
(this=<optimized out>, __state=<optimized out>)
---Type <return> to continue, or q <return> to quit---
at /usr/src/debug/gcc-5.3.1-20160406/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_ios.tcc:48
#8 0x00007ffff747a74f in std::basic_ios<char, std::char_traits<char> >::setstate (__state=<optimized out>, this=<optimized out>)
at /usr/src/debug/gcc-5.3.1-20160406/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_ios.h:158
#9 std::istream::getline (this=0x7fffffffcd80, __s=0x7fffffffcd7f "",
__n=1024, __delim=<optimized out>)
at ../../../../../libstdc++-v3/src/c++98/istream.cc:106
#10 0x000000000041225a in SSMlinuxMetrics::cpuModule::getLevels (this=0x67e040)
at cpuModule.cpp:179
cpuModule.cpp:179 is the try { data.getline(line, 1024) }.
According to these couple of questions:
When does a process get SIGABRT (signal 6)?
What causes a SIGABRT fault?
It sounds like there are really only two possibilities here:
I've gone out of bounds somewhere and corrupted the istringstream instance.
There's a bug in the library.
Since 2 seems unlikely and I can't find a case for #1 -- e.g., run in valgrind there are no errors before the abort:
==8886== Memcheck, a memory error detector
==8886== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8886== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==8886== Command: ./monitor_demo
==8886==
terminate called after throwing an instance of 'std::ios_base::failure'
what(): basic_ios::clear
==8886==
==8886== Process terminating with default action of signal 6 (SIGABRT)
==8886== at 0x5D89A28: raise (raise.c:55)
==8886== by 0x5D8B629: abort (abort.c:89)
And (of course) "it has been working fine up until now", I'm stumped.
Beyond squinting at code and trying to isolate paths until I find the problem or have an SSCCE demonstrating a bug, is there anything I'm ignorant of that might provide a quick solution?
1. The project is an incomplete one I've come back to after a few months, during which time I know glibc was upgraded on the system.
I believe strangeqargo's guess, that it was because of an ABI compatibility introduced by the libc upgrade is correct. The system had been up for 8 days and the update had occurred during that time.
After a reboot, and with no changes what-so-ever to the code, it compiles and runs without error. I tested on another system as well, same result.
Probably the moral is if you notice glibc has been updated, reboot the system...

How to retain a stacktrace when Cortex-M3 gone in hardfault?

Using the following setup:
Cortex-M3 based µC
gcc-arm cross toolchain
using C and C++
FreeRtos 7.5.3
Eclipse Luna
Segger Jlink with JLinkGDBServer
Code Confidence FreeRtos debug plugin
Using JLinkGDBServer and eclipse as debug frontend, I always have a nice stacktrace when stepping through my code. When using the Code Confidence freertos tools (eclipse plugin), I also see the stacktraces of all threads which are currently not running (without that plugin, I see just the stacktrace of the active thread). So far so good.
But now, when my application fall into a hardfault, the stacktrace is lost.
Well, I know the technique on how to find out the code address which causes the hardfault (as seen here).
But this is very poor information compared to full stacktrace.
Ok, some times when falling into hardfault there is no way to retain a stacktrace, e.g. when the stack is corrupted by the faulty code. But if the stack is healty, I think that getting a stacktrace might be possible (isn't it?).
I think the reason for loosing the stacktrace when in hardfault is, that the stackpointer would be swiched from PSP to MSP automatically by the Cortex-M3 architecture. One idea is now, to (maybe) set the MSP to the previous PSP value (and maybe have to do some additional stack preperation?).
Any suggestions on how to do that or other approaches to retain a stacktrace when in hardfault?
Edit 2015-07-07, added more details.
I uses this code to provocate a hardfault:
__attribute__((optimize("O0"))) static void checkHardfault() {
volatile uint32_t* varAtOddAddress = (uint32_t*)-1;
(*varAtOddAddress)++;
}
When stepping into checkHardfault(), my stacktrace looks good like this:
gdb-> backtrace
#0 checkHardfault () at Main.cxx:179
#1 0x100360f6 in GetOneEvent () at Main.cxx:185
#2 0x1003604e in executeMainLoop () at Main.cxx:121
#3 0x1001783a in vMainTask (pvParameters=0x0) at Main.cxx:408
#4 0x00000000 in ?? ()
When run into the hardfault (at (*varAtOddAddress)++;) and find myself inside of the HardFault_Handler(), the stacktrace is:
gdb-> backtrace
#0 HardFault_Handler () at Hardfault.c:312
#1 <signal handler called>
#2 0x10015f36 in prvPortStartFirstTask () at freertos/portable/GCC/ARM_CM3/port.c:224
#3 0x10015fd6 in xPortStartScheduler () at freertos/portable/GCC/ARM_CM3/port.c:301
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
The quickest way to get the debugger to give you the details of the state prior to the hard fault is to return the processor to the state prior to the hard fault.
In the debugger, write a script that takes the information from the various hardware registers and restore PC, LR, R0-R14 to the state just prior to causing the hard fault, then do your stack dump.
Of course, this isn't always helpful when you end up at the hard fault because of popping stuff off of a blown stack or stomping on stuff in memory. You generally tend to corrupt a bunch of the important registers, return back to some crazy spot in memory, and then execute whatever's there. You can end up hard faulting many thousands (millions?) of cycles after your real problem happens.
Consider using the following gdb macro to restore the register contents:
define hfstack
set $frame_ptr = (unsigned *)$sp
if $lr & 0x10
set $sp = $frame_ptr + (8 * 4)
else
set $sp = $frame_ptr + (26 * 4)
end
set $lr = $frame_ptr[5]
set $pc = $frame_ptr[6]
bt
end
document hfstack
set the correct stack context after a hard fault on Cortex M
end

Crashing C++ application on production machine due to segmentation fault

We are facing C++ application crash issue due to segmentation fault on RED hat Linux. We are using embedded python in C++.
Please find below my limitation
Don’t I have access to production machine where application crashes. Client send us core dump files when application crashes.
Problem is not reproducible on our test machine which has exactly same configuration as production machine.
Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. We haven’t get time frame or any specific pattern in which application crashes.
Application is complex and embedded python code is used from lot of places from within application. We have done extensive code reviews but couldn’t find the fix by doing code review.
As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven’t get any code where such operation is performed. Might be such operations are called through python scripts executed from embedded python on which we don’t have control or we can’t review it.
We can’t use any profiling tool on production environment like Valgrind.
We are using gdb on our local machine to analyze core dump. We can’t run gdb on production machine.
Please find below the efforts we have putted in.
We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem.
We are not getting crash point in logs. Every time we get different logs. I think this is due to; Memory is smashed somewhere else and application crashes after sometime.
We have checked load at any point on our application and it is never exceeded our application limit.
Memory utilization of our application is also normal.
We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing.
I appreciate any help to guide us to proceed further to solve the problem.
Below is the version details
Red hat linux server 5.6 (Tikanga)
Python 2.6.2 GCC 4.1
Following is the stack trace I am getting from the core dump files they have shared (on my machine). FYI, We don’t have access to production machine to run gdb on core dump files.
0 0x00000033c6678630 in ?? ()
1 0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2 0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3 0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4 0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5 0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6 PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7 0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8 PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9 0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()
You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug.
Do not assume that the stack trace is relevant. It might be relevant, but pointer misuse can often lead to crashes some time later
Build with full warnings on. The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local.
Investigate your arrays. Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at().
Investigate your pointers. Replace them with std::unique_ptr(C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). Replace the rest with shared_ptr or weak_ptr. Any that can't be replaced are probably the source of problematic logic.
Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. Try it.
First things first, compile both your binary and libpython with debug symbols and push it out. The stack trace will be much easier to follow.
The relevant argument to g++ is -g.
Suggestions:
As already suggested, provide a complete debug build
Provide a memory test tool and a CPU torture test
Load debug symbols of python library when analyzing the core dump
The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. If so, within this code, or passed arguments, there might be the actual error. Assertions at any interface to the code and code dumps may help.

infinite abort() in a backrace of a c++ program core dump

I have a strange problem that I can't solve. Please help!
The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:
*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***
When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.
GNU gdb (GDB) 7.3
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0 0x001c44d4 in raise ()
(gdb) bt
#0 0x001c44d4 in raise ()
#1 0x001c47e0 in abort ()
#2 0x001c47e0 in abort ()
#3 0x001c47e0 in abort ()
#4 0x001c47e0 in abort ()
#5 0x001c47e0 in abort ()
#6 0x001c47e0 in abort ()
#7 0x001c47e0 in abort ()
#8 0x001c47e0 in abort ()
#9 0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()
And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.
When I examine the other threads everything seems normal there.
(gdb) info threads
Id Target Id Frame
6 LWP 705 0x00132f04 in nanosleep ()
5 LWP 704 0x001e7a70 in select ()
4 LWP 703 0x00132f04 in nanosleep ()
3 LWP 702 0x00132318 in sem_wait ()
2 LWP 700 0x00132f04 in nanosleep ()
* 1 LWP 706 0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0 0x001e7a70 in select ()
(gdb) bt
#0 0x001e7a70 in select ()
#1 0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2 0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3 0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4 0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5 0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc)
#6 0x00074c78 in CScanner::Poll (this=0xbea7d4cc)
#7 0x00089edc in CThread5::Thread5Poll (this=0xbea7d360)
#8 0x0008c140 in CThread5::run (this=0xbea7d360)
#9 0x00088698 in CThread::threadFunc (p=0xbea7d360)
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(Classes and functions names are a bit wierd because I changed them -:)
So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows
Backtrace stopped: previous frame identical to this frame (corrupt stack?).
It happends because threads 2-6 are created in the thread #1.
The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.
Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).
UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9
UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).
My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.
In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.
Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.
One possibility here is that something in that thread has very, very badly smashed the stack by vastly overwriting an on-stack data structure, destroying all the needed data on the stack in the process. That makes postmortem debugging very unpleasant.
If you can reproduce the problem at will, the right thing to do is to run the thread under gdb and watch what is going on precisely at the moment when the the stack gets nuked. This may, in turn, require some sort of careful search to determine where exactly the error is happening.
If you cannot reproduce the problem at will, the best I can suggest is very carefully looking for clues in the thread local storage for that thread to see if it hints at where the thread was executing before death hit.