The scenario is a file read into an unsigned char buffer, the buffer is put into an istringstream, and the lines in it iterated through.
istringstream data((char*)buffer);
char line[1024]
while (data.good()) {
data.getline(line, 1024);
[...]
}
if (
data.rdstate() & (ios_base::badbit | ios_base::failbit)
) throw foobarException (
Initially, it was this foobarException which was being caught, which didn't say much because it was a very unlikely case -- the file is /proc/stat, the buffer is fine, and only the first few lines are actually iterated this way then the rest of the data is discarded (and the loop broken out of). In fact, that clause has never fired previously.1
I want to stress the point about about only the first few lines being used, and that in debugging etc. the buffer obviously still has plenty of data left in it before the failure, so nothing is hitting EOF.
I stepped through with a debugger to check the buffer was being filled from the file appropriately and each iteration of getline() was getting what it should, right up until the mysterious point of failure -- although since this was a fatal error, there was not much more information to get at that point. I then changed the above code to trap and report the error in more detail:
istringstream data((char*)buffer);
data.exceptions(istringstream::failbit | istringstream::badbit);
char line[1024];
while (data.good()) {
try { data.getline(line, 1024); }
catch (istringstream::failure& ex) {
And suddenly things changed -- rather than catching and reporting an error, the process was dying through SIGABRT inside the try. A backtrace looks like this:
#0 0x00007ffff6b2fa28 in __GI_raise (sig=sig#entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007ffff6b3162a in __GI_abort () at abort.c:89
#2 0x00007ffff7464add in __gnu_cxx::__verbose_terminate_handler ()
at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007ffff7462936 in __cxxabiv1::__terminate (handler=<optimized out>)
at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007ffff7462981 in std::terminate ()
at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:57
#5 0x00007ffff7462b99 in __cxxabiv1::__cxa_throw (obj=obj#entry=0x6801a0,
tinfo=0x7ffff7749740 <typeinfo for std::ios_base::failure>,
dest=0x7ffff7472890 <std::ios_base::failure::~failure()>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:87
#6 0x00007ffff748b9a6 in std::__throw_ios_failure (
__s=__s#entry=0x7ffff7512427 "basic_ios::clear")
at ../../../../../libstdc++-v3/src/c++11/functexcept.cc:126
#7 0x00007ffff74c938a in std::basic_ios<char, std::char_traits<char> >::clear
(this=<optimized out>, __state=<optimized out>)
---Type <return> to continue, or q <return> to quit---
at /usr/src/debug/gcc-5.3.1-20160406/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_ios.tcc:48
#8 0x00007ffff747a74f in std::basic_ios<char, std::char_traits<char> >::setstate (__state=<optimized out>, this=<optimized out>)
at /usr/src/debug/gcc-5.3.1-20160406/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_ios.h:158
#9 std::istream::getline (this=0x7fffffffcd80, __s=0x7fffffffcd7f "",
__n=1024, __delim=<optimized out>)
at ../../../../../libstdc++-v3/src/c++98/istream.cc:106
#10 0x000000000041225a in SSMlinuxMetrics::cpuModule::getLevels (this=0x67e040)
at cpuModule.cpp:179
cpuModule.cpp:179 is the try { data.getline(line, 1024) }.
According to these couple of questions:
When does a process get SIGABRT (signal 6)?
What causes a SIGABRT fault?
It sounds like there are really only two possibilities here:
I've gone out of bounds somewhere and corrupted the istringstream instance.
There's a bug in the library.
Since 2 seems unlikely and I can't find a case for #1 -- e.g., run in valgrind there are no errors before the abort:
==8886== Memcheck, a memory error detector
==8886== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8886== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==8886== Command: ./monitor_demo
==8886==
terminate called after throwing an instance of 'std::ios_base::failure'
what(): basic_ios::clear
==8886==
==8886== Process terminating with default action of signal 6 (SIGABRT)
==8886== at 0x5D89A28: raise (raise.c:55)
==8886== by 0x5D8B629: abort (abort.c:89)
And (of course) "it has been working fine up until now", I'm stumped.
Beyond squinting at code and trying to isolate paths until I find the problem or have an SSCCE demonstrating a bug, is there anything I'm ignorant of that might provide a quick solution?
1. The project is an incomplete one I've come back to after a few months, during which time I know glibc was upgraded on the system.
I believe strangeqargo's guess, that it was because of an ABI compatibility introduced by the libc upgrade is correct. The system had been up for 8 days and the update had occurred during that time.
After a reboot, and with no changes what-so-ever to the code, it compiles and runs without error. I tested on another system as well, same result.
Probably the moral is if you notice glibc has been updated, reboot the system...
Related
I am getting lots of warnings from OpenSceneGraph and all look like this:
Warning:: Picked up error in TriangleIntersect
(-117448 -2.12751e+06 -519242, -120167 -2.17679e+06 -383117, -234607 -1.85755e+06 -431865)
(-nan, -nan, -nan)
And unfortunately I cannot trace the origin of them.
I tried to launch my program, then interrupt it with CTRL + C
and with bt print the back-trace. It gives me only some basic trace from application workflow like:
#0 0x00007fffe8116bf9 in __GI___poll (fds=0x5555584549f0, nfds=3, timeout=461) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fffe45395c9 in () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007fffe45396dc in g_main_context_iteration () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007ffff2a2897f in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#4 0x00007ffff29cd9fa in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#5 0x00007ffff29d6aa4 in QCoreApplication::exec() () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
The warnings occur when entering:
void osgViewer::ViewerBase::frame()
I tried to enter it also but it escapes immediately, prints the warnings and continues the program flow. I suppose it triggers some actions maybe in other threads.
And here comes my question, is there a chance of getting to the origin/trace of those warning messages with GDB?
I am using OpenSceneGraph 3.4
The error is coming from this line:
https://github.com/jklimke/osg/blob/master/src/osgUtil/LineSegmentIntersector.cpp#L174
That code and error isn't even in the current trunk/head OSG source, so maybe your problem could be resolved by simply updating to the latest version of OSG.
I have a single threaded program that crashes consistently at certain points right after free() is called when running in non-debug mode.
When in debug mode however, debugger breaks on the line that calls free() even though there are no break points set. When I try to step to the next line again, debugger breaks again on the same line. Stepping once again resumes execution as normal. No crash, no segfault, nothing.
EDIT-1: Contrary to what I wrote above, crashes in non-debug mode
turns out to be inconsistent, which makes me think I am somehow
writing somewhere that I shouldn't. (Breaks in debug mode are
still consistent, though.)
Call stack at the breaks shows some windows library functions(I think) called after the function that calls free() statement. I have no idea how to interpret them. And consequently, I have no idea how to go about debugging in this situation.
I have provided the call stacks at break points below. Can someone point me in a direction where I can tackle the problem? What might be causing the breaks in debugger mode?
Program is run on Windows Vista, compiled with gcc 4.9.2, debugger used is gdb. Assume double release is not the case.(I use ::operator new and ::operator delete overloads that catch that. Situation described is the same without these overloads as well.)
Note that the crash(or the involuntary breaks in debugger) is consistent. Happens every time, in the same execution point.
Here is the call stack at the initial break:
(Note that free_wrapper() is the function that houses free() statement that causes the crash/breaks.)
#0 0x770186ff ntdll!DbgBreakPoint() (C:\Windows\system32\ntdll.dll:??)
#1 0x77082edb ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#2 0x7706b953 ntdll!RtlImageRvaToVa() (C:\Windows\system32\ntdll.dll:??)
#3 0x77052c4f ntdll!RtlQueryRegistryValues() (C:\Windows\system32\ntdll.dll:??)
#4 0x77083f3b ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#5 0x7704bcfd ntdll!EtwSendNotification() (C:\Windows\system32\ntdll.dll:??)
#6 0x770374d5 ntdll!RtlEnumerateGenericTableWithoutSplaying() (C:\Windows\system32\ntdll.dll:??)
#7 0x75829dc6 KERNEL32!HeapFree() (C:\Windows\system32\kernel32.dll:??)
#8 0x75a99c03 msvcrt!free() (C:\Windows\system32\msvcrt.dll:??)
#9 0x350000 ?? () (??:??)
--> #10 0x534020 free_wrapper(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\Unrelated\MemMgmt.cpp:282)
#11 0x407f74 operator delete(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1002)
#12 0x629a74 __gnu_cxx::new_allocator<char>::deallocate(this=0x22f718, __p=0x352af0 "\nÿÿÿÿÿÿº\r%") (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/ext/new_allocator.h:110)
#13 0x6c2257 std::allocator_traits<std::allocator<char> >::deallocate(__a=..., __p=0x352af0 "\nÿÿÿÿÿÿº\r%", __n=50) (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/bits/alloc_traits.h:383)
#14 0x611940 basic_CDataUnit<std::allocator<char> >::~basic_CDataUnit(this=0x22f714, __vtt_parm=0x781df4 <VTT for basic_CDataUnit_TDB<std::allocator<char> >+4>, __in_chrg=<optimized out>) (include/DataUnit/CDataUnit.h:112)
#15 0x61dfa1 basic_CDataUnit_TDB<std::allocator<char> >::~basic_CDataUnit_TDB(this=0x22f714, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) (include/DataUnit/CDataUnit_TDB.h:125)
#16 0x503898 CTblSegHandle::UpdateChainedRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:912)
#17 0x502fcc CTblSegHandle::UpdateRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:764)
#18 0x443272 UpdateRow(row_addr=..., new_data_unit=..., vColTypes=..., block_hnd=..., seg_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:910)
#19 0x443470 UpdateRow(row_addr=..., vColValues=..., vColTypes=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:935)
#20 0x4023e3 test_RowChaining() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:234)
#21 0x4081c6 main() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1034)
And here is the call stack when I step to the next line and debugger breaks one last time before resuming normal execution:
#0 0x770186ff ntdll!DbgBreakPoint() (C:\Windows\system32\ntdll.dll:??)
#1 0x77082edb ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#2 0x77052c7f ntdll!RtlQueryRegistryValues() (C:\Windows\system32\ntdll.dll:??)
#3 0x77083f3b ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#4 0x7704bcfd ntdll!EtwSendNotification() (C:\Windows\system32\ntdll.dll:??)
#5 0x770374d5 ntdll!RtlEnumerateGenericTableWithoutSplaying() (C:\Windows\system32\ntdll.dll:??)
#6 0x75829dc6 KERNEL32!HeapFree() (C:\Windows\system32\kernel32.dll:??)
#7 0x75a99c03 msvcrt!free() (C:\Windows\system32\msvcrt.dll:??)
#8 0x350000 ?? () (??:??)
--> #9 0x534020 free_wrapper(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\Unrelated\MemMgmt.cpp:282)
#10 0x407f74 operator delete(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1002)
#11 0x629a74 __gnu_cxx::new_allocator<char>::deallocate(this=0x22f718, __p=0x352af0 "\nÿÿÿÿÿÿº\r%") (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/ext/new_allocator.h:110)
#12 0x6c2257 std::allocator_traits<std::allocator<char> >::deallocate(__a=..., __p=0x352af0 "\nÿÿÿÿÿÿº\r%", __n=50) (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/bits/alloc_traits.h:383)
#13 0x611940 basic_CDataUnit<std::allocator<char> >::~basic_CDataUnit(this=0x22f714, __vtt_parm=0x781df4 <VTT for basic_CDataUnit_TDB<std::allocator<char> >+4>, __in_chrg=<optimized out>) (include/DataUnit/CDataUnit.h:112)
#14 0x61dfa1 basic_CDataUnit_TDB<std::allocator<char> >::~basic_CDataUnit_TDB(this=0x22f714, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) (include/DataUnit/CDataUnit_TDB.h:125)
#15 0x503898 CTblSegHandle::UpdateChainedRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:912)
#16 0x502fcc CTblSegHandle::UpdateRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:764)
#17 0x443272 UpdateRow(row_addr=..., new_data_unit=..., vColTypes=..., block_hnd=..., seg_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:910)
#18 0x443470 UpdateRow(row_addr=..., vColValues=..., vColTypes=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:935)
#19 0x4023e3 test_RowChaining() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:234)
#20 0x4081c6 main() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1034)
When I see a call stack that looks like yours the most common cause is heap corruption. A double free or attempting to free a pointer that was never allocated can have similar call stacks. Since you characterize the crash as inconsistent that makes heap corruption the more likely candidate. Double frees and freeing unallocated pointers tend to crash consistently in the same place. To hunt down issues like this I usually:
Install Debugging Tools for Windows
Open a command prompt with elevated privileges
Change directory to the directory that Debugging Tools for Windows is installed in.
Enable full page heap by running gflags.exe -p /enable applicationName.exe /full
Launch application with debugger attached and recreate the issue.
Disable full page heap for the application by running gflags.exe -p /disable applicationName.exe
Running the application with full page heap places an inaccessible page at the end of each allocation so that the program stops immediately if it accesses memory beyond the allocation. This is according to the page GFlags and PageHeap. If a buffer overflow is causing the heap corruption this setting should cause the debugger to break when the overflow occurs..
Make sure to disable page heap when you are done debugging. Running under full page heap can greatly increase memory pressure on an application by making every heap allocation consume an entire page.
You can use valgrind to check if there is any invalid read /write or any invalid free is there in your CODE.
valgrind -v --leak-check=full --show-reachable=yes --log-file=log_valgrind ./Process
log_valgrind will contains invalid read/write.
I have a core on both Solaris/Linux platforms and I don´t see the problem.
On a linux platform, I have the following core:
(gdb) where
#0 0x001aa81b in do_lookup_x () from /lib/ld-linux.so.2
#1 0x001ab0da in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x001afa05 in _dl_fixup () from /lib/ld-linux.so.2
#3 0x001b5c90 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x00275e4c in __gxx_personality_v0 () from /opt/gnatpro/lib/libstdc++.so.6
#5 0x00645cfe in _Unwind_RaiseException_Phase2 (exc=0x2a7b10, context=0xffd58434) at ../../../src/libgcc/../gcc/unwind.inc:67
#6 0x00646082 in _Unwind_RaiseException (exc=0x2a7b10) at ../../../src/libgcc/../gcc/unwind.inc:136
#7 0x0027628d in __cxa_throw () from /opt/gnatpro/lib/libstdc++.so.6
#8 0x00276e4f in operator new(unsigned int) () from /opt/gnatpro/lib/libstdc++.so.6
#9 0x08053737 in Receptor::receive (this=0x93c12d8, msj=...) at Receptor.cc:477
#10 0x08099666 in EventProcessor::run (this=0xffd75580) at EventProcessor.cc:437
#11 0x0809747d in SEventProcessor::run (this=0xffd75580) at SEventProcessor.cc:80
#12 0x08065564 in main (argc=1, argv=0xffd76734) at my_project.cc:20
On a Solaris platform I have another core:
$ pstack core.ultimo
core 'core.ultimo' of 9220: my_project_sun
----------------- lwp# 1 / thread# 1 --------------------
0006fa28 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Dend6kM_pk2_ (1010144, 1ce84, ffbd0df8, ffb7a18c, fffffff8, ffbedc7c) + 30
0005d580 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Esize6kM_I_ (1010144, 219, 1ce84, ffffffff, fffffff8, ffbedc7c) + 30
0005ab14 __1cTReceptorHreceive6MrnKMensaje__v_ (33e630, ffbede70, ffffffff, 33e634, 33e68c, 0) + 1d4
0015df78 __1cREventProcessorDrun6M_v_ (ffbede18, 33e630, dcc, 1, 33e730, 6e) + 350
00159a50 __1cWSEventProcessorDrun6M_v_ (da08000, 2302f7, 111de0c, 159980, ff1fa07c, cc) + 48
000b6acc main (1, ffbeef74, ffbeef7c, 250000, 0, 0) + 16c
00045e10 _start (0, 0, 0, 0, 0, 0) + 108
----------------- lwp# 2 / thread# 2 --------------------
...
The piece of code is:
...
msj2.tipo(UPDATE);
for(i = 0; i < distr.size(); ++i)
{
distr[i]->insert(new Mensaje(msj2)); **--> Receptor.cc:477**
}
...
This core happens randomly, sometimes the process is running for weeks.
The size of the core is 4291407872 B.
I am running valgrind to see if the heap is corrupted but by now I have not encountered problems as "Invalid read", "Invalid write" ...
Also, when I was running valgrind I have found twice the following message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to the core? I think that I have seen these errors with valgrind before and they weren´t as important and the ones that say "Invalid read/write".
If you have any idea how to solve this problem, it would be highly appreciated.
The core size is the clue. The largest 32-bit unsigned number is 4,294,967,295. Your core is quite close to that indicating that the process is out of memory. The most likely cause is a memory leak.
See my recent article Memory Leaks in C/C++
Valgrind will find the issue for you on Linux. You have to start it with the --leak-check option for this. It will check for leaks when the process exits gracefully so you will need a way to shut the process down.
Dtrace with dbx on Solaris will also likely work.
Also, when I was running valgrind I have found twice the following
message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to
the core?
Yes, that could result in a SIGSEGV, as it's quite likely undefined behavior. (I'm not going to say it's definitely undefined behavior without seeing the actual code - but it likely is.) It's not likely that doing that can cause a SIGSEGV, but then again the intermittent failure you're seeing doesn't happen all that often. So you do need to fix that problem.
In addition to valgrind, on Solaris you can also use libumem and watchmalloc to check for problems managing heap memory. See the man pages for umem_debug and watchmalloc to get started.
To use dbx on Solaris, you need to have Solaris Studio installed (it's free). Solaris Studio also offers a way to use the run-time memory checking of dbx without having to directly invoke the dbx debugger. See the man page for bcheck. The bcheck man page will be in the Solaris Studio installation directory tree, in the man directory.
And if it is a memory leak, you should be able to see the process address space growing over time.
We are facing C++ application crash issue due to segmentation fault on RED hat Linux. We are using embedded python in C++.
Please find below my limitation
Don’t I have access to production machine where application crashes. Client send us core dump files when application crashes.
Problem is not reproducible on our test machine which has exactly same configuration as production machine.
Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. We haven’t get time frame or any specific pattern in which application crashes.
Application is complex and embedded python code is used from lot of places from within application. We have done extensive code reviews but couldn’t find the fix by doing code review.
As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven’t get any code where such operation is performed. Might be such operations are called through python scripts executed from embedded python on which we don’t have control or we can’t review it.
We can’t use any profiling tool on production environment like Valgrind.
We are using gdb on our local machine to analyze core dump. We can’t run gdb on production machine.
Please find below the efforts we have putted in.
We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem.
We are not getting crash point in logs. Every time we get different logs. I think this is due to; Memory is smashed somewhere else and application crashes after sometime.
We have checked load at any point on our application and it is never exceeded our application limit.
Memory utilization of our application is also normal.
We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing.
I appreciate any help to guide us to proceed further to solve the problem.
Below is the version details
Red hat linux server 5.6 (Tikanga)
Python 2.6.2 GCC 4.1
Following is the stack trace I am getting from the core dump files they have shared (on my machine). FYI, We don’t have access to production machine to run gdb on core dump files.
0 0x00000033c6678630 in ?? ()
1 0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2 0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3 0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4 0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5 0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6 PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7 0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8 PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9 0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()
You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug.
Do not assume that the stack trace is relevant. It might be relevant, but pointer misuse can often lead to crashes some time later
Build with full warnings on. The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local.
Investigate your arrays. Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at().
Investigate your pointers. Replace them with std::unique_ptr(C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). Replace the rest with shared_ptr or weak_ptr. Any that can't be replaced are probably the source of problematic logic.
Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. Try it.
First things first, compile both your binary and libpython with debug symbols and push it out. The stack trace will be much easier to follow.
The relevant argument to g++ is -g.
Suggestions:
As already suggested, provide a complete debug build
Provide a memory test tool and a CPU torture test
Load debug symbols of python library when analyzing the core dump
The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. If so, within this code, or passed arguments, there might be the actual error. Assertions at any interface to the code and code dumps may help.
I have a strange problem that I can't solve. Please help!
The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:
*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***
When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.
GNU gdb (GDB) 7.3
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0 0x001c44d4 in raise ()
(gdb) bt
#0 0x001c44d4 in raise ()
#1 0x001c47e0 in abort ()
#2 0x001c47e0 in abort ()
#3 0x001c47e0 in abort ()
#4 0x001c47e0 in abort ()
#5 0x001c47e0 in abort ()
#6 0x001c47e0 in abort ()
#7 0x001c47e0 in abort ()
#8 0x001c47e0 in abort ()
#9 0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()
And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.
When I examine the other threads everything seems normal there.
(gdb) info threads
Id Target Id Frame
6 LWP 705 0x00132f04 in nanosleep ()
5 LWP 704 0x001e7a70 in select ()
4 LWP 703 0x00132f04 in nanosleep ()
3 LWP 702 0x00132318 in sem_wait ()
2 LWP 700 0x00132f04 in nanosleep ()
* 1 LWP 706 0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0 0x001e7a70 in select ()
(gdb) bt
#0 0x001e7a70 in select ()
#1 0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2 0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3 0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4 0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5 0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc)
#6 0x00074c78 in CScanner::Poll (this=0xbea7d4cc)
#7 0x00089edc in CThread5::Thread5Poll (this=0xbea7d360)
#8 0x0008c140 in CThread5::run (this=0xbea7d360)
#9 0x00088698 in CThread::threadFunc (p=0xbea7d360)
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(Classes and functions names are a bit wierd because I changed them -:)
So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows
Backtrace stopped: previous frame identical to this frame (corrupt stack?).
It happends because threads 2-6 are created in the thread #1.
The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.
Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).
UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9
UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).
My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.
In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.
Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.
One possibility here is that something in that thread has very, very badly smashed the stack by vastly overwriting an on-stack data structure, destroying all the needed data on the stack in the process. That makes postmortem debugging very unpleasant.
If you can reproduce the problem at will, the right thing to do is to run the thread under gdb and watch what is going on precisely at the moment when the the stack gets nuked. This may, in turn, require some sort of careful search to determine where exactly the error is happening.
If you cannot reproduce the problem at will, the best I can suggest is very carefully looking for clues in the thread local storage for that thread to see if it hints at where the thread was executing before death hit.