I have this very strange problem with my code at runtime when I'm running it in parallel with mpi:
*** glibc detected *** ./QuadTreeConstruction: munmap_chunk(): invalid pointer: 0x0000000001fbf180 ***
======= Backtrace: =========
/lib/libc.so.6(+0x776d6)[0x7f38763156d6]
./QuadTreeConstruction(_ZN9__gnu_cxx13new_allocatorImE10deallocateEPmm+0x20)[0x423f04]
./QuadTreeConstruction(_ZNSt13_Bvector_baseISaIbEE13_M_deallocateEv+0x50)[0x423e72]
./QuadTreeConstruction(_ZNSt13_Bvector_baseISaIbEED2Ev+0x1b)[0x423c79]
./QuadTreeConstruction(_ZNSt6vectorIbSaIbEED1Ev+0x18)[0x4237d2]
./QuadTreeConstruction(_Z22findLocalandGhostCellsRK8QuadTreeRK6ArrayVIiES5_iRS3_S6_+0x849)[0x41dbbd]
./QuadTreeConstruction(main+0xa32)[0x41ca49]
/lib/libc.so.6(__libc_start_main+0xfe)[0x7f38762bcd8e]
./QuadTreeConstruction[0x41b029]
My code is valgrind clean up to the some errors that are internal to mpi library (I use OpenMpi and I have seen them many times before but they have never been an issue; see http://www.open-mpi.org/faq/?category=debugging#valgrind_clean). I do not have any problem when I am running in serial.
I have been able to track down the problem to a SIGABORT system call using gdb and here is the stack when the code breaks:
0 raise raise.c 64 0x7f4bd8655ba5
1 abort abort.c 92 0x7f4bd86596b0
2 __libc_message libc_fatal.c 189 0x7f4bd868f65b
3 malloc_printerr malloc.c 6283 0x7f4bd86996d6
4 __gnu_cxx::new_allocator<unsigned long>::deallocate new_allocator.h 95 0x423f04
5 std::_Bvector_base<std::allocator<bool> >::_M_deallocate stl_bvector.h 444 0x423e72
6 std::_Bvector_base<std::allocator<bool> >::~_Bvector_base stl_bvector.h 430 0x423c79
7 std::vector<bool, std::allocator<bool> >::~vector stl_bvector.h 547 0x4237d2
8 findLocalandGhostCells mpi_partition.cpp 249 0x41dbbd
9 main mpi_partition.cpp 111 0x41ca49
This sounds like a memory corruption but I have absolutely no clue what is causing it. Basically the code breaks inside a function that looks something like this:
void findLocalandGhostCells(){
std::vector<bool> foo(fooSize,false);
// do stuff with foo; nothing crazy -- I promise
return;
}
Anyone has any idea what I should be doing now? :(
If you are quite sure that vector operation itself is correct and non-crazy, try tracing the members of your vector step by step. It's possible that some of your other operations corrupted your memory blocks of vector. For example, a memcpy that invaded memory of vector.
Related
Ours is a PowerPC based embedded system running Linux. We are encountering a random SIGILL crash which is seen for wide variety of applications. The root-cause for the crash is zeroing out of the instruction to be executed. This indicates corruption of the text segment residing in memory. As the text segment is loaded read-only, the application cannot corrupt it. So I am suspecting some common sub-system (DMA?) causing this corruption. Since the problem takes days to reproduce (crash due to SIGILL) it is getting difficult to investigate. So to begin with I want to be able to know if and when the text segment of any application has been corrupted.
I have looked at the stack trace and all the pointers, registers are proper.
Do you guys have any suggestions how I can go about it?
Some Info:
Linux 3.12.19-rt30 #1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU/Linux
(gdb) bt
0 0x10457dc0 in xxx
Disassembly output:
=> 0x10457dc0 <+80>: mr r1,r11
0x10457dc4 <+84>: blr
Instruction expected at address 0x10457dc0: 0x7d615b78
Instruction found after catching SIGILL 0x10457dc0: 0x00000000
(gdb) maintenance info sections
0x10006c60->0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
Expected (from the application binary):
(gdb) x /32 0x10457da0
0x10457da0 : 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0 : 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0 : 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0 : 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0
Actual (after handling SIGILL and dumping nearby memory locations):
Faulting instruction address: 0x10457dc0
0x10457da0 : 0x913E0000
0x10457db0 : 0x83ABFFF4
=> 0x10457dc0 : 0x00000000
0x10457dd0 : 0x4BCD8BE5
0x10457de0 : 0x93E1000C
Edit:
One lead that we have is that the corruption is always occurring at an offset that ends with 0xdc0.
For e.g.
Faulting instruction address: 0x10653dc0 << printed by our application after catching SIGILL
Faulting instruction address: 0x1000ddc0 << printed by our application after catching SIGILL
flash_erase[8557]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite[8561]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk[4448]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk[16002]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats[20670]: unhandled signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr[27923]: unhandled signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001
Edit 2: Another lead is that the corruption is always occurring at physical frame number 0x00a4d. I suppose with PAGE_SIZE of 4096 this translates to physical address of 0x00A4DDC0. We are suspecting couple of our kernel drivers and investigating further. Is there any better idea (like putting hardware watchpoint) which could be more efficient? How about KASAN as suggested below?
Any help is appreciated. Thanks.
1.) Text segment is RO, but the permissions could be changed by mprotect, you can check that if you think it is possible
2.) If it is kernel problem:
Run kernel with KASAN and KUBSAN (undefined behaviour) sanitizers
Focus on drivers code not included in mainline
The hint here is one byte corruption. Maybe i'm wrong, but it means that DMA is not to blame. It looks like some kind of invalid store.
3.) Hardware. I think, your problem looks like a hardware problem (RAM issue).
You can try to decrease RAM system frequency in bootloader
Check if this problem reproduces on stable mainline software, that is how you can prove that it's it
I have a core on both Solaris/Linux platforms and I don´t see the problem.
On a linux platform, I have the following core:
(gdb) where
#0 0x001aa81b in do_lookup_x () from /lib/ld-linux.so.2
#1 0x001ab0da in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x001afa05 in _dl_fixup () from /lib/ld-linux.so.2
#3 0x001b5c90 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x00275e4c in __gxx_personality_v0 () from /opt/gnatpro/lib/libstdc++.so.6
#5 0x00645cfe in _Unwind_RaiseException_Phase2 (exc=0x2a7b10, context=0xffd58434) at ../../../src/libgcc/../gcc/unwind.inc:67
#6 0x00646082 in _Unwind_RaiseException (exc=0x2a7b10) at ../../../src/libgcc/../gcc/unwind.inc:136
#7 0x0027628d in __cxa_throw () from /opt/gnatpro/lib/libstdc++.so.6
#8 0x00276e4f in operator new(unsigned int) () from /opt/gnatpro/lib/libstdc++.so.6
#9 0x08053737 in Receptor::receive (this=0x93c12d8, msj=...) at Receptor.cc:477
#10 0x08099666 in EventProcessor::run (this=0xffd75580) at EventProcessor.cc:437
#11 0x0809747d in SEventProcessor::run (this=0xffd75580) at SEventProcessor.cc:80
#12 0x08065564 in main (argc=1, argv=0xffd76734) at my_project.cc:20
On a Solaris platform I have another core:
$ pstack core.ultimo
core 'core.ultimo' of 9220: my_project_sun
----------------- lwp# 1 / thread# 1 --------------------
0006fa28 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Dend6kM_pk2_ (1010144, 1ce84, ffbd0df8, ffb7a18c, fffffff8, ffbedc7c) + 30
0005d580 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Esize6kM_I_ (1010144, 219, 1ce84, ffffffff, fffffff8, ffbedc7c) + 30
0005ab14 __1cTReceptorHreceive6MrnKMensaje__v_ (33e630, ffbede70, ffffffff, 33e634, 33e68c, 0) + 1d4
0015df78 __1cREventProcessorDrun6M_v_ (ffbede18, 33e630, dcc, 1, 33e730, 6e) + 350
00159a50 __1cWSEventProcessorDrun6M_v_ (da08000, 2302f7, 111de0c, 159980, ff1fa07c, cc) + 48
000b6acc main (1, ffbeef74, ffbeef7c, 250000, 0, 0) + 16c
00045e10 _start (0, 0, 0, 0, 0, 0) + 108
----------------- lwp# 2 / thread# 2 --------------------
...
The piece of code is:
...
msj2.tipo(UPDATE);
for(i = 0; i < distr.size(); ++i)
{
distr[i]->insert(new Mensaje(msj2)); **--> Receptor.cc:477**
}
...
This core happens randomly, sometimes the process is running for weeks.
The size of the core is 4291407872 B.
I am running valgrind to see if the heap is corrupted but by now I have not encountered problems as "Invalid read", "Invalid write" ...
Also, when I was running valgrind I have found twice the following message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to the core? I think that I have seen these errors with valgrind before and they weren´t as important and the ones that say "Invalid read/write".
If you have any idea how to solve this problem, it would be highly appreciated.
The core size is the clue. The largest 32-bit unsigned number is 4,294,967,295. Your core is quite close to that indicating that the process is out of memory. The most likely cause is a memory leak.
See my recent article Memory Leaks in C/C++
Valgrind will find the issue for you on Linux. You have to start it with the --leak-check option for this. It will check for leaks when the process exits gracefully so you will need a way to shut the process down.
Dtrace with dbx on Solaris will also likely work.
Also, when I was running valgrind I have found twice the following
message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to
the core?
Yes, that could result in a SIGSEGV, as it's quite likely undefined behavior. (I'm not going to say it's definitely undefined behavior without seeing the actual code - but it likely is.) It's not likely that doing that can cause a SIGSEGV, but then again the intermittent failure you're seeing doesn't happen all that often. So you do need to fix that problem.
In addition to valgrind, on Solaris you can also use libumem and watchmalloc to check for problems managing heap memory. See the man pages for umem_debug and watchmalloc to get started.
To use dbx on Solaris, you need to have Solaris Studio installed (it's free). Solaris Studio also offers a way to use the run-time memory checking of dbx without having to directly invoke the dbx debugger. See the man page for bcheck. The bcheck man page will be in the Solaris Studio installation directory tree, in the man directory.
And if it is a memory leak, you should be able to see the process address space growing over time.
I have problem with debugging Qt5.4 application on win7. It gives following error by message-Box (without debug mode, the application runs without any problem):
The inferior stopped because it triggered an exception.
Stopped in thread 18 by: Exception at 0x7718c42d, code: 0xe06d7363: C++ exception, flags=0x1 (execution cannot be continued) (first chance).
call stack:
0 RaiseException KERNELBASE 0x7718c42d
1 DllGetClassObject vsfilter 0x3a6aca71
2 DirectVobSub vsfilter 0x3a64fa11
3 vsfilter 0x3a632c0b
4 DllGetClassObject vsfilter 0x3a6c90bc
5 vsfilter 0x3a636da0
6 CSTInnerUnknown::AddRef sync.cxx 225 0x753deddf
7 DllGetClassObject vsfilter 0x3a6aab2f
8 CRWLock::AcquireReaderLock rwlock_ole32.cxx 3099 0x753da471
9 DllGetClassObject vsfilter 0x3a6aab2f
10 DllGetClassObject vsfilter 0x3a6ce866
11 DllGetClassObject vsfilter 0x3a6cbd4b
12 DispatchMessageWorker USER32 0x759b77c4
13 wcscpy_s USER32 0x75a1a61d
14 _RtlUserThreadStart ntdll 0x77919f45
It might be a little bit late, but I had the same error.
See, I don't see your code, so I cant say what your problem was exactly but I can say what solved mine:
I had a Gaussian-filter of open-cv running, which was clearing my picture from noise. It had the size of 3x3. As soon as I adjusted the size to 10x10 for testing, I got the same exception.
So maybe you have also a to big filter or something. To say it exactly I would need more code example.
But anyway this maybe helps someone else!
We are facing C++ application crash issue due to segmentation fault on RED hat Linux. We are using embedded python in C++.
Please find below my limitation
Don’t I have access to production machine where application crashes. Client send us core dump files when application crashes.
Problem is not reproducible on our test machine which has exactly same configuration as production machine.
Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. We haven’t get time frame or any specific pattern in which application crashes.
Application is complex and embedded python code is used from lot of places from within application. We have done extensive code reviews but couldn’t find the fix by doing code review.
As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven’t get any code where such operation is performed. Might be such operations are called through python scripts executed from embedded python on which we don’t have control or we can’t review it.
We can’t use any profiling tool on production environment like Valgrind.
We are using gdb on our local machine to analyze core dump. We can’t run gdb on production machine.
Please find below the efforts we have putted in.
We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem.
We are not getting crash point in logs. Every time we get different logs. I think this is due to; Memory is smashed somewhere else and application crashes after sometime.
We have checked load at any point on our application and it is never exceeded our application limit.
Memory utilization of our application is also normal.
We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing.
I appreciate any help to guide us to proceed further to solve the problem.
Below is the version details
Red hat linux server 5.6 (Tikanga)
Python 2.6.2 GCC 4.1
Following is the stack trace I am getting from the core dump files they have shared (on my machine). FYI, We don’t have access to production machine to run gdb on core dump files.
0 0x00000033c6678630 in ?? ()
1 0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2 0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3 0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4 0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5 0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6 PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7 0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8 PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9 0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()
You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug.
Do not assume that the stack trace is relevant. It might be relevant, but pointer misuse can often lead to crashes some time later
Build with full warnings on. The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local.
Investigate your arrays. Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at().
Investigate your pointers. Replace them with std::unique_ptr(C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). Replace the rest with shared_ptr or weak_ptr. Any that can't be replaced are probably the source of problematic logic.
Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. Try it.
First things first, compile both your binary and libpython with debug symbols and push it out. The stack trace will be much easier to follow.
The relevant argument to g++ is -g.
Suggestions:
As already suggested, provide a complete debug build
Provide a memory test tool and a CPU torture test
Load debug symbols of python library when analyzing the core dump
The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. If so, within this code, or passed arguments, there might be the actual error. Assertions at any interface to the code and code dumps may help.
I have notice pretty bizare bahaviour of gdb when debugging a straight block of code.
I ran gdb normally with following commands.
gdb ./exe
break main
run
next
then [enter] a few times.
What I got a as result was
35 world.generations(generations);
(gdb)
36 world.popSize(100);
(gdb)
37 world.eliteSize(5);
(gdb)
41 world.setEvaluationFnc( eval );
(gdb)
37 world.eliteSize(5);
(gdb)
39 world.pXOver(0.9);
(gdb)
38 world.pMut(0.9);
(gdb)
41 world.setEvaluationFnc( eval );
(gdb)
There is absolutely no reason to run over those lines twice. I do not understand this behaviour. The code looks as follows:
(gdb) list 39
34 SimpleGA<MySpecimen> world;
35 world.generations(generations);
36 world.popSize(100);
37 world.eliteSize(5);
38 world.pMut(0.9);
39 world.pXOver(0.9);
40
41 world.setEvaluationFnc( eval );
42
43 world.setErrorSink(stderrSink);
I am not sure if i should disregard it or there something wicked going on in my code. The app uses OpenMP and is compiled to use it. However, the info thread says there is only one thread running. Also, everything seems to give proper results, but even executed twice there should be no problems as those are mostly some plain setters.
Did anyone seen something like this or have any hints where to investigate ? I failed on my own =).
Thanks for hints,
luk32.
Most likely the compiler rearranging the code. I suppose the "new" order still works correctly?
If possible, try to debug with optimizations turned off; that increases the likelyhood of the executable staying closer to the source code.