Segmentation fault in __pthread_getspecific called from libcuda.so.1 - c++

Problem: Segmentation fault (SIGSEGV, signal 11)
Brief program description:
high performance gpu (CUDA) server handling requests from remote
clients
each incoming request spawns a thread that performs
calculations on multiple GPU's (serial, not in parallel) and sends
back a result to the client, this usually takes anywhere between 10-200ms as each request consists of tens or hundreds of kernel calls
request handler threads have exclusive access to GPU's, meaning that if one thread is running something on GPU1 all others will have to wait until its done
compiled with -arch=sm_35 -code=compute_35
using CUDA 5.0
i'm not using any CUDA atomics explicitly or any in-kernel synchronization barriers, though i'm using thrust (various functions) and cudaDeviceSynchronize() obviously
Nvidia driver: NVIDIA dlloader X Driver 313.30 Wed Mar 27 15:33:21 PDT 2013
OS and HW info:
Linux lub1 3.5.0-23-generic #35~precise1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
GPU's: 4x GPU 0: GeForce GTX TITAN
32 GB RAM
MB: ASUS MAXIMUS V EXTREME
CPU: i7-3770K
Crash information:
Crash occurs "randomly" after a couple of thousands requests are handled (sometimes sooner, sometimes later). Stack traces from some of the crashes look like this:
#0 0x00007f8a5b18fd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1 0x00007f8a5a0c0cf3 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f8a59ff7b30 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f8a59fcc34a in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f8a5ab253e7 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5 0x00007f8a5ab484fa in cudaGetDevice () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6 0x000000000046c2a6 in thrust::detail::backend::cuda::arch::device_properties() ()
#0 0x00007ff03ba35d91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1 0x00007ff03a966cf3 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007ff03aa24f8b in ?? () from /usr/lib/libcuda.so.1
#3 0x00007ff03b3e411c in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#4 0x00007ff03b3dd4b3 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#5 0x00007ff03b3d18e0 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#6 0x00007ff03b3fc4d9 in cudaMemset () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7 0x0000000000448177 in libgbase::cudaGenericDatabase::cudaCountIndividual(unsigned int, ...
#0 0x00007f01db6d6153 in ?? () from /usr/lib/libcuda.so.1
#1 0x00007f01db6db7e4 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f01db6dbc30 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f01db6dbec2 in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f01db6c6c58 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007f01db6c7b49 in ?? () from /usr/lib/libcuda.so.1
#6 0x00007f01db6bdc22 in ?? () from /usr/lib/libcuda.so.1
#7 0x00007f01db5f0df7 in ?? () from /usr/lib/libcuda.so.1
#8 0x00007f01db5f4e0d in ?? () from /usr/lib/libcuda.so.1
#9 0x00007f01db5dbcea in ?? () from /usr/lib/libcuda.so.1
#10 0x00007f01dc11e0aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#11 0x00007f01dc1466dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#12 0x0000000000472373 in thrust::detail::backend::cuda::detail::b40c_thrust::BaseRadixSortingEnactor
#0 0x00007f397533dd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
#1 0x00007f397426ecf3 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f397427baec in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f39741a9840 in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f39741add08 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007f3974194cea in ?? () from /usr/lib/libcuda.so.1
#6 0x00007f3974cd70aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#7 0x00007f3974cff6dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0
#8 0x000000000046bf26 in thrust::detail::backend::cuda::detail::checked_cudaMemcpy(void*
As you can see, usually it ends up in __pthread_getspecific called from libcuda.so or somewhere in the library itself. As far as i remember there has been just one case where it did not crash but instead it hanged in a strange way: the program was able to respond to my requests if they did not involve any GPU computation (statistics etc.), but otherwise i never got a reply. Also, doing nvidia-smi -L did not work, it just hung there until i rebooted the computer. Looked to me like a GPU deadlock sort of. This might be a completely different issue than this one though.
Does anyone have a clue where the problem might be or what could cause this?
Updates:
Some additional analysis:
cuda-memcheck does not print any error messages.
valgrind - leak check does print quite a few messages, like those below (there are hundreds like that):
==2464== 16 bytes in 1 blocks are definitely lost in loss record 6 of 725
==2464== at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2464== by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464== by 0x56B859D: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464== by 0x5050C82: __nptl_deallocate_tsd (pthread_create.c:156)
==2464== by 0x5050EA7: start_thread (pthread_create.c:315)
==2464== by 0x6DDBCBC: clone (clone.S:112)
==2464==
==2464== 16 bytes in 1 blocks are definitely lost in loss record 7 of 725
==2464== at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2464== by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464== by 0x56B86D8: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464== by 0x5677E0F: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464== by 0x400F90D: _dl_fini (dl-fini.c:254)
==2464== by 0x6D23900: __run_exit_handlers (exit.c:78)
==2464== by 0x6D23984: exit (exit.c:100)
==2464== by 0x6D09773: (below main) (libc-start.c:258)
==2464== 408 bytes in 3 blocks are possibly lost in loss record 222 of 725
==2464== at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2464== by 0x5A89B98: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5A8A1F2: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5A8A3FF: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5B02E34: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5AFFAA5: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5AAF009: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5A7A6D3: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x59B205C: ??? (in /usr/lib/libcuda.so.313.30)
==2464== by 0x5984544: cuInit (in /usr/lib/libcuda.so.313.30)
==2464== by 0x568983B: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
==2464== by 0x5689967: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
More information:
I have tried running on fewer cards (3, as that is the minimum needed for the program) and the crash still occurs.
The above is not true, i misconfigured the application and it used all four cards. Re-running the experiments with really just 3 cards seems to resolve the problem, it is now running for several hours under heavy load without crashes. I will now try to let it run a bit more and maybe then attempt to use a different subset of 3 cards to verify this and at the same time test if the problem is related to one particular card or not.
I monitored GPU temperature during the test runs and there does not seem to be anything wrong. The cards get up to about 78-80 °C under highest load with fan going at about 56% and this stays until the crash happens (several minutes), does not seem to be too high to me.
One thing i have been thinking about is the way the requests are handled - there is quite a lot of cudaSetDevice calls, since each request spawns a new thread (i'm using mongoose library) and then this thread switches between cards by calling cudaSetDevice(id) with appropriate device id. The switching can happen multiple times during one request and i am not using any streams (so it all goes to the default (0) stream IIRC). Can this somehow be related to the crashes occuring in pthread_getspecific ?
I have also tried upgrading to the latest drivers (beta, 319.12) but that didn't help.

If you can identify 3 cards that work, try cycling the 4th card in place of one of the 3, and see if you get the failures again. This is just standard troubleshooting I think. If you can identify a single card that, when included in a group of 3, still elicits the issue, then that card is suspect.
But, my suggestion to run with fewer cards was also based on the idea that it may reduce the overall load on the PSU. Even at 1500W, you may not have enough juice. So if you cycle the 4th card in, in place of one of the 3 (i.e. still keep only 3 cards in the system or configure your app to use 3) and you get no failures, the problem may be due to overall power draw with 4 cards.
Note that the power consumption of the GTX Titan at full load can be on the order of 250W or possibly more. So it might seem that your 1500W PSU should be fine, but it may come down to a careful analysis of how much DC power is available on each rail, and how the motherboard and PSU harness is distributing the 12V DC rails to each GPU.
So if reducing to 3GPUs seems to fix the problem no matter which 3 you use, my guess is that your PSU is not up to the task. Not all 1500W is available from a single DC rail. The 12V "rail" is actually composed of several different 12V rails, each of which delivers a certain portion of the overall 1500W. So even though you may not be pulling 1500W, you can still overload a single rail, depending on how the GPU power is connected to the rails.
I agree that temperatures in the 80C range should be fine, but that indicates (approximately) a fully loaded GPU, so if you're seeing that on all 4 GPUs at once, then you are pulling a heavy load.

Related

Core dump in zmq library in a multi-threaded application with optimiized binary

This core dump on zmq library happened on field (not reproducible yet) with an optimized binary.
#0 0x00007f44a00801f7 in raise () from /lib64/libc.so.6
#1 0x00007f44a00818e8 in abort () from /lib64/libc.so.6
#2 0x00007f44a1f74759 in zmq::zmq_abort(char const*) () from /lib64/libzmq.so.5
#3 0x00007f44a1fa410d in zmq::tcp_write(int, void const*, unsigned long) () from /lib64/libzmq.so.5
#4 0x00007f44a1f9f417 in zmq::stream_engine_t::out_event() () from /lib64/libzmq.so.5
#5 0x00007f44a1f7437a in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#6 0x00007f44a1fa83a6 in thread_routine () from /lib64/libzmq.so.5
#7 0x00007f44a1b2ce25 in start_thread () from /lib64/libpthread.so.0
#8 0x00007f44a014334d in clone () from /lib64/libc.so.6enter code here
While I am analyzing my application code and hoping to find some misuse of zmq (probably using same zmq socket by 2 different threads or some other memory corruption), I would like to know what else can i get from this core-dump?
For a start, I can see total 102 threads running at the dump time. A many of them are in the epoll_wait.
#0 0x00007f44a0143923 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f44a1f74309 in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#2 0x00007f44a1fa83a6 in thread_routine () from /lib64/libzmq.so.5
#3 0x00007f44a1b2ce25 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f44a014334d in clone () from /lib64/libc.so.6
The other threads pointing to application code do not look suspicious yet.
The errno printed is 14 = EFAULT (Bad address).
Can i try to get anything from the disassembly? I have not debugged many disassembly in the past. But in this situation if i can get any clue, i can jump-in.
Any (other) advice/pointer will also be highly appreciated.
Thanks.

Unexpected behavior of custom hardware design based on i.MX6 & MT41K256M16TW-107 IT:P

I'm new to custom hardware designs and I'm going to scale up my custom hardware which is functioning well with few boards. I need some help with making decision on prototypes and scaling up with the state of the prototypes.
This hardware is based on i.MX6Q processor & MT41K256M16TW-107 IT:P memory. This is most similar to nitrogen6_max development board.
I'm having trouble with my hardware which is really difficult to figure out as some boards are working really well and some are not (From 7 units of production 4 boards are functioning really well, one board getting segmentation faults and kernel panic while running linux application ). When I do memory calibration of bad boards those are really looks like same to good boards.
Segmentation fault is directing to some memory issues, I back traced and core dumped using linux GDB. >>
Program terminated with signal SIGSEGV, Segmentation fault.
#0 gcoHARDWARE_QuerySamplerBase (Hardware=0x22193dc, Hardware#entry=0x0,
VertexCount=0x7ef95370, VertexCount#entry=0x7ef95368, VertexBase=0x40000,
FragmentCount=FragmentCount#entry=0x2217814, FragmentBase=0x0) at
gc_hal_user_hardware_query.c:6020
6020 gc_hal_user_hardware_query.c: No such file or directory.
[Current thread is 1 (Thread 0x76feb010 (LWP 697))]
(gdb) bt
#0 gcoHARDWARE_QuerySamplerBase (Hardware=0x22193dc, Hardware#entry=0x0,
VertexCount=0x7ef95370, VertexCount#entry=0x7ef95368, VertexBase=0x40000,
FragmentCount=FragmentCount#entry=0x2217814, FragmentBase=0x0) at
gc_hal_user_hardware_query.c:6020
#1 0x765d20e8 in gcoHAL_QuerySamplerBase (Hal=<optimized out>,
VertexCount=VertexCount#entry=0x7ef95368, VertexBase=<optimized out>,
FragmentCount=FragmentCount#entry=0x2217814,
FragmentBase=0x0) at gc_hal_user_query.c:692
#2 0x681e31ec in gcChipRecompileEvaluateKeyStates (chipCtx=0x0,
gc=0x7ef95380) at src/chip/gc_chip_state.c:2115
#3 gcChipValidateRecompileState (gc=0x7ef95380, gc#entry=0x21bd96c,
chipCtx=0x0, chipCtx#entry=0x2217814) at src/chip/gc_chip_state.c:2634
#4 0x681c6da8 in __glChipDrawValidateState (gc=0x21bd96c) at
src/chip/gc_chip_draw.c:5217
#5 0x68195688 in __glDrawValidateState (gc=0x21bd96c) at
src/glcore/gc_es_draw.c:585
#6 __glDrawPrimitive (gc=0x21bd96c, mode=<optimized out>) at
src/glcore/gc_es_draw.c:943
#7 0x68171048 in glDrawArrays (mode=4, first=6, count=6) at
src/glcore/gc_es_api.c:399
#8 0x76c9ac72 in CEGUI::OpenGL3GeometryBuffer::draw() const () from
/usr/lib/libCEGUIOpenGLRenderer-0.so.2
#9 0x76dd1aee in CEGUI::RenderQueue::draw() const () from
/usr/lib/libCEGUIBase-0.so.2
#10 0x76e317d8 in CEGUI::RenderingSurface::draw(CEGUI::RenderQueue const&,
CEGUI::RenderQueueEventArgs&) () from /usr/lib/libCEGUIBase-0.so.2
#11 0x76e31838 in CEGUI::RenderingSurface::drawContent() () from
/usr/lib/libCEGUIBase-0.so.2
#12 0x76e36d30 in CEGUI::GUIContext::drawContent() () from
/usr/lib/libCEGUIBase-0.so.2
#13 0x76e31710 in CEGUI::RenderingSurface::draw() () from
/usr/lib/libCEGUIBase-0.so.2
#14 0x001bf79c in tengri::gui::cegui::System::Impl::draw (this=0x2374f08) at
codebase/src/gui/cegui/system.cpp:107
#15 tengri::gui::cegui::System::draw (this=this#entry=0x2374e74) at
codebase/src/gui/cegui/system.cpp:212
#16 0x000b151e in falcon::osd::view::MainWindowBase::Impl::preNativeUpdate
(this=0x2374e10) at codebase/src/osd/view/MainWindow.cpp:51
#17 falcon::osd::view::MainWindowBase::preNativeUpdate
(this=this#entry=0x209fe30) at codebase/src/osd/view/MainWindow.cpp:91
#18 0x000c4686 in falcon::osd::view::FBMainWindow::update (this=0x209fe00)
at
codebase/include/falcon/osd/view/FBMainWindow.h:56
#19 falcon::osd::view::App::Impl::execute (this=0x209fdb0) at
codebase/src/osd/view/app_view_osd_falcon.cpp:139
#20 falcon::osd::view::App::execute (this=<optimized out>) at
codebase/src/osd/view/app_view_osd_falcon.cpp:176
#21 0x000475f6 in falcon::osd::App::execute (this=this#entry=0x7ef95c84) at
codebase/src/osd/app_osd_falcon.cpp:75
#22 0x00047598 in main () at codebase/src/main.cpp:5
(gdb) Quit
Here I have attached NXP tool calibration results for 2 good boards and 1 bad(getting segmentation faults) board. Click on following links.
Board 1
Board 2
Board 3
I did stress test using stressapptest and it was a over night test. But I didn't get any fault and test was passed.
From above 3 boards Board 1 and Board 2 are working really well and Board 3 is getting kernel panics while running same application on 3 boards. Can you help me to figure out any clue from this results from above 3 boards ?
I did 50 units of production 6 months ago and only 30 were worked properly. But that is with Alliance memory AS4C256M16D3A-12BCN. So will this be an issue of the design ? If this is an issue of the ddr layout or whole design why some boards are working really well ?
Will this be an issue of the manufacturing side ? Then how this could be happen with the same production ? Because some are working and some are not.
Will stressapptest stress power as well. Do you know any linux app which can stress power as well?
I don't have much experience with mass production and but I like to move forward after learning and correcting this issues. I must be thankful to you if you will kindly reply me soon.

Calling free() in C++ triggers ntdll!DbgBreakPoint() in debug but crashes in release

I have a single threaded program that crashes consistently at certain points right after free() is called when running in non-debug mode.
When in debug mode however, debugger breaks on the line that calls free() even though there are no break points set. When I try to step to the next line again, debugger breaks again on the same line. Stepping once again resumes execution as normal. No crash, no segfault, nothing.
EDIT-1: Contrary to what I wrote above, crashes in non-debug mode
turns out to be inconsistent, which makes me think I am somehow
writing somewhere that I shouldn't. (Breaks in debug mode are
still consistent, though.)
Call stack at the breaks shows some windows library functions(I think) called after the function that calls free() statement. I have no idea how to interpret them. And consequently, I have no idea how to go about debugging in this situation.
I have provided the call stacks at break points below. Can someone point me in a direction where I can tackle the problem? What might be causing the breaks in debugger mode?
Program is run on Windows Vista, compiled with gcc 4.9.2, debugger used is gdb. Assume double release is not the case.(I use ::operator new and ::operator delete overloads that catch that. Situation described is the same without these overloads as well.)
Note that the crash(or the involuntary breaks in debugger) is consistent. Happens every time, in the same execution point.
Here is the call stack at the initial break:
(Note that free_wrapper() is the function that houses free() statement that causes the crash/breaks.)
#0 0x770186ff ntdll!DbgBreakPoint() (C:\Windows\system32\ntdll.dll:??)
#1 0x77082edb ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#2 0x7706b953 ntdll!RtlImageRvaToVa() (C:\Windows\system32\ntdll.dll:??)
#3 0x77052c4f ntdll!RtlQueryRegistryValues() (C:\Windows\system32\ntdll.dll:??)
#4 0x77083f3b ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#5 0x7704bcfd ntdll!EtwSendNotification() (C:\Windows\system32\ntdll.dll:??)
#6 0x770374d5 ntdll!RtlEnumerateGenericTableWithoutSplaying() (C:\Windows\system32\ntdll.dll:??)
#7 0x75829dc6 KERNEL32!HeapFree() (C:\Windows\system32\kernel32.dll:??)
#8 0x75a99c03 msvcrt!free() (C:\Windows\system32\msvcrt.dll:??)
#9 0x350000 ?? () (??:??)
--> #10 0x534020 free_wrapper(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\Unrelated\MemMgmt.cpp:282)
#11 0x407f74 operator delete(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1002)
#12 0x629a74 __gnu_cxx::new_allocator<char>::deallocate(this=0x22f718, __p=0x352af0 "\nÿÿÿÿÿÿº\r%") (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/ext/new_allocator.h:110)
#13 0x6c2257 std::allocator_traits<std::allocator<char> >::deallocate(__a=..., __p=0x352af0 "\nÿÿÿÿÿÿº\r%", __n=50) (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/bits/alloc_traits.h:383)
#14 0x611940 basic_CDataUnit<std::allocator<char> >::~basic_CDataUnit(this=0x22f714, __vtt_parm=0x781df4 <VTT for basic_CDataUnit_TDB<std::allocator<char> >+4>, __in_chrg=<optimized out>) (include/DataUnit/CDataUnit.h:112)
#15 0x61dfa1 basic_CDataUnit_TDB<std::allocator<char> >::~basic_CDataUnit_TDB(this=0x22f714, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) (include/DataUnit/CDataUnit_TDB.h:125)
#16 0x503898 CTblSegHandle::UpdateChainedRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:912)
#17 0x502fcc CTblSegHandle::UpdateRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:764)
#18 0x443272 UpdateRow(row_addr=..., new_data_unit=..., vColTypes=..., block_hnd=..., seg_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:910)
#19 0x443470 UpdateRow(row_addr=..., vColValues=..., vColTypes=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:935)
#20 0x4023e3 test_RowChaining() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:234)
#21 0x4081c6 main() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1034)
And here is the call stack when I step to the next line and debugger breaks one last time before resuming normal execution:
#0 0x770186ff ntdll!DbgBreakPoint() (C:\Windows\system32\ntdll.dll:??)
#1 0x77082edb ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#2 0x77052c7f ntdll!RtlQueryRegistryValues() (C:\Windows\system32\ntdll.dll:??)
#3 0x77083f3b ntdll!RtlpNtMakeTemporaryKey() (C:\Windows\system32\ntdll.dll:??)
#4 0x7704bcfd ntdll!EtwSendNotification() (C:\Windows\system32\ntdll.dll:??)
#5 0x770374d5 ntdll!RtlEnumerateGenericTableWithoutSplaying() (C:\Windows\system32\ntdll.dll:??)
#6 0x75829dc6 KERNEL32!HeapFree() (C:\Windows\system32\kernel32.dll:??)
#7 0x75a99c03 msvcrt!free() (C:\Windows\system32\msvcrt.dll:??)
#8 0x350000 ?? () (??:??)
--> #9 0x534020 free_wrapper(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\Unrelated\MemMgmt.cpp:282)
#10 0x407f74 operator delete(pv=0x352af0) (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1002)
#11 0x629a74 __gnu_cxx::new_allocator<char>::deallocate(this=0x22f718, __p=0x352af0 "\nÿÿÿÿÿÿº\r%") (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/ext/new_allocator.h:110)
#12 0x6c2257 std::allocator_traits<std::allocator<char> >::deallocate(__a=..., __p=0x352af0 "\nÿÿÿÿÿÿº\r%", __n=50) (C:/Program Files/CodeBlocks/MinGW/lib/gcc/mingw32/4.9.2/include/c++/bits/alloc_traits.h:383)
#13 0x611940 basic_CDataUnit<std::allocator<char> >::~basic_CDataUnit(this=0x22f714, __vtt_parm=0x781df4 <VTT for basic_CDataUnit_TDB<std::allocator<char> >+4>, __in_chrg=<optimized out>) (include/DataUnit/CDataUnit.h:112)
#14 0x61dfa1 basic_CDataUnit_TDB<std::allocator<char> >::~basic_CDataUnit_TDB(this=0x22f714, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) (include/DataUnit/CDataUnit_TDB.h:125)
#15 0x503898 CTblSegHandle::UpdateChainedRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:912)
#16 0x502fcc CTblSegHandle::UpdateRowData(this=0x353cf8, new_row_data=..., old_row_fetch_res=..., vColTypes=..., block_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\SegHandles\CTblSegHandle.cpp:764)
#17 0x443272 UpdateRow(row_addr=..., new_data_unit=..., vColTypes=..., block_hnd=..., seg_hnd=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:910)
#18 0x443470 UpdateRow(row_addr=..., vColValues=..., vColTypes=...) (C:\dm\bin\codes\CodeBlocks\ProjTemp\src\DbUtilities.cpp:935)
#19 0x4023e3 test_RowChaining() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:234)
#20 0x4081c6 main() (C:\dm\bin\codes\CodeBlocks\ProjTemp\main.cpp:1034)
When I see a call stack that looks like yours the most common cause is heap corruption. A double free or attempting to free a pointer that was never allocated can have similar call stacks. Since you characterize the crash as inconsistent that makes heap corruption the more likely candidate. Double frees and freeing unallocated pointers tend to crash consistently in the same place. To hunt down issues like this I usually:
Install Debugging Tools for Windows
Open a command prompt with elevated privileges
Change directory to the directory that Debugging Tools for Windows is installed in.
Enable full page heap by running gflags.exe -p /enable applicationName.exe /full
Launch application with debugger attached and recreate the issue.
Disable full page heap for the application by running gflags.exe -p /disable applicationName.exe
Running the application with full page heap places an inaccessible page at the end of each allocation so that the program stops immediately if it accesses memory beyond the allocation. This is according to the page GFlags and PageHeap. If a buffer overflow is causing the heap corruption this setting should cause the debugger to break when the overflow occurs..
Make sure to disable page heap when you are done debugging. Running under full page heap can greatly increase memory pressure on an application by making every heap allocation consume an entire page.
You can use valgrind to check if there is any invalid read /write or any invalid free is there in your CODE.
valgrind -v --leak-check=full --show-reachable=yes --log-file=log_valgrind ./Process
log_valgrind will contains invalid read/write.

C++ heap corruption and valgrind

I have a core on both Solaris/Linux platforms and I don´t see the problem.
On a linux platform, I have the following core:
(gdb) where
#0 0x001aa81b in do_lookup_x () from /lib/ld-linux.so.2
#1 0x001ab0da in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x001afa05 in _dl_fixup () from /lib/ld-linux.so.2
#3 0x001b5c90 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x00275e4c in __gxx_personality_v0 () from /opt/gnatpro/lib/libstdc++.so.6
#5 0x00645cfe in _Unwind_RaiseException_Phase2 (exc=0x2a7b10, context=0xffd58434) at ../../../src/libgcc/../gcc/unwind.inc:67
#6 0x00646082 in _Unwind_RaiseException (exc=0x2a7b10) at ../../../src/libgcc/../gcc/unwind.inc:136
#7 0x0027628d in __cxa_throw () from /opt/gnatpro/lib/libstdc++.so.6
#8 0x00276e4f in operator new(unsigned int) () from /opt/gnatpro/lib/libstdc++.so.6
#9 0x08053737 in Receptor::receive (this=0x93c12d8, msj=...) at Receptor.cc:477
#10 0x08099666 in EventProcessor::run (this=0xffd75580) at EventProcessor.cc:437
#11 0x0809747d in SEventProcessor::run (this=0xffd75580) at SEventProcessor.cc:80
#12 0x08065564 in main (argc=1, argv=0xffd76734) at my_project.cc:20
On a Solaris platform I have another core:
$ pstack core.ultimo
core 'core.ultimo' of 9220: my_project_sun
----------------- lwp# 1 / thread# 1 --------------------
0006fa28 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Dend6kM_pk2_ (1010144, 1ce84, ffbd0df8, ffb7a18c, fffffff8, ffbedc7c) + 30
0005d580 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Esize6kM_I_ (1010144, 219, 1ce84, ffffffff, fffffff8, ffbedc7c) + 30
0005ab14 __1cTReceptorHreceive6MrnKMensaje__v_ (33e630, ffbede70, ffffffff, 33e634, 33e68c, 0) + 1d4
0015df78 __1cREventProcessorDrun6M_v_ (ffbede18, 33e630, dcc, 1, 33e730, 6e) + 350
00159a50 __1cWSEventProcessorDrun6M_v_ (da08000, 2302f7, 111de0c, 159980, ff1fa07c, cc) + 48
000b6acc main (1, ffbeef74, ffbeef7c, 250000, 0, 0) + 16c
00045e10 _start (0, 0, 0, 0, 0, 0) + 108
----------------- lwp# 2 / thread# 2 --------------------
...
The piece of code is:
...
msj2.tipo(UPDATE);
for(i = 0; i < distr.size(); ++i)
{
distr[i]->insert(new Mensaje(msj2)); **--> Receptor.cc:477**
}
...
This core happens randomly, sometimes the process is running for weeks.
The size of the core is 4291407872 B.
I am running valgrind to see if the heap is corrupted but by now I have not encountered problems as "Invalid read", "Invalid write" ...
Also, when I was running valgrind I have found twice the following message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to the core? I think that I have seen these errors with valgrind before and they weren´t as important and the ones that say "Invalid read/write".
If you have any idea how to solve this problem, it would be highly appreciated.
The core size is the clue. The largest 32-bit unsigned number is 4,294,967,295. Your core is quite close to that indicating that the process is out of memory. The most likely cause is a memory leak.
See my recent article Memory Leaks in C/C++
Valgrind will find the issue for you on Linux. You have to start it with the --leak-check option for this. It will check for leaks when the process exits gracefully so you will need a way to shut the process down.
Dtrace with dbx on Solaris will also likely work.
Also, when I was running valgrind I have found twice the following
message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to
the core?
Yes, that could result in a SIGSEGV, as it's quite likely undefined behavior. (I'm not going to say it's definitely undefined behavior without seeing the actual code - but it likely is.) It's not likely that doing that can cause a SIGSEGV, but then again the intermittent failure you're seeing doesn't happen all that often. So you do need to fix that problem.
In addition to valgrind, on Solaris you can also use libumem and watchmalloc to check for problems managing heap memory. See the man pages for umem_debug and watchmalloc to get started.
To use dbx on Solaris, you need to have Solaris Studio installed (it's free). Solaris Studio also offers a way to use the run-time memory checking of dbx without having to directly invoke the dbx debugger. See the man page for bcheck. The bcheck man page will be in the Solaris Studio installation directory tree, in the man directory.
And if it is a memory leak, you should be able to see the process address space growing over time.

SIGSEGV on program exit with boost::log

Some time ago we separate our big project with almost static libraries to many projects with dynamic libraries.
Since then we stated seeing problems on shutdown.
Sometimes, the process would not terminate. With gdb I found, that on object destruction a segfault occurs, but the process is blocked in futex_wait.
I've since improved the code, by creating global objects are now created in function, instead of global static data. That reduced the problem: it doesn't happen in my development environment anymore.
However, in test environment (rare) and in production environment (often) processes still get stuck on shutdown. So we need to restart container manually, or have some kind of health check.
We are trying to simulate this kind of situation on standalone docker container running under Kubernetes where we have the process running under circusd and we see following:
#0 malloc_consolidate (av=0xf47fc400 <main_arena>) at malloc.c:4151
#1 0xf46ff1ab in _int_free (av=0xf47fc400 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4057
#2 0xf48c6e68 in operator delete(void*) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#3 0xf52d173d in std::_Deque_base<boost::log::v2_mt_posix::record_view, std::allocator<boost::log::v2_mt_posix::record_view> >::~_Deque_base() () from /usr/local/lib/liblog.so.0
#4 0xf52d18b3 in std::deque<boost::log::v2_mt_posix::record_view, std::allocator<boost::log::v2_mt_posix::record_view> >::~deque() () from /usr/local/lib/liblog.so.0
#5 0xf52d1940 in boost::log::v2_mt_posix::sinks::bounded_fifo_queue<4000u, boost::log::v2_mt_posix::sinks::drop_on_overflow>::~bounded_fifo_queue() () from /usr/local/lib/liblog.so.0
#6 0xf52d462e in boost::log::v2_mt_posix::sinks::asynchronous_sink<cout_sink, boost::log::v2_mt_posix::sinks::bounded_fifo_queue<4000u, boost::log::v2_mt_posix::sinks::drop_on_overflow>
>::~asynchronous_sink() () from /usr/local/lib/liblog.so.0
#7 0xf52d47f4 in asynchronous_sink<cout_sink>::~asynchronous_sink() () from /usr/local/lib/liblog.so.0
#8 0xf52c199a in boost::detail::sp_counted_impl_pd<asynchronous_sink<cout_sink>*, boost::detail::sp_ms_deleter<asynchronous_sink<cout_sink> >
>::dispose() () from /usr/local/lib/liblog.so.0
#9 0xf51f3e7b in boost::log::v2_mt_posix::core::~core() () from /usr/lib/libboost_log.so.1.58.0
#10 0xf51f6529 in boost::detail::sp_counted_impl_p<boost::log::v2_mt_posix::core>::dispose() () from /usr/lib/libboost_log.so.1.58.0
#11 0xf51f6160 in boost::shared_ptr<boost::log::v2_mt_posix::core>::~shared_ptr() () from /usr/lib/libboost_log.so.1.58.0
#12 0xf46bcfb3 in __cxa_finalize (d=0xf526fa88) at cxa_finalize.c:56
#13 0xf51eaab3 in ?? () from /usr/lib/libboost_log.so.1.58.0
#14 0xf7769e2c in _dl_fini () at dl-fini.c:252
#15 0xf46bcc21 in __run_exit_handlers (status=status#entry=0, listp=0xf47fc3a4 <__exit_funcs>, run_list_atexit=run_list_atexit#entry=true) at exit.c:82
#16 0xf46bcc7d in __GI_exit (status=0) at exit.c:104
#17 0xf46a572b in __libc_start_main (main=0x8060dc0, argc=5, argv=0xffdd1514, init=0x8088090, fini=0x8088100, rtld_fini=0xf7769c50 <_dl_fini>, stack_end=0xffdd150c) at libc-start.c:321
#18 0x080630cc in ?? ()
I have no ideas how to progress from here. What is happening? Why do we get the segfault in boost::log::core destruction in this environment?
Does anyone have some advice how can I find it, probably, based on experience?