C++ debugging: Terminated with SIGABRT - c++

I am trying to write a program, in C++, which runs on a cluster of machines, and all machines are talking to each other over TCP sockets. Program crashes randomly at one of the machines. I did an analysis of core-dump with gdb. Following are the output:
$ gdb executable dump
Core was generated by `/home/user/experiments/files/executable 2 /home/user/'.
Program terminated with signal SIGABRT, Aborted.
0 0x00007fb76a084c37 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
0 0x00007fb76a084c37 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
1 0x00007fb76a088028 in __GI_abort () at abort.c:89
2 0x00007fb76a0c12a4 in __libc_message (do_abort=do_abort#entry=2, fmt=fmt#entry=0x7fb76a1cd113 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
3 0x00007fb76a158bbc in __GI___fortify_fail (msg=<optimized out>, msg#entry=0x7fb76a1cd0aa "buffer overflow detected") at fortify_fail.c:38
4 0x00007fb76a157a90 in __GI___chk_fail () at chk_fail.c:28
5 0x00007fb76a158b07 in __fdelt_chk (d=<optimized out>) at fdelt_chk.c:25
6 0x000000000040a918 in LocalSenderPort::run() ()
7 0x000000000040ae70 in LocalSenderPort::LocalSenderPort(unsigned int, std::string, std::vector<std::string, std::allocator<std::string> >, char*) ()
8 0x00000000004033d5 in main ()
Any suggestions for what should I look? How should I proceed? Any help is really appreciated.
I am not sharing code right now, as its a large code spread across files. But I can share if needed.

This error: __fdelt_chk (d=<optimized out>) at fdelt_chk.c:25 means that your program violated precondition of one of the FD_* macros.
The source of fdelt_chk is quite simple, and there are only two conditions under which it fails: you pass in negative file descriptor, or you pass in a file descriptor greater than 1023.
In this day and age, using select and/or FD_SET in any program that can have more than 1024 simultaneous connections (which Linux easily allows) can only end in tears. Use epoll instead.

Related

Unexpected behavior of custom hardware design based on i.MX6 & MT41K256M16TW-107 IT:P

I'm new to custom hardware designs and I'm going to scale up my custom hardware which is functioning well with few boards. I need some help with making decision on prototypes and scaling up with the state of the prototypes.
This hardware is based on i.MX6Q processor & MT41K256M16TW-107 IT:P memory. This is most similar to nitrogen6_max development board.
I'm having trouble with my hardware which is really difficult to figure out as some boards are working really well and some are not (From 7 units of production 4 boards are functioning really well, one board getting segmentation faults and kernel panic while running linux application ). When I do memory calibration of bad boards those are really looks like same to good boards.
Segmentation fault is directing to some memory issues, I back traced and core dumped using linux GDB. >>
Program terminated with signal SIGSEGV, Segmentation fault.
#0 gcoHARDWARE_QuerySamplerBase (Hardware=0x22193dc, Hardware#entry=0x0,
VertexCount=0x7ef95370, VertexCount#entry=0x7ef95368, VertexBase=0x40000,
FragmentCount=FragmentCount#entry=0x2217814, FragmentBase=0x0) at
gc_hal_user_hardware_query.c:6020
6020 gc_hal_user_hardware_query.c: No such file or directory.
[Current thread is 1 (Thread 0x76feb010 (LWP 697))]
(gdb) bt
#0 gcoHARDWARE_QuerySamplerBase (Hardware=0x22193dc, Hardware#entry=0x0,
VertexCount=0x7ef95370, VertexCount#entry=0x7ef95368, VertexBase=0x40000,
FragmentCount=FragmentCount#entry=0x2217814, FragmentBase=0x0) at
gc_hal_user_hardware_query.c:6020
#1 0x765d20e8 in gcoHAL_QuerySamplerBase (Hal=<optimized out>,
VertexCount=VertexCount#entry=0x7ef95368, VertexBase=<optimized out>,
FragmentCount=FragmentCount#entry=0x2217814,
FragmentBase=0x0) at gc_hal_user_query.c:692
#2 0x681e31ec in gcChipRecompileEvaluateKeyStates (chipCtx=0x0,
gc=0x7ef95380) at src/chip/gc_chip_state.c:2115
#3 gcChipValidateRecompileState (gc=0x7ef95380, gc#entry=0x21bd96c,
chipCtx=0x0, chipCtx#entry=0x2217814) at src/chip/gc_chip_state.c:2634
#4 0x681c6da8 in __glChipDrawValidateState (gc=0x21bd96c) at
src/chip/gc_chip_draw.c:5217
#5 0x68195688 in __glDrawValidateState (gc=0x21bd96c) at
src/glcore/gc_es_draw.c:585
#6 __glDrawPrimitive (gc=0x21bd96c, mode=<optimized out>) at
src/glcore/gc_es_draw.c:943
#7 0x68171048 in glDrawArrays (mode=4, first=6, count=6) at
src/glcore/gc_es_api.c:399
#8 0x76c9ac72 in CEGUI::OpenGL3GeometryBuffer::draw() const () from
/usr/lib/libCEGUIOpenGLRenderer-0.so.2
#9 0x76dd1aee in CEGUI::RenderQueue::draw() const () from
/usr/lib/libCEGUIBase-0.so.2
#10 0x76e317d8 in CEGUI::RenderingSurface::draw(CEGUI::RenderQueue const&,
CEGUI::RenderQueueEventArgs&) () from /usr/lib/libCEGUIBase-0.so.2
#11 0x76e31838 in CEGUI::RenderingSurface::drawContent() () from
/usr/lib/libCEGUIBase-0.so.2
#12 0x76e36d30 in CEGUI::GUIContext::drawContent() () from
/usr/lib/libCEGUIBase-0.so.2
#13 0x76e31710 in CEGUI::RenderingSurface::draw() () from
/usr/lib/libCEGUIBase-0.so.2
#14 0x001bf79c in tengri::gui::cegui::System::Impl::draw (this=0x2374f08) at
codebase/src/gui/cegui/system.cpp:107
#15 tengri::gui::cegui::System::draw (this=this#entry=0x2374e74) at
codebase/src/gui/cegui/system.cpp:212
#16 0x000b151e in falcon::osd::view::MainWindowBase::Impl::preNativeUpdate
(this=0x2374e10) at codebase/src/osd/view/MainWindow.cpp:51
#17 falcon::osd::view::MainWindowBase::preNativeUpdate
(this=this#entry=0x209fe30) at codebase/src/osd/view/MainWindow.cpp:91
#18 0x000c4686 in falcon::osd::view::FBMainWindow::update (this=0x209fe00)
at
codebase/include/falcon/osd/view/FBMainWindow.h:56
#19 falcon::osd::view::App::Impl::execute (this=0x209fdb0) at
codebase/src/osd/view/app_view_osd_falcon.cpp:139
#20 falcon::osd::view::App::execute (this=<optimized out>) at
codebase/src/osd/view/app_view_osd_falcon.cpp:176
#21 0x000475f6 in falcon::osd::App::execute (this=this#entry=0x7ef95c84) at
codebase/src/osd/app_osd_falcon.cpp:75
#22 0x00047598 in main () at codebase/src/main.cpp:5
(gdb) Quit
Here I have attached NXP tool calibration results for 2 good boards and 1 bad(getting segmentation faults) board. Click on following links.
Board 1
Board 2
Board 3
I did stress test using stressapptest and it was a over night test. But I didn't get any fault and test was passed.
From above 3 boards Board 1 and Board 2 are working really well and Board 3 is getting kernel panics while running same application on 3 boards. Can you help me to figure out any clue from this results from above 3 boards ?
I did 50 units of production 6 months ago and only 30 were worked properly. But that is with Alliance memory AS4C256M16D3A-12BCN. So will this be an issue of the design ? If this is an issue of the ddr layout or whole design why some boards are working really well ?
Will this be an issue of the manufacturing side ? Then how this could be happen with the same production ? Because some are working and some are not.
Will stressapptest stress power as well. Do you know any linux app which can stress power as well?
I don't have much experience with mass production and but I like to move forward after learning and correcting this issues. I must be thankful to you if you will kindly reply me soon.

My program crashes when calling vkCmdBindDescriptorSets

My program runs well when I open only one model file. But when I try to open multiple files (with different vulkan instance and thread), my program might crash in this place. I checked the arguments of the function, but they seemed to have no any problem.
The GDB backtrace is here:
Thread 83 "VulkanRenderer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffebfdff700 (LWP 50908)]
0x00007fffe35b7053 in ?? () from /usr/lib/nvidia-375/libnvidia-glcore.so.375.39
(gdb) bt
#0 0x00007fffe35b7053 in ?? () from /usr/lib/nvidia-375/libnvidia-glcore.so.375.39
#1 0x00007fffe35e1a7e in ?? () from /usr/lib/nvidia-375/libnvidia-glcore.so.375.39
#2 0x00007fffe35e3102 in ?? () from /usr/lib/nvidia-375/libnvidia-glcore.so.375.39
#3 0x00007ffff78ca4ed in VulkanCommandBuffer::SetDescriptorSet(vk::PipelineBindPoint, VulkanPipelineLayout*, unsigned int, unsigned int, VulkanDescriptorSet**, unsigned int, unsigned int*) () from
How can I fix this crash bug?
Are the commands being sent to the same queue or different queues. Also where is the output going? Is it the same window for both instances?

Reading OpenEXRs sequentially from a Pipe

I am trying to read a stream of EXRs from one pipe, process them and write the results into a different pipe. This this case they are named pipes but they could just as well be stdin and stdout.
My problem occurs when the pipe runs dry. OpenEXR doesn't like trying to read nothing and crashes with the following stack trace.
(gdb) run in.exr out.exr
Starting program: /Users/jon/Library/Developer/Xcode/DerivedData/compressor-abhdftqzleulxsfkpidvcazfowwo/Build/Products/Debug/compressor in.exr out.exr
Reading symbols for shared libraries +++++++++......................................................................................................... done
Reading symbols for shared libraries ............ done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
terminate called throwing an exception
Program received signal SIGABRT, Aborted.
0x00007fff90957ce2 in __pthread_kill ()
(gdb) backtrace
#0 0x00007fff90957ce2 in __pthread_kill ()
#1 0x00007fff866f27d2 in pthread_kill ()
#2 0x00007fff866e3a7a in abort ()
#3 0x00007fff8643c7bc in abort_message ()
#4 0x00007fff86439fcf in default_terminate ()
#5 0x00007fff844d61cd in _objc_terminate ()
#6 0x00007fff8643a001 in safe_handler_caller ()
#7 0x00007fff86439fed in unexpected_defaults_to_terminate ()
#8 0x00007fff8643a040 in __cxxabiv1::__unexpected ()
#9 0x00007fff8643aefe in __cxa_call_unexpected ()
#10 0x0000000100008cfb in exr::ReadEXR (pixelBuffer=#0x7fff5fbfee00, is=#0x7fff5fbfeef8) at /Users/jon/Development/compressor/compressor/exr.cpp:47
#11 0x0000000100001c39 in main (argc=4, argv=0x7fff5fbffaa8) at /Users/jon/Development/compressor/compressor/main.cpp:79
I would really like OpenEXR to block the thread until more data becomes available but if there was some method of checking manually to see whether there is more data that would do, so long as it was somewhat robust.
Thanks.
The solution to this problem is indeed to extend Imf::Istream and implement it to block when the input pipe runs dry.
For this specific problem some considerations need to be made like pipes aren't seekable and d o not know their position, they can be worked around however.

infinite abort() in a backrace of a c++ program core dump

I have a strange problem that I can't solve. Please help!
The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:
*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***
When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.
GNU gdb (GDB) 7.3
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0 0x001c44d4 in raise ()
(gdb) bt
#0 0x001c44d4 in raise ()
#1 0x001c47e0 in abort ()
#2 0x001c47e0 in abort ()
#3 0x001c47e0 in abort ()
#4 0x001c47e0 in abort ()
#5 0x001c47e0 in abort ()
#6 0x001c47e0 in abort ()
#7 0x001c47e0 in abort ()
#8 0x001c47e0 in abort ()
#9 0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()
And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.
When I examine the other threads everything seems normal there.
(gdb) info threads
Id Target Id Frame
6 LWP 705 0x00132f04 in nanosleep ()
5 LWP 704 0x001e7a70 in select ()
4 LWP 703 0x00132f04 in nanosleep ()
3 LWP 702 0x00132318 in sem_wait ()
2 LWP 700 0x00132f04 in nanosleep ()
* 1 LWP 706 0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0 0x001e7a70 in select ()
(gdb) bt
#0 0x001e7a70 in select ()
#1 0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2 0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3 0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4 0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5 0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc)
#6 0x00074c78 in CScanner::Poll (this=0xbea7d4cc)
#7 0x00089edc in CThread5::Thread5Poll (this=0xbea7d360)
#8 0x0008c140 in CThread5::run (this=0xbea7d360)
#9 0x00088698 in CThread::threadFunc (p=0xbea7d360)
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(Classes and functions names are a bit wierd because I changed them -:)
So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows
Backtrace stopped: previous frame identical to this frame (corrupt stack?).
It happends because threads 2-6 are created in the thread #1.
The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.
Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).
UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9
UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).
My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.
In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.
Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.
One possibility here is that something in that thread has very, very badly smashed the stack by vastly overwriting an on-stack data structure, destroying all the needed data on the stack in the process. That makes postmortem debugging very unpleasant.
If you can reproduce the problem at will, the right thing to do is to run the thread under gdb and watch what is going on precisely at the moment when the the stack gets nuked. This may, in turn, require some sort of careful search to determine where exactly the error is happening.
If you cannot reproduce the problem at will, the best I can suggest is very carefully looking for clues in the thread local storage for that thread to see if it hints at where the thread was executing before death hit.

Mac: I get SIGABRT but the call stack is useless

I'm coding a game for Mac in c++, and I'm getting a SIGABRT, and the console prints the following:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::bad_lexical_cast> >'
what(): bad lexical cast: source type value could not be interpreted as target
Program received signal: “SIGABRT”.
So, I'm doing a bad lexical_cast. But the problem is that I can't know where, because the call stack is as follows:
#0 0x7fff85fb629a in mach_msg_trap
#1 0x7fff85fb690d in mach_msg
#2 0x7fff81f58932 in __CFRunLoopRun
#3 0x7fff81f57dbf in CFRunLoopRunSpecific
#4 0x7fff88dba7ee in RunCurrentEventLoopInMode
#5 0x7fff88dba5f3 in ReceiveNextEventCommon
#6 0x7fff88dba4ac in BlockUntilNextEventMatchingListInMode
#7 0x7fff84f85e64 in _DPSNextEvent
#8 0x7fff84f857a9 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
#9 0x7fff84f4b48b in -[NSApplication run]
#10 0x7fff84f441a8 in NSApplicationMain
#11 0x1000ef759 in os_gameMainLoop at main-osx.mm:22
#12 0x10009a97d in main at words.cpp:18
That´s not the right stack.
What's match_msg_trap?
Whay am I getting this call stack?
Do I have any way to get a good call stack on the crash?
Thanks!
The debugger stopped in the wrong thread. Try t a a bt in GDB to see backtraces for all the threads.
mach_msg_trap is where threads park while they are waiting for a message to come in. So, you are looking at a thread that isn't running. Mach is the name of the message-passing interface on OS X.