I have an ncurses app that does the following, sometimes instantly after launch, sometimes after some fiddling.
malloc: *** error for object 0x100300400: double free
Program received signal SIGABRT, Aborted
(gdb) where
#0 0x00007fff846a7426 in read ()
#1 0x00007fff83f3d775 in _nc_wgetch ()
#2 0x00007fff83f3de3f in wgetch ()
(and so on into my code)
Does anyone have suggestions for likely things to pursue?
It looks like you are using glibc, likely on an x86_64 Linux system.
The tool to use for any kind of heap corruption on Linux/x86_64 is Valgrind. It will just immediately give you the answer, so there is no point in guessing where the problem might be (and it could be anywhere).
Related
I'm wondering if there is a way to catch all segment fault/core dump, and print its callstack? Catching all signals look like a doable way, but I'm not pretty sure how it works, according to some of my experience, it did not always give the outcome I want, sometimes it just failed to catch that core dump, maybe I did something wrong.
The reason I ask this is that I usually debug with a very complicated system, and many segmentfault errors are hard to reproduce, and not practical to run it with gdb line by line. So if I can catch all segmentfault and print some call stack or other information, that would be great to help me debugging.
I'm wondering if there is a way to catch all segment fault ...
Certainly. This can be achieved by registering a signal handler using std::signal.
... and print its callstack?
This is a quite a lot trickier. There is no standard way to inspect that call stack in C++. Linux has several ways, but unfortunately the most conventional way of using the POSIX standard backtrace function is not async-signal safe, so there is no guarantee that it would work in a signal handler.
A simpler approach is to not do this in a signal handler, but instead configure Linux to generate a core dump. You'll get much more information out of that.
... core dump, and print its callstack?
Certainly. You can use a debugger:
gdb /path/to/executable /path/to/corefile
(gdb) bt
and not practical to run it with gdb line by line.
Then run it in gdb but not line by line. Simply run the program within gdb until it segfaults, then print the backtrace.
While using GDB to debug a program I have been having issues with the program stopping while in debug mode. When I do a backtrace, I find that it's deep within a proprietary third party library call stack and I am looking to find out why exactly the program has stopped. I am still just a GDB beginner so I still unsure of how to do this. Looking at the backtrace I noticed that "__cxa_throw () from /usr/lib64/libstdc++.so.6" so I am presuming that an exception of some sort was thrown but I would like to know how to get more information about it, if possible.
Try using the backtrace command which will show how your program got in the state it is. Here you can find more details.
How do I find out why gdb has stopped
GDB usually tells you right away, e.g.
Program received signal SIGABRT, Aborted.
0x00007ffff7750425 in __GI_raise (sig=<optimized out>)
The program stopped because it received a signal.
I find that it's deep within a proprietary third party library call stack and I am looking to find out why exactly the program has stopped.
It has stopped exactly for the reason GDB told you about.
Looking at the backtrace I noticed that __cxa_throw() from /usr/lib64/libstdc++.so.6 so I am presuming that an exception of some sort was thrown but I would like to know how to get more information about it.
Presence of __cxa_throw does indicate that an exception has been thrown (and presence of std::terminate() indicates that it was an uncaught exception).
Without debug info for the third-party library, your choices of finding the cause are limited:
You can read the documentation for this library and double-check that you have not violated any preconditions that it requires
You can disassemble the routine that called __cxa_throw and figure out exactly why that routine was called.
I can't get into the specifics, for a variety of reasons, but here's the essential architecture of what I'm working with
I have a C++ framework, which uses C++ object files built by me to execute a dynamic simulation.
The C++ libraries call, among other things, a shared (.so) library, written in Ada.
As best as I can tell, the Ada library (which is a large collection of nontrivial code) is generating exceptions on fringe cases, but I'm having trouble isolating the function that is generating the exception.
Here's what I'm using:
CentOS 4.8 (Final)
gcc 3.4.6 (w/ gnat)
gdb 6.3.0.0-1.162.el4rh
This is the error I get under normal execution:
terminate called without an active exception
raised PROGRAM_ERROR : unhandled signal
I can get gdb to catch the exception as soon as it returns to the C++, but I can't get it to catch inside the Ada code. I've made sure to compile everything with -g, but that doesn't seem to help the problem.
When I try to catch/break on the signal/exception in gdb (which politely tells me Catch of signal not yet implemented), I get this:
[Thread debugging using libthread_db enabled]
[New thread -1208371520 (LWP 14568)]
terminate called without an active exception
Program received signal SIGABRT, Aborted.
[Switching to thread -1208371520 (LWP 14568)]
0x001327a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
I believe the terminate called [...] line is from the framework. When I try to capture that break, then run a backtrace (bt), I get something like this:
#0 0x001327a2 in gdb makes me want to flip tables.
#1 0x00661825 in raise () from /lib/tls/libc.so.6
#2 0x00663289 in abort () from /lib/tls/libc.so.6
#3 0x0061123e in __gnu_cxx: __verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x0060eed1 in __xac_call_unexpected () from /usr/lib/libstdc++.so.6
#5 0x0060ef06 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x0060f0a3 in __xax_rethrow () from /usr/lib/libstdc++.so.6
#7 0x001fe526 in cpputil::ExceptionBase::Rethrow (scope=#0xbfe67470) at ExceptionBase.cpp:140
At that point, it's into the framework code.
I've read several guides and tutorials and man pages online, but I'm at a bit of a loss. I'm hoping that someone here can help get me pointed in the right direction.
It sounds like you're able to compile the Ada source code. Assuming that's the case, in the subprogram(s) that are being called through whose execution the exceptions are being raised, add an exception handler at the end that dumps the exception information:
when E : others =>
Ada.Text_IO.Put_Line(Ada.Exceptions.Exception_Information(E));
raise;
You'll also need to add a 'with' of Ada.Exceptions to the package. And Ada.Text_IO if that isn't already present.
I'm not sure exactly what you'll get out from that version of GNAT, but it's probably the invocation addresses which you can then decode using addr2line.
Could you start the C++ framework from an Ada main? If so, and you can propagate the exceptions through the C++ framework to the Ada main, its last chance handler ought to give you a pretty good report with exception, source file and line where it occurred, and a stack dump for addr2line. My experience with these is that the debugger usually isn't needed after that.
I could be off beam here because I haven't used a Gnat anywhere near as old as yours...
As the question states, is it useful to always collect a software-based backtrace (like using libc backtrace http://www.gnu.org/software/libc/manual/html_node/Backtraces.html ) in all error functions and signal handlers ?
Would it not be very helpful for debugging a wide variety of bugs like memory, concurrency bugs etc. ? I guess it would not hurt normal performance as well as it will be triggered only in error paths.
As the question states, is it useful to always collect a software-based backtrace
Yes, it is generally very useful to have a crash stack trace when:
your code runs in your own environment, and you are not worried about the stack trace revealing any secrets.
when the crash handler does not further corrupt the coredump, does not hang, etc.
like using libc backtrace
glibc backtrace calls calloc under certain conditions, and is not safe in a crash handler. It can cause both the hang, and the further corruption mentioned above. Writing a crash handler that will reliably print stack trace in async-signal-safe manner is quite non-trivial.
why then do error functions in "standard" applications not call backtrace?
Consider cat /no/such/file. Currently it produces:
cat: /no/such/file: No such file or directory
which is all you really need to know. Making this print anything else is useless. If you had many such files, and cat printed a full stack trace for each, you'd get many pages of error output, and that would only make finding the real problem harder.
For fatal signal handlers (e.g. SIGSEGV) the answer is that most "standard" applications don't actually handle such signals, and simply use the default action, which produces a core dump.
But if they did catch the signal, calling backtrace, backtrace_symbols, or backtrace_symbols_fd from the signal handler would be equally unsafe, and could deadlock, which is much worse than simply dumping core. Consider what happens if you have a long-running script with a 1000 commands in it. You start it, and a week later discover that it didn't make any progress because the second command crashed and deadlocked trying to print the crash stack trace.
I'm building a program in C++, using SDL, and am occasionally receiving this error:
* glibc detected * ./assistant: double free or corruption (!prev)
It's difficult to replicate, so I can't find exactly what's causing it, but I just added a second thread to the program, and neither thread run on its own seems to cause the error.
The threads don't share any variables, though they both run the functions SDL_BlitSurface and SDL_Flip. Could running these concurrently throw up such an error, or am I barking up the wrong tree?
If this is the cause, should I simply throw a mutex around all SDL calls?
are you running with the MALLOC_CHECK_ environment variable set? This turns on memory checks in glibc, and I've had problems with it before because of a race condition in the glibc memory checking code (http://sourceware.org/bugzilla/show_bug.cgi?id=10282) which made it put out messages like this spuriously. Try running under valgrind and see if that sees any issues.
Turns out that it was being caused by the threads not terminating correctly. Instead of terminating them from main, I allowed them to return when they saw that main had finished running (through a global 'running' variable), and the problem disappeared.