Today, I had to realize to my horror that my C++ simulation program crashed after running for 12 days, just several lines before its end, leaving me with nothing but a (truncated) core dump.
Analysis of the core dump with gdb revealed, that the
Program terminated with signal SIGBUS, Bus error.
and that the crash occured at the following line of my code:
seconds = std::difftime(stopTime, startTime); // seconds is of type double
The variables stopTime and startTime are of type std::time_t and I was able to extract their values at crash time from the core dump:
startTime: 1426863332
stopTime: 1427977226
The stack trace above the difftime-call looks like this:
#0 0x.. in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1 0x.. in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
I wrote a small program to reproduce the error, but without success. Just calling std::difftime(stopTime, startTime) with the above values does not cause a SIGBUS crash. Of course, I don't want that to happen again. I have successfully executed the same program several times before (although with different arguments) with comparable execution times. What could cause this problem and how can I prevent it in the future?
Here is some additional system information.
GCC: (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]
Linux Kernel: 3.11.10-25-desktop, x86_64
C++ standard library: 6.0.18
Edit
Here is some more context. First, the complete stack trace (ellipsis [..] mine):
#0 0x00007f309a4a5bca in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1 0x00007f309a4ac195 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2 0x0000000000465453 in CStopwatch::getTime (this=0x7fff0db48c60, delimiterHourMinuteSecondsBy="") at [..] CStopwatch.cpp:86
#3 0x00000000004652a9 in CStopwatch::stop (this=0x7fff0db48c60) at [..] CStopwatch.cpp:51
#4 0x0000000000479a0c in main (argc=33, argv=0x7fff0db499c8) at [..] coherent_ofdm_tsync_mse.cpp:998
The problem occurs in an object of class CStopwatch which is created at the beginning of the program. The stopwatch is started in main() at the very top. After the simulation is finished, the function CStopwatch::stop( ) is called.
The constructor of the stopwatch class:
/*
* Initialize start and stop time on construction
*/
CStopwatch::CStopwatch()
{
this->startTime = std::time_t( 0 );
this->stopTime = std::time_t( 0 );
this->isRunning = false;
}
The function CStopwatch::stop( )
/*
* Stop the timer and return the elapsed time as a string
*/
std::string CStopwatch::stop( )
{
if ( this->isRunning ) {
this->stopTime = std::time( 0 );
}
this->isRunning = false;
return getTime( );
}
The function CStopwatch::getTime()
/*
* Return the elapsed time as a string
*/
std::string CStopwatch::getTime( std::string delimiterHourMinuteSecondsBy )
{
std::ostringstream timeString;
// ...some string init
// time in seconds
double seconds;
if ( this->isRunning ){
// return instantaneous time
seconds = std::difftime(time(0), startTime);
} else {
// return stopped time
seconds = std::difftime(stopTime, startTime); // <-- line where the
// program crashed
}
// ..convert seconds into a string
return timeString.str( );
}
At the beginning of the program CStopwatch::start( ) is called
/*
* Start the timer, if watch is already running, this is effectively a reset
*/
void CStopwatch::start( )
{
this->startTime = std::time( 0 );
this->isRunning = true;
}
There are only a few reasons that a program may receive SIGBUS on Linux. Several are listed in answers to this question.
Look in /var/log/messages around the time of the crash, it is likely that you'll find that there was a disk failure, or some other cause for kernel unhappiness.
Another (unlikely) possibility is that someone updated libstdc++.so.6 while your program was running, and has done so incorrectly (by writing over existing file, rather than removing it and creating new file in its place).
It looks like std::difftime is being lazily loaded on its first access; if some of the runtime linker's internal state had been damaged elsewhere in your program, it could cause this.
Note that _dl_runtime_resolve would have to complete before the std::difftime call can begin, so the error is unlikely to be with your time values. You can easily verify by opening the core file in gdb:
(gdb) frame 2 # this is CStopwatch::getTime
(gdb) print this
(gdb) print *this
etc. etc.
If gdb is able to read and resolve the address, and the values look sane, that definitely didn't cause a SIGBUS at runtime. Alternatively, it's possible your stack is smashed; if _dl_fixup is preparing the trampoline jump rather than just handling relocation etc.; we can't be certain without looking at the code, but can check the stack itself:
(gdb) print %rsp
(gdb) x/16xb $rsp-16 # print the top 16 bytes of the stack
The easy workaround to try is setting the LD_BIND_NOW environment variable and forcing symbol resolution at startup. This just hides the problem though, because some memory is still getting damaged somewhere, and we're only hiding the symptom.
As for fixing the problem properly - even if short runs don't exhibit the error, it's possible some memory damage is occurring but is asymptomatic. Try running a shorter simulation under valgrind and fix all warnings and errors unless you're certain they're benign.
Impossible to tell without further context, but:
this could be null or corrupt
startTime could be a null reference
stopTime could be a null reference
I was going to suggest you set a breakpoint on the line and print out stopTime and startTime, but you've already nearly done that by looking at the core file.
It looks as if something is going wrong linking the function in. Might it be that you are compiling against a different set of headers from the standard library you are linking to?
It may just be memory related:
if this is deeply nested, you might simply have a stack overflow.
if this is the first time it's being called, perhaps it is trying to allocate memory for the library, load it in, and link it, and that failed due to hitting a memory limit
If this code path is called many many times, and never crashes elsewhere, maybe it's time to run memtest86 overnight.
Related
Hello my fellow Developers,
I am trying to analyse stack-dump from a C++ program, in this particular program we are creating signal handlers and and inside signal handler we have code something like below
res = backtrace(arr, size);
if (res > 0)
{
btrace = backtrace_symbols(arr, size);
}
for ( unsigned int i =0; i < size ; ++i )
{
log_error(" trace: %s", btrace[i] );
}
I am able to get traces however unable to find address which i am expecting, now I am wondering if there is any possibility that other threads might have filled the stacktrace buffer(as it is an embedded device and we have limited buffer size), if so how i can identify/filter stacktraces thread wise or is there any option to restrict stacktraces to only thread which caught segmentation fault.
Note: I am able to identify 2 addresses from my executable which are present in stacktrace, very recent one is the address of signal handler itself and 2nd one is from another thread.
I tried to googling resources on how backtrace works in multi threaded env. on but not able to find much info, so I am here.
Thanks
Using the following setup:
Cortex-M3 based µC
gcc-arm cross toolchain
using C and C++
FreeRtos 7.5.3
Eclipse Luna
Segger Jlink with JLinkGDBServer
Code Confidence FreeRtos debug plugin
Using JLinkGDBServer and eclipse as debug frontend, I always have a nice stacktrace when stepping through my code. When using the Code Confidence freertos tools (eclipse plugin), I also see the stacktraces of all threads which are currently not running (without that plugin, I see just the stacktrace of the active thread). So far so good.
But now, when my application fall into a hardfault, the stacktrace is lost.
Well, I know the technique on how to find out the code address which causes the hardfault (as seen here).
But this is very poor information compared to full stacktrace.
Ok, some times when falling into hardfault there is no way to retain a stacktrace, e.g. when the stack is corrupted by the faulty code. But if the stack is healty, I think that getting a stacktrace might be possible (isn't it?).
I think the reason for loosing the stacktrace when in hardfault is, that the stackpointer would be swiched from PSP to MSP automatically by the Cortex-M3 architecture. One idea is now, to (maybe) set the MSP to the previous PSP value (and maybe have to do some additional stack preperation?).
Any suggestions on how to do that or other approaches to retain a stacktrace when in hardfault?
Edit 2015-07-07, added more details.
I uses this code to provocate a hardfault:
__attribute__((optimize("O0"))) static void checkHardfault() {
volatile uint32_t* varAtOddAddress = (uint32_t*)-1;
(*varAtOddAddress)++;
}
When stepping into checkHardfault(), my stacktrace looks good like this:
gdb-> backtrace
#0 checkHardfault () at Main.cxx:179
#1 0x100360f6 in GetOneEvent () at Main.cxx:185
#2 0x1003604e in executeMainLoop () at Main.cxx:121
#3 0x1001783a in vMainTask (pvParameters=0x0) at Main.cxx:408
#4 0x00000000 in ?? ()
When run into the hardfault (at (*varAtOddAddress)++;) and find myself inside of the HardFault_Handler(), the stacktrace is:
gdb-> backtrace
#0 HardFault_Handler () at Hardfault.c:312
#1 <signal handler called>
#2 0x10015f36 in prvPortStartFirstTask () at freertos/portable/GCC/ARM_CM3/port.c:224
#3 0x10015fd6 in xPortStartScheduler () at freertos/portable/GCC/ARM_CM3/port.c:301
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
The quickest way to get the debugger to give you the details of the state prior to the hard fault is to return the processor to the state prior to the hard fault.
In the debugger, write a script that takes the information from the various hardware registers and restore PC, LR, R0-R14 to the state just prior to causing the hard fault, then do your stack dump.
Of course, this isn't always helpful when you end up at the hard fault because of popping stuff off of a blown stack or stomping on stuff in memory. You generally tend to corrupt a bunch of the important registers, return back to some crazy spot in memory, and then execute whatever's there. You can end up hard faulting many thousands (millions?) of cycles after your real problem happens.
Consider using the following gdb macro to restore the register contents:
define hfstack
set $frame_ptr = (unsigned *)$sp
if $lr & 0x10
set $sp = $frame_ptr + (8 * 4)
else
set $sp = $frame_ptr + (26 * 4)
end
set $lr = $frame_ptr[5]
set $pc = $frame_ptr[6]
bt
end
document hfstack
set the correct stack context after a hard fault on Cortex M
end
I'm currently writing a multi-threaded, high efficient and scalable algorithm. Because I have to guess a parameter for the code and I'm not sure how the calculation performs on a specific data set, I would like to watch a variable. The test only works with a real world, huge data set. It is possible to analyze the collected data after profiling. Imagine the following, simple code example (real code can contain multiple watch points:
// function get's called by loops of multiple threads
void payload(data_t* data, double threshold) {
double value = calc(data);
// here I want to watch the value
if (value < threshold) {
doSomething(data);
} else {
doSomethingElse(data);
}
}
I thought about the following approaches:
Using cout or other system outputs
Use a binary output (file, network)
Set a breakpoint via gdb/lldb
Use variable watching + logging via gdb/lldb
I'm not happy with the results because: To use 1. and 2. I have to change the code, but this is a debugging/evaluating task. Furthermore 1. requires locking and 1.+2. requires I/O operations, which heavily slows down the entire code and makes testing with real data nearly impossible. 3. is also too slow. To use 4., I have to know the variable address because it's not a global variable, but because threads get created by a dynamic scheduler, this would require breaking + stepping for each thread.
So my conclusion is, that I need a profiler/debugger that works at machine code level and dumps/logs/watches the variable without double->string conversion and is highly efficient, or to sum up with other words: I would like to profile the internal state of my algorithm without heavy slow-down and without doing deep modification. Does anybody know a tool that is able to this?
OK, this took some time but now I'm able to present a solution for my problem. It's called tracepoints. Instead of breaking the program every time, it's more lightweight and (ideally) doesn't change performance/timing too much. It does not require code changes. Here is an explanation how to use them using gdb:
Make sure you compiled your program with debugging symbols (using the -g flag). Now, start the gdb server and provide a network port (e.g. 10000) and the program arguments:
gdbserver :10000 ./program --parameters you --want --to use
Now, switch to a second console and start gdb (program parameters are not required here):
gdb ./program
All following commands are entered in the gdb command line interface. So let's connect to the server:
target remote :10000
After you got the connection confirmation, use trace or ftrace to set a tracepoint to a specific source location (try ftrace first, it should be faster but doesn't work on all platforms):
trace source.c:127
This should create tracepoint #1. Now you can setup an action for this tracepoint. Here I want to collect the data from myVariable
action 1
collect myVariable
end
If expect much data or want to use the data later (after restart), you can set a binary trace file:
tsave trace.bin
Now, start tracing and run the program:
tstart
continue
You can wait for program exit or interrupt your program using CTRL-C (still on gdb console, not on server side). Continue by telling gdb that you want to stop tracing:
tstop
Now we come the tricky part and I'm not really happy with the following code because it's really slow:
set pagination off
set logging file trace.txt
tfind start
while ($trace_frame != -1)
set logging on
printf "%f\n", myVariable
set logging off
tfind
end
This dumps all variable data to a text file. You can add some filter or preparation here. Now you're done and you can exit gdb. This will also shutdown the server:
quit
For detailed documentation especially for explanation of filtering and more advanced tracepoint positions, you can visit the following document: http://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html
To isolate trace file writing from your program execution, you can use cgroups or another network connected computer. When using another computer, you have to add the host to the port information (e.g. 192.168.1.37:10000). To load a binary trace file later, just start gdb as shown above (forget the server) and change the target:
gdb ./program
target tfile trace.bin
you can set hardware watchpoint using gdb debugger, for example if you have
bool b;
variable and you want to be notified every time the value of it has chenged (by any thread)
you would declare a watchpoint like this:
(gdb) watch *(bool*)0x7fffffffe344
example:
root#comp:~# gdb prog
GNU gdb (GDB) 7.5-ubuntu
Copyright ...
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /dist/Debug/GNU-Linux-x86/cppapp_socket5_ipaddresses...done.
(gdb) watch *(bool*)0x7fffffffe344
Hardware watchpoint 1: *(bool*)0x7fffffffe344
(gdb) start
Temporary breakpoint 2 at 0x40079f: file main.cpp, line 26.
Starting program: /dist/Debug/GNU-Linux-x86/cppapp_socket5_ipaddresses
Hardware watchpoint 1: *(bool*)0x7fffffffe344
Old value = true
New value = false
main () at main.cpp:50
50 if (strcmp(mask, "255.0.0.0") != 0) {
(gdb) c
Continuing.
Hardware watchpoint 1: *(bool*)0x7fffffffe344
Old value = false
New value = true
main () at main.cpp:41
41 if (ifa ->ifa_addr->sa_family == AF_INET) { // check it is IP4
(gdb) c
Continuing.
mask:255.255.255.0
eth0 IP Address 192.168.1.5
[Inferior 1 (process 18146) exited normally]
(gdb) q
I have a program that uses the gloox library to connect to an xmpp server. Connection always succeeds if I run the program directly. However, the program is having high cpu usage. So I turned to valgrind for help. But if I run the program with valgrind (--tool=callgrind), the connection always times out. I have to admit I'm new to valgrind, but why is this happening?
Valgrind does a number of transformations of executed code, making it run 10-50 times slower than natively. So it is likely that connection times out. You can run Valgrind with profiled program under strace to locate this connection by error codes.
If your original problem is a high cpu with gloox, I'm almost sure that your program polls every 10 milliseconds for new xmpp messages.
Run your program with recv(-1) instead of recv(10) for example.
http://camaya.net/glooxlist/dev/msg01191.html
After I run into a similar problem and extra debugging, it comes down to a problem when parsing the xmpp xml stanza.
In our case, the problem was with the xpath parser that uses an util.h function int2string that use long2string.
Under normal execution
int len = (int)( log( (double)( 10 ) ) / log( (double) 10 ) ) + 1;
gives 2 but gives 1 under valgrind and break everything down.
We replaced the function
static inline const std::string int2string( int value )
{
return long2string( value );
}
by
#include <sstream>
static inline const std::string int2string( int value )
{
/* ADDON*/
//when we call long2string, it does weird cmath log stuff and with computer precision,
//the result may be different from an environnement to another. eg: when using valgrind
std::ostringstream s;
s << value;
return s.str();
/* ADDON */
//return long2string( value );
}
I am having problem in getting my stack trace output to stderr or dumping to a log file. I am running the code in Kubuntu10.04 with gcc compiler (4.4.3). The issue is that in the normal running mode (without gdb), the program does not output anything except 'Segmentation Fault'. I wish to output the backtrace output as in the print statements below. When I run gdb with my application, it comes to the printf/fprintf/(function call) statement, and then crashes with the following statement:
669 {
(gdb)
670 printf("Testing for stability.\n");
(gdb)
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff68b1f45 in puts () from /lib/libc.so.6
The strange things is that it works if I call a function within the same file that crashes, it works fine and spews the output properly. But if the program crashes in a function outside this file, it does not print any output.
So no printf or file dumping statement or function call gets processed. I am using the following sample code:
void bt_sighandler(int sig, siginfo_t *info,
void *secret) {
void *trace[16];
char **messages = (char **)NULL;
int i, trace_size = 0;
ucontext_t *uc = (ucontext_t *)secret;
/* Do something useful with siginfo_t */
if (sig == SIGSEGV)
printf("Got signal %d, faulty address is %p, "
"from %p\n", sig, info->si_addr,
uc->uc_mcontext.gregs[0]);
else
printf("Got signal %d#92; \n", sig);
trace_size = backtrace(trace, 16);
/* overwrite sigaction with caller's address */
trace[1] = (void *) uc->uc_mcontext.gregs[0];
messages = backtrace_symbols(trace, trace_size);
/* skip first stack frame (points here) */
printf("[bt] Execution path:#92; \n");
for (i=1; i<trace_size; ++i)
printf("[bt] %s#92; \n", messages[i]);
exit(0);
}
int main() {
/* Install our signal handler */
struct sigaction sa;
sa.sa_sigaction = (void *)bt_sighandler;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_RESTART | SA_SIGINFO;
sigaction(SIGSEGV, &sa, NULL);
sigaction(SIGUSR1, &sa, NULL);
/* Do something */
printf("%d#92; \n", func_b());
}
Thanks in advance for any help.
Unfortunately you just can't reliably do much of anything in a SIGSEGV handler. Think about it this way: Your program has a serious error and its state (including system level state such as the heap) is in an inconsistent state.
In such a case, you can't expect the OS to magically fix up the heap and other internals it needs in order to be able to execute arbitrary code within your signal handler.
If the SEGV happens in your own code, the good solution is to use the core and fix the root problem. If the core happens in other code via say a shared library, I'd suggest isolating that code in an entirely separate binary and communicate between the two binaries. Then if the library crashes your main program does not.
You are supposed to do very little in a signal handler, in principle only access variables of type sig_atomic_t and volatile data.
Doing I/O is definitely out of the question. See this page for gcc:
http://www.gnu.org/s/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy
Try using simpler functions, such as strcat() and write().
Is there a reason you can't use valgrind?
When the application crashes Linux creates a core dump with the state of the application when it crashed. The core file can be examined using gdb.
If no core file is created try changing core file size with
ulimit -c unlimited
in the same shell and before the program is started.
The name of the core file is usually core.PID where PID is the pid of the program.
The core file is usually placed somewhere in /tmp or the directory where the program was started.
A lot more info on core files is available on the man page for core. Use
man core
to read the man page.
I managed to get it partially working. Actually I was running the application in 'sudo' mode. Running it in user mode gives me the callstack. However running in user mode disables hardware acceleration (nvidia graphics drivers). To resolve that, I added myself to the 'video' group, so that I have access to /dev/nvidia0 & /dev/nvidiactl. However when I get the access the stack does not get generated anymore. Its only when I am in user mode and hardware acceleration is disabled, the stack is coming. But I can't run my application without hardware acceleration (mean some important functionality would get disabled). Please let me know if anyone has any idea.
Thanks.