How to track down a SIGFPE/Arithmetic exception

How to track down a SIGFPE/Arithmetic exception - c++

I have a C++ application cross-compiled for Linux running on an ARM CortexA9 processor which is crashing with a SIGFPE/Arithmetic exception. Initially I thought that it's because of some optimizations introduced by the -O3 flag of gcc but then I built it in debug mode and it still crashes.
I debugged the application with gdb which catches the exception but unfortunately the operation triggering exception seems to also trash the stack so I cannot get any detailed information about the place in my code which causes that to happen. The only detail I could finally get was the operation triggering the exception(from the following piece of stack trace):
3 raise() 0x402720ac
2 __aeabi_uldivmod() 0x400bb0b8
1 __divsi3() 0x400b9880
The __aeabi_uldivmod() is performing an unsigned long long division and reminder so I tried the brute force approach and searched my code for places that might use that operation but without much success as it proved to be a daunting task. Also I tried to check for potential divisions by zero but again the code base it's pretty large and checking every division operation it's a cumbersome and somewhat dumb approach. So there must be a smarter way to figure out what's happening.
Are there any techniques to track down the causes of such exceptions when the debugger cannot do much to help?
UPDATE: After crunching on hex numbers, dumping memory and doing stack forensics(thanks Crashworks) I came across this gem in the ARM Compiler documentation(even though I'm not using the ARM Ltd. compiler):
Integer division-by-zero errors can be trapped and identified by
re-implementing the appropriate C library helper functions. The
default behavior when division by zero occurs is that when the signal
function is used, or
__rt_raise() or __aeabi_idiv0() are re-implemented, __aeabi_idiv0() is
called. Otherwise, the division function returns zero.
__aeabi_idiv0() raises SIGFPE with an additional argument, DIVBYZERO.
So I put a breakpoint at __aeabi_idiv0(_aeabi_ldiv0) et Voila!, I had my complete stack trace before being completely trashed. Thanks everybody for their very informative answers!
Disclaimer: the "winning" answer was chosen solely and subjectively taking into account the weight of its suggestions into my debugging efforts, because more than one was informative and really helpful.

My first suggestion would be to open a memory window looking at the region around your stack pointer, and go digging through it to see if you can find uncorrupted stack frames nearby that might give you a clue as to where the crash was. Usually stack-trashes only burn a couple of the stack frames, so if you look upwards a few hundred bytes, you can get past the damaged area and get a general sense of where the code was. You can even look down the stack, on the assumption that the dead function might have called some other function before it died, and thus there might be an old frame still in memory pointing back at the current IP.
In the comments, I linked some presentation slides that illustrate the technique on a PowerPC — look at around #73-86 for a case study in a similar botched-stack crash. Obviously your ARM's stack frames will be laid out differently, but the general principle holds.

(Using the basic idea from Fedor Skrynnikov, but with compiler help instead)
Compile your code with -pg. This will insert calls to mcount and mcountleave() in every function. Do not link against the GCC profiling lib, but provide your own. The only thing you want to do in your mcount and mcountleave() is to keep a copy of the current stack, so just copy the top 128 bytes or so of the stack to a fixed buffer. Both the stack and the buffer will be in cache all the time so it's fairly cheap.

You can implement special guards in functions that can cause the exception. Guard is a simple class, in constractor of this class you put the name of the file and line (_FILE_, _LINE_) into file/array/whatever. The main condition is that this storage should be the same for all instances of this class(kind of stack). In the destructor you remove this line. To make it works you need to put the creation of this guard on the first line of each function and to create it only on stack. When you will be out of current block deconstructor will be called. So in the moment of your exception you will know from this improvised callstack which function is causing a problem.
Ofcaurse you may put creation of this class under debug condition

Enable generation of core files, and open the core file with the debuger

Since it uses raise() to raise the exception, I would expect that signal() should be able to catch it. Is this not the case?
Alternatively, you can set a conditional breakpoint at __aeabi_uldivmod to break when divisor (r1) is 0.

Related

Is ref-copying a compiler optimization, and can I avoid it?

I dislike pointers, and generally try to write as much code as I can using refs instead.
I've written a very rudimentary "vertical layout" system for a small Win32 app. Most of the Layout methods look like this:
void Control::DoLayout(int availableWidth, int &consumedYAmt)
{
textYPosition = consumedYAmt;
consumedYAmt += measureText(font, availableWidth);
}
They are looped through like so:
int innerYValue = 0;
foreach(control in controls) {
control->DoLayout(availableWidth, innerYValue);
}
int heightOfControl = innerYValue;
It's not drawing its content here, just calculating exactly how much space this control will require (usually it's adding padding too, etc). This has worked great for me.......in debug mode.
I found that in Release mode, I could suddenly see tangible, loggable issues where, when I'm looping through controls and calling DoLayout(), the consumedYAmt variable actually stays at 0 in the outside loop. The most annoying part is that if I put in breakpoints and walk through the code line by line, this stops happening and parts of it are properly updated by the inside "add" methods.
I'm kind of thinking about whether this would be some compiler optimization where they think I'm simply adding the ref flag to ints as a way to optimize memory; or if there's any possibility this actually works in a way different from how it seems.
I would give a minimum reproducible example, but I wasn't able to do so with a tiny commandline app. I get the sense that if this is an optimization, it only kicks in for larger code blocks and indirections.
EDIT: Again sorry for generally low information, but I'm now getting hints that this might be some kind of linker issue. I skipped one part of the inheritance model in my pseudocode: The calling class actually calls "Layout()", which is a non-virtual function on the root definition of the class. This function performs some implementation-neutral logic, and then calls DoLayout() with the same arguments. However, I'm now noticing that if I try adding a breakpoint to Layout(), Visual Studio claims that "The breakpoint will not be hit. No executable code of the debugger's target code type is associated with this line." I am able to add breakpoints to certain other lines, but I'm beginning to notice weird stepping logic where it refuses to go inside certain functions, like Layout. Already tried completely clearing the build folders and rebuilding. I'm going to have to keep looking, since I have to admit this isn't a lot to go on.
Also, random addition: The "controls" list is a vector containing shared_ptr objects. I hadn't suspected the looping mechanism previously but now I'm looking more closely.

"the consumedYAmt variable actually stays at 0"
The behavior you describe is typical for a specific optimization that's more due to the CPU than the compiler. I suspect you're logging consumedYAmt from another thread. The updates to consumedYAmt simply don't make it to that other thread.
This is legal for the CPU, because the C++ compiler didn't put in memory fences. And the CPU compiler didn't put in fences because the variable isn't atomic.
In a small program without threads, this simply doesn't show up, nor does it show in debug mode.

Written by OP
Okay, eventually figured this one out. As simple as the issue was, pinning it down became difficult because of Release mode's debugger seemingly acting in inconsistent ways. When I changed tactic to adding Logging statements in lots of places, I found that my Control class had an "mShowing" variable that was uninitialized in its constructor. In debug mode, it apparently retained uninitialized memory which I guess made it "true" - but in release mode, my best analysis is that memory protections made it default to "false", which as it turns out skipped the main body of the DoLayout method most of the time.
Since through the process, responders were low on information to work with (certainly could've been easier if I posted a longer example), I instead simply upvoted each comment that mentioned uninitialized variables.

My code crashes on delete this

I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.

The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)

It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)

Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).

"this" pointer getting corrupted in stack trace

I have seen this thread. My case is slightly different and I'm struggling to figure out how "this" pointer is getting corrupted.
I'm using the Qt 4.6.2 framework, using their QTreeView with my own model. The backtrace I get (86 frames long, with a lot of recursion, that's why I haven't pasted the whole thing in, it's in this pastebin only involves their code.
It finally segfaults on some assembler in QBasicAtomicInt::deref, but it's obvious that it has died further down, evidenced by these three frames:
#15 0x01420fd3 in QFrame::event (this=0x942bba0, e=0xbf8eb624) at widgets/qframe.cpp:557
#16 0x014bb382 in QAbstractScrollArea::viewportEvent (this=0x4, e=0x93f9240) at widgets/qabstractscrollarea.cpp:1036
#17 0x0156fbd7 in QAbstractItemView::viewportEvent (this=0x942bba0, event=0xbf8eb624) at itemviews/qabstractitemview.cpp:1610
In frame 17, this is 0x942bb0. In frame 16, this should be the same, as in frame 17 it's calling its ancestor's implementation of the same method. However this becomes 0x4.
Interestingly enough in frame 15 (again, frame 16 has called its ancestor's implementation of the same function), the 'this' pointer is restored to 0x942bba0.
If you looked at the pastebin of the full backtrace, you might see some 'value optimized out'. I had the application compiled with optimization on; I now have gcc set to -g3 -O0 so when it happens next time I might have something more. But of course now I can't make it crash -- it is a fairly difficult bug to make happen (but very important to fix nonetheless) so I don't think that's too suspicious.
Given the optimizations, is that this pointer=0x4 unusual or definitely wrong? What is odd is that there's no real code in any of these viewportEvent frames -- they simply do a switch on the event's type, it falls through the switch statement, and it returns its ancestor's implementation.
Valgrind doesn't seem to be throwing up any issues, although I haven't made it crash in Valgrind yet.
Has anybody seen this behaviour before? What could be causing it?

I have seen this sort of thing before when debugging optimized builds and it has never been an indication of what the real bug is for me.
It is easier to first think about a local variable. In a non-optimized build, everything has its designated place in memory and must be stored after every line of code. This is so the debugger can find it. In an optimized build, values can live in registers without being written to memory. This is a major part of the improved performance of an optimized build. The debugger doesn't understand this and will always look at memory, so you will often see the wrong value.
The same can happen with parameters. If the optimizer decides to pass a parameter in a register, the debugger is still going to look at the stackframe. More specifically, at the location where the parameter would be according to the rules of the calling convention.
The fact that the next frame of the stack has the value properly restored indicates that the generated instructions are dealing with the this parameter correctly, but the debugger just doesn't know where to look for it.

Debugging a memory error with GDB and C++

I'm running my C++ program in gdb. I'm not real experienced with gdb, but I'm getting messages like:
warning: HEAP[test.exe]:
warning: Heap block at 064EA560 modified at 064EA569 past requested size of 1
How can I track down where this is happening at? Viewing the memory doesn't give me any clues.
Thanks!

So you're busting your heap. Here's a nice GDB tutorial to keep in mind.
My normal practice is to set a break in known good part of the code. Once it gets there step through until you error out. Normally you can determine the problem that way.
Because you're getting a heap error I'd assume it has to do with something you're putting on the heap so pay special attention to variables (I think you can use print in GDB to determine it's memory address and that may be able to sync you with where your erroring out). You should also remember that entering functions and returning from functions play with the heap so they may be where your problem lies (especially if you messed your heap before returning from a function).

You can probably use a feature called a "watch point". This is like a break point but the debugger stops when the memory is modified.
I gave a rough idea on how to use this in an answer to a different question.

If you can use other tools, I highly recommend trying out Valgrind. It is an instrumentation framework, that can run your code in a manner that allows it to, typically, stop at the exact instruction that causes the error. Heap errors are usually easy to find, this way.

One thing you can try, as this is the same sort of thing as the standard libc, with the MALLOC_CHECK_ envronment variable configured (man libc).
If you keep from exiting gdb (if your application quit's, just use "r" to re-run it), you can setup a memory breakpoint at that address, "hbreak 0x64EA569", also use "help hbreak" to configure condition's or other breakpoitn enable/disable options to prevent excessively entering that breakpoint....
You can just configure a log file, set log ... setup a stack trace on every break, "display/bt -4", then hit r, and just hold down the enter key and let it scroll by
"or use c ## to continue x times... etc..", eventually you will see that same assertion, then you will now have (due to the display/bt) a stacktrace which you can corolate to what code was modifying on that address...

I had similar problem when I was trying to realloc array of pointers to my structures, but instead I was reallocating as array of ints (because I got the code from tutorial and forgot to change it). The compiler wasnt correcting me because it cannot be checked whats in size argument.
My variable was:
itemsetList_t ** iteration_isets;
So in realloc instead of having:
iteration_isets = realloc(iteration_isets, sizeof(itemsetList_t *) * max_elem);
I had:
iteration_isets = realloc(iteration_isets, sizeof(int) * max_elem);
And this caused my heap problem.

What can modify the frame pointer?

I have a very strange bug cropping up right now in a fairly massive C++ application at work (massive in terms of CPU and RAM usage as well as code length - in excess of 100,000 lines). This is running on a dual-core Sun Solaris 10 machine. The program subscribes to stock price feeds and displays them on "pages" configured by the user (a page is a window construct customized by the user - the program allows the user to configure such pages). This program used to work without issue until one of the underlying libraries became multi-threaded. The parts of the program affected by this have been changed accordingly. On to my problem.
Roughly once in every three executions the program will segfault on startup. This is not necessarily a hard rule - sometimes it'll crash three times in a row then work five times in a row. It's the segfault that's interesting (read: painful). It may manifest itself in a number of ways, but most commonly what will happen is function A calls function B and upon entering function B the frame pointer will suddenly be set to 0x000002. Function A:
result_type emit(typename type_trait<T_arg1>::take _A_a1) const
{ return emitter_type::emit(impl_, _A_a1); }
This is a simple signal implementation. impl_ and _A_a1 are well-defined within their frame at the crash. On actual execution of that instruction, we end up at program counter 0x000002.
This doesn't always happen on that function. In fact it happens in quite a few places, but this is one of the simpler cases that doesn't leave that much room for error. Sometimes what will happen is a stack-allocated variable will suddenly be sitting on junk memory (always on 0x000002) for no reason whatsoever. Other times, that same code will run just fine. So, my question is, what can mangle the stack so badly? What can actually change the value of the frame pointer? I've certainly never heard of such a thing. About the only thing I can think of is writing out of bounds on an array, but I've built it with a stack protector which should come up with any instances of that happening. I'm also well within the bounds of my stack here. I also don't see how another thread could overwrite the variable on the stack of the first thread since each thread has it's own stack (this is all pthreads). I've tried building this on a linux machine and while I don't get segfaults there, roughly one out of three times it will freeze up on me.

Stack corruption, 99.9% definitely.
The smells you should be looking carefully for are:-
Use of 'C' arrays
Use of 'C' strcpy-style functions
memcpy
malloc and free
thread-safety of anything using pointers
Uninitialised POD variables.
Pointer Arithmetic
Functions trying to return local variables by reference

I had that exact problem today and was knee-deep in gdb mud and debugging for a straight hour before occurred to me that I simply wrote over array boundaries (where I didn't expect it the least) of a C array.
So, if possible, use vectors instead because any decend STL implementation will give good compiler messages if you try that in debug mode (whereas C arrays punish you with segfaults).

I'm not sure what you're calling a "frame pointer", as you say:
On actual execution of that
instruction, we end up at program
counter 0x000002
Which makes it sound like the return address is being corrupted. The frame pointer is a pointer that points to the location on the stack of the current function call's context. It may well point to the return address (this is an implementation detail), but the frame pointer itself is not the return address.
I don't think there's enough information here to really give you a good answer, but some things that might be culprits are:
incorrect calling convention. If you're calling a function using a calling convention different from how the function was compiled, the stack may become corrupted.
RAM hit. Anything writing through a bad pointer can cause garbage to end up on the stack. I'm not familiar with Solaris, but most thread implementations have the threads in the same process address space, so any thread can access any other thread's stack. One way a thread can get a pointer into another thread's stack is if the address of a local variable is passed to an API that ultimately deals with the pointer on a different thread. unless you synchronize things properly, this will end up with the pointer accessing invalid data. Given that you're dealing with a "simple signal implementation", it seems like it's possible that one thread is sending a signal to another. Maybe one of the parameters in that signal has a pointer to a local?

There's some confusion here between stack overflow and stack corruption.
Stack Overflow is a very specific issue cause by try to use using more stack than the operating system has allocated to your thread. The three normal causes are like this.
void foo()
{
foo(); // endless recursion - whoops!
}
void foo2()
{
char myBuffer[A_VERY_BIG_NUMBER]; // The stack can't hold that much.
}
class bigObj
{
char myBuffer[A_VERY_BIG_NUMBER];
}
void foo2( bigObj big1) // pass by value of a big object - whoops!
{
}
In embedded systems, thread stack size may be measured in bytes and even a simple calling sequence can cause problems. By default on windows, each thread gets 1 Meg of stack, so causing stack overflow is much less of a common problem. Unless you have endless recursion, stack overflows can always be mitigated by increasing the stack size, even though this usually is NOT the best answer.
Stack Corruption simply means writing outside the bounds of the current stack frame, thus potentially corrupting other data - or return addresses on the stack.
At it's simplest:-
void foo()
{
char message[10];
message[10] = '!'; // whoops! beyond end of array
}

That sounds like a stack overflow problem - something is writing beyond the bounds of an array and trampling over the stack frame (and probably the return address too) on the stack. There's a large literature on the subject. "The Shell Programmer's Guide" (2nd Edition) has SPARC examples that may help you.

With C++ unitialized variables and race conditions are likely suspects for intermittent crashes.

Is it possible to run the thing through Valgrind? Perhaps Sun provides a similar tool. Intel VTune (Actually I was thinking of Thread Checker) also has some very nice tools for thread debugging and such.
If your employer can spring for the cost of the more expensive tools, they can really make these sorts of problems a lot easier to solve.

It's not hard to mangle the frame pointer - if you look at the disassembly of a routine you will see that it is pushed at the start of a routine and pulled at the end - so if anything overwrites the stack it can get lost. The stack pointer is where the stack is currently at - and the frame pointer is where it started at (for the current routine).
Firstly I would verify that all of the libraries and related objects have been rebuilt clean and all of the compiler options are consistent - I've had a similar problem before (Solaris 2.5) that was caused by an object file that hadn't been rebuilt.
It sounds exactly like an overwrite - and putting guard blocks around memory isn't going to help if it is simply a bad offset.
After each core dump examine the core file to learn as much as you can about the similarities between the faults. Then try to identify what is getting overwritten. As I remember the frame pointer is the last stack pointer - so anything logically before the frame pointer shouldn't be modified in the current stack frame - so maybe record this and copy it elsewhere and compare upon return.

Is something meaning to assign a value of 2 to a variable but instead is assigning its address to 2?
The other details are lost on me but "2" is the recurring theme in your problem description. ;)

I would second that this definitely sounds like a stack corruption due to out of bound array or buffer writing. Stack protector would be good as long as the writing is sequential, not random.

I second the notion that it is likely stack corruption. I'll add that the switch to a multi-threaded library makes me suspicious that what has happened is a lurking bug has been exposed. Possibly the sequencing the buffer overflow was occurring on unused memory. Now it's hitting another thread's stack. There are many other possible scenarios.
Sorry if that doesn't give much of a hint at how to find it.

I tried Valgrind on it, but unfortunately it doesn't detect stack errors:
"In addition to the performance penalty an important limitation of Valgrind is its inability to detect bounds errors in the use of static or stack allocated data."
I tend to agree that this is a stack overflow problem. The tricky thing is tracking it down. Like I said, there's over 100,000 lines of code to this thing (including custom libraries developed in-house - some of it going as far back as 1992) so if anyone has any good tricks for catching that sort of thing, I'd be grateful. There's arrays being worked on all over the place and the app uses OI for its GUI (if you haven't heard of OI, be grateful) so just looking for a logical fallacy is a mammoth task and my time is short.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
No one asked this, but I built with gcc-4.2. Also, I can guarantee ABI safety here so that's also not the issue. As for the "garbage at the end of the stack" on the RAM hit, the fact that it is universally 2 (though in different places in the code) makes me doubt that as garbage tends to be random.

It is impossible to know, but here are some hints that I can come up with.
In pthreads you must allocate the stack and pass it to the thread. Did you allocate enough? There is no automatic stack growth like in a single threaded process.
If you are sure that you don't corrupt the stack by writing past stack allocated data check for rouge pointers (mostly uninitialized pointers).
One of the threads could overwrite some data that others depend on (check your data synchronisation).
Debugging is usually not very helpful here. I would try to create lots of log output (traces for entry and exit of every function/method call) and then analyze the log.
The fact that the error manifest itself differently on Linux may help. What thread mapping are you using on Solaris? Make sure you map every thread to it's own LWP to ease the debugging.

Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
If you pass anything on the stack by reference or by address, this would most certainly happen if another thread tried to use it after the first thread returned from a function.
You might be able to repro this by forcing the app onto a single processor. I don't know how you do that with Sparc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js