How to find out where floating point exception occured? - c++

I am currently using a large computational package written in c++, which I have downloaded from github and compiled myself as I want to use it for some work I am doing.
The code works well for most purposes. Unfortunately, I have found that for certain inputs the code gives the error: Floating point exception (core dumped)
Now, I am a beginner at c++ and I have had no luck trying to browse through the many scripts that make up the code. My question is therefore: Is there a simple way to get a c++ code to output which line and which script the error occurred? Being used to Python, this is where I would always start, but unfortunately the compiled code does not return any more details about the error. Do I need to compile it in a form of debugging mode to get it to do so?

Yes, you should build the program in debug mode and run it through a debugger. It'll "break" when the error happens and tell you exactly what line of code triggers it. Furthermore, you can examine the values of variables in that stack frame and lower to diagnose the cause of the problem.
In fact, while developing, you should be doing this anyway.
It is impossible to give general steps as to how to do this, but if you're using an IDE (Visual Studio, Xcode) this should automatically happen; if you're using GCC on the command line, research GDB; if you're using Clang on the command line, research LLDB.
Speaking generally, though, a Floating-Point Exception (not a C++ exception!) is usually, and perhaps confusingly, triggered by an integer division by zero. Though, there are other reasons it can occur. You'll know more once you're debugging.

Related

Debugging run time error of armadillo package

I am using armadillo c++ library for developing a Rcpp package. However I am finding debugging any run time errors from armadillo extremely cumbersome. Currently I have to insert printout after every line to fish out line at which errors are coming. Armadillo just throw errors like:
error: subtraction: incompatible matrix dimensions: 756x1 and 26x1
and it does not tell any information about line number. Using gdb is also not particularly helpful because the error might be coming after many iterations. Is there any better to get line number of where the error is occurring.
I do not know about the Rcpp integration, but I debug my Armadillo code using gdb.
Just make sure that you never catch any exceptions like std::logic_error in your code. If you then run the program from within gdb, it will abort once the error occurs and just by typing bt you get a nice back trace that shows you which line is to blame. And then you also easily inspect variable values etc. in that stack frame.
You do not need to step through the code to make use of the advantages of the debugger.
If Rcpp does not allow avoiding to catch that exception, you should always be able to write a simple C++ test program for your code that does not block the debugger.

CPPUTest debugging - C++

I am learning TDD and using CppUTest in eclipse.
Is there any way to debug my code getting a nagging segmentation fault.
Thanks
I don't know anything special in CppUTest or Eclipse to help you, but some generic segfault debugging ideas seem appropriate here:
Add flushing print statements (e.g. printf(...) + fflush(stdout) or fprintf(stderr, ...)) to your code and see what gets printed. Do this in a binary search fashion with just a few prints at a time until you narrow down exactly where it is crashing. This sounds old fashioned but is extremely effective. Here is a guide I found googling that talks about this well-known technique: http://www.floccinaucinihilipilification.net/blog/2011/3/24/debugging-via-binary-search.html
Compile your code with debugging symbols and run it in a debugger. When you hit your segfault, ask for a backtrace and see if you can figure out what happened. When doing this it can be especially helpful to use a graphical debugger.
Run your code with a debugging tool like a debug malloc library or something from the valgrind suite. This may catch problems that are root causes of your segfaults but aren't occuring at the exact place where the segfault is generated (e.g. double frees, out of bound array access clobbering pointers used later, etc).
It would be helpful if you could add some code to your question, to give us a better idea of what you are up against. Not knowing any of the details, I would suggest the following:
Add -vto your executable's arguments in the Debug dialog. This will print the names of your test cases as they are executed. The last name that prints is likely the test where the segmentation fault occurs.
Put a breakpoint in that test case, where you call your code under test
Step into your code until the segfault occurs.
Trace back the value that caused the segfault (most often, a dangling pointer) and find out, why it was NULL or uninitialized.

Irreproducible runtime errors - general approach?

I'm facing a problem that is so mysterious, that I don't even know how to formulate this question... I cannot even post any piece of code.
I develop a big project on my own, started from scratch. It's nearly release time, but I can't get rid of some annoying error. My program writes an output file from time to time and during that I get either:
std::string out_of_range error
std::string length_error
just lots of nonsense on output
Worth noting that those errors appear very rarely and can never be reproduced, even with the same input. Memcheck shows no memory violation, even on runs where errors were previously noted. Cppcheck has no complains as well. I use STL and pthreads intensively, but without the latter one errors also happen.
I tried both newest g++ and icpc. I am running on some version of Ubuntu, but I don't believe that's the reason.
I would appreciate any help from you, guys, on how to tackle such problems.
Thanks in advance.
Enable coredumps (ulimit -c or setrlimit()), get a core and start gdb'ing. Or, if you can, make a setup where you always run under gdb, so that when the error eventually happen you have some information available.
The symptoms hint at a memory corruption.
If I had to guess, I'd say that something is corrupting the internal state of the std::string object that you're writing out. Does the string object live on the stack? Have you eliminated stack smashing as a possible cause (that wouldn't be detectable by valgrind)?
I would also suggest running your executable under a debugger, set up in such a way that it would trigger a breakpoint whenever the problem happens. This would allow you to examine the state of your process at that point, which might be helpful in figuring out what's going on.
gdb and valgrind are very useful tools for debugging errors like this. valgrind is especially powerful for identifying memory access problems and memory leaks.
I encountered strange optimization bugs in gcc (like a ++i being assembled to i++ in rare circumstances). You could try declaring some critical variables volatile but if valgrind doesn't find anything, chances are low. And of course it's like shooting in the dark...
If you can at least detect that something is wrong in a certain run from inside the program, like detecting nonsensical output, you could then call an empty "gotNonsense()" function that you can break into with gdb.
If you cannot determine where exactly in the code does your program crash, one way to find that place would be using a debug output. Debug output is good way of debugging bugs that cannot be reproduced, because you will get more information about the bug the next time it happens, without the need to actively reproduce it. I recommend using some logging lib for that, boost provides one, for example.
You are using STL intensively, so you can try to run your program with libstdc++ in debug mode. It will do extra checks on iterators, containers and algorithms. To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG

Why does my debugger sometimes freak out and do things like not line up with my code?

When I'm using my debugger (in my particular case, it was QT Creator together with GDB that inspired this) on my C++ code, sometimes even after calling make clean followed by make the debugger seems to freak out.
Sometimes it will seem to be lined up with another piece of code's line numbers, and will jump around. Sometimes this is is off by one line, sometimes this is totally off and it'll jump around erratically.
Other times, it'll freak out by stepping into things I didn't ask it to step into, like while stepping over a function call, it might step into the string initialization routine that is part of it.
When I get seg faults, sometimes it's able to tell me where it happened perfectly, and other times it's not even able to display question marks for which functions called the code and from where, and all I see is assembly, even while running the exact same code repeatedly.
I can't seem to figure out a pattern to what causes these failures, and sometimes my debugger is perfectly well behaved.
What are the theoretical reasons behind these debugger freak outs, and what are the concrete steps I can take to prevent them?
There's 3 very common reasons
You're debugging optimized code. This rarely works - optimized code can be reordered/inlined/precomputed/etc. to the point there's no chance whatsoever to map it back to the source code.
You're not debugging, for whatever reason, the binary matching the current source code.
You've invoked undefined behavior somewhere - if whatever stuff your code did, it has messed around with the scaffolding the debugger needs to keep its sanity. This is what usually happens when you get a segfault and you can't get a sane stack trace, you've overwritten/messed with the information(e.g. stack pointers) the debugger needs to do its job.
And probably hundreds more - of the stuff I personally encounter is: debugging multithreaded code; depending on gcc/gdb versions and various other things - there's been quite a handful debugger bugs.
One possible reason is that debuggers are as buggy as any other program!
But the most common reason for a debugger not showing the right source location is that the compiler optimized the code in some way, so there is no simple correspondence between the source code and the executable code. A common optimization that confuses debuggers is inlining, and C++ is very prone to it.
For example, your string initialization routine was probably inlined into the function call, so as far as the debugger was concerned, there was just one function that happened to start with some string initialization code.
If you're tracking down an algorithm bug (as opposed to a coding bug that produces undefined behavior, or a concurrency bug), turning the optimization level down will help you track the bug, because the debugger will have a simpler view of the code.
I have the same question like yours, and I cannot solve it yet. But I have came out one problem solution which is to install a virtual machine and install Unix system in it. And debug it in Linux system. Perhaps it will work.
I have found out the reason, you should rebuild the project every time you changed your code, or the Qt will just run the old version of the code.

Finding division by zero in a big project

Recently, our big project began crashing on unhandled division by zero. No recent code seems to contain any likely elements so it may be new data sets affecting old code. The problem is the code base is pretty big, and running on an embedded device with no comfortable debug access (debug is done by a lot of printf()s over serial console, there is no gdb for the device and even if there was, the binary compiled with debug symbols wouldn't fit).
The most viable way would likely be to find all the division operations (they are relatively infrequent), and analyze code surrounding each of them to see if any of the divisor variables was left unguarded.
The question is then either how to find all division operations in a big (~200 files, some big) C++ project, or, if you have a better idea how to locate the error, please give them.
extra info: project runs on embedded ARM9, a small custom Linux distro, crosscompiled with Cygwin/Windows crosstools, IDE is Eclipse but there's also Cygwin with all the respective goodies. Thing is the project is very hardware-specific, and the crashes occur only when running at full capacity, all the essential interconnected modules active. Restricted "fault mode" where only bare bones are active doesn't create them.
I think the most direct step, would be to try to catch the unhandled exception and generate a dump or printf stack information or similar.
Take a look at this question or just search in google for info relating to exception catching in your particular environment.
By the way, I think that the division could happen as a result of a call to an external library, so it's not 100% sure that you'll find the culprit just by greping your code.
If I remember right, the ARM9 doesn't have hardware divide so it's going to be implemented in a function call the compiler makes whenever it has to perform a division.
See if your toolset implements the divide by zero handling in the same way as ARM's toolset does (it's likely that it does something at least similar). If so, you can install a handler that gets called when the problem occurs and you can printf() registers and stack so that you can determine where the problem is occurring. A possible similar alternative is that your small Linux distro is throwing a signal you can catch.
I'm not sure how you're getting your information that a divide by zero is occurring, but if it's because the runtime is spitting out a message to that effect, you always have the option of finding out where that is handled in the runtime, and replacing it with your own more informative message. However, I'd guess that there's a more 'architected' way to get your code to run (a signal handler or ARM's technique).
Finding all of the divisions shouldn't be hard with a custom grep search. You can easily distinguish that usage from other usages of the / and % character in C++.
Also, if you know what you are dividing, you could globally overload the / and % operator to have a __FILE__ and __LINE__ informing assertion. If using a makefile, it shouldn't be hard to include the custom operator code in all the linked files without touching the code.
You should use this as an excuse to invest in improving the debug-ability of your device - for both this problem and future issues. Even if you can't get live debugging, you should be able to find a way to generate and save off core dumps for post-mortem debugging (pinpointing the source or any unhandled exception immediately).
PC-Lint might help, it's like Findbugs for C++. It is a commercial product but there is a 30 money back guarantee.
Handle the exception.
Usually the exception will be handed a structure that contains the address that caused the exception and other information. You will probably have to become familiar with the microcontroller's datasheet or RTOS manual.
Use the -save-temps for gcc and find the relevant assembly for division in the generated .s file. If you're lucky it will be something fairly distinctive, possibly even a function call. If it's a function call you can use weak linking to override it with your own checked version. Otherwise locating the divisions in the assembly should give you a very good idea where they are in the C/C++ code and you can instrument them directly.
usually you could modify/override the divide-by-zero exception handler if you have access to the exception handler routines.
in case of ARM, the division is done by a library routine. and there are mechanisms to inform the user-code, when a divide by zero occurs.
see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html
i would suggest to provide a __rt_raise() as said in the page above.
__rt_raise(2,2) will get called when the divide routine detects a divide by zero.
so you can print the LR register.
and then use addr2line to crossref it against the source line
The only way to find those conditions is the usual:
try to reproduce the problem without looking at the source (as the bug already happened you should have info on the part of the program that is affected)
if found, check the source for this point and fix it, otherwise:
2.1. grep for each / not followed by a * or / (grep "/[^/*]" i think)
2.2. find the conditions for which the code is executed and reproduce it
The exception already has the address location of the offending divide by zero code. The CPU saves register contents when a exception occurs including the PC(program counter). Your OS should pass this information along (I assumes that is how you know it is divide by zero). Print the address and go look in your code. If you can print a stack trace it would be even easier to solve.
Another option would be to check the differences in your version control software between the last know working version and the first non working version. This should give you a limmited change set within which to search for the problem.