Recently, our big project began crashing on unhandled division by zero. No recent code seems to contain any likely elements so it may be new data sets affecting old code. The problem is the code base is pretty big, and running on an embedded device with no comfortable debug access (debug is done by a lot of printf()s over serial console, there is no gdb for the device and even if there was, the binary compiled with debug symbols wouldn't fit).
The most viable way would likely be to find all the division operations (they are relatively infrequent), and analyze code surrounding each of them to see if any of the divisor variables was left unguarded.
The question is then either how to find all division operations in a big (~200 files, some big) C++ project, or, if you have a better idea how to locate the error, please give them.
extra info: project runs on embedded ARM9, a small custom Linux distro, crosscompiled with Cygwin/Windows crosstools, IDE is Eclipse but there's also Cygwin with all the respective goodies. Thing is the project is very hardware-specific, and the crashes occur only when running at full capacity, all the essential interconnected modules active. Restricted "fault mode" where only bare bones are active doesn't create them.
I think the most direct step, would be to try to catch the unhandled exception and generate a dump or printf stack information or similar.
Take a look at this question or just search in google for info relating to exception catching in your particular environment.
By the way, I think that the division could happen as a result of a call to an external library, so it's not 100% sure that you'll find the culprit just by greping your code.
If I remember right, the ARM9 doesn't have hardware divide so it's going to be implemented in a function call the compiler makes whenever it has to perform a division.
See if your toolset implements the divide by zero handling in the same way as ARM's toolset does (it's likely that it does something at least similar). If so, you can install a handler that gets called when the problem occurs and you can printf() registers and stack so that you can determine where the problem is occurring. A possible similar alternative is that your small Linux distro is throwing a signal you can catch.
I'm not sure how you're getting your information that a divide by zero is occurring, but if it's because the runtime is spitting out a message to that effect, you always have the option of finding out where that is handled in the runtime, and replacing it with your own more informative message. However, I'd guess that there's a more 'architected' way to get your code to run (a signal handler or ARM's technique).
Finding all of the divisions shouldn't be hard with a custom grep search. You can easily distinguish that usage from other usages of the / and % character in C++.
Also, if you know what you are dividing, you could globally overload the / and % operator to have a __FILE__ and __LINE__ informing assertion. If using a makefile, it shouldn't be hard to include the custom operator code in all the linked files without touching the code.
You should use this as an excuse to invest in improving the debug-ability of your device - for both this problem and future issues. Even if you can't get live debugging, you should be able to find a way to generate and save off core dumps for post-mortem debugging (pinpointing the source or any unhandled exception immediately).
PC-Lint might help, it's like Findbugs for C++. It is a commercial product but there is a 30 money back guarantee.
Handle the exception.
Usually the exception will be handed a structure that contains the address that caused the exception and other information. You will probably have to become familiar with the microcontroller's datasheet or RTOS manual.
Use the -save-temps for gcc and find the relevant assembly for division in the generated .s file. If you're lucky it will be something fairly distinctive, possibly even a function call. If it's a function call you can use weak linking to override it with your own checked version. Otherwise locating the divisions in the assembly should give you a very good idea where they are in the C/C++ code and you can instrument them directly.
usually you could modify/override the divide-by-zero exception handler if you have access to the exception handler routines.
in case of ARM, the division is done by a library routine. and there are mechanisms to inform the user-code, when a divide by zero occurs.
see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html
i would suggest to provide a __rt_raise() as said in the page above.
__rt_raise(2,2) will get called when the divide routine detects a divide by zero.
so you can print the LR register.
and then use addr2line to crossref it against the source line
The only way to find those conditions is the usual:
try to reproduce the problem without looking at the source (as the bug already happened you should have info on the part of the program that is affected)
if found, check the source for this point and fix it, otherwise:
2.1. grep for each / not followed by a * or / (grep "/[^/*]" i think)
2.2. find the conditions for which the code is executed and reproduce it
The exception already has the address location of the offending divide by zero code. The CPU saves register contents when a exception occurs including the PC(program counter). Your OS should pass this information along (I assumes that is how you know it is divide by zero). Print the address and go look in your code. If you can print a stack trace it would be even easier to solve.
Another option would be to check the differences in your version control software between the last know working version and the first non working version. This should give you a limmited change set within which to search for the problem.
Related
We have a custom application we use built around VB/C++ code.
This code will run for days, weeks, months, without this throwing up an exception error.
I'm trying to learn more about how this error is thrown, and how to interpret (if you can) the error listed when an exception is thrown. I've googled some information and read the Microsoft provided error description, but I'm still stuck with the task of trouble shooting something that occurs once in a blue moon. There is no known set of interactions with the software that causes this and appears to happen randomly.
Is the first exception the root cause? Is it all the way down the stack call? Can anyone provide any insight on how to read these codes so I could interpret where I actually need to look.
Any information or guidance on reading the exception or making any use of it, and then trouble shooting it would be helpful. The test below is copied from windows log when the event was thrown.
Thanks in advance for any help.
Application: Epic.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.AccessViolationException [![enter image description here][1]][1]
at MemMap.ComBuf.IsCharAvailable(Int32)
at HMI.frmPmacStat.RefreshTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)
at System.Timers.Timer.MyTimerCallback(System.Object)
at System.Threading.TimerQueueTimer.CallCallbackInContext(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,
System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,
System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.TimerQueueTimer.CallCallback()
at System.Threading.TimerQueueTimer.Fire()
at System.Threading.TimerQueue.FireQueuedTimerCompletion(System.Object)
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
There are exceptions that are thrown by the c++ runtime environment, as a result of executing a throw expression, and there are other types of errors caused by the operating system or hardware trapping your instruction. Invalid access to memory is generally not thrown by code in c++, but is a side effect of evaluating an expression trying to access memory at an invalid address, resulting in the OS signaling the process, usually killing it. Because it's outside C++, it tends to be platform specific, but typical errors are:
reading a null pointer
using a pointer to an object that has been deleted
going outside an array's valid range of elements
using invalidated iterators into STL containers
Generally speaking, you can test for null and array bounds at runtime to detect the problem before it happens. Using a dangling pointer is more difficult to track down, because the time between the delete and the mis-use of that pointer can be long, and it can be difficult to find why it happened without a memory debugger, such as valgrind. Using smart pointers instead of raw pointers can help mitigate the problems of mis-managing memory, and can help avoid such errors.
Invalid iterators are subset of the general dangling pointer problem, but are common enough to be worth mentioning as their own category. Understanding your containers and which operations invalidate them is crucial, and some implementations can be compiled in "debug mode" which can help detect use of invalidated iterators.
As others have noted, this type of error is tricky to identify without digging into the code and running tests (automated or manual). The more pieces of the system you can pull out and still reproduce it, the better. Divide and conquer is your friend here.
Beyond that, it all depends how important it is for you to resolve this and how much effort you're willing to put in. There are at least three classes of tools that can help with such intermittent problems:
Application monitors that track potential errors as your application runs. These tend to slow your program significantly (10x or more slowdown). Examples include:
Microsoft's Application Verifier
Open-source and cross-platform Dr. Memory
Google's Crashpad. Unlike the previous two programs, this one requires instrumenting your code. It is also (allegedly -- haven't tried it) easier to use with helpers like Backtrace's commercial integration for analyzing Crashpad output
Google's Sanitizers - free and some are built into gcc and clang. There's also a Windows port of Address Sanitizer, but a cursory look suggests it may be a little bit of a second-class citizen.
If you can run and repro it also run it on Linux, you could use valgrind; rr (see this CppCast ep) which is a free extension for gdb that records and replays your program so you can record a debug session that crashed and then step through it to see what went wrong; and/or UndoDB and friends from Undo software, which is a more sophisticated, commercial product like rr.
Static analysis of the code. This is a set of tools that looks for common errors in your code. It generally has a low signal-to-noise ratio, so there are a lot of minor things to dig through if your run it on a large, existing project (best to start a project using these things from the beginning if possible). That said, many of the warnings are invaluable. Examples:
Most compilers have a subset of this functionality built in. If you're using Visual Studio, add /W4 /WX to your compilation flags for the C++ code to force maximum warnings, then fix all the warnings. For gcc and clang, add '-Wall -Wpedantic -Werror` to enforce no warnings.
PVS-Studio (commercial)
PC-Lint (commercial)
If you can instrument the code to write log messages, something like Debugview++ may be of assistance.
Things get harder if you have multithreading going on, which it looks like you do, because the non-determinism gets harder to track, there are new classes of possible errors that are introduced, and some of the above tools won't work well (e.g., I think rr is single-threaded only). Beyond a full IDE like Visual Studio, you'd need to go with something like Intel's Inspector (formerly Intel Thread Checker), or on Linux, Valgrind's Helgrind and DRD and ThreadSanitizer (in the sanitizers above, but also Linux only AFAIK). But hopefully this list gives you a place to start.
Every so often I (re)compile some C (or C++) file I am working on -- which by the way succeeds without any warnings -- and then I execute my program only to realize that nothing has changed since my previous compilation. To keep things simple, let's assume that I added an instruction to my source to print out some debugging information onto the screen, so that I have a visual evidence of trouble: indeed, I compile, execute, and unexpectedly nothing is printed onto the screen.
This happened me once when I had a buggy code (I ran out of the bounds of a static array). Of course, if your code has some kind of hidden bug (What are all the common undefined behaviours that a C++ programmer should know about?) the compiled code can be pretty much anything.
This happened me twice when I used some ridiculously slow network hard drive which -- I guess -- simply did not update my executable file after compilation, and I kept running-and-running the old version, despite the updated source. I just speculate here, and feel free to correct me, if such a phenomenon is impossible, but I suspect it has had to do something with certain processes waiting for IO.
Well, such things could of course happen (and they indeed do), when you execute an old version in the wrong directory (that is: you execute something similar, but actually completely unrelated to your source).
It is happening again, and it annoys me enough to ask: how do you make sure that your executable is matching the source you are working on? Should I compare the date strings of the source and the executable in the main function? Should I delete the executable prior compilation? I guess people might do something similar by means of version control.
Note: I was warned that this might be a subjective topic likely doomed to be closed.
Just use ol' good version control possibilities
In easy case you can just add (any) visible version-id in the code and check it (hash, revision-id, timestamp)
If your project have a lot of dependent files and you suspect older version, than "latest", in produced code, you can (except, obvioulsly, good makefile-rules) monitor also version of every file, used for building code (VCS-dependent, but not so heavy trick)
Check the timestamp of your executable. That should give you a hint regarding whether or not it is recent/up-to-date.
Alternatively, calculate a checksum for your executable and display it on startup, then you have a clue that if the csum is the same the executable was not updated.
First of all, thank you for taking the time to view my question and help. I noticed that a lot of questioners here show little or no appreciation, but I'm sincerely appreciative for the help and the community here :)
I wrote a C++ plugin (compromised of hundreds of source files) for an application I do not have the source code for (it's a video game). In other words, I only have the source code for my plugin, but not the game. Now, somewhere in those thousands of lines in my plugin, something causes the game engine to throw (probably an access violation) and I don't know where. By the time the debugger breaks, the stack is corrupted and all I get are hex addresses for DLLs I do not have the source for (but the exception occurs in my DLL for sure). I tried everything... I just can't seem to find where the exception occurs. Sometimes the debugger points to a "memory relocation" function (which I never used in my plugin), sometimes it points to the engine's GameFrame(), and other times it points to a damage callback (all these are just different member functions of a class).
I tried practically everything... I googled for hours trying to find out how to use other debuggers like WinDbg and Microsoft Application Verifier. I tried to comment out one or the other, or both, where the debugger points, but it still crashes. I even inserted OUTPUT("The name of the last executed function is: %s", __FUNCTION__) into EVERY function in my application hoping to painstakingly catch the last function but it seems any kind of I/O prevents the exception from occurring for some reason... And 10 minutes of debugging and the crash happens at some random last executed function.
I can't find out where this access violation is happening or where some temporary object is removed to cause these bad pointers (I check every pointer before using it), but damn, I'm reaching my limit's end here.
So, how does one debug the impossible... a random crash with a crappy debugger call stack? Thanks in advance for your patient and kind help!
My suggestion: try different debuggers (non MS), they catch different things.
My experience: a program I have source code and full debugging symbols corrupt the stack, VS nor WinDbg can help but Ollydbg comments a non-string var with the value "r for pattern.", so I had overwrote some string buffer onto this var. Also Ollydbg have option to walk the stack the hard way (not using dbghelp.dll)
From my experience, the old adage "Prevention is better than cure" is very relevant. It is best to prevent the bugs from creeping in, by following good software development practices (unit tests, regressions, code review, etc.) than to work it out later once the bugs show up.
Of course, real world is not perfect, and bugs do show up. To debug memory corruption, you have some nice tools like valgrind, which at least narrow down the problem sections for you to take a closer look at. Debugging a complex program is not easy, and if your debugger throws up, it requires a lot of persistence on your part. One technique I find useful is to selectively enable or disable certain modules, to narrow down the module has the problem.
Sometimes you need to use "referential transparency" to unload some modules. To give you a stripped down example, consider:
int foo = factorial(3);
If I suspect there's a problem in this code (and the debugger crashes before I can see the call stack), I have to try by removing this code, and see if the problem persists. However, foo may be used later, so I cannot just remove it. Instead I can replace it with int foo = 6; and continue.
Another important point is to always maintain a trace file, where your code keeps logging what it is doing. When a program crashes, the trace file can often help narrow down the problem. Of course, you disable the tracing by default, so that it doesn't cause a performance bottleneck.
I know that E&C is a controversial subject and some say that it encourages a wrong approach to debugging, but still - I think we can agree that there are numerous cases when it is clearly useful - experimenting with different values of some constants, redesigning GUI parameters on-the-fly to find a good look... You name it.
My question is: Are we ever going to have E&C on GDB? I understand that it is a platform-specific feature and needs some serious cooperation with the compiler, the debugger and the OS (MSVC has this one easy as the compiler and debugger always come in one package), but... It still should be doable. I've even heard something about Apple having it implemented in their version of GCC [citation needed]. And I'd say it is indeed feasible.
Knowing all the hype about MSVC's E&C (my experience says it's the first thing MSVC users mention when asked "why not switch to Eclipse and gcc/gdb"), I'm seriously surprised that after quite some years GCC/GDB still doesn't have such feature. Are there any good reasons for that? Is someone working on it as we speak?
It is a surprisingly non-trivial amount of work, encompassing many design decisions and feature tradeoffs. Consider: you are debugging. The debugee is suspended. Its image in memory contains the object code of the source, and the binary layout of objects, the heap, the stacks. The debugger is inspecting its memory image. It has loaded debug information about the symbols, types, address mappings, pc (ip) to source correspondences. It displays the call stack, data values.
Now you want to allow a particular set of possible edits to the code and/or data, without stopping the debuggee and restarting. The simplest might be to change one line of code to another. Perhaps you recompile that file or just that function or just that line. Now you have to patch the debuggee image to execute that new line of code the next time you step over it or otherwise run through it. How does that work under the hood? What happens if the code is larger than the line of code it replaced? How does it interact with compiler optimizations? Perhaps you can only do this on a specially compiled for EnC debugging target. Perhaps you will constrain possible sites it is legal to EnC. Consider: what happens if you edit a line of code in a function suspended down in the call stack. When the code returns there does it run the original version of the function or the version with your line changed? If the original version, where does that source come from?
Can you add or remove locals? What does that do to the call stack of suspended frames? Of the current function?
Can you change function signatures? Add fields to / remove fields from objects? What about existing instances? What about pending destructors or finalizers? Etc.
There are many, many functionality details to attend to to make any kind of usuable EnC work. Then there are many cross-tools integration issues necessary to provide the infrastructure to power EnC. In particular, it helps to have some kind of repository of debug information that can make available the before- and after-edit debug information and object code to the debugger. For C++, the incrementally updatable debug information in PDBs helps. Incremental linking may help too.
Looking from the MS ecosystem over into the GCC ecosystem, it is easy to imagine the complexity and integration issues across GDB/GCC/binutils, the myriad of targets, some needed EnC specific target abstractions, and the "nice to have but inessential" nature of EnC, are why it has not appeared yet in GDB/GCC.
Happy hacking!
(p.s. It is instructive and inspiring to look at what the Smalltalk-80 interactive programming environment could do. In St80 there was no concept of "restart" -- the image and its object memory were always live, if you edited any aspect of a class you still had to keep running. In such environments object versioning was not a hypothetical.)
I'm not familiar with MSVC's E&C, but GDB has some of the things you've mentioned:
http://sourceware.org/gdb/current/onlinedocs/gdb/Altering.html#Altering
17. Altering Execution
Once you think you have found an error in your program, you might want to find out for certain whether correcting the apparent error would lead to correct results in the rest of the run. You can find the answer by experiment, using the gdb features for altering execution of the program.
For example, you can store new values into variables or memory locations, give your program a signal, restart it at a different address, or even return prematurely from a function.
Assignment: Assignment to variables
Jumping: Continuing at a different address
Signaling: Giving your program a signal
Returning: Returning from a function
Calling: Calling your program's functions
Patching: Patching your program
Compiling and Injecting Code: Compiling and injecting code in GDB
This is a pretty good reference to the old Apple implementation of "fix and continue". It also references other working implementations.
http://sources.redhat.com/ml/gdb/2003-06/msg00500.html
Here is a snippet:
Fix and continue is a feature implemented by many other debuggers,
which we added to our gdb for this release. Sun Workshop, SGI ProDev
WorkShop, Microsoft's Visual Studio, HP's wdb, and Sun's Hotspot Java
VM all provide this feature in one way or another. I based our
implementation on the HP wdb Fix and Continue feature, which they
added a few years back. Although my final implementation follows the
general outlines of the approach they took, there is almost no shared
code between them. Some of this is because of the architectual
differences (both the processor and the ABI), but even more of it is
due to implementation design differences.
Note that this capability may have been removed in a later version of their toolchain.
UPDATE: Dec-21-2012
There is a GDB Roadmap PDF presentation that includes a slide describing "Fix and Continue" among other bullet points. The presentation is dated July-9-2012 so maybe there is hope to have this added at some point. The presentation was part of the GNU Tools Cauldron 2012.
Also, I get it that adding E&C to GDB or anywhere in Linux land is a tough chore with all the different components.
But I don't see E&C as controversial. I remember using it in VB5 and VB6 and it was probably there before that. Also it's been in Office VBA since way back. And it's been in Visual Studio since VS2005. VS2003 was the only one that didn't have it and I remember devs howling about it. They intended to add it back anyway and they did with VS2005 and it's been there since. It works with C#, VB, and also C and C++. It's been in MS core tools for 20+ years, almost continuous (counting VB when it was standalone), and subtracting VS2003. But you could still say they had it in Office VBA during the VS2003 period ;)
And Jetbrains recently added it too their C# tool Rider. They bragged about it (rightly so imo) in their Rider blog.
When I'm using my debugger (in my particular case, it was QT Creator together with GDB that inspired this) on my C++ code, sometimes even after calling make clean followed by make the debugger seems to freak out.
Sometimes it will seem to be lined up with another piece of code's line numbers, and will jump around. Sometimes this is is off by one line, sometimes this is totally off and it'll jump around erratically.
Other times, it'll freak out by stepping into things I didn't ask it to step into, like while stepping over a function call, it might step into the string initialization routine that is part of it.
When I get seg faults, sometimes it's able to tell me where it happened perfectly, and other times it's not even able to display question marks for which functions called the code and from where, and all I see is assembly, even while running the exact same code repeatedly.
I can't seem to figure out a pattern to what causes these failures, and sometimes my debugger is perfectly well behaved.
What are the theoretical reasons behind these debugger freak outs, and what are the concrete steps I can take to prevent them?
There's 3 very common reasons
You're debugging optimized code. This rarely works - optimized code can be reordered/inlined/precomputed/etc. to the point there's no chance whatsoever to map it back to the source code.
You're not debugging, for whatever reason, the binary matching the current source code.
You've invoked undefined behavior somewhere - if whatever stuff your code did, it has messed around with the scaffolding the debugger needs to keep its sanity. This is what usually happens when you get a segfault and you can't get a sane stack trace, you've overwritten/messed with the information(e.g. stack pointers) the debugger needs to do its job.
And probably hundreds more - of the stuff I personally encounter is: debugging multithreaded code; depending on gcc/gdb versions and various other things - there's been quite a handful debugger bugs.
One possible reason is that debuggers are as buggy as any other program!
But the most common reason for a debugger not showing the right source location is that the compiler optimized the code in some way, so there is no simple correspondence between the source code and the executable code. A common optimization that confuses debuggers is inlining, and C++ is very prone to it.
For example, your string initialization routine was probably inlined into the function call, so as far as the debugger was concerned, there was just one function that happened to start with some string initialization code.
If you're tracking down an algorithm bug (as opposed to a coding bug that produces undefined behavior, or a concurrency bug), turning the optimization level down will help you track the bug, because the debugger will have a simpler view of the code.
I have the same question like yours, and I cannot solve it yet. But I have came out one problem solution which is to install a virtual machine and install Unix system in it. And debug it in Linux system. Perhaps it will work.
I have found out the reason, you should rebuild the project every time you changed your code, or the Qt will just run the old version of the code.