First of all, thank you for taking the time to view my question and help. I noticed that a lot of questioners here show little or no appreciation, but I'm sincerely appreciative for the help and the community here :)
I wrote a C++ plugin (compromised of hundreds of source files) for an application I do not have the source code for (it's a video game). In other words, I only have the source code for my plugin, but not the game. Now, somewhere in those thousands of lines in my plugin, something causes the game engine to throw (probably an access violation) and I don't know where. By the time the debugger breaks, the stack is corrupted and all I get are hex addresses for DLLs I do not have the source for (but the exception occurs in my DLL for sure). I tried everything... I just can't seem to find where the exception occurs. Sometimes the debugger points to a "memory relocation" function (which I never used in my plugin), sometimes it points to the engine's GameFrame(), and other times it points to a damage callback (all these are just different member functions of a class).
I tried practically everything... I googled for hours trying to find out how to use other debuggers like WinDbg and Microsoft Application Verifier. I tried to comment out one or the other, or both, where the debugger points, but it still crashes. I even inserted OUTPUT("The name of the last executed function is: %s", __FUNCTION__) into EVERY function in my application hoping to painstakingly catch the last function but it seems any kind of I/O prevents the exception from occurring for some reason... And 10 minutes of debugging and the crash happens at some random last executed function.
I can't find out where this access violation is happening or where some temporary object is removed to cause these bad pointers (I check every pointer before using it), but damn, I'm reaching my limit's end here.
So, how does one debug the impossible... a random crash with a crappy debugger call stack? Thanks in advance for your patient and kind help!
My suggestion: try different debuggers (non MS), they catch different things.
My experience: a program I have source code and full debugging symbols corrupt the stack, VS nor WinDbg can help but Ollydbg comments a non-string var with the value "r for pattern.", so I had overwrote some string buffer onto this var. Also Ollydbg have option to walk the stack the hard way (not using dbghelp.dll)
From my experience, the old adage "Prevention is better than cure" is very relevant. It is best to prevent the bugs from creeping in, by following good software development practices (unit tests, regressions, code review, etc.) than to work it out later once the bugs show up.
Of course, real world is not perfect, and bugs do show up. To debug memory corruption, you have some nice tools like valgrind, which at least narrow down the problem sections for you to take a closer look at. Debugging a complex program is not easy, and if your debugger throws up, it requires a lot of persistence on your part. One technique I find useful is to selectively enable or disable certain modules, to narrow down the module has the problem.
Sometimes you need to use "referential transparency" to unload some modules. To give you a stripped down example, consider:
int foo = factorial(3);
If I suspect there's a problem in this code (and the debugger crashes before I can see the call stack), I have to try by removing this code, and see if the problem persists. However, foo may be used later, so I cannot just remove it. Instead I can replace it with int foo = 6; and continue.
Another important point is to always maintain a trace file, where your code keeps logging what it is doing. When a program crashes, the trace file can often help narrow down the problem. Of course, you disable the tracing by default, so that it doesn't cause a performance bottleneck.
Related
I am learning TDD and using CppUTest in eclipse.
Is there any way to debug my code getting a nagging segmentation fault.
Thanks
I don't know anything special in CppUTest or Eclipse to help you, but some generic segfault debugging ideas seem appropriate here:
Add flushing print statements (e.g. printf(...) + fflush(stdout) or fprintf(stderr, ...)) to your code and see what gets printed. Do this in a binary search fashion with just a few prints at a time until you narrow down exactly where it is crashing. This sounds old fashioned but is extremely effective. Here is a guide I found googling that talks about this well-known technique: http://www.floccinaucinihilipilification.net/blog/2011/3/24/debugging-via-binary-search.html
Compile your code with debugging symbols and run it in a debugger. When you hit your segfault, ask for a backtrace and see if you can figure out what happened. When doing this it can be especially helpful to use a graphical debugger.
Run your code with a debugging tool like a debug malloc library or something from the valgrind suite. This may catch problems that are root causes of your segfaults but aren't occuring at the exact place where the segfault is generated (e.g. double frees, out of bound array access clobbering pointers used later, etc).
It would be helpful if you could add some code to your question, to give us a better idea of what you are up against. Not knowing any of the details, I would suggest the following:
Add -vto your executable's arguments in the Debug dialog. This will print the names of your test cases as they are executed. The last name that prints is likely the test where the segmentation fault occurs.
Put a breakpoint in that test case, where you call your code under test
Step into your code until the segfault occurs.
Trace back the value that caused the segfault (most often, a dangling pointer) and find out, why it was NULL or uninitialized.
I'm facing a problem that is so mysterious, that I don't even know how to formulate this question... I cannot even post any piece of code.
I develop a big project on my own, started from scratch. It's nearly release time, but I can't get rid of some annoying error. My program writes an output file from time to time and during that I get either:
std::string out_of_range error
std::string length_error
just lots of nonsense on output
Worth noting that those errors appear very rarely and can never be reproduced, even with the same input. Memcheck shows no memory violation, even on runs where errors were previously noted. Cppcheck has no complains as well. I use STL and pthreads intensively, but without the latter one errors also happen.
I tried both newest g++ and icpc. I am running on some version of Ubuntu, but I don't believe that's the reason.
I would appreciate any help from you, guys, on how to tackle such problems.
Thanks in advance.
Enable coredumps (ulimit -c or setrlimit()), get a core and start gdb'ing. Or, if you can, make a setup where you always run under gdb, so that when the error eventually happen you have some information available.
The symptoms hint at a memory corruption.
If I had to guess, I'd say that something is corrupting the internal state of the std::string object that you're writing out. Does the string object live on the stack? Have you eliminated stack smashing as a possible cause (that wouldn't be detectable by valgrind)?
I would also suggest running your executable under a debugger, set up in such a way that it would trigger a breakpoint whenever the problem happens. This would allow you to examine the state of your process at that point, which might be helpful in figuring out what's going on.
gdb and valgrind are very useful tools for debugging errors like this. valgrind is especially powerful for identifying memory access problems and memory leaks.
I encountered strange optimization bugs in gcc (like a ++i being assembled to i++ in rare circumstances). You could try declaring some critical variables volatile but if valgrind doesn't find anything, chances are low. And of course it's like shooting in the dark...
If you can at least detect that something is wrong in a certain run from inside the program, like detecting nonsensical output, you could then call an empty "gotNonsense()" function that you can break into with gdb.
If you cannot determine where exactly in the code does your program crash, one way to find that place would be using a debug output. Debug output is good way of debugging bugs that cannot be reproduced, because you will get more information about the bug the next time it happens, without the need to actively reproduce it. I recommend using some logging lib for that, boost provides one, for example.
You are using STL intensively, so you can try to run your program with libstdc++ in debug mode. It will do extra checks on iterators, containers and algorithms. To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG
When I'm using my debugger (in my particular case, it was QT Creator together with GDB that inspired this) on my C++ code, sometimes even after calling make clean followed by make the debugger seems to freak out.
Sometimes it will seem to be lined up with another piece of code's line numbers, and will jump around. Sometimes this is is off by one line, sometimes this is totally off and it'll jump around erratically.
Other times, it'll freak out by stepping into things I didn't ask it to step into, like while stepping over a function call, it might step into the string initialization routine that is part of it.
When I get seg faults, sometimes it's able to tell me where it happened perfectly, and other times it's not even able to display question marks for which functions called the code and from where, and all I see is assembly, even while running the exact same code repeatedly.
I can't seem to figure out a pattern to what causes these failures, and sometimes my debugger is perfectly well behaved.
What are the theoretical reasons behind these debugger freak outs, and what are the concrete steps I can take to prevent them?
There's 3 very common reasons
You're debugging optimized code. This rarely works - optimized code can be reordered/inlined/precomputed/etc. to the point there's no chance whatsoever to map it back to the source code.
You're not debugging, for whatever reason, the binary matching the current source code.
You've invoked undefined behavior somewhere - if whatever stuff your code did, it has messed around with the scaffolding the debugger needs to keep its sanity. This is what usually happens when you get a segfault and you can't get a sane stack trace, you've overwritten/messed with the information(e.g. stack pointers) the debugger needs to do its job.
And probably hundreds more - of the stuff I personally encounter is: debugging multithreaded code; depending on gcc/gdb versions and various other things - there's been quite a handful debugger bugs.
One possible reason is that debuggers are as buggy as any other program!
But the most common reason for a debugger not showing the right source location is that the compiler optimized the code in some way, so there is no simple correspondence between the source code and the executable code. A common optimization that confuses debuggers is inlining, and C++ is very prone to it.
For example, your string initialization routine was probably inlined into the function call, so as far as the debugger was concerned, there was just one function that happened to start with some string initialization code.
If you're tracking down an algorithm bug (as opposed to a coding bug that produces undefined behavior, or a concurrency bug), turning the optimization level down will help you track the bug, because the debugger will have a simpler view of the code.
I have the same question like yours, and I cannot solve it yet. But I have came out one problem solution which is to install a virtual machine and install Unix system in it. And debug it in Linux system. Perhaps it will work.
I have found out the reason, you should rebuild the project every time you changed your code, or the Qt will just run the old version of the code.
I have a fairly complex (approx 200,000 lines of C++ code) application that has decided to crash, although it crashes a little differently on a couple of different systems. The trick is that it doesn't crash or trap out in debugger. It only crashes when the application .EXE is run independently (either the debug EXE or the release EXE - both behave the same way). When it crashes in the debug EXE, and I get it to start debugging, the call stack is buried down into the windows/MFC part of things, and isn't reflecting any of my code. Perhaps I'm seeing a stack corruption of some sort, but I'm just not sure at the moment. My question is more general - it's about tools and techniques.
I'm an old programmer (C and assembly language days), and a relative newcomer (couple/few years) to C++ and Visual Studio (2003 for this projecT).
Are there tricks or techniques anyone's had success with in tracking down crashing issues when you cannot make the software crash in a debugger session? Stuff like permission issues, for example?
The only thing I've thought of is to start plugging in debug/status messages to a logfile, but that's a long, hard way to go. Been there, done that. Any better suggestions? Am I missing some tools that would help? Is VS 2008 better for this kind of thing?
Thanks for any guidance. Some very smart people here (you know who you are!).
cheers.
lint.
C/C++ Free alternative to Lint?
I've not done C++ professionally for over 10 years, but back in the day I used Rational PurifyPlus, which will be a good start, as is BoundsChecker (if it still exists!) These products find out of bounds accesses, corrupted memory, corrupted stack and other problems that can go undetected until "boom" and then you have no idea where you are.
I would try these first. If that fails, then you can start typing in logging statements.
If the debugger mitigates the crash, this can be for these reasons:
memory corruption: under a debug build memory is allocated with space before an after, so rogue writes may not corrupt under a debug session
timing and multi-threading: the debugger alters timing of threads and can make tricky multi-threaded problems hard to nail down.
If it's memory corruption, a memory tracking/diagnostic tool (I used to use BoundsChecker to great effect in the good old days of C++) may help you to locate and fix the cause in minutes, where any other technique coud take days or even months.
For other cases, you've suggested another approach yourself: a sometimes labour-intensive but very effective approach to getting a "real" stack trace is to simply use printf - a vastly underrated debugging tool available in every environment. If you have a rough idea you can straddle the crash area with only a few log messages to narrow down the location, and then add more as you home in on the problem area. This can often unearth enough clues that you can isolate the cause of the crash in a few minutes, even though it can seem like a lot of work and perhaps a hopeless cause before you start.
edit:
Also, if you have the application under source control, then get a historical version from when you think it was working, and then do a binary chop between that date and "now" to isolate when the issue began to occur. This can often narrow down a bug to the precise checkin that introduced the bug, and if you're lucky it will point you at a few lines of code. (If you're unlucky the bug won't be so easily repeatable, or you'll narrow it down to a 500-file checkin where a major refactoring or similar took place)
Get the debugging tool kit from MS ( http://www.microsoft.com/whdc/devtools/debugging/default.mspx ).
Set adplus up for crash mode monitoring ( http://www.microsoft.com/whdc/devtools/debugging/default.mspx ).
This should get you a crash dump when the app crashes. Load the dump up in WindDbg from the debugging toolkit and analyze using that. It is a painful, but very powerful, process to anaylyze out-of-debugger crashes.
There are quite a few resources around for using WinDbg - a good book on general Windows unmanaged debugging and the tools in the debugging kits is: http://www.amazon.com/Advanced-Windows-Debugging-ebook/dp/B000XPNUMW
I couldn't recommend more the blog of Mark Rusinovich. Absolutely brilliant guy from whom you can learn a whole bunch of debugging techniques for windows and many more. Especially try read some of the "The Case of" series! Amazing stuff!
For example take a look at this case he had investigated - a crash of IE. He shows how to capture the stack of the failing thread and many more interesting stuff. His main tools are windows debugging tools and also his sysinternals tools!
Enough said. Go read it!
Also I would recommend the book: Windows Internals 5. Again by Mark and company.
Might be that you have a too big object on the stack...
Explainations (from comments):
I gives this answer because that's the only case I've seen that a debuger (VS or CodeWarrior) couldn't catch and seeemed mysterious. Most of the time, that was the big application object that was defined on the stack in the main() function, and having members not allocated on the heap. Just calling new to instantiate the object fixed the obscure problem. Didn't need to get a specific tool for that in the end.
My experience is that sometime indeed program launched by the debugger (release or debug mode) don't crash as they do when launch on their own.
But I don't recall a case when the very same program launch on it's own, and then attached and continued through a debugger don't reproduce the crash.
An other and better approach if the crash doesn't always happens, would be to be able to produce a minidump (equivalent of unix coredump) and do a postmortem analysis,
there are plenty of tools on windows to do that, for example look at:
http://www.codeproject.com/KB/debug/postmortemdebug_standalone1.aspx?df=100&forumid=3419&exp=0&select=1114393
(perhaps someone may have a better link that this one).
Recently, our big project began crashing on unhandled division by zero. No recent code seems to contain any likely elements so it may be new data sets affecting old code. The problem is the code base is pretty big, and running on an embedded device with no comfortable debug access (debug is done by a lot of printf()s over serial console, there is no gdb for the device and even if there was, the binary compiled with debug symbols wouldn't fit).
The most viable way would likely be to find all the division operations (they are relatively infrequent), and analyze code surrounding each of them to see if any of the divisor variables was left unguarded.
The question is then either how to find all division operations in a big (~200 files, some big) C++ project, or, if you have a better idea how to locate the error, please give them.
extra info: project runs on embedded ARM9, a small custom Linux distro, crosscompiled with Cygwin/Windows crosstools, IDE is Eclipse but there's also Cygwin with all the respective goodies. Thing is the project is very hardware-specific, and the crashes occur only when running at full capacity, all the essential interconnected modules active. Restricted "fault mode" where only bare bones are active doesn't create them.
I think the most direct step, would be to try to catch the unhandled exception and generate a dump or printf stack information or similar.
Take a look at this question or just search in google for info relating to exception catching in your particular environment.
By the way, I think that the division could happen as a result of a call to an external library, so it's not 100% sure that you'll find the culprit just by greping your code.
If I remember right, the ARM9 doesn't have hardware divide so it's going to be implemented in a function call the compiler makes whenever it has to perform a division.
See if your toolset implements the divide by zero handling in the same way as ARM's toolset does (it's likely that it does something at least similar). If so, you can install a handler that gets called when the problem occurs and you can printf() registers and stack so that you can determine where the problem is occurring. A possible similar alternative is that your small Linux distro is throwing a signal you can catch.
I'm not sure how you're getting your information that a divide by zero is occurring, but if it's because the runtime is spitting out a message to that effect, you always have the option of finding out where that is handled in the runtime, and replacing it with your own more informative message. However, I'd guess that there's a more 'architected' way to get your code to run (a signal handler or ARM's technique).
Finding all of the divisions shouldn't be hard with a custom grep search. You can easily distinguish that usage from other usages of the / and % character in C++.
Also, if you know what you are dividing, you could globally overload the / and % operator to have a __FILE__ and __LINE__ informing assertion. If using a makefile, it shouldn't be hard to include the custom operator code in all the linked files without touching the code.
You should use this as an excuse to invest in improving the debug-ability of your device - for both this problem and future issues. Even if you can't get live debugging, you should be able to find a way to generate and save off core dumps for post-mortem debugging (pinpointing the source or any unhandled exception immediately).
PC-Lint might help, it's like Findbugs for C++. It is a commercial product but there is a 30 money back guarantee.
Handle the exception.
Usually the exception will be handed a structure that contains the address that caused the exception and other information. You will probably have to become familiar with the microcontroller's datasheet or RTOS manual.
Use the -save-temps for gcc and find the relevant assembly for division in the generated .s file. If you're lucky it will be something fairly distinctive, possibly even a function call. If it's a function call you can use weak linking to override it with your own checked version. Otherwise locating the divisions in the assembly should give you a very good idea where they are in the C/C++ code and you can instrument them directly.
usually you could modify/override the divide-by-zero exception handler if you have access to the exception handler routines.
in case of ARM, the division is done by a library routine. and there are mechanisms to inform the user-code, when a divide by zero occurs.
see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html
i would suggest to provide a __rt_raise() as said in the page above.
__rt_raise(2,2) will get called when the divide routine detects a divide by zero.
so you can print the LR register.
and then use addr2line to crossref it against the source line
The only way to find those conditions is the usual:
try to reproduce the problem without looking at the source (as the bug already happened you should have info on the part of the program that is affected)
if found, check the source for this point and fix it, otherwise:
2.1. grep for each / not followed by a * or / (grep "/[^/*]" i think)
2.2. find the conditions for which the code is executed and reproduce it
The exception already has the address location of the offending divide by zero code. The CPU saves register contents when a exception occurs including the PC(program counter). Your OS should pass this information along (I assumes that is how you know it is divide by zero). Print the address and go look in your code. If you can print a stack trace it would be even easier to solve.
Another option would be to check the differences in your version control software between the last know working version and the first non working version. This should give you a limmited change set within which to search for the problem.