Irreproducible runtime errors - general approach?

Irreproducible runtime errors - general approach? - c++

I'm facing a problem that is so mysterious, that I don't even know how to formulate this question... I cannot even post any piece of code.
I develop a big project on my own, started from scratch. It's nearly release time, but I can't get rid of some annoying error. My program writes an output file from time to time and during that I get either:
std::string out_of_range error
std::string length_error
just lots of nonsense on output
Worth noting that those errors appear very rarely and can never be reproduced, even with the same input. Memcheck shows no memory violation, even on runs where errors were previously noted. Cppcheck has no complains as well. I use STL and pthreads intensively, but without the latter one errors also happen.
I tried both newest g++ and icpc. I am running on some version of Ubuntu, but I don't believe that's the reason.
I would appreciate any help from you, guys, on how to tackle such problems.
Thanks in advance.

Enable coredumps (ulimit -c or setrlimit()), get a core and start gdb'ing. Or, if you can, make a setup where you always run under gdb, so that when the error eventually happen you have some information available.

The symptoms hint at a memory corruption.
If I had to guess, I'd say that something is corrupting the internal state of the std::string object that you're writing out. Does the string object live on the stack? Have you eliminated stack smashing as a possible cause (that wouldn't be detectable by valgrind)?
I would also suggest running your executable under a debugger, set up in such a way that it would trigger a breakpoint whenever the problem happens. This would allow you to examine the state of your process at that point, which might be helpful in figuring out what's going on.

gdb and valgrind are very useful tools for debugging errors like this. valgrind is especially powerful for identifying memory access problems and memory leaks.

I encountered strange optimization bugs in gcc (like a ++i being assembled to i++ in rare circumstances). You could try declaring some critical variables volatile but if valgrind doesn't find anything, chances are low. And of course it's like shooting in the dark...
If you can at least detect that something is wrong in a certain run from inside the program, like detecting nonsensical output, you could then call an empty "gotNonsense()" function that you can break into with gdb.

If you cannot determine where exactly in the code does your program crash, one way to find that place would be using a debug output. Debug output is good way of debugging bugs that cannot be reproduced, because you will get more information about the bug the next time it happens, without the need to actively reproduce it. I recommend using some logging lib for that, boost provides one, for example.

You are using STL intensively, so you can try to run your program with libstdc++ in debug mode. It will do extra checks on iterators, containers and algorithms. To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG

Related

Segmentation when NOT running on debug

as the title mentions, I have a problem where one executable of a big project that gives a segmentation fault when it runs but is compiled normally and not with debug.
We are working on linux SUSE servers and code is mostly C++. Through bt in gdb, I have been able to see where exactly the problem occurs, which brings me to the question. The file is an auto-generated one which has not been changed for years. The difference now is that we have updated a third party component, gSOAP. Before updating the third party version it worked normally on both debug and not.
With debug flags, the problem disappears magically (for newbies like me).
I am sorry but its not possible to include a lot of code, only the line that is:
/*------------------------------------------------------------.
| yynewstate -- Push a new state, which is found in yystate. |
`------------------------------------------------------------*/
yynewstate:
/* In all cases, when you get here, the value and location stacks
have just been pushed. So pushing a state here evens the stacks. */
yyssp++;
yysetstate:
*yyssp = yystate; <------------------ THIS LINE
So, any help would appreciated. I actually dont understand why this problem rises and what steps I should take to solve it.
EDIT, I dont expect you to solve this particular case for me, as in more to help me understand why in programming this could occur, my case in this code is just an example.

First, please realize that you're using C++, not Java or any other language where the running of your program is always predictable, even runtime issues are predictable.
In C++, things are not predictable as in those languages. Just because your original program hasn't changed for years does not mean the program was error-free. That's how C++ works -- you think you have an error-free program, and it is not really error-free.
From your code, the exception is because yyssp is pointing to something it shouldn't be pointing to, and dereferencing this pointer causes the exception. That is the only thing that could be concluded from the code you posted. Why the pointer is pointing to where it is? We don't know, that is what you need to discover by debugging.
As to why things run differently in debug and release -- again, a bug like this allows a program to run in an unpredictable way. Add or remove code, run it on another machine, run it with differing compiler options, maybe even run it next week, and it might behave differently.
One thing you should not do -- if you make a totally irrelevant code change and magically your program works, do not claim the problem is fixed or resolved. No -- the problem is not fixed -- you've either masked it, or the bug is moved to another part of your code, hidden from you. Every fix that entails things like this must be reasoned as to why the fix addresses the problem.
Too many times, a naive programmer thinks that moving things around, adding or removing lines, and bingo, things work, that becomes the fix. Don't fall into that trap.

someone in my team found a temporary solution for this,
it was the optimization flags that this library is build with.
The default for our build was -O2 while on debug this changes.
Building the library with -O0 (changing the makefile) provides a temporary solution.

Is it possible to bypass or catch a "Segmentation fault"

I'm using an external library (xqilla) for c++ that ends with a segmentation fault for some uri's, and some not. I'm a bit new to the whole C world, I'm guessing it's not possible to catch this as if it was an exception, but I need to ask if it's possible. Any other solution would of course also be welcomed.
So is there an alternative to try catch for a "Segmentation fault" error?

If you run your program in a debugger, it will tell you what instruction caused the bad memory access in question, so that you can fix it.
Alternatively, you can add a signal handler via signal(2) or sigaction(2), but your ability to debug in that way will probably be pretty difficult. The state of your program after such a fault is probably unpredictable.

If you are receiving a segmentation fault with a third party library, you should first narrow the scope of the problem and make sure that the bug is in your program and not the library. In the event of it being the fault of the library, you could save a lot of wasted effort by reporting the bug to the maintainer or searching the mailing list, documentation, etc.
Once you've gotten past this and decide to debug your program, capturing the "segmentation fault" is not what you want to do. The behavior is undefined at this point1. You should compile with -g (if on GCC or Clang) to generate debugging information and run it with a debugger. There are several tools that can help you catch and fix bugs:
GCC and Clang's warning options, -Wall -Wextra -pedantic
LLVM's sanitizer, which exist in both GCC and Clang. In particular, you can try out -fsanitize=undefined,address,leak2
GNU's debugger, gdb
Valgrind
You will save a lot of effort by following the normal route of fixing your program instead of a backhanded way.
1) Source: Segmentation fault handling
2) Don't run valgrind and the sanitizer at the same time. Some sanitizers are mutually exclusive as well.

I'm not familiar with xquilla specifically, but a "Segmentation fault" formally indicates that the program attempted to access a memory address that hasn't been allocated to it. With extremely rare exceptions (e.g. emulation of a different computer altogether) this indicates a catastrophic bug in the program. It is likely that, by careful manipulation of the input, the program can be made to misbehave in an arbitrarily malicious fashion rather than just crashing.
Your best option is to scrap this library and find one that does the same job but with fewer bugs.
If that's not an option, your second-best bet is to isolate the library in a separate process, running in a "sandbox" which prevents it from damaging anything when it crashes or is taken over by malware. The rest of your application would then detect the crash, clean up, and move on. Unfortunately, writing such a sandbox is Hard, and I don't know of any off-the-shelf code you can use. Good luck!

Compiler optimization makes program crash

I'm writing a program in C++/Qt which contains a graph file parser. I use g++ to compile the project.
While developing, I am constantly comparing the performance of my low level parser layer between different compiler flags regarding optimization and debug information, plus Qt's debug flag (turning on/off qDebug() and Q_ASSERT()).
Now I'm facing a problem where the only correctly functioning build is the one without any optimization. All other versions, even with -O1, seem to work in another way. They crash due to unsatisfied assertions, which are satisfied when compiled without a -O... flag. The code doesn't produce any compiler warning, even with -Wall.
I am very sure that there is a bug in my program, which seems to be only harmful with optimization being enabled. The problem is: I can't find it even when debugging the program. The parser seems to read wrong data from the file. When I run some simple test cases, they run perfectly. When I run a bigger test case (a route calculation on a graph read directly from a file), there is an incorrect read in the file which I can't explain.
Where should I start tracking down the problem of this undefined behavior? Which optimization methods are possibly involved within this different behavior? (I could enable all flags one after the other, but I don't know that much compiler flags but -O... and I know that there are a lot of them, so this would need a very long time.) As soon as I know which type the bug is of, I am sure I find it sooner or later.
You can help me a lot if you can tell me which compiler optimization methods are possible candidates for such problems.

There are a few classes of bugs that commonly arise in optimized builds, that often don't arise in debug builds.
Un-initialized variables. The compiler can catch some but not all. Look at all your constructors, look at global variables. etc. Particularly look for uninitialized pointers. In a debug build memory is reset to zero, but in a release build it isn't.
Use of temporaries that have gone out of scope. For example when you return a reference to a local temporary in a function. These often work in debug builds because the stack is padded out more. The temporaries tend to survive on the stack a little longer.
array overruns writing of temporaries. For example if you create an array as a temporary in a function and then write one element beyond the end. Again, the stack will have extra space in debug ( for debugging information ) and your overrun won't hit program data.

There are optimizations you can disable from the optimized build to help make debugging the optimized version easier.
-g -O1 -fno-inline -fno-loop-optimize -fno-if-conversion -fno-if-conversion2 \
-fno-delayed-branch
This should make stepping through your code in the debugger a little easier to follow.
Another suggestion is that if the assertions you have do not give you enough information about what is causing the problem, you should consider adding more assertions. If you are afraid of performance issues, or assertion clutter, you can wrap them in a macro. This allows you to distinguish the debugging assertions from the ones you originally added, so they can be disabled or removed from your code later.

1) Use valgrind on the broken version. (For that matter, try valgrind on the working version, maybe you'll get lucky.)
2) Build the system with "-O1 -g" and step through your program with gdb. At the crash, what variable has an incorrect value? Re-run your program and note when that variable is written to (or when it isn't and should have been.)

Why does my debugger sometimes freak out and do things like not line up with my code?

When I'm using my debugger (in my particular case, it was QT Creator together with GDB that inspired this) on my C++ code, sometimes even after calling make clean followed by make the debugger seems to freak out.
Sometimes it will seem to be lined up with another piece of code's line numbers, and will jump around. Sometimes this is is off by one line, sometimes this is totally off and it'll jump around erratically.
Other times, it'll freak out by stepping into things I didn't ask it to step into, like while stepping over a function call, it might step into the string initialization routine that is part of it.
When I get seg faults, sometimes it's able to tell me where it happened perfectly, and other times it's not even able to display question marks for which functions called the code and from where, and all I see is assembly, even while running the exact same code repeatedly.
I can't seem to figure out a pattern to what causes these failures, and sometimes my debugger is perfectly well behaved.
What are the theoretical reasons behind these debugger freak outs, and what are the concrete steps I can take to prevent them?

There's 3 very common reasons
You're debugging optimized code. This rarely works - optimized code can be reordered/inlined/precomputed/etc. to the point there's no chance whatsoever to map it back to the source code.
You're not debugging, for whatever reason, the binary matching the current source code.
You've invoked undefined behavior somewhere - if whatever stuff your code did, it has messed around with the scaffolding the debugger needs to keep its sanity. This is what usually happens when you get a segfault and you can't get a sane stack trace, you've overwritten/messed with the information(e.g. stack pointers) the debugger needs to do its job.
And probably hundreds more - of the stuff I personally encounter is: debugging multithreaded code; depending on gcc/gdb versions and various other things - there's been quite a handful debugger bugs.

One possible reason is that debuggers are as buggy as any other program!
But the most common reason for a debugger not showing the right source location is that the compiler optimized the code in some way, so there is no simple correspondence between the source code and the executable code. A common optimization that confuses debuggers is inlining, and C++ is very prone to it.
For example, your string initialization routine was probably inlined into the function call, so as far as the debugger was concerned, there was just one function that happened to start with some string initialization code.
If you're tracking down an algorithm bug (as opposed to a coding bug that produces undefined behavior, or a concurrency bug), turning the optimization level down will help you track the bug, because the debugger will have a simpler view of the code.

I have the same question like yours, and I cannot solve it yet. But I have came out one problem solution which is to install a virtual machine and install Unix system in it. And debug it in Linux system. Perhaps it will work.
I have found out the reason, you should rebuild the project every time you changed your code, or the Qt will just run the old version of the code.

Complement to valgrind?

I have been working for the last few weeks trying to track down a really difficult bug that crashes my application. First, the application was crashing on the assign of a std::string, then during the free of a local variable.
After careful inspection of the code, there was no reason for it to crash at these locations; however, it always crashed while trying to free an invalid pointer (i.e. a pointer that pointed to invalid memory). And I have no idea why this pointer was not pointing to the right location.
I suspect that the issue has to do with a memory corruption problem or pointer corruption problem of some sort. The problem is that I can't visually track it down....yet. I have no idea where to start looking in the code, and there are thousands of lines of code to go through so this does not seem like a realistic approach to the problem.
So in comes Valgrind...
A tool that I have depended upon many a time to find issues within the code that may lead to a crash of this type. However, this time it has come up empty handed! I do not see any errors in valgrind when the problem occurs and so hence the reason for me asking this question.
Are there any other applications that can complement valgrind and help find issues in the code that may cause a crash mentioned above?
Thanks!

I assume you're using valgrind's memcheck tool, which is what it is famous for. Since you are using valgrind already you might also try running your program through valgrind --tool=exp-sgcheck (formerly exp-ptrcheck), which is an experimental tool that is designed to catch certain types of errors that memcheck will miss, including access checks for stack and global arrays, and use of pointers that happen to point to a valid object but not the object that was intended. It does this by using a completely different mechanism, essentially tracking each pointer into memory rather than tracking the memory itself, and through use of heuristics.
Be aware that the tool is experimental, but you may find that it catches something significant. Currently it does not yet support OS X or non-Intel processors.

In my experience, coverity and purify have founds such kind of errors than valgrind didn't (in fact all found problems which weren't seen by the others).
But sometimes no tool give an hint and you have to dig more, add instrumentation, play with breakpoints on "modify memory at address", try to simply the testcase which fails and so on to find out the root cause. That's can be very painful.

My experience is that often this sort of problem is caused by a heap overflow. Electric Fence is a relatively simple allocation debugging tool I like to use. Its main use is as a dynamic analysis tool to check for heap overflows, a complement to "-fstack-protector-all" which checks for stack overflows.
More links to efence stuff.

Is it possible some stack corruption is occurring? If so, try enabling stack canaries with the -fstack-protector-all option, assuming you are using g++.
Other than that, have you cranked up warning flags to help identify suspicious code?

In my opinion, using a debugger with "reverse debugging" capabilities could help.
You would be able to step back in time and hopefully find out what was the real source of the problem.
Here are a couple of links:
http://www.gnu.org/software/gdb/news/reversible.html
http://undo-software.com/ (which apparently is free for non-commercial applications)

You didn't specify the platform, but I can recommend Gimpel PC-lint as an excellent static analysis tool (don't be fooled by the name!). They also offer FlexeLint for other platforms, but I have no personal experience of that product.

Have you tried using lint, flexlint or cppcheck. These may help identify a problem.
If you know what area of memory is being corrupted have you tried marking this memory as protected. This may mask your problem and not help at all but if it still crashes the point at where the memory is modified will help resolve your problem.

If valgrind can identify the bad pointer being passed to free(), you could try running the program under DDD, which can set a hardware watchpoing on the memory location and halt the program when it is getting a bad value. If the pointer is getting changed a lot you may have to write some code around malloc and free to keep track of which values are good and bad.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js