I've got an error that isn't consistently reproducible where free() is called on an invalid heap pointer. Reducing this problem to "minimal" is fundamentally not possible with the code in question--(once I've done that, it's solved). I'm unable to spot any obvious problems (such as a potential case where calloc is never called, or a double free, etc...)
I believe valgrind would be a solve for this except that the performance impact will be too extreme (these are client->server calls with timeouts, and operations that are expensive to begin with...>4 seconds in some cases)
This leaves me with fsanitize=address, I believe? My experience so far with this has been...not great.
What I've got is two static libs and an executable that links with them. I've turned on fsanitize=address for all three of them. With -fsanitize=address, the code exits cleanly under the debugger during a very thoroughly tested and correct init routine (in the middle of a 256 byte memcpy into a 16 meg heap allocation--exit code 1).
Can anyone with practical experience using fsanitize provide me any tips on where the problem may lie? I'm using gcc/ld under cmake and the code is (fundamentally) C compiled with C++. Switching to clang is probably an option if that might improve things.
Typical compile command for a file:
"command": "/usr/bin/c++ -I. -I/home/redacted -fpermissive -g -g3 -fasynchronous-unwind-tables -fsanitize=address
-std=gnu++11 -o core/CMakeFiles/nginx_core.dir/src/core/nginx.cpp.o -c /home/redacted.cpp",
I'm just going to leave this here for future searchers having problems with fsanitize. tldr;--It worked. I had two fundamental problems:
fsanitize was outputting thorough error information about the reason it was exiting. This was getting swallowed by nginx...and in our customized version of it getting redirected to an obscure log file. Not sure why under gdb I wasn't getting a debug break, but nevertheless...it was detecting a legit error. Key piece of info here: Setting a breakpoint in __asan_report_error will halt the program before exit so you can inspect your various frames.
While the initialization routine is correct and heavily tested, as I mentioned, it does require its client to correctly allocate a (non-trivial) configuration structure. In this case, the structure was 1 byte short of complete, causing a 1 byte overread.
Related
What is a good way to debug stack value corruption. In a program of mine sometimes the address of the this pointer gets changed after a method returns that does a shutdown on a file descriptor. I debugged the program for hours but I can not find the problem.
What is a good method to find out what changes the address of the this pointer? When I manually add a watch on the this pointer the error would not occur. The error still occurs when I strip down my code as much as possible. I tried Valgrind but it does not find any early stack corruption.
I managed to detect when the error occurs, I compiled the code in 64 bit mode. The address of this changed from 0xxxxxxx to 0x1000000xxxxxxx. I check the address of this in the methods where the error occurs, that I found out when the address changes (see the first paragraaf for this).
Is there any other way to find out the cause of this problem?
You might want to give a shot to address-sanitizer. It is available in gcc 4.8:
AddressSanitizer , a fast memory error detector, has been added and
can be enabled via -fsanitize=address. Memory access instructions will
be instrumented to detect heap-, stack-, and global-buffer overflow as
well as use-after-free bugs. To get nicer stacktraces, use
-fno-omit-frame-pointer. The AddressSanitizer is available on IA-32/x86-64/x32/PowerPC/PowerPC64 GNU/Linux and on x86-64 Darwin.
In GCC (but apparently not clang), you need to specify -fsanitize=address in both the compiler flags and linker flags, as described in this related answer.
This question may seem slightly open ended, but it's been troubling me for some time so I thought I would post it here in the hope of discussion or advice.
I am a physics PhD student running a fairly large computation on a reasonably complex fortran program. The program involves a large number of particles (~1000) that interact with each other by local potentials and move according to overdamped langevin dynamics.
Recently the program has begun to behave quite strangely. I'm not sure what changed, but it seems that different things are happening when the program is run with the same input parameters. Sometimes the program will run to completion. Other times it will produce a seg fault - at varying points within the computation. Occasionally it seems to simply grind to a halt without producing any errors, and one a couple of occasions has caused my computer to display warnings about running out of program memory.
The thing that confuses me here is why the program should be behaving differently for the same input. I'm really just hoping for suggestions of what might be going on here. Currently my only idea is some kind of memory management problem. The computer I'm running on is a 2013 iMac with 8GB of RAM, a 2.7GHz quad core i5 processor and OSX Mavericks. Not the most powerful in the world but I'm fairly sure I've run bigger computations on it without having these problems.
A seg fault indicates that either your program is running out of memory or that your program has an error. The most common errors in Fortran that cause seg faults are array subscript errors and disagreement between arguments in the call and the procedure (dummy arguments). For the first, turn on your compiler's option for run-time subscript checking. For the second, place all procedures into a module (or modules) and use that module (or modules). This will enable the compiler to check argument consistency.
What compiler are you using?
UPDATE: if you are using gfortran, try these compiler options: -O2 -fimplicit-none -Wall -Wline-truncation -Wcharacter-truncation -Wsurprising -Waliasing -Wimplicit-interface -Wunused -parameter -fwhole-file -fcheck=all -std=f2008 -pedantic -fbacktrace
I'm in a very weird situation where my code works on my desktop but crashes on a remote cluster. I've spent countless times checking my cource code for errors, running it in debugger to catch what breaks the code, and looking for memory leaks under valgrind (which turned out to be clean -- at least under gcc).
Eventually what I have found out so far is that the same source code produces identical on both machines as long as I'm using the same compiler (gcc 4.4.5). Problem is I want to use intel compiler on the remote cluster for better performances and also some prebuilt libraries that use intel. Besides, I'm still worried that maybe gcc is neglecting some memory issues that are caught in intel compiler.
What does this mean for my code?
It probably means you are relying on undefined, unspecified or implementation-defined behavior.
Maybe you forgot to initialize a variable, or you access an array beyond its valid bounds, or you have expressions like a[i] = b[i++] in your code... the possibilities are practically infinite.
Does the crash result in a core file? If back traces, equivalent to gdb 'bt' command, from multiple core dumps are consistent, then you can begin to start putting in printf statements selectively and work backwards up the list of functions in the stack trace.
If there are no memory leaks detected, then heap is probably okay. That leaves the stack as a potential problem area. It looks like you may have an uninitialized variable that is smashing the stack.
Try compiling your app with '-fstack-protector' included in your gcc/g++ compile command arguments.
I wrote a C++ CLI program with MS VC++ 2010 and GCC 4.2.1 (for Mac OS X 10.6 64 bit, in Eclipse).
The program works well under GCC+OS X and most times under Windows. But sometimes it silently freezes. The command line cursor keeps blinking, but the program refuses to continue working.
The following configurations work well:
GCC with 'Release' and 'Debug' configuration.
VC++ with 'Debug' configuration
The error only occurs in the configuration 'VC++ with 'Release' configuration' under Win 7 32 bit and 64 bit. Unfortunately this is the configuration my customer wants to work with ;-(
I already checked my program high and low and fixed all memory leaks. But this error still occurs. Do you have any ideas how I can find the error?
Use logging to narrow down which part of code the program is executing when it crashes. Keep adding log until you narrow it down enough to see the issue.
Enable debug information in the release build (both compiler and linker); many variables won't show up correctly, but it should at least give you sensible backtrace (unless the freeze is due to stack smashing or stack overflow), which is usually enough if you keep functions short and doing just one thing.
Memory leaks can't cause freezes. Other forms of memory misuse are however quite likely to. In my experience overrunning a buffer often cause freezes when that buffer is freed as the free function follows the corrupted block chains. Also watch for any other kind of Undefined Behaviour. There is a lot of it in C/C++ and it usually behaves as you expect in debug and completely randomly when optimized.
Try building and running the program under DUMA library to check for buffer overruns. Be warned though that:
It requires a lot of memory. I mean easily like thousand times more. So you can only test on simple cases.
Microsoft headers tend to abuse their internal allocation functions and mismatch e.g. regular malloc and internal __debug_free (or the other way 'round). So might get a few cases that you'll have to carefully workaround by including those system headers into the duma one before it redefines the functions.
Try building the program for Linux and run it under Valgrind. That will check more problems in addition to buffer overruns and won't use that much memory (only twice as normal, but it is slower, approximately 20 times).
Debug versions usually initialize all allocated memory (MSVC fills them with 0xCD with the debug configuration). Maybe you have some uninitialized values in your classes, with the GCC configurations and MSVC Debug configuration it gets a "lucky" value, but in MSVC Release it doesn't.
Here are the rest of the magic numbers used by MSVC.
So look for uninitialized variables, attributes and allocated memory blocks.
Thank you all, especially Cody Gray and MikMik, I found it!
As some of you recommended I told VS to generate debug information and disabled optimizations in the release configuration. Then I started the program and paused it. Alternatively I remotely attached to the running process. This helped me finding the region where the error was.
The reasons were infinite loops, caused by reads behind the boundaries of an array and a missing exclusion of an invalid case. Both led to unreachable stopping conditions at runtime. The esoteric part came from the fact, that my program uses some randomized values.
That's life...
Some things in GDB (actually using DDD gui) confuse me, when debugging my own C++ codes:
1) Why is there no backtrace available after a HEAP ERROR crash?
2) Why does gdb sometimes stop AFTER the breakpoint rather than AT the breakpoint?
3) Sometimes stepping through commented lines causes execution of some instructions (gdb busy)??
Any explanations greatly appreciated,
Petr
1) I'm not sure for heap error, but for example if you ran out of memory it might not be able to process the backtrace properly. Also if a heap corruption caused a pointer to blow up part of your application's stack that would cause the backtrace to be unavailable.
2) If you have optimization enabled, it's quite possible for this to happen. The compiler can reorder statements, and the underlying assembly upon which the breakpoint was placed may correspond to the later line of code. Don't use optimization when trying to debug such things.
3) This could be caused by either the source code not having been rebuilt before execution (so the binary is different from the actual source, or possibly even from optimization settings again.
Few possible explainations:
1) Why is there no backtrace available after a HEAP ERROR crash?
If the program is generating a core dump file you can run GDB as follows: "gdb program -c corefile" and get a backtrace.
2) Why does gdb sometimes stop AFTER the breakpoint rather than AT the breakpoint?
Breakpoints are generally placed on statements, so watch out for that. The problem here could also be caused by a mismatch between the binary and the code you're using.
3) Sometimes stepping through commented lines causes execution of some instructions (gdb busy)??
Again, see #2.
2) Why does gdb sometimes stop AFTER the breakpoint rather than AT the breakpoint?
Do you have optimization enabled during your compilation? If so the compiler may be doing non-trivial rearrangements of your code... This could conceivable address your number 3 as well.
With g++ use -O0 or no -O at all to turn optimization off.
I'm unclear of what your number 1 is asking.
Regarding the breakpoint and comment/instruction behavior, are you compiling with optimization enabled (e.g, -O3, etc.)? GDB can handle that but the behavior you are seeing sometimes occurs when debugging optimized code, especially with code compiled with aggressive optimizations.
Heap checks are probably done after main returns, try set backtrace past-main in GDB. If it's crashing - the process is gone - you need to load the core file into debugger (gdb prog core).
Optimized code, see #dmckee's answer
Same as 2.