What is a good way to debug stack value corruption. In a program of mine sometimes the address of the this pointer gets changed after a method returns that does a shutdown on a file descriptor. I debugged the program for hours but I can not find the problem.
What is a good method to find out what changes the address of the this pointer? When I manually add a watch on the this pointer the error would not occur. The error still occurs when I strip down my code as much as possible. I tried Valgrind but it does not find any early stack corruption.
I managed to detect when the error occurs, I compiled the code in 64 bit mode. The address of this changed from 0xxxxxxx to 0x1000000xxxxxxx. I check the address of this in the methods where the error occurs, that I found out when the address changes (see the first paragraaf for this).
Is there any other way to find out the cause of this problem?
You might want to give a shot to address-sanitizer. It is available in gcc 4.8:
AddressSanitizer , a fast memory error detector, has been added and
can be enabled via -fsanitize=address. Memory access instructions will
be instrumented to detect heap-, stack-, and global-buffer overflow as
well as use-after-free bugs. To get nicer stacktraces, use
-fno-omit-frame-pointer. The AddressSanitizer is available on IA-32/x86-64/x32/PowerPC/PowerPC64 GNU/Linux and on x86-64 Darwin.
In GCC (but apparently not clang), you need to specify -fsanitize=address in both the compiler flags and linker flags, as described in this related answer.
Related
I've built some C++ code that uses OpenACC and compiled it with the PGI compiler for use on the Tesla GPU.
Compilation succeeds without any warnings.
I run the program and get two errors:
call to cuStreamSynchronize returned error 717: Invalid address space
call to cuMemFreeHost returned error 717: Invalid address space
The internet doesn't seem to know much about this, other than to suggest enabling unified memory so that the problem is automatically swept under the rug. I'm not into that kind of solution.
How do I go about debugging this?
With C++ code running only on the CPU, I'd fire up gdb, do a backtrace, and say, "Ah ha!"
But now I have code living on the CPU and the GPU and data flowing between the two. I don't even know what tools to use.
A fallback is to start commenting out lines until the problem goes away, but that seems suboptimal too.
You can use "cuda-gdb" to debug the device code or use "cuda-memcheck" to check for memory errors.
Though I'm not sure either will help here. The error is indicating that the device code is issuing an instruction using an address from the wrong memory space. For example, using a shared memory pointer with an instruction that expects a global memory pointer.
I have not seen this error before nor do I see any previous bug reports for it, so can only theorize as to the cause. One possibility is if you have a shared memory variable (scalar or array in a "private" clause, or "cache" directive) that's passed from a outer gang loop to a vector routine. In this case, the vector routine may be accessing the variable as if it's in global memory.
Most likely whatever the cause, it's a compiler error. If possible, please post or send to PGI Customer Service (trs#pgroup.com) a reproducing example and I'll get it to our compiler engineers for investigation.
I can also try to get you a work-around once I better understand the cause. Though in the meantime you can try compiling with "-ta=tesla:nollvm,keepgpu". "nollvm" will cause the compiler to generate an intermediary CUDA C version of the OpenACC kernels as opposed to the default LLVM device code generator. "keepgpu" will keep the intermediary ".gpu" file which you can inspect.
There are some helpful environment variables that aid in debugging. Any combination can be enabled:
export PGI_ACC_TIME=1 #Profile time usage
export PGI_ACC_NOTIFY=1 #Set to values 0-3 where 3 is the most detailed
export PGI_ACC_DEBUG=1 #Extra debugging info
When I run my program it will occasionally crash and give me this error:
"glibc detected /pathtoexecutable: free(): invalid next size (fast)"
The backtrace leads to a member function that just calls a vector's push_back function -
void Path::add(Position p) {path.push_back(p);}
I have tried googling the error and the very large majority of the problems are people allocating too little memory. But how could that be happening on an std::vector<>.push_back? What can I check for? Any help is appreciated.
You're probably doing an invalid write somewhere and trashing the control information kept by glibc for bookkeeping. Thus, when it tries to free things it detects abnormal conditions (unreasonable sizes to free).
What's worst about this kind of thing is that the problem doesn't manifest itself at the point where you made the real mistake so it can be quite hard to catch (it's quite common to be an off-by-one error).
Your best bet is to use a memory debugger. valgrind could be a start (since you mentioned glibc). Look for "invalid write of size..." before the glibc message.
As #cniculat sayed, try valgrind.
Another tools you can try are:
gcc stl debug support. If ther problem in incorrect usage of STL container, tahn compiling with
D_GLIBCXX_DEBUG and -D_GLIBCXX_DEBUG_PEDANTIC may reveal the problem. In case the problem will be discovered, program will be aborted by assert(), so you'll receive error message on console.
Yet another option is to use google tcmalloc. It overrides malloc()/free(). Just link your appl with tcmalloc link version, and it can detect mist memory usage problems.
STL debug support and tcmalloc can be used in regular in debug builds. This way you can work as regular, while these tools will in "background" assert you if there's an error.
I am making a library that have too much code to give it here.
My problem is a segmentation fault, that Valgrind analyse as:
Jump to the invalid address stated on the next line
at 0x72612F656D6F682F: ???
at [...] (stack call)
Thanks to this question, I guess it is because I have a stack corruption somewhere.
My question is: how to find it?
I tried using GDB, but the segmentation fault appears to not be at the same place. GDB tell me it is on the first line of a function while Valgrind tell it is the call of this function that make a segmentation fault.
If the problem is repeatable, you can use technique similar to this answer to set a watchpoint on the location of return address, and have GDB stop on the instruction immediately following the one that corrupts it.
Since this is from years ago, you've probably figured out your bug. But for anyone who might stumble upon this, I would strongly encourage you to look into the "sanitizers".
If you're running Memcheck, you can probably run AddressSanitizer, which exists in both clang and gcc. AddressSanitizer can often detect stack corruption issues better than Memcheck. (Besides stack corruption, AddressSanitizer can detect many different types of addressing bugs).
However, if you scroll back in your Memcheck log, you might see Conditional jump or move depends on uninitialised value(s), in which case you're using an uninitialized variable, which is often harder to debug. For this, you can try MemorySanitizer (currently clang and Linux only, https://clang.llvm.org/docs/MemorySanitizer.html). In particular, look at the origin tracking options. This provides better origin tracking than Memcheck for uses of uninitialized variables. Do note, however, that MemorySanitizer is not trivial to set up, as it generally requires all external libraries to be built with (MemorySanitizer) instrumentation.
I wrote a C++ CLI program with MS VC++ 2010 and GCC 4.2.1 (for Mac OS X 10.6 64 bit, in Eclipse).
The program works well under GCC+OS X and most times under Windows. But sometimes it silently freezes. The command line cursor keeps blinking, but the program refuses to continue working.
The following configurations work well:
GCC with 'Release' and 'Debug' configuration.
VC++ with 'Debug' configuration
The error only occurs in the configuration 'VC++ with 'Release' configuration' under Win 7 32 bit and 64 bit. Unfortunately this is the configuration my customer wants to work with ;-(
I already checked my program high and low and fixed all memory leaks. But this error still occurs. Do you have any ideas how I can find the error?
Use logging to narrow down which part of code the program is executing when it crashes. Keep adding log until you narrow it down enough to see the issue.
Enable debug information in the release build (both compiler and linker); many variables won't show up correctly, but it should at least give you sensible backtrace (unless the freeze is due to stack smashing or stack overflow), which is usually enough if you keep functions short and doing just one thing.
Memory leaks can't cause freezes. Other forms of memory misuse are however quite likely to. In my experience overrunning a buffer often cause freezes when that buffer is freed as the free function follows the corrupted block chains. Also watch for any other kind of Undefined Behaviour. There is a lot of it in C/C++ and it usually behaves as you expect in debug and completely randomly when optimized.
Try building and running the program under DUMA library to check for buffer overruns. Be warned though that:
It requires a lot of memory. I mean easily like thousand times more. So you can only test on simple cases.
Microsoft headers tend to abuse their internal allocation functions and mismatch e.g. regular malloc and internal __debug_free (or the other way 'round). So might get a few cases that you'll have to carefully workaround by including those system headers into the duma one before it redefines the functions.
Try building the program for Linux and run it under Valgrind. That will check more problems in addition to buffer overruns and won't use that much memory (only twice as normal, but it is slower, approximately 20 times).
Debug versions usually initialize all allocated memory (MSVC fills them with 0xCD with the debug configuration). Maybe you have some uninitialized values in your classes, with the GCC configurations and MSVC Debug configuration it gets a "lucky" value, but in MSVC Release it doesn't.
Here are the rest of the magic numbers used by MSVC.
So look for uninitialized variables, attributes and allocated memory blocks.
Thank you all, especially Cody Gray and MikMik, I found it!
As some of you recommended I told VS to generate debug information and disabled optimizations in the release configuration. Then I started the program and paused it. Alternatively I remotely attached to the running process. This helped me finding the region where the error was.
The reasons were infinite loops, caused by reads behind the boundaries of an array and a missing exclusion of an invalid case. Both led to unreachable stopping conditions at runtime. The esoteric part came from the fact, that my program uses some randomized values.
That's life...
When I run my program it will occasionally crash and give me this error:
"glibc detected /pathtoexecutable: free(): invalid next size (fast)"
The backtrace leads to a member function that just calls a vector's push_back function -
void Path::add(Position p) {path.push_back(p);}
I have tried googling the error and the very large majority of the problems are people allocating too little memory. But how could that be happening on an std::vector<>.push_back? What can I check for? Any help is appreciated.
You're probably doing an invalid write somewhere and trashing the control information kept by glibc for bookkeeping. Thus, when it tries to free things it detects abnormal conditions (unreasonable sizes to free).
What's worst about this kind of thing is that the problem doesn't manifest itself at the point where you made the real mistake so it can be quite hard to catch (it's quite common to be an off-by-one error).
Your best bet is to use a memory debugger. valgrind could be a start (since you mentioned glibc). Look for "invalid write of size..." before the glibc message.
As #cniculat sayed, try valgrind.
Another tools you can try are:
gcc stl debug support. If ther problem in incorrect usage of STL container, tahn compiling with
D_GLIBCXX_DEBUG and -D_GLIBCXX_DEBUG_PEDANTIC may reveal the problem. In case the problem will be discovered, program will be aborted by assert(), so you'll receive error message on console.
Yet another option is to use google tcmalloc. It overrides malloc()/free(). Just link your appl with tcmalloc link version, and it can detect mist memory usage problems.
STL debug support and tcmalloc can be used in regular in debug builds. This way you can work as regular, while these tools will in "background" assert you if there's an error.