Debugging an "Invalid address space" error - c++

I've built some C++ code that uses OpenACC and compiled it with the PGI compiler for use on the Tesla GPU.
Compilation succeeds without any warnings.
I run the program and get two errors:
call to cuStreamSynchronize returned error 717: Invalid address space
call to cuMemFreeHost returned error 717: Invalid address space
The internet doesn't seem to know much about this, other than to suggest enabling unified memory so that the problem is automatically swept under the rug. I'm not into that kind of solution.
How do I go about debugging this?
With C++ code running only on the CPU, I'd fire up gdb, do a backtrace, and say, "Ah ha!"
But now I have code living on the CPU and the GPU and data flowing between the two. I don't even know what tools to use.
A fallback is to start commenting out lines until the problem goes away, but that seems suboptimal too.

You can use "cuda-gdb" to debug the device code or use "cuda-memcheck" to check for memory errors.
Though I'm not sure either will help here. The error is indicating that the device code is issuing an instruction using an address from the wrong memory space. For example, using a shared memory pointer with an instruction that expects a global memory pointer.
I have not seen this error before nor do I see any previous bug reports for it, so can only theorize as to the cause. One possibility is if you have a shared memory variable (scalar or array in a "private" clause, or "cache" directive) that's passed from a outer gang loop to a vector routine. In this case, the vector routine may be accessing the variable as if it's in global memory.
Most likely whatever the cause, it's a compiler error. If possible, please post or send to PGI Customer Service (trs#pgroup.com) a reproducing example and I'll get it to our compiler engineers for investigation.
I can also try to get you a work-around once I better understand the cause. Though in the meantime you can try compiling with "-ta=tesla:nollvm,keepgpu". "nollvm" will cause the compiler to generate an intermediary CUDA C version of the OpenACC kernels as opposed to the default LLVM device code generator. "keepgpu" will keep the intermediary ".gpu" file which you can inspect.

There are some helpful environment variables that aid in debugging. Any combination can be enabled:
export PGI_ACC_TIME=1 #Profile time usage
export PGI_ACC_NOTIFY=1 #Set to values 0-3 where 3 is the most detailed
export PGI_ACC_DEBUG=1 #Extra debugging info

Related

Debugging CUDA MMU Fault

In my code I repeatedly get memory access errors, and I cannot find the reason why this would happen.
What is a MMU error on CUDA in the first place, and how can I debug where its coming from? Currently it happens when defining a lambda function, but when I rewrite the code it happens at some other place, so its quite undefined behaviour, and I don't know how to even start debugging this.
The MMU fault you are referring to is presumably an Xid 31 error as described here.
The most common reason for this in my experience is a CUDA code defect (code written by CUDA user, i.e. GPU kernel/device code) that results in an error occurring during the execution of a GPU kernel. Such issues, in my experience, are nearly always capturable/localizable using cuda-memcheck. (You can also use a debugger as described in the link above).
For these cases, the best method to begin the debug, IMO, is to start using the method described here. It is essentially what is being referred to in the document I linked above. Using that method, cuda-memcheck is generally able to localize the error to a specific line of source code for you. Thereafter you have additional debug avenues you can pursue, using in-kernel printf and/or a debugger, as described.
If cuda-memcheck does not report any issues, but the Xid 31 error is logged in your system logs each time you run a particular app, then as indicated in the first linked document, this is not really end-user debuggable (and should be a rare occurrence) and the only recourse at that point is to file a bug at developer.nvidia.com, using the general method described here.

What does it mean when the same source code gives different answers under two different compilers?

I'm in a very weird situation where my code works on my desktop but crashes on a remote cluster. I've spent countless times checking my cource code for errors, running it in debugger to catch what breaks the code, and looking for memory leaks under valgrind (which turned out to be clean -- at least under gcc).
Eventually what I have found out so far is that the same source code produces identical on both machines as long as I'm using the same compiler (gcc 4.4.5). Problem is I want to use intel compiler on the remote cluster for better performances and also some prebuilt libraries that use intel. Besides, I'm still worried that maybe gcc is neglecting some memory issues that are caught in intel compiler.
What does this mean for my code?
It probably means you are relying on undefined, unspecified or implementation-defined behavior.
Maybe you forgot to initialize a variable, or you access an array beyond its valid bounds, or you have expressions like a[i] = b[i++] in your code... the possibilities are practically infinite.
Does the crash result in a core file? If back traces, equivalent to gdb 'bt' command, from multiple core dumps are consistent, then you can begin to start putting in printf statements selectively and work backwards up the list of functions in the stack trace.
If there are no memory leaks detected, then heap is probably okay. That leaves the stack as a potential problem area. It looks like you may have an uninitialized variable that is smashing the stack.
Try compiling your app with '-fstack-protector' included in your gcc/g++ compile command arguments.

Dwarf Error: Cannot find DIE

I am having a lot of trouble debugging a segmentation fault in a C++ project in XCode 4.
I only get a segfault when I built with the "LLVM 2.0" compiler option and use -O3 optimization. From what I understand, there are limited debugging options when one is using optimization, but here is the debug output I get after I run in Xcode with gdb turned on:
warning: Got an error handling event: "Dwarf Error: Cannot find DIE at 0x3be2 referenced from DIE at 0x11d [in module /Users/imran/Library/Developer/Xcode/DerivedData/cgo-hczcifktgscxjigfphieegbpxxsq/Build/Products/Debug/cgo]".
No memory available to program now: unsafe to call malloc
I can't get gdb to give me any useful info after that (like a trace), but I'm not sure I really know how to use it properly. When I try to use the "LLDB" debugger Xcode just crashes (which has been a common theme since I started using it).
My program is deterministic, but when I try to isolate the problem with print statements the behavior will change. For example if I add cout << "hello"; at one point the segfault goes away. Other print statements cause my program to segfault in a different iteration of its main loop. And naturally when I put in enough print statements to supposedly pinpoint the offending code, the segfault seems to occur after one line but before the next (i.e. nowhere).
I am using pointers and dynamic memory allocation, which is likely the cause of the problem, but since I can't narrow down the block of code causing the error I don't know what code to show here.
I tried profiling with the "Leaks" tool in Instruments, but it didn't find any leaks.
Any advice? I am very inexperienced with debugging so anything would help, really.
EDIT: Solved. Given certain inputs, my program would try to read past the end of an array.
I don't think there's enough information that I can help you with the DWARF issue. I am not familiar enough with that toolchain to know how robust it is.
Your crashing symptoms however smell a lot like heap corruption. I don't know what allocator OSX uses by default, but common optimizations store metadata inline with objects and/or thread the freelist through empty objects, which makes them very sensitive to buffer overflows on the heap. Freeing an object twice or using a dangling pointer (a pointer that has been freed but whose space may now be in use by another allocation) can also cause seemingly nondeterministic and hard to track errors, since the layout of the heap is likely to change between runs. Print statements also use the allocator, which means changing the print statements can change when and where the problem will appear.
A tool that you may find helpful in determining if this is a heap problem or something unrelated is a heap replacement called DieHard by my advisor (http://prisms.cs.umass.edu/emery/index.php?page=download-diehard). I believe it will build on OSX, and you can link it into your program using LD_PRELOAD=/path/to/libdiehard.so to replace the default allocator at runtime. Its sole purpose is to resist memory errors and heap corruption, so if your application actually runs with it, that's probably where you need to look.

compiling with o2 flag makes program to trow access violation

I know it may be some once in life time question but I've stuck with it and i cann't think of any possible problem that's cousing this, I've written a code in c++ (somthing around 500 lines in seperate classes and files) using visual studio and while I compile it without optimization flag (/od) it works fine, but when I try to compile it using release configuration (/o2 flag for optimization) the program gives access violation and crashes. after some debuging i found out there is a this value is changing inside one of member functions but i can't see any direct use of pointer in the call stack were the pointer changes, can any one give any suggestion what makes that happen in only when optimization is enabled?
don't know if this may help you or not, but when I'm compiling using optimization I can see there is an assembly instuction added at the end of my first function call pop ebp don't know what this one does but what ever it is, this is where this pointer changes.
something new that i found while trying to debug using disassembler, there is 13 push instructions and only 10 pop instructions in the function that is causing the problem (the problem is caused by the last pop just before ret instruction) is it okay or not? (i'm counting all push,pop instructions in the functions that are called too.)
The reason you're seeing different behavior with and without optimizations is that your code (unintentionally) relies on undefined behavior. It just so happens to work if the compiler lays out data in one way, and breaks if the compiler lays it out differently.
In other words, you have a bug.
It may be in your already tested code, or it may be in how you use that code. In any case, as #Nim said in the comments, check wherever you allocate and free memory. Check that your classes follow the rule of three. Verify that you don't have a buffer overrun somewhere. And perhaps, try compiling it with different compilers as well. Use static analysis tools (MSVC has /analyze, Clang has --analyze. On Linux Valgrind may be a good bet).
But don't assume that it is a compiler bug. Those do occur, sure, but they're not commonly the source of such errors. In nearly every case, it is a latent bug in the developers own code. Just because it doesn't trigger every time, with every compiler flag doesn't mean it doesn't exist, or that it's the compiler's fault.
Since you say that a this pointer suddenly changes value leads me to believe that this is related to a heap corruption. On the other hand since you say this is related to optimized code or not, it might as well be related to the stack. One of the things the optimizer does, is that it removes unused variables put on the stack, that are never accessed.
This in fact means that when you are not compiling in optimized mode, there will be more variables present on the stack, thus making the memory layout somewhat different and in a sense add more memory space to the stack, which might have huge impact to how the software reacts to for example stack overflow.
If there are local variables that are never used, the program doesn't care if you corrupt the memory of the never used local variables. It's only when you corrupt memory that you actually use, when it becomes a problem.
There are different warning levels (four if I'm not mistaken) that you can tell the compiler to use. If you use the highest one a warning will be treated as a compiler error, which will halt the compilation process. This way you can notice local variables that will be removed when the code is optimized and can move you closer to the real problem. Start searching around these areas of the code to start with.
I also suggest that you cut away code and test, just to rule out where the problematic code is located, and gradually dig down close the problem. When you have no information you must start from the beginning (the main loop of the program) and try to isolate and rule out portions of the code that is working ok. "If I comment out this function call, then it doesn't crashes" might give you a hint :)

C++: Where to start when my application crashes at random places?

I'm developing a game and when I do a specific action in the game, it crashes.
So I went debugging and I saw my application crashed at simple C++ statements like if, return, ... Each time when I re-run, it crashes randomly at one of 3 lines and it never succeeds.
line 1:
if (dynamic) { ... } // dynamic is a bool member of my class
line 2:
return m_Fixture; // a line of the Box2D physical engine. m_Fixture is a pointer.
line 3:
return m_Density; // The body of a simple getter for an integer.
I get no errors from the app nor the OS...
Are there hints, tips or tricks to debug more efficient and get known what is going on?
That's why I love Java...
Thanks
Random crashes like this are usually caused by stack corruption, since these are branching instructions and thus are sensitive to the condition of the stack. These are somewhat hard to track down, but you should run valgrind and examine the call stack on each crash to try and identify common functions that might be the root cause of the error.
Are there hints, tips or tricks to debug more efficient and get known what is going on?
Run game in debugger, on the point of crash, check values of all arguments. Either using visual studio watch window or using gdb. Using "call stack" check parent routines, try to think what could go wrong.
In suspicious(potentially related to crash) routines, consider dumping all arguments to stderr (if you're using libsdl or on *nixlike systems), or write a logfile, or send dupilcates of all error messages using (on Windows) OutputDebugString. This will make them visible in "output" window in visual studio or debugger. You can also write "traces" (log("function %s was called", __FUNCTION__))
If you can't debug immediately, produce core dumps on crash. On windows it can be done using MiniDumpWriteDump, on linux it is set somewhere in configuration variables. core dumps can be handled by debugger. I'm not sure if VS express can deal with them on Windows, but you still can debug them using WinDBG.
if crash happens within class, check *this argument. It could be invalid or zero.
If the bug is truly evil (elusive stack corruption in multithreaded app that leads to delayed crash), write custom memory manager, that will override new/delete, provide alternative to malloc(if your app for some reason uses it, which may be possible), AND that locks all unused memory memory using VirtualProtect (windows) or OS-specific alternative. In this case all potentially dangerous operation will crash app instantly, which will allow you to debug the problem (if you have Just-In-Time debugger) and instantly find dangerous routine. I prefer such "custom memory manager" to boundschecker and such - since in my experience it was more useful. As an alternative you could try to use valgrind, which is available on linux only. Note, that if your app very frequently allocates memory, you'll need a large amount of RAM in order to be able to lock every unused memory block (because in order to be locked, block should be PAGE_SIZE bytes big).
In areas where you need sanity check either use ASSERT, or (IMO better solution) write a routine that will crash the application (by throwing an std::exception with a meaningful message) if some condition isn't met.
If you've identified a problematic routine, walk through it using debugger's step into/step over. Watch the arguments.
If you've identified a problematic routine, but can't directly debug it for whatever reason, after every statement within that routine, dump all variables into stderr or logfile (fprintf or iostreams - your choice). Then analyze outputs and think how it could have happened. Make sure to flush logfile after every write, or you might miss the data right before the crash.
In general you should be happy that app crashes somewhere. Crash means a bug you can quickly find using debugger and exterminate. Bugs that don't crash the program are much more difficult (example of truly complex bug: given 100000 values of input, after few hundreds of manipulations with values, among thousands of outputs, app produces 1 absolutely incorrect result, which shouldn't have happened at all)
That's why I love Java...
Excuse me, if you can't deal with language, it is entirely your fault. If you can't handle the tool, either pick another one or improve your skill. It is possible to make game in java, by the way.
These are mostly due to stack corruption, but heap corruption can also affect programs in this way.
stack corruption occurs most of the time because of "off by one errors".
heap corruption occurs because of new/delete not being handled carefully, like double delete.
Basically what happens is that the overflow/corruption overwrites an important instruction, then much much later on, when you try to execute the instruction, it will crash.
I generally like to take a second to step back and think through the code, trying to catch any logic errors.
You might try commenting out different parts of the code and seeing if it affects how the program is compiled.
Besides those two things you could try using a debugger like Visual Studio or Eclipse etc...
Lastly you could try to post your code and the error you are getting on a website with a community that knows programming and could help you work through the error (read: stackoverflow)
Crashes / Seg faults usually happen when you access a memory location that it is not allowed to access, or you attempt to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location).
There are many memory analyzer tools, for example I use Valgrind which is really great in telling what the issue is (not only the line number, but also what's causing the crash).
There are no simple C++ statements. An if is only as simple as the condition you evaluate. A return is only as simple as the expression you return.
You should use a debugger and/or post some of the crashing code. Can't be of much use with "my app crashed" as information.
I had problems like this before. I was trying to refresh the GUI from different threads.
If the if statements involve dereferencing pointers, you're almost certainly corrupting the stack (this explains why an innocent return 0 would crash...)
This can happen, for instance, by going out of bounds in an array (you should be using std::vector!), trying to strcpy a char[]-based string missing the ending '\0' (you should be using std::string!), passing a bad size to memcpy (you should be using copy-constructors!), etc.
Try to figure out a way to reproduce it reliably, then place a watch on the corrupted pointer. Run through the code line-by-line until you find the very line that corrupts the pointer.
Look at the disassembly. Almost any C/C++ debugger will be happy to show you the machine code and the registers where the program crashed. The registers include the Instruction Pointer (EIP or RIP on x86/x64) which is where the program was when it stopped. The other registers usually have memory addresses or data. If the memory address is 0 or a bad pointer, there is your problem.
Then you just have to work backward to find out how it got that way. Hardware breakpoints on memory changes are very helpful here.
On a Linux/BSD/Mac, using GDB's scripting features can help a lot here. You can script things so that after the breakpoint is hit 20 times it enables a hardware watch on the address of array element 17. Etc.
You can also write debugging into your program. Use the assert() function. Everywhere!
Use assert to check the arguments to every function. Use assert to check the state of every object before you exit the function. In a game, assert that the player is on the map, that the player has health between 0 and 100, assert everything that you can think of. For complicated objects write verify() or validate() functions into the object itself that checks everything about it and then call those from an assert().
Another way to write in debugging is to have the program use signal() in Linux or asm int 3 in Windows to break into the debugger from the program. Then you can write temporary code into the program to check if it is on iteration 1117321 of the main loop. That can be useful if the bug always happens at 1117322. The program will execute much faster this way than to use a debugger breakpoint.
some tips :
- run your application under a debugger, with the symbol files (PDB) together.
- How to set Visual Studio as the default post-mortem debugger?
- set default debugger for WinDbg Just-in-time Debugging
- check memory allocations Overriding new and delete, and Overriding malloc and free
One other trick: turn off code optimization and see if the crash points make more sense. Optimization is allowed to float little bits of your code to surprising places; mapping that back to source code lines can be less than perfect.
Check pointers. At a guess, you're dereferencing a null pointer.
I've found 'random' crashes when there are some reference to a deleted object. As the memory is not necessarily overwritten, in many cases you don't notice it and the program works correctly, and than crashes after the memory was updated and is not valid anymore.
JUST FOR DEBUGGING PURPOSES, try commenting out some suspicious 'deletes'. Then, if it doesn't crash anymore, there you are.
use the GNU Debugger
Refactoring.
Scan all the code, make it clearer if not clear at first read, try to understand what you wrote and immediately fix what seems incorrect.
You'll certainly discover the problem(s) this way and fix a lot of other problems too.