CUDA debugging procedure for non-deterministic output

CUDA debugging procedure for non-deterministic output - c++

I'm debugging my CUDA 4.0/Thrust-based image reconstruction code on my Ubuntu 10.10 64-bit system and I've been trying to figure out how to debug this run-time error I have in which my output images appear to some random "noise." There is no random number generator output in my code, so I expect the output to be consistent between runs, even if it's wrong. However, it's not...
I was just wondering if any one has a general procedure for debugging CUDA runtime errors such as these. I'm not using any shared memory in my cuda kernels. I've taken pains to avoid any race conditions involving global memory, but I could have missed something.
I've tried using gpu ocelot, but it has problems recognizing some of my CUDA and CUSPARSE function calls.
Also, my code generally works. It's just when I change this one setting that I get these non-deterministic results. I've checked all code associated with that setting, but I can't figure out what I'm doing wrong. If I can distill it to something that I can post here, I might do that, but at this point it's too complicated to post here.

Are you sure all of your kernels have proper blocksize/remainder handling? The one place we have seen non-deterministic results occurred when we had data elements at the end of the array not being processed.
Our kernels were originally were intended for data that was known to be an integer multiple of 256 elements. So we used a blocksize of 256, and did a simple division to get the number of blocks. When the data was then changed to be any length, the leftover 255 or less elements never got processed. Those spots in the output then had random data.

Related

How to Find Code Resulting in Inconsistent Results

I have a C++ program that processes images and tracks objects in them, using OpenCV. For the most part, it works well; however the results that I get are inconsistent. That is, approximately 10% of the time, I am getting slightly different output values and I cannot figure out why. I do not have any calls to random; I have run valgrind to look for uninitialized memory; I have run clang-tools static analysis on it. No luck. The inconsistent runs have one of several different outputs, so they are not completely random.
Is there a tool that will show me where two runs diverge? If I run gprof or maybe cflow, can I compare them and see what was different? Is there some other tool or process I can use?
Edit: Thank you for the feedback. I believe that it is due to threading and a race condition; the suggestion was very helpful. I am currently using advice from: Ways to Find a Race Condition

Answering my own question, so that someone can see it. It was not a race condition:
The underlying problem was that we were using HOG descriptors with the incorrect parameters.
As it says in the documentation (https://docs.opencv.org/3.4.7/d5/d33/structcv_1_1HOGDescriptor.html), when you call cv::HOGDescriptor::compute, and you pass in a win_stride, it has to be a multiple of block stride. Similarly, block stride must be a multiple of cell stride. We did not set these properly. The end result is that sometimes (about 10% of the time), memory was being overwritten or otherwise corrupted. It didn't throw an error, and the results were almost correct all the time, but they were subtly different.

Clearing memory issue in C++

I have a C++ code that generates random 3D network structures. I work well and if I run it manually (from the Terminal), I get two different structures, as expected.
However, if I create a small loop to launch it 10 successive times, it produces 10 times the exact same structure, which is not normal. If I add a sleep(1) line at the end of the code, it works again, so I guess it as something to do with C++ releasing the memory (I am absolutely not an expect so I could be completely wrong).
The problem is that, by adding the sleep(1) command, it take much more time to run (10x more). This is of course not an issue for 10 runs, but the aim is to make 1000's of them.
Is there a way to force C++ to release the memory at the end of the code?

C++ does not release memory automatically at all (except for code in destructors) so that is not the case.
But random numbers generators uses a system clock counter (I may be wrong here).
In a pascal language you should've call randomize procedure to init random generator with seed. Without doing so, random numbers generator produces the same results with each run, wich is very like your situation
In C++ there is srand function that is typycally inited by current time, like in example there http://en.cppreference.com/w/cpp/numeric/random/rand
I dont know how you init your rand generator, but if you doing so with a time with seconds resolution and your code is fast enough to do 10 loops in one second - this can be a case. It explans how 1 second delay fixes situation.
if thats the case, you can try a time function with bigger resolution. Also in c++11 stl, there is much powerfull random module (as in boost libraries, if you dont have c++11x). Documentation is here http://www.cplusplus.com/reference/random/

Specific debugging tips/tools to track down memory corruption?

I'm writing a simple video encoder that compresses YUV420p video. I noticed that the resulting file always looks slightly different when I regenerate it from the same input file with the same compression settings. No big changes, it's usually just a few bits that suffer from a "cosmic ray bit flip" effect.
At no point in my program I use randomized values, so the resulting output should always be the same. I suspect that my program performs reads/writes outside its allocated memory, which would explain the randomness of the data.
Aside from regular debugging practices, are there special tools/tricks to help me detect the cause of these shenanigans?

If on Windows OS, you can try AppVerifier

use debug variants of your containers
use your containers' interfaces rather than raw memory
MallocDebug
MallocLogging
valgrind
MallocScribble
in some cases, you can perform an operation twice with what you should expect to be equal results - verify the results match (note that you should not compare memory directly, but use the interfaces).
if you know the algorithms employed well enough, you could provide an input which you can analyze the output for errors (e.g. "if the input is a black frame, the output should also be a black frame").
not sure which platform you are developing on. the tools mentioned are for 'nixes and OS X (also a 'nix but...)

You can make one output record then make another one. And during the process of compressing in each output bit of the former you can compare with the same bit of the latter. I don't know the compressor. But if you have several stages of compressing you can make several records of different stages for the first resulted record. When compressing the second resulted record you need to compare bits at each stage to the bits of a correspondent records of the first output. As a result you can find a certain place which corrupts bits.

C++ performance weirdness w/ OpenGL

I am rewriting some rendering C code in C++. The old C code basically computes everything it needs and renders it at each frame. The new C++ code instead pre-computes what it needs and stores that as a linked list.
Now, actual rendering operations are translations, colour changes and calls to GL lists.
While executing the operations in the linked list should be pretty straightforward, it would appear that the resulting method call takes longer than the old version (which computes everything each time - I have of course made sure that the new version isn't recomputing).
The weird thing? It executes less OpenGL operations than the old version. But it gets weirder. When I added counters for each type of operation, and a good old printf at the end of the method, it got faster - both gprof and manual measurements confirm this.
I also bothered to take a look at the assembly code generated by G++ in both cases (with and without trace), and there is no major change (which was my initial suspicion) - the only differences are a few more stack words allocated for counters, increasing said counters, and preparing for printf followed by a jump to it.
Also, this holds true with both -O2 and -O3. I am using gcc 4.4.5 and gprof 2.20.51 on Ubuntu Maverick.
I guess my question is: what's happening? What am I doing wrong? Is something throwing off both my measurements and gprof?

By spending time in printf, you may be avoiding stalls in your next OpenGL call.

Without more information, it is difficult to know what is happening here, but here are a few hints:
Are you sure the OpenGL calls are the same? You can use some tool to compare the calls issued. Make sure there was no state change introduced by the possibly different order things are done.
Have you tried to use a profiler at runtime? If you have many objects, the simple fact of chasing pointers while looping over the list could introduce cache misses.
Have you identified a particular bottleneck, either on the CPU side or GPU side?
Here is my own guess on what could be going wrong. The calls you send to your GPU take some time to complete: the previous code, by mixing CPU operations and GPU calls, made CPU and GPU work in parallel; on the contrary the new code first makes the CPU computes many things while the GPU is idling, then feeds the GPU with all the work to get done while the CPU has nothing left to do.

Program crashes when run outside IDE

I'm currently working on a C++ program in Windows XP that processes large sets of data. Our largest input file causes the program to terminate unexpectedly with no sort of error message. Interestingly, when the program is run from our IDE (Code::Blocks), the file is processed without any such issues.
As the data is being processed, it's placed into a tree structure. After we finish our computations, the data is moved into a C++ STL vector before being sent off to be rendered in OpenGL.
I was hoping to gain some insight into what might be causing this crash. I've already checked out another post which I can't post a link to since I'm a new user. The issue in the post was quite similar to mine and resulted from an out of bounds index to an array. However, I'm quite sure no such out-of-bounds error is occurring.
I'm wondering if, perhaps, the size of the data set is leading to issues when allocating space for the vector. The systems I've been testing the program on should, in theory, have adequate memory to handle the data (2GB of RAM with the data set taking up approx. 1GB). Of course, if memory serves, the STL vectors simply double their allocated space when their capacity is reached.
Thanks, Eric

The fact that the code works within the IDE (presumably running within a debugger?), but not standalone suggests to me that it might be an initialisation issue.

Compiler with the warning level set to max.
Then check all your warning. I would guess it is an uninitialized variable (that in debug mode is being initialized to NULL/0).
Personally I have set my templates so that warnings are always at max and that warnings are flagged as errors so that compilation will fail.

You'd probably find it helpful to configure the O/S to create a crash dump (maybe, I don't know, still by using some Windows system software called "Dr Watson"), to which you can then attach a debugger after the program has crashed (assuming that it is crashing).
You should also trap the various ways in which a program might exit semi-gracefully without a crash dump: atexit, set_unexpected, set_terminate and maybe others.

What does your memory model look like? Are you banging up against an index limit (i.e. sizeof int)?

As it turns out, our hardware is reaching its limit. The program was hitting the system's memory limit and failing miserably. We couldn't even see the error statements being produced until I hooked cerr into a file from the command line (thanks starko). Thanks for all the helpful suggestions!

Sounds like your program is throwing an exception that you are not catching. The boost test framework has some exception handlers that could be a quick way to localise the exception location.
Are there indices in the tree structure that could overflow? Are you using indexes into the vector that are beyond the current size of the vector?
new vector...
vector.push_back()
vector.push_back()
vector[0] = xyz
vector[1] = abc
vector[2] = slsk // Uh,oh, outside vector
How large is your largest input set? Do you end up allocating size*size elements? If so, is your largest input set larger than 65536 elements (65536*65536 == MAX_INT)?
I agree the most likely reason that the IDE works fine when standalone does not is because the debugger is wiping memory to 0 or using memory guards around allocated memory.
Failing anything else, is it possible to reduce the size of your data set until you find exactly the size that works, and a slightly large example that fails - that might be informative.

I'd recommend to try to determine approximately which line of your code does causes the crash.
Since this only happen outside your IDE you can use OutputDebugString to output the current position, and use DebugView.
Really the behavior of a program compiled for debug inside and outside of IDE can be completely different. They can use a different set of runtime libraries when a program is loaded from the IDE.
Recently I was bitten by a timing bug in my code, somehow when debugging from the IDE the timing was always good an the bug was not observed, but in release mode bam the bug was there. This kinda of bug are really a PITA to debug.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA debugging procedure for non-deterministic output - c++

Related

How to Find Code Resulting in Inconsistent Results

Clearing memory issue in C++

Specific debugging tips/tools to track down memory corruption?

C++ performance weirdness w/ OpenGL

Program crashes when run outside IDE

Categories

Resources