What follows is the part of my kernel that does not behave properly, then an explanation of what I've found while debugging.
__global__ void Mangler(float *matrix, int *map)
{
__shared__ signed int localMap[N];
if(0 == threadIdx.x)
{
for(int i=0; i<N; i++)
localMap[i] = -1;
}
__syncthreads();
int fn = ...; // a lot of code goes into this number, skipped for clarity
int rnumber = threadIdx.x;
int X = atomicCAS(&localMap[fn], -1, rnumber); // Spot of bother 1
if(X == -1) // Spot of bother 2
{
// some code
}
else
{
// other code
}
}
I've found in the documentation that atomicCAS(*address, compare, value) basically returns (and saves to the given address) the result of (old == compare ? value : old), where old is the value at the address before executing the function.
Going with this, I believe that executing int X = atomicCAS(&localMap[fn], -1, rnumber); should have two possible outcomes (according to NVidia Cuda C Programming Guide):
if localMap[fn] == -1 then X should have a value of rnumber and localMap[fn] should have a value of rnumber. This does not happen.
if localMap[fn] != -1 then X should be set to the value of localMap[fn] and said value should be left intact.
What happens instead, as debugging with NSight has shown me, is that X is being assigned -1, while localMap[fn] is being assigned the value of rnumber. I do not understand that, but as you can see in my code, I've changed the if to catch this situation.
Which brings me to spot of bother number 2: though NSight shows the value of X as -1, the if {} is being completely skipped (no breakpoints within hit whatsoever) and execution jumps straight to else.
My questions:
Do I misunderstand atomicCAS completely? yes, I did
What could cause and if which should evaluate as true to jump straight into else in device code?
I'm using NVidia CUDA 5.5, Visual Studio 2012 x64 on Windows 8, NVidia Nsight Monitor Visual Studio Edition 3.1. The GPU for the machine is NVidia GeForce GTX 550 Ti.
I've tried changing the syntax to if(X!=-1); the true branch of the if is still not being executed.
From the doc, atomicCAS returns the old value, that means, that in your list, your two outcomes are wrong. Your X will always be set to the old value of localMap[fn], regardless which value it had. What is set according to the comparison with the -1, is the new value of localMap[fn]. When it is -1, it is set to rnumber, else it is left intact.
So the behaviour you see with the values of X, rnumber and localMap are as expected.
I cannot help with your second problem, as I dont use NSight, and dont know how it works - according to your code, your true branch should be evaluated (but be careful: your false branch also - as it is multi threaded some threads can have the condition evaluated to true, and some to false - my guess/assumption would be that you must tell somehow your debugger which thread/warp/block you want to debug and you looked at the false).
Related
I'm using:
GeForce GTX 1080 TI which has a compute capability 6.1.
OpenCV 3.2 version (was built for VS2013, x64 Release and Debug configurations separately).
CUDA 8.0 version.
Visual studio 2013, Relase and Debug configurations of x64 platform.
My purpose is to process part of the entire input image.
The image part declared by upper left coordinate and width and height.
Problem description:
An invalid configuration argument CUDA error is rasied only when I'm running the Release output in stand alone mode (without debugging) via visual studio DEBUG menu (Ctrl + F5).
If I'm running the same Release executable via VS Debug menu (F5) the error isn't raised.
Also, when I'm running the output of Debug configuration that was generated by the same application code, both options F5 and Ctrl+F5 are work properly and the error isn't raised.
Here is my code:
struct sRect
{
unsigned int m_StartRow;
unsigned int m_StartCol;
unsigned int m_SizeRows;
unsigned int m_SizeCols;
};
__global__ void CleanNoisePreparation(unsigned char * SrcImage, size_t iStep, const sRect ImageSlice)
{
int iXPos = threadIdx.x + blockIdx.x*blockDim.x;
int iYPos = threadIdx.y + blockIdx.y*blockDim.y;
if (!(iXPos < ImageSlice.m_SizeCols && iYPos < ImageSlice.m_SizeRows))
return;
/*In case pixel value is less or equal to 127 set it to black color (0) otherwisw set it to white color (255)*/
SrcImage[iYPos * iStep + iXPos] = (SrcImage[iYPos * iStep + iXPos] <= (unsigned char)127) ? ((unsigned char)0) : ((unsigned char)255);
}
void PerformCleanNoisePreparationOnGPU(cv::cuda::GpuMat& Image,
const sRect &ImageSlice,
const dim3 &dimGrid,
const dim3 &dimBlock,
const cudaStream_t &Stream)
{
/*Calculate the rquired start address based on the required image slice characteristics*/
unsigned char * pImageData = (unsigned char*)(Image.data ImageSlice.m_StartRow * Image.step + ImageSlice.m_StartCol);
CleanNoisePreparation << <dimGrid, dimBlock, 0, Stream >> >(pImageData, Image.step, ImageSlice);
CUDA(cudaGetLastError());
}
void main
{
sRect ResSliceParams;
ResSliceParams.m_StartRow = 0;
ResSliceParams.m_StartCol = 4854;
ResSliceParams.m_SizeRows = 7096;
ResSliceParams.m_SizeCols = 5146;
cv::cuda::GpuMat MyFrame = cv::cuda::GpuMat::GpuMat(cv::Size(10000, 7096), CV_8U);
//Image step size is 10240
dim3 dimBlock (32, 32, 1)
dim3 dimGrid (161, 222, 1)
cudaStream_t cudaStream;
cudaStreamCreateWithFlags(&cudaStream, cudaStreamNonBlocking);
PerformCleanNoisePreparationOnGPU(MyFrame,
ResSliceParams,
dimGrid,
dimBlock,
cudaStream);
}
The error is raised also when:
The kernel is totally empty (All lines were commented)
The kernel inputs list is empty.
Default stream is used instead of specific stream
The problem source was found.
Due to the fact that the problem was raised only when I was executed my application under Release without debugging mode I could only used a prints commands in order to learn what are the variables values and what is the real flow of the code.
So, I was figured that dimGrid.y was set to a negative value by mistake only in this execution mode and under all other executions modes it was a positive value as I was expected.
Due to this negative value the CUDA was raised the rror of "Invalid configuration argument".
More detailes:
I have a code which calculate the required dimGrid values based on the input image resolution and if it is a portrait or a landscape.
I'm using a class member from type bool to hold this indication and send its initialization value to other sub classes as part of the member initialization list calls of the main class which include all of them as a members.
It was figured out that only in Release without debugging execution mode the bool value was false instead of true (which represents landscape mode) in the scope of the sub classes in opposite to its value in the scope of the main class.
I was verified that I it was initialized (as part of the member initialization list) to true before I sent it to all other sub classes constructors but due to the fact that class members initialization order isn't determined according to the members initialization list order but according to thier declarations order in the class it was sent to them an uninitiated.
In my system, only in Release without debugging execution mode, an uninitiated bool type get 0 value but in all other exections mode it get a positive value.
While "if" condition is performed on an uninitiated bool type, 0 translated to false but any positive value is translated to true.
This was caused to a wrong calculation of the dimGrid values.
I need an implementation of upper_bound as described in the STL for my metal compute kernel. Not having anything in the metal standard library, I essentially copied it from <algorithm> into my shader file like so:
static device float* upper_bound( device float* first, device float* last, float val)
{
ptrdiff_t count = last - first;
while( count > 0){
device float* it = first;
ptrdiff_t step = count/2;
it += step;
if( !(val < *it)){
first = ++it;
count -= step + 1;
}else count = step;
}
return first;
}
I created a simple kernel to test it like so:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
device float* where = upper_bound( input, input + 5, 3.1);
output[0] = where - input;
}
Which for this test has a hardcoded input size and search value. I also hardcoded a 5 element input buffer on the framework side as you'll see below. This kernel I expect to return the index of the first input greater than 3.1
It doesn't work. In fact output[0] is never written--as I preloaded the buffer with a magic number to see if it gets over-written. It doesn't. In fact after waitUntilCompleted, commandBuffer.error looks like this:
Error Domain = MTLCommandBufferErrorDomain
Code = 1
NSLocalizedDescription = "IOAcceleratorFamily returned error code 3"
What does error code 3 mean? Did my kernel get killed before it had a chance to finish?
Further, I tried just a linear search version of upper_bound like so:
static device float* upper_bound2( device float* first, device float* last, float val)
{
while( first < last && *first <= val)
++first;
return first;
}
This one works (sort-of). I have the same problem with a binary search lower_bound from <algorithm>--yet a naive linear version works (sort-of). BTW, I tested my STL copied versions from straight C-code (with device removed obviously) and they work fine outside of shader-land. Please tell me I'm doing something wrong and this is not a metal compiler bug.
Now about that "sort-of" above: the linear search versions work on a 5s and mini-2 (A7s) (returns index 3 in the example above), but on a 6+ (A8) it gives the right answer + 2^31. What the heck! Same exact code. Note on the framework side I use uint32_t and on the shader side I use uint--which are the same thing. Note also that every pointer subtraction (ptrdiff_t are signed 8-byte things) are small non-negative values. Why is the 6+ setting that high order bit? And of course, why don't my real binary search versions work?
Here is the framework side stuff:
id<MTLFunction> upperBoundTestKernel = [_library newFunctionWithName: #"upper_bound_test"];
id <MTLComputePipelineState> upperBoundTestPipelineState = [_device
newComputePipelineStateWithFunction: upperBoundTestKernel
error: &err];
float sortedNumbers[] = {1., 2., 3., 4., 5.};
id<MTLBuffer> testInputBuffer = [_device
newBufferWithBytes:(const void *)sortedNumbers
length: sizeof(sortedNumbers)
options: MTLResourceCPUCacheModeDefaultCache];
id<MTLBuffer> testOutputBuffer = [_device
newBufferWithLength: sizeof(uint32_t)
options: MTLResourceCPUCacheModeDefaultCache];
*(uint32_t*)testOutputBuffer.contents = 42;//magic number better get clobbered
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState: upperBoundTestPipelineState];
[commandEncoder setBuffer: testInputBuffer offset: 0 atIndex: 0];
[commandEncoder setBuffer: testOutputBuffer offset: 0 atIndex: 1];
[commandEncoder
dispatchThreadgroups: MTLSizeMake( 1, 1, 1)
threadsPerThreadgroup: MTLSizeMake( 1, 1, 1)];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
uint32_t answer = *(uint32_t*)testOutputBuffer.contents;
Well, I've found a solution/work-around. I guessed it was a pointer-aliasing problem since first and last pointed into the same buffer. So I changed them to offsets from a single pointer variable. Here's a re-written upper_bound2:
static uint upper_bound2( device float* input, uint first, uint last, float val)
{
while( first < last && input[first] <= val)
++first;
return first;
}
And a re-written test kernel:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
output[0] = upper_bound2( input, 0, 5, 3.1);
}
This worked--completely. That is, not only did it fix the "sort-of" problem for the linear search, but a similarly re-written binary search worked too. I don't want to believe this though. The metal shader language is supposed to be a subset of C++, yet standard pointer semantics don't work? Can I really not compare or subtract pointers?
Anyway, I don't recall seeing any docs saying there can be no pointer aliasing or what declaration incantation would help me here. Any more help?
[UPDATE]
For the record, as pointed out by "slime" on Apple's dev forum:
https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html#//apple_ref/doc/uid/TP40014364-CH4-SW3
"Buffers (device and constant) specified as argument values to a graphics or kernel function cannot be aliased—that is, a buffer passed as an argument value cannot overlap another buffer passed to a separate argument of the same graphics or kernel function."
But it's also worth noting that upper_bound() is not a kernel function and upper_bound_test() is not passed aliased arguments. What upper_bound_test() does do is create a local temporary that points into the same buffer as one of its arguments. Perhaps the docs should say what it means, something like: "No pointer aliasing to device and constant memory in any function is allowed including rvalues." I don't actually know if this is too strong.
I have an if statement that is currently never executed, however if I print something to the screen it takes over ten times longer for the program to run than if a variable is declared. Doing a bit of research online this seems to be some kind of branch prediction issue. Is there anything I can do to improve the program speed?
Basically both myTest and myTest_new return the same thing except one is a macro and one is a function. I am just monitoring the time it takes for bitTest to execute. and it executes in 3 seconds with just declaration in if statement but takes over a minute when Serial.print is in if statement even though neither are executed.
void bitTest()
{
int count = 0;
Serial1.println("New Test");
int lastint = 0;
Serial1.println("int");
for (int index = -2147483647; index <= 2147483647; index+=1000) {
if (index <= 0 && lastint > 0) {
break;
}
lastint = index;
for (int num = 0; num <= 31; num++) {
++1000;
int vcr1 = myTest(index, num);
int vcr2 = myTest_new(index, num);
if (vcr1 != vcr2) {
Serial1.println("Test"); // leave this println() and it takes 300 seconds for the test to run
//int x = 0;
}
} // if (index)
} // for (index)
Serial1.print("count = ");
Serial1.println(count);
return;
}
It is much less likely to be caused by a branch prediction (that branch prediction shouldn't be influenced by what you do inside your code) but by the fact that
{
int x = 0;
}
simply does nothing, because the scope of x ends at }, so that the compiler simply ditches the whole if clause, including the check. Note that this is only possible because the expression that if checks has no side effects, and neither does the block that would get executed.
By the way, the code you showed would usually directly be "compiled away", because the compiler, at compile time, can determine whether the if clause could ever be executed, unless you explicitly tell the compiler to omit such safe optimizations. Hence, I kind of doubt your "10 times as slow" measurement. Either the code you're showing isn't the actual example on which you demonstrate this, or you should turn on compiler optimization prior to doing performance comparisons.
The reason why your program takes forever is that it's buggy:
for (int index = -2147483647; index <= 2147483647; index+=1000) {
simply: at a very large index close to the maximum integer value, a wrap-around will occur. There's no "correct" way for your program to terminate. Hence you invented your strange lastint > 0 checking.
Now, fix up the loop (I mean, you're really just using every 1000th element, so why not simply loop index from 0 to 2*2147483?)
++1000;
should be illegal in C, because you can't increase a constant numeral. This is very much WTF.
All in all, your program is a mess. Re-write it, and debug a clean, well-defined version of it.
I appear to have coded a class that travels backwards in time. Allow me to explain:
I have a function, OrthogonalCamera::project(), that sets a matrix to a certain value. I then print out the value of that matrix, as such.
cam.project();
std::cout << "My Projection Matrix: " << std::endl << ProjectionMatrix::getMatrix() << std::endl;
cam.project() pushes a matrix onto ProjectionMatrix's stack (I am using the std::stack container), and ProjectionMatrix::getMatrix() just returns the stack's top element. If I run just this code, I get the following output:
2 0 0 0
0 7.7957 0 0
0 0 -0.001 0
-1 -1 -0.998 1
But if I run the code with these to lines after the std::cout call
float *foo = new float[16];
Mat4 fooMatrix = foo;
Then I get this output:
2 0 0 0
0 -2 0 0
0 0 -0.001 0
-1 1 -0.998 1
My question is the following: what could I possibly be doing such that code executed after I print a value changes the value being printed?
Some of the functions I'm using:
static void load(Mat4 &set)
{
if(ProjectionMatrix::matrices.size() > 0)
ProjectionMatrix::matrices.pop();
ProjectionMatrix::matrices.push(set);
}
static Mat4 &getMatrix()
{
return ProjectionMatrix::matrices.top();
}
and
void OrthogonalCamera::project()
{
Mat4 orthProjection = { { 2.0f / (this->r - this->l), 0, 0, -1 * ((this->r + this->l) / (this->r - this->l)) },
{ 0, 2.0f / (this->t - this->b), 0, -1 * ((this->t + this->b) / (this->t - this->b)) },
{ 0, 0, -2.0f / (this->farClip - this->nearClip), -1 * ((this->farClip + this->nearClip) / (this->farClip - this->nearClip)) },
{ 0, 0, 0, 1 } }; //this is apparently the projection matrix for an orthographic projection.
orthProjection = orthProjection.transpose();
ProjectionMatrix::load(orthProjection);
}
EDIT: whoever formatted my code, thank you. I'm not really too good with the formatting here, and it looks much nicer now :)
FURTHER EDIT: I have verified that the initialization of fooMatrix is running after I call std::cout.
UPTEENTH EDIT: Here is the function that initializes fooMatrix:
typedef Matrix<float, 4, 4> Mat4;
template<typename T, unsigned int rows, unsigned int cols>
Matrix<T, rows, cols>::Matrix(T *set)
{
this->matrixData = new T*[rows];
for (unsigned int i = 0; i < rows; i++)
{
this->matrixData[i] = new T[cols];
}
unsigned int counter = 0; //because I was too lazy to use set[(i * cols) + j]
for (unsigned int i = 0; i < rows; i++)
{
for (unsigned int j = 0; j < cols; j++)
{
this->matrixData[i][j] = set[counter];
counter++;
}
}
}
g64th EDIT: This isn't just an output problem. I actually have to use the value of the matrix elsewhere, and it's value aligns with the described behaviours (whether or not I print it).
TREE 3rd EDIT: Running it through the debugger gave me a yet again different value:
-7.559 0 0 0
0 -2 0 0
0 0 -0.001 0
1 1 -0.998 1
a(g64, g64)th EDIT: the problem does not exist compiling on linux. Just on Windows with MinGW. Could it be a compiler bug? That would make me sad.
FINAL EDIT: It works now. I don't know what I did, but it works. I've made sure I was using an up-to-date build that didn't have the code that ensures causality still functions, and it works. Thank you for helping me figure this out, stackoverflow community. As always you've been helpful and tolerant of my slowness. I'll by hypervigilant for any undefined behaviours or pointer screw-ups that can cause this unpredictability.
You're not writing your program instruction by instruction. You are describing its behavior to a C++ compiler, which then tries to express the same in machine code.
The compiler is allowed to reorder your code, as long as the observable behavior does not change.
In other words, the compiler is almost certainly reordering your code. So why does the observable behavior change?
Because your code exhibits undefined behavior.
Again, you are writing C++ code. C++ is a standard, a specification saying what the "meaning" of your code is. You're working under a contract that "As long as I, the programmer, write code that can be interpreted according to the C++ standard, then you, the compiler, will generate an executable whose behavior matches that of my source code".
If your code does anything not specified in this standard, then you have violated this contract. You have fed the compiler code whose behavior can not be interpreted according to the C++ standard. And then all bets are off. The compiler trusted you. It believed that you would fulfill the contract. It analyzed your code and generated an executable based on the assumption that you would write code that had a well-defined meaning. You did not, so the compiler was working under a false assumption. And then anything it builds on top of that assumption is also invalid.
Garbage in, garbage out. :)
Sadly, there's no easy way to pinpoint the error. You can carefully study ever piece of your code, or you can try stepping through the offending code in the debugger. Or break into the debugger at the point where the "wrong" value is seen, and study the disassembly and how you got there.
It's a pain, but that's undefined behavior for you. :)
Static analysis tools (Valgrind on Linux, and depending on your version of Visual Studio, the /analyze switch may or may not be available. Clang has a similar option built in) may help
What is your compiler? If you are compiling with gcc, try turning on thorough and verbose warnings. If you are using Visual Studio, set your warnings to /W4 and treat all warnings as errors.
Once you have done that and can still compile, if the bug still exists, then run the program through Valgrind. It is likely that at some point in your program, at an earlier point, you read past the end of some array and then write something. That something you write is overwriting what you're trying to print. Therefore, when you put more things on the stack, reading past the end of some array will put you in a completely different location in memory, so you are instead overwriting something else. Valgrind was made to catch stuff like that.
I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.
By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the element I am trying to access is not out of bounds (double checked). The following is the kernel code where it crashes.
for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++)
{
int pI = old_set[i];
if(pI <= -1 || pI > pts.size())
{
printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
continue;
}
p = pts[pI];
double d = distance(A,B,p);
if(d > dist) {
dist = d;
furthestPoint = i;
fpi = pI;
}
}
//fpi = old_set[furthestPoint];
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);
My code crashes when I uncomment the statements (array access and printf) after the for loop. I am unable to explain the error as furthestPoint is always within bounds of old_set array size. Old_setS stores the size of smaller arrays that each thread can operate on. It crashes even if just try to print the value of furthestPoint (last line) without the array access statement above it.
There's no problem with the above code for input size <= 1 million. Am I overflowing some buffer in the device in case of 10 million?
Please help me in finding the source of the crash.
There is no out of bounds memory access in your code (or at least not one which is causing the symptoms you are seeing).
What is happening is that your kernel is being killed by the display driver because it is taking too much time to execute on your display GPU. All CUDA platform display drivers include a time limit for any operation on the GPU. This exists to prevent the display from freezing for a sufficiently long time that either the OS kernel panics or the user panics and thinks the machine has crashed. On the windows platform you are using, the time limit is about 2 seconds.
What has partly mislead you into thinking the source of the problem is array adressing is the commenting out of code makes the problem disappear. But what really happens there is an artifact of compiler optimization. When you comment out a global memory write, the compiler recognizes that the calculations which lead to the value being stored are unused, and it removes all that code from the assembler code it emits (google "nvcc dead code removal" for more information). That has the effect of making the code run much faster and puts it under the display driver time limit.
For workarounds see this recent stackoverflow question and answer