Problem: CUDA Naive sum reduction, but not desired result [duplicate] - c++

hello I want to find the sum of array elements using CUDA.
__global__ void countZeros(int *d_A, int * B)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
B[0] = B[0]+d_A[index];
}
so in the end, B[0] supposed to contain the sum of all elements. but I noticed that B[0] equals to zero every time. so in the end it contains only last element.
why B[0] becomes zero every time?

All of the threads are writing to B[0], and some may be attempting to write simultaneously. This line of code:
B[0] = B[0]+d_A[index];
requires a read and a write of B[0]. If multiple threads are doing this at the same time, you will get strange results.
You can make a simple fix by doing this:
atomicAdd(B, d_A[index]);
and you should get sensible results (assuming you have no errors elsewhere in your code, that you haven't shown.) Be sure to initialize B[0] to some known value before calling this kernel.
If you want to do this efficiently, however, you should study the cuda reduction sample or just use CUB.
And be sure to use proper cuda error checking any time you are having trouble with a CUDA code.
So, if you still can't get sensible results, please instrument your code with proper cuda error checking before asking "I made this change but it still doesn't work, why?" I can't tell you why, because this is the only snippet of code that you've shown.

Related

Understanding the use of memset in CUDA device code

I have a linear int array arr, which is on CUDA global memory. I want to set sub-arrays of arr to defined values. The sub-array start indexes are given by the starts array, while the length of each sub-array is given in counts array.
What I want to do is to set the value of sub-array i starting from starts[i] and continuing upto counts[i] to the value starts[i]. That is, the operation is:
arr[starts[i]: starts[i]+counts[i]] = starts[i]
I thought of using memset() in the kernel for setting the values. However, it is not getting correctly written ( the array elements are being assigned some random values). The code I am using is:
#include <stdlib.h>
__global__ void kern(int* starts,int* counts, int* arr,int* numels)
{
unsigned int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx>=numels[0])
return;
const int val = starts[idx];
memset(&arr[val], val, sizeof(arr[0])*counts[idx]) ;
__syncthreads();
}
Please note that numels[0] contains the number of elements in starts array.
I have checked the code with cuda-memcheck() but didn't get any errors. I am using PyCUDA, if it's relevant. I am probably misunderstanding the usage of memset here, as I am learning CUDA.
Can you please suggest a way to correct this? Or other efficient way of doint this operation.
P.S: I know that thrust::fill() can probably do this well, but since I am learning CUDA, I would like to know how to do this without using external libraries.
The memset and memcpy implementations in CUDA device code emit simple, serial, byte values operations (and note that memset can't set anything other than byte values, which might be contributing to the problem you see if the values you are trying to set are larger than 8 bits).
You could replace the memset call with something like this:
const int val = starts[idx];
//memset(&arr[val], val, sizeof(arr[0])*counts[idx]) ;
for(int i = 0; i < counts[idx]; i++)
arr[val + i] = val;
The performance of that code will probably be better than the built-in memset.
Note also that the __syncthreads() call at the end of your kernel is both unnecessary, and a potential source of deadlock and should be removed. See here for more information.

How to store bool result of a CUDA kernel function

Assume that we have 2^10 CUDA cores and 2^20 data points. I want a kernel that will process these points and will provide true/false for each of them. So I will have 2^20 bits. Example:
bool f(x) { return x % 2? true : false; }
void kernel(int* input, byte* output)
{
tidx = thread.x ...
output[tidx] = f(input[tidx]);
...or...
sharedarr[tidx] = f(input[tidx]);
sync()
output[blockidx] = reduce(sharedarr);
...or...
atomic_result |= f(input[tidx]) << tidx;
sync(..)
output[blckidx] = atomic_result;
}
Thrust/CUDA has some algorithms as "partitioning", "transformation" which provides similar alternatives.
My question is, when I write the relevant CUDA kernel with a predicate that is providing the corresponding bool result,
should I use one byte for each result and directly store the result in the output array? Performing one step for calculation and performing another step for reduction/partitioning later.
should I compact the output in the shared memory, using one byte for 8 threads and then at the end write the result from shared memory to output array?
should I use atomic variables?
What's the best way to write such a kernel and the most logical data structure to keep the results? Is it better to use more memory and simply do more writes to main memory instead of trying to deal with compacting the result before writing back to result memory area?
There is no tradeoff between speed and data size when using the __ballot() warp-voting intrinsic to efficiently pack the results.
Assuming that you can redefine output to be of uint32_t type, and that your block size is a multiple of the warp size (32), you can simply store the packed output using
output[tidx / warpSize] = __ballot(f(input[tidx]));
Note this makes all threads of the warp try to store the result of __ballot(). Only one thread of the warp will succeed, but as their results are all identical, it does not matter which one will.

Can reducing loop times in C++ codes help increase the speed?

I give the following example to illustrate my question:
void fun(int i, float *pt)
{
// do something based on i
std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;
void fun_without_optmization()
{
float *example;
example = new float [LOOP];
for(unsigned int i=0; i<LOOP; i++)
{
fun(i,example);
}
delete []example;
}
void fun_with_optimization()
{
float *example;
example = new float [LOOP];
unsigned int unit_loop = LOOP/10;
unsigned int left_loop = LOOP%10;
pt = example;
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
delete []example;
}
As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?
Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.
Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?
Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.
The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.
The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.
This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.
So this code:
for (int i=0; i<n;i++) //n multiple of 4, for simplicity
a+=temp; //just a random instruction
Will actually execute faster on a parallel architecture if rewritten like:
for (int i=0;i<n ;i+=4)
{
temp0 = temp0 +temp1; //reads and additions can be executed in parallel
temp1 = temp2 +temp3;
a=temp0+temp1+a;
}
There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.
It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.
Writing your code, like:
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...
On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.

Different ways to access array's element

As well as I know, there are two ways to access array's element in C++:
int array[5]; //If we have an array of 5 integers
1) Using square brackets
array[i]
2) Using pointers
*(array+i)
My university's teacher forces me to use *(array+i) method, telling me that "it's more optimized".
So, can you please explain, is there any real difference between them? Does the second method has any advantages over the first one?
Is one option more optimized than the other ?
Well, let's see in practice the assembler code generated with MSVC2013 (NON-OPTIMIZED debug mode):
; 21 : array[i] = 8;
mov eax, DWORD PTR _i$[ebp]
mov DWORD PTR _array$[ebp+eax*4], 8
; 22 : *(array + i) = 8;
mov eax, DWORD PTR _i$[ebp]
mov DWORD PTR _array$[ebp+eax*4], 8
Well, with the best will, I cannot see any difference in the code generated.
By the way, someone recently wrote on SO: premature optimizing is the root of all evil. Your teacher should know that !
Has one an advantage over the other ?
Clearly, option one has the advantage of being intuitive and readable. Option2 becomes quickly UNREADABLE in mathematical applications.
Example 1: distance of a 2D mathematical vector implemented as an array.
double v[2] = { 2.0, 1.0 };
// option 1:
double d1 = sqrt(v[0] * v[0] + v[1] * v[1]);
//option 2:
double d2 = sqrt(*v**v + *(v + 1)**(v + 1));
In fact the second option is really misleading due to the **, because you have to read the formula carefully to understand if it's a double dereference or a multiplication by a dereferenced pointer. Not speaking of people who might be mislead by some other languages like ADA in which ** means "power"
Example 2: calculation of the determinant of a 2x2 matrix
double m[2][2] = { { 1.0, 2.0 }, { 3.0, 4.0 } };
// option 1
double dt1 = m[0][0] * m[1][1] - m[1][0] * m[0][1];
// option 2
double *x = reinterpret_cast<double*>(m);
double dt2 = *x **(x+2*1+1) - *(x+2*1) * *(x+1);
With multidimensional arays, option 2 is a nightmare. Note that :
I've used a temporary one dimensional pointer x to be able to use the formula. Using m here would have caused misleading compilation error messages.
you have to know the precise layout of your object and you have to introduce the size of the first dimension in every formula !
Imagine that later on you want to increase the number of elements in your 2D array. You'll have to rewrite everything !
Semantic gap
What your teacher is missing here, is that the operator [] has a meaning that is well understood by the compiler and the reader. It's an abstraction that is not dependent on how your data structure is implemented in reality.
Suppose you have an array and a very simple code:
int w[10] {0};
... // put something in w
int sum = 0;
for (int i = 0; i < 10; i++)
sum += w[i];
Later you decide to use a std::vector instead of an array, because you've learnt that it's much more flexible and powerful. All you have to do is to change the definition (and initialisation) of w :
vector<int> w(10,0);
The rest of your code will work, because the semantic of [] is the same for the two data strutures. I let you imagine what would have hapened if you'd have used your teacher's advice...
"My university's teacher forces me to use *(array+i) method, telling me that "it's more optimized"."
What are they telling you please? If you didn't got something completely wrong with this statement1, ask them for a proof regarding the generated assembler code (#Christophe was giving one in his answer here). I don't believe they could give you such, when looking in deeper.
You could easily check this out yourself using the e.g. the -S option of GCC to produce the assembler code, and compare the results achieved with one or the other version.
Any decent, modern C++ compiler will produce the exactly same assembler code for both of these statements, as long they refer to any c++ fundamental types.
"Does the second method has any advantages over the first one?"
No. The opposite appears to occur, because of less intuitive readability of the code.
1) For class/struct types there could be overloads of the T& operator[](int index) that do things behind the scenes, but if so, *(array+i) should be implemented to behave consistently.
My university's teacher forces me to use *(array+i) method, telling me that "it's more optimized".
Your teacher is absolutely wrong.
The standard defines array[i] to be equivalent to *(array+i), and there is no reason for a compiler to treat them otherwise. They are the same. Neither will be "more optimized" than the other.
The only reason to recommend one over the other is convention and readability and, in those competitions, array[i] wins.
I wonder what else your teacher is getting wrong? :(

10 milli-second C++ excution time

I try to find out the exact execution time for "for loop" with 2e6 iteritions.
The following code is ran within 10ms after compiled from g++ for c++ file.
People told me that is optimization code automatically done by C++ compiler so you
get meaningless execution time. In other words,since there is no any output call
such as printf or cout<< for variable a,b,c so the optimized code will do nothing for
that "for loop" that is why I got really short program execution time in 10ms. Right ? Why they said the time result is meaningless for "for loop".
Please advise
int main(){
int max = 2e6;
int a,b,c;
// CODE YOU WANT TO TIME
int start = getMilliCount();
for (int i = 0; i < max; i++) {
a = 1234 + 5678 + i;
b = 1234 * 5678 + i;
c=1234/2+i;
}
int milliSecondsElapsed = getMilliSpan(start);
printf("\n\nElapsed time = %u milliseconds %d\n", milliSecondsElapsed,max);
return 0;
}
The run-time is absolutely not meaningless. It proves at least one important point: the optimizer is smarter than given credit, and it's able to deduce the loop has no side effects, so it cuts it out.
So even if the profile result only proves this one thing, it does have meaning.
To address what you want:
I try to find out the exact execution time for "for loop" with 2e8 iteritions.
The execution time of a for loop with 2e8 can be 0 if there are no observable effects. Or very large if they are. That's why you usually profile actual code using dedicated tools.
The compiler can change the program in any way that does not change anything observable, i.e. all outputs etc. must be exactly the same as the outputs of the un-optimized code. In your example, the compiler may notice that the values of a, b and c after the loop are never used and the loop does nothing else, so it might as well remove the loop from your program.
It could also observe that the value of the variables depend directly on max and just skip all but the last iteration.
In both cases, the result would not depend on max. It still is not meaningless, it just means that you underestimate your compiler.
Edit:
I tested this scenario with g++ -O2, the loop gets completely removed and does not run at all.