Cuda triple nested for loop assignement - c++

I'm trying to convert c++ code into Cuda code and I've got the following triple nested for loop that will fill an array for further OpenGL rendering (i'm simply creating a coordinate vertices array):
for(int z=0;z<263;z++) {
for(int y=0;y<170;y++) {
for(int x=0;x<170;x++) {
g_vertex_buffer_data_3[i]=(float)x+0.5f;
g_vertex_buffer_data_3[i+1]=(float)y+0.5f;
g_vertex_buffer_data_3[i+2]=-(float)z+0.5f;
i+=3;
}
}
}
I would like to get faster operations and so I'll use Cuda for some operations like the one listed above. I want to create one block for each iteration of the outermost loop and since the inner loops have iterations of 170 * 170 = 28900 total iterations, assign one thread to each innermost loop iteration. I converted the c++ code into this (it's just a small program that i made to understand how to use Cuda):
__global__ void mykernel(int k, float *buffer) {
int idz=blockIdx.x;
int idx=threadIdx.x;
int idy=threadIdx.y;
buffer[k]=idx+0.5;
buffer[k+1]=idy+0.5;
buffer[k+2]=idz+0.5;
k+=3;
}
int main(void) {
int dim=3*170*170*263;
float* g_vertex_buffer_data_2 = new float[dim];
float* g_vertex_buffer_data_3;
int i=0;
HANDLE_ERROR(cudaMalloc((void**)&g_vertex_buffer_data_3, sizeof(float)*dim));
dim3 dimBlock(170, 170);
dim3 dimGrid(263);
mykernel<<<dimGrid, dimBlock>>>(i, g_vertex_buffer_data_3);
HANDLE_ERROR(cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,sizeof(float)*dim,cudaMemcpyDeviceToHost));
for(int j=0;j<100;j++){
printf("g_vertex_buffer_data_2[%d]=%f\n",j,g_vertex_buffer_data_2[j]);
}
cudaFree(g_vertex_buffer_data_3);
return 0;
}
Trying to launch it I get a segmenation fault. Do you know what am i doing wrong?
I think the problem is that threadIdx.x and threadIdx.y grow at the same time, while I would like to have threadIdx.x to be the inner one and threadIdx.y to be the outer one.

There is a lot wrong here, but the source of the segfault is this:
cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
You either want
cudaMemcpy(&g_vertex_buffer_data_2[0],g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
or
cudaMemcpy(g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
Once you fix that you will notice that the kernel is actually never launching with an invalid launch error. This is because a block size of (170,170) is illegal. CUDA has a 1024 threads per block limit on all current hardware.
There might well be other problems in your code. I stopped looking after I found these two.

Related

CUDA - separating cpu code from cuda code

Was looking to use system functions (such as rand() ) within the CUDA kernel. However, ideally this would just run on the CPU. Can I separate files (.cu and .c++), while still making use of gpu matrix addition? For example, something along these lines:
in main.cpp:
int main(){
std::vector<int> myVec;
srand(time(NULL));
for (int i = 0; i < 1024; i++){
myvec.push_back( rand()%26);
}
selfSquare(myVec, 1024);
}
and in cudaFuncs.cu:
__global__ void selfSquare_cu(int *arr, n){
int i = threadIdx.x;
if (i < n){
arr[i] = arr[i] * arr[i];
}
}
void selfSquare(std::vector<int> arr, int n){
int *cuArr;
cudaMallocManaged(&cuArr, n * sizeof(int));
for (int i = 0; i < n; i++){
cuArr[i] = arr[i];
}
selfSquare_cu<<1, n>>(cuArr, n);
}
What are best practices surrounding situations like these? Would it be a better idea to use curand and write everything in the kernel? It looks to me like in the above example, there is an extra step in taking the vector and copying it to the shared cuda memory.
In this case the only thing that you need is to have the array initialised with random values. Each value of the array can be initialised indipendently.
The CPU is involved in your code during the initialization and trasferring of the data to the device and back to the host.
In your case, do you really need to have the CPU to initialize the data for then having all those values moved to the GPU?
The best approach is to allocate some device memory and then initialize the values using a kernel.
This will save time because
The elements are initialized in parallel
There is not memory transfer required from the host to the device
As a rule of thumb, always avoid communication between host and device if possible.

Count values from array CUDA

I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?
This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.

OpenMP double for loop array with stored results

I've spent time going over other posts but I still can't get this simple program to go.
#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;
int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
dum1=0,dum2=0,dum3=0;
a=0,b=0,c=0;
for (counter=0;counter<steps;counter++)
{
dum1 = somefunct1()+rand();
dum2=somefunct2()+rand();
dum3 = somefunct3(dum1, dum2, ...);
a += somefunct4(dum1,dum2,dum3, ...);
b += somefunct5(dum1,dum2,dum3, ...);
c += somefunct6(dum1,dum2,dum3, ...);
cumulative++; //count number of loops executed
}
pos[dummy][0] = a;//saves results of second loop to array
pos[dummy][1] = b;
pos[dummy][2] = c;
non+= pos[dummy][0];//holds the summed a values
}
}
}
I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.
I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.
The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.
Any help would be greatly appreciated.
Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.
......
float non=0;
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
//RNG declared
#pragma omp for
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
....

Possibly negative indices in a CUDA thread block?

I have a quite simple 1D CUDA kernel doing a inclusive sum, that is, if we have a input 1D array
[ x_0, x_1, x_2,..., x_n-1 ]
the output would be
[ x_0, x_0+x_1, x_0+x_1+x_2, ..., x_0+x_1+...x_n-1 ].
The kernel shown below actually does not completely finish this job, on the other hand it finishes its job within each block. Anyway my question is not about how I can completely implement the inclusive sum, but I think there is a possible negative-indexing error during thread calculation.
__global__ void parallel_scan_inefficient(float* input, float* output){
// num_threads and max_i are globalled defined
__shared__ float temp[num_threads];
int i = blockIdx.x*blockDim.x+threadIdx.x;//global index
if (i<max_i)
{
temp[threadIdx.x]=input[i];
}
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
output[i]=temp[threadIdx.x];
}
This piece of program is in fact from Hwu&Kirk's textbook "Programming Massively Parallel Processors" Chapter 9 pp.203.
So as you can see in the for-loop
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
since "threadIdx.x" starts from 0 for each block, but "stride" starts from 1. Wouldn't we see for example temp[-1] for the first element in a block ? Also after one iteration, "stride" then becomes 2 and we will see temp[-2] for threadIdx.x=0 ?
This doesn't quite make sense to me, though CUDA compiler doesn't report any errors - I ran cuda-memcheck for this kernel and it is still fine. Also the results are right (of course it is right for each block, as I said this kernel only partially finishes the inclusive sum)
I reckon I might make a very stupid mistake but I just couldn't spot it. Any light would be much appreciated. Many thanks.
If you have a code like this:
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
Then for thread where threadIdx.x == 0 the for loop will be skipped entirely. Try running the following code in main:
for (unsigned int stride=1;stride<=0; stride*=2)
{
cout << "I am running" << endl;
}
And you'll see there is nothing in the console.

CUDA shared memory programming is not working

all:
I am learning how shared memory accelerates the GPU programming process. I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors.
The code runs, however, the result is not as expected.
The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162;
The code is as follows, with the reference to here;
Would you please tell me which part goes wrong? Your help is very much appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>
const int N=1024;
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[N];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid]*myblock[tid];
// write the result back to global memory
data[tid] = myblock[tid];
__syncthreads();
}
int main (){
char key;
float *a;
float *dev_a;
a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));
for (int i=0; i<N; i++){
a [i] = i;
}
cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
compute_it<<<N,1>>>(dev_a);
cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);
for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}
std::cin>>key;
free (a);
free (dev_a);
One of the most immediate problems in your kernel code is this:
data[tid] = myblock[tid];
I think you probably meant this:
data[tid] = tmp;
In addition, you're launching 1024 blocks of one thread each. This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.)
There are many problems with this approach, but one immediate problem will be encountered here:
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;
Since tid is always zero, and therefore no other values in your shared memory array (myblock) get populated, the logic in this line cannot be sensible. When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp, but myblock[1023] never gets populated with anything.
It seems that you don't understand various CUDA hierarchies:
a grid is all threads associated with a kernel launch
a grid is composed of threadblocks
each threadblock is a group of threads working together on a single SM
the shared memory resource is a per-SM resource, not a device-wide resource
__synchthreads() also operates on threadblock basis (not device-wide)
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid.
Instead you should break your problem into groups of reasonable-sized threadblocks (i.e. more than one thread). Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock.
You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code.
If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters:
compute_it<<<1,N>>>(dev_a);
As indicated by Kristof, you will get something that comes close to what you want, I think. However you will not be able to conveniently scale that beyond N=1024 without other changes to your code.
This line of code is also not correct:
free (dev_a);
Since dev_a was allocated on the device using cudaMalloc you should free it like this:
cudaFree (dev_a);
Since you have only one thread per block, your tid will always be 0.
Try launching the kernel this way:
compute_it<<<1,N>>>(dev_a);
instead of
compute_it<<>>(dev_a);