OpenCL code not working on larger datasets - c++

I am trying to write a sorting function and a summation function in OpenCL/C++. However, while both functions work fine on smaller datasets, neither work on any dataset of any notable length. The dataset I'm trying to use is about 2 million entries long, but the functions stop working consistently at about 500. Any help on why this is would be appreciated. OpenCL code below.
EDIT: Only the code fully relevant to the sum is now shown (as per request).
kernel void sum(global const double* A, global double* B) {
int id = get_global_id(0);
int N = get_global_size(0);
B[id] = A[id];
barrier(CLK_GLOBAL_MEM_FENCE);
for (int i = 1; i < N/2; i *= 2) { //i is a stride
if (!(id % (i * 2)) && ((id + i) < N))
B[id] += B[id + i];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
And the C++ code:
std::vector<double> temps(100000, 1);
// Load functions
cl::Kernel kernel_sum = cl::Kernel(program, "sum");
// Set up variables
size_t elements = temps.size();
size_t size = temps.size() * sizeof(double);
size_t workgroup_size = 10;
size_t padding_size = elements % workgroup_size;
// Sum
if (padding_size) {
std::vector<double> temps_padding(workgroup_size - padding_size, 0);
temps.insert(temps.end(), temps_padding.begin(), temps_padding.end());
}
std::vector<double> temps_sum(elements);
size_t output_size = temps_sum.size() * sizeof(double);
cl::Buffer sum_buffer_1(context, CL_MEM_READ_ONLY, size);
cl::Buffer sum_buffer_2(context, CL_MEM_READ_WRITE, output_size);
queue.enqueueWriteBuffer(sum_buffer_1, CL_TRUE, 0, size, &temps[0]);
queue.enqueueFillBuffer(sum_buffer_2, 0, 0, output_size);
kernel_sum.setArg(0, sum_buffer_1);
kernel_sum.setArg(1, sum_buffer_2);
queue.enqueueNDRangeKernel(kernel_sum, cl::NullRange, cl::NDRange(elements), cl::NDRange(workgroup_size));
queue.enqueueReadBuffer(sum_buffer_2, CL_TRUE, 0, output_size, &temps_sum[0]);
double summed = temps_sum[0];
std::cout << "SUMMED: " << summed << std::endl;
I have tried looking around everywhere but I'm completely stuck.

You're trying to use barriers for synchronisation across work groups. This won't work. Barriers are for synchronising within work groups.
Work groups don't run in a well defined order relative to one another; you can only use this sort of reduction algorithm within a workgroup. You will probably need to use a second kernel pass to combine results from individual workgroups, or do this part on the host CPU. (Or modify your algorithm to use atomics in some way, etc.)

Related

Kernels Synchronisation

I'm new to Cuda programming and I'm implementing the classical Floyd APSP Algorithm. This algorithm consists in 3 nested loops and all the code inside the two inner loops can be executed in parallel.
As main parts of my code, here is the kernel code:
__global__ void dfloyd(double *dM, size_t k, size_t n)
{
unsigned int x = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int y = threadIdx.y + blockIdx.y * blockDim.y;
unsigned int index = y * n + x;
double d;
if (x < n && y < n)
{
d=dM[x+k*n] + dM[k+y*n];
if (d<dM[index])
dM[index]=d;
}
}
and here is the part from the main function where the kernels are launched (for readability I omitted error handling code):
double *dM;
cudaMalloc((void **)&dM, sizeof_M);
cudaMemcpy(dM, hM, sizeof_M, cudaMemcpyHostToDevice);
int dimx = 32;
int dimy = 32;
dim3 block(dimx, dimy);
dim3 grid((n + block.x - 1) / block.x, (n + block.y - 1) / block.y);
for (size_t k=0; k<n; k++)
{
dfloyd<<<grid, block>>>(dM, k, n);
cudaDeviceSynchronize();
}
cudaMemcpy(hM, dM, sizeof_M, cudaMemcpyDeviceToHost);
[For the understanding, dM is referring to the distance matrix stored in the device side and hM in the host side and n is referring to the number of nodes.]
Kernels inside the k-loop have to be executed serially, this explains why I write the cudaDeviceSynchronize() instruction after each kernel execution.
However, I notice that putting this synchro instruction outside the loop leads to the same result.
Now, my question. Do the two following pieces of code
for (size_t k=0; k<n; k++)
{
dfloyd<<<grid, block>>>(dM, k, n);
cudaDeviceSynchronize();
}
and
for (size_t k=0; k<n; k++)
{
dfloyd<<<grid, block>>>(dM, k, n);
}
cudaDeviceSynchronize();
are equivalent?
They are not equivalent but will give the same results. The first one will make the host wait after each kernel call until the kernel has returned, while the other one will make it wait only once.
Maybe the confusing part is why does it work; in CUDA, two consecutive kernel calls on the same stream (in your case, default stream) are guaranteed to be executed serially.
Performance wise, it is advised to use the second version, as synchronisation with the host adds overhead.
Edit: in that specific case, you do not even need to call cudaDeviceSynchronize() because the cudaMemcpy will synchronize.

How to speed up this GSL code for selecting a submatrix?

I wrote a very simple function in GSL, to select a submatrix from an existing matrix in a struct.
EDIT: I had timed VERY INCORRECTLY and didn't notice the changed number of zeros in front.Still, I hope this can be sped up
For 100x100 submatrices of a 10000x10000 matrix, it takes 1.2E-5 seconds. So, repeating that 1E4 times, takes 50 times longer than I need to diagonalise the 100x100 matrix.
EDIT:
I realise, it happens even if I comment out everything except return(0);
Thus, I theorize, it must be something about struct TOWER. This is how TOWER looks:
struct TOWER
{
int array_level[TOWERSIZE];
int array_window[TOWERSIZE];
gsl_matrix *matrix_ordered_covariance;
gsl_matrix *matrix_peano_covariance;
double array_angle_tw[XISTEP];
double array_correl_tw[XISTEP];
gsl_interp_accel *acc_correl; // interpolating for correlation
gsl_spline *spline_correl;
double array_all_eigenvalues[TOWERSIZE]; //contains all eiv. of whole matrix
std::vector< std::vector<double> > cropped_peano_covariance, peano_mask;
};
Below comes my function!
/* --- --- */
int monolevelsubmatrix(int i, int j, struct TOWER *tower, gsl_matrix *result) //relying on spline!! //must addd auto vanishing
{
int firstrow, firstcol,mu,nu,a,b;
double aux, correl;
firstrow = helix*i;
firstcol = helix*j;
gsl_matrix_view Xi = gsl_matrix_submatrix (tower ->matrix_ordered_covariance, firstrow, firstcol, helix, helix);
gsl_matrix_memcpy (result, &(Xi.matrix));
return(0);
}
/* --- --- */
The problem is almost certainly gls_matric_memcpy. The source for that is in copy_source.c, with:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
for (i = 0; i < src_size1 ; i++)
{
for (j = 0; j < MULTIPLICITY * src_size2; j++)
{
dest->data[MULTIPLICITY * dest_tda * i + j]
= src->data[MULTIPLICITY * src_tda * i + j];
}
}
This would be quite slow. Note that gls_matrix_memcpy returns a GLS_ERROR if the matrices are different sizes, so it's very likely the data member could be served with a CRT memcpy on the data members of dest and src.
This loop is very slow. Each cell is derefence through dest & src structs for the data member, and THEN indexed.
You could choose to write a replacement for the library, or write your own personal version of this matrix copy, with something like (untested suggestion code here):
unsigned int cellsize = sizeof( src->data[0] ); // just psuedocode here
memcpy( dest->data, src->data, cellsize * src_size1 * src_size2 * MULTIPLICITY )
Note that MULTIPLICITY is a define, usually 1 or 2, probably depends on library configuration - might not apply to your usage (if it's 1 )
Now, important caveat....if the source matrix is a subview, then you have to go by rows...that is, a loop of rows in i where crt's memcpy is limited to rows at a time, not the entire matrix as I show above.
In other words, you do have to account for the source matrix geometry from which the subview was taken...that's probably why they index each cell (makes it simple).
If, however, you KNOW the geometry, you can very likely optimize this WAY above the performance you're seeing.
If all you did was take out the src/dest derefence, you'd see SOME performance gain, as in:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
float * dest_data = dest->data; // psuedocode here
float * src_data = src->data; // psuedocode here
for (i = 0; i < src_size1 ; i++)
{
for (j = 0; j < MULTIPLICITY * src_size2; j++)
{
dest_data[MULTIPLICITY * dest_tda * i + j]
= src_data[MULTIPLICITY * src_tda * i + j];
}
}
We'd HOPE the compiler recognized that anyway, but...sometimes...

Wrong results with CUDA threads writing on private locations in global memory

EDIT 3:
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I'll list the main variables and structures involved.
Variables:
srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
idx (device) : identification number of current thread
My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx][COLORLEVELS] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.
Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I'm debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run.. So I was wondering if you could suggest any other solution to my problem.
Thank you.
#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
#include "cuPrintf.cu"
using namespace std;
#define ARRDIM 512
#define COLORLEVELS 4
__global__ void gpuKernel
(
float *sa, float *aux,
size_t memPitchAux, int w,
float *c_glob
)
{
float sc_loc[COLORLEVELS];
float g0=0.0f;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int idx = tidy * memPitchAux/4 + tidx;
for(int ic=0; ic<COLORLEVELS; ic++)
{
sc_loc[ic] = ((float)(ic*ic));
}
for(int is=0; is<COLORLEVELS; is++)
{
int ic = fabs(sa[tidy*w +tidx]);
c_glob[tidy * COLORLEVELS + tidx + ic] += 1.0f;
}
for(int ic=0; ic<COLORLEVELS; ic++)
{
g0 += c_glob[tidy * COLORLEVELS + tidx + ic]*sc_loc[ic];
}
aux[idx] = g0;
}
int main(int argc, char* argv[])
{
/*
* array src host and device
*/
int heightSrc = ARRDIM;
int widthSrc = ARRDIM;
cudaSetDevice(0);
float *srcArr_h, *srcArr_d;
size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;
srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero
int totArrElm = heightSrc*widthSrc;
for(int ic=0; ic<totArrElm; ic++)
{
srcArr_h[ic] = (float)(rand() % COLORLEVELS);
}
cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);
/*
* auxiliary buffer auxD to save final results
*/
float *auxD;
size_t auxDPitch;
cudaMallocPitch((void**)&auxD,&auxDPitch,widthSrc*sizeof(float),heightSrc);
cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);
/*
* auxiliary buffer auxH allocation + initialization on host
*/
size_t auxHPitch;
auxHPitch = widthSrc*sizeof(float);
float *auxH = (float *) malloc(heightSrc*auxHPitch);
/*
* kernel launch specs
*/
int thpb_x = 16;
int thpb_y = 16;
int blpg_x = (int) widthSrc/thpb_x;
int blpg_y = (int) heightSrc/thpb_y;
int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;
/*
* c_glob: array that reserves a private location of COLORLEVELS floats for each thread
*/
int cglob_w = COLORLEVELS;
int cglob_h = num_threads;
float *c_glob_d;
size_t c_globDPitch;
cudaMallocPitch((void**)&c_glob_d,&c_globDPitch,cglob_w*sizeof(float),cglob_h);
cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);
/*
* kernel launch
*/
dim3 dimBlock(thpb_x,thpb_y, 1);
dim3 dimGrid(blpg_x,blpg_y,1);
gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);
cudaThreadSynchronize();
cudaMemcpy2D(auxH,auxHPitch,
auxD,auxDPitch,
auxHPitch, heightSrc,
cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float min = auxH[0];
float max = auxH[0];
float f;
string str;
for(int i=0; i<widthSrc*heightSrc; i++)
{
if(min > auxH[i])
min = auxH[i];
if(max < auxH[i])
max = auxH[i];
}
cudaFree(srcArr_d);
cudaFree(auxD);
cudaFree(c_glob_d);
}
You decided neither not to show the whole code nor a reduced size thereof reproducing your problem. Therefore, it has not been possible to make tests and verify the possible solution below.
I think you have spot the source of the problem: multiple threads are trying to write to the same memory locations in parallel. This is a situation leading to race conditions. For an example, see the fourth slide of the presentation "CUDA C: race conditions, atomics, locks, mutex, and warps".
Race conditions have a brute-force solution: atomic functions. They are described at Section B.12 of the CUDA C Programming Guide. So you can try to fix your problem by changing the line
c[ic] += 1.0f;
to
atomicAdd(&c[ic],1);
You will pay this fix with performance: atomic operations serialize the code to avoid race conditions.
I have mentioned that atomic functions are a brute-force solution to your problem because it can be that, by properly rethinking the implementation, you can find a way to avoid them. But this is not possible to say as of now due to the very few details you provided.

Access vector of pointers to other vectors on a GPU

so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following:
for(int i =0;i<N;i++){
dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');
}
where A[i] will be a 2D matrix of some size.
I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example:
for(int i =0;i<N;i++){
status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}
but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's.
i currently have the following code that compiles:
double **GlobalFVecs_d;
double **GlobalFPVecs_d;
extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){
cudaError_t err;
GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
checkCudaError(err);
for(int i =0; i < numpulsars;i++){
err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
checkCudaError(err);
err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
checkCudaError(err);
}
err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
checkCudaError(err);
}
but if i now try and access it with:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;
for(int i =0; i < numpulsars; i++){
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
}
it seg faults here, is this not how to get at the data?
The kernal function that i'm calling is just:
__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {
// Each thread computes one element of C
// by accumulating results into Cvalue
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int subrow=row-thispulsar*numcoeff;
int subcol=row-thispulsar*numcoeff;
__syncthreads();
if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;
FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];
}
What am i not doing right? Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works.
Do you think this is the best way to go about doing this? If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that.
Cheers
Lindley
change this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
to this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);
and I believe it will work.
Your methodology of handling pointers is mostly correct. However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. But we are not allowed to dereference device pointers in host code.
However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. This array (GlobalFPVecs_d) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel.
It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself.

Is it possible to run the sum computation in parallel in OpenCL?

I am a newbie in OpenCL. However, I understand the C/C++ basics and the OOP.
My question is as follows: is it somehow possible to run the sum computation task in parallel? Is it theoretically possible? Below I will describe what I've tried to do:
The task is, for example:
double* values = new double[1000]; //let's pretend it has some random values inside
double sum = 0.0;
for(int i = 0; i < 1000; i++) {
sum += values[i];
}
What I tried to do in OpenCL kernel (and I feel it is wrong because perhaps it accesses the same "sum" variable from different threads/tasks at the same time):
__kernel void calculate2dim(__global float* vectors1dim,
__global float output,
const unsigned int count) {
int i = get_global_id(0);
output += vectors1dim[i];
}
This code is wrong. I will highly appreciate if anyone answers me if it is theoretically possible to run such tasks in parallel and if it is - how!
If you want to sum the values of your array in a parallel fashion, you should make sure you reduce contention and make sure there's no data dependencies across threads.
Data dependencies will cause threads to have to wait for each other, creating contention, which is what you want to avoid to get true parallellization.
One way you could do that is to split your array into N arrays, each containing some subsection of your original array, and then calling your OpenCL kernel function with each different array.
At the end, when all kernels have done the hard work, you can just sum up the results of each array into one. This operation can easily be done by the CPU.
The key is to not have any dependencies between the calculations done in each kernel, so you have to split your data and processing accordingly.
I don't know if your data has any actual dependencies from your question, but that is for you to figure out.
The piece of code I've provided for reference should do the job.
E.g. you have N elements, and size of your workgroup is WS = 64. I assume that N is multiple of 2*WS (this is important, one workgroup calculates sum of 2*WS elements). Then you need to run kernel specifying:
globalSizeX = 2*WS*(N/(2*WS));
As a result sum array will have partial sums of 2*WS elements. ( e.g. sum[1] - will contain sum of elements whose indices are from 2*WS to 4*WS-1).
If your globalSizeX is 2*WS or less (which means that you have only one workgroup), then you are done. Just use sum[0] as a result.
If not - you need to repeat procedure, this time using sum array as input array and output to other array (create 2 arrays and ping-pong between them). And so on untill you will have only one workgroup.
Search also for Hilli Steele / Blelloch parallel algorithms.
This article could be useful as well
Here is the actual example:
__kernel void par_sum(__global unsigned int* input, __global unsigned int* sum)
{
int li = get_local_id(0);
int groupId = get_group_id(0);
__local int our_h[2 * get_group_size(0)];
our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0];
our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1];
// sweep up
int width = 2;
int num_el = 2*get_group_size(0)/width;
int wby2 = width>>1;
for(int i = 2*BLK_SIZ>>1; i>0; i>>=1)
{
barrier(CLK_LOCL_MEM_FENCE);
if(li < num_el)
{
int idx = width*(li+1) - 1;
our_h[idx] = our_h[idx] + our_h[(idx - wby2)];
}
width<<=1;
wby2 = width>>1;
num_el>>=1;
}
barrier(CLK_LOCL_MEM_FENCE);
// down-sweep
if(0 == li)
sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum
}