work group barrier not working

work group barrier not working - c++

as a simple test to see whether or not the OpenCL 2.0 functions work for me, I've written a small kernel that calls work_group_barrier. However, for the life if me, I can't figure out why the kernel becomes invalid.
Considering that the kernel will be valid if "barrier" is used, and work_group_barrier is just a renamed version of barrier, this doesn't make sense.
The kernel in question:
#pragma OPENCL EXTENSION cl_amd_printf : enable
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
//pragmas go here
#define TRUE 1
#define FALSE 0
__kernel void my_dumb_test(
__global float *in0,
__global float *in1,
__global float *out
){
int global_num = get_global_id(0);
int local_num = get_local_id(0);
int local_size = get_local_size(0);
int global_size = get_global_size(0);
int group_id = get_group_id(0);
int group_num = get_num_groups(0);
local int a;
int b = 2;
//a = work_group_broadcast(b, local_num);
//uint sub_group_size = get_sub_group_size();
//printf("in0[%d]: %f\n", global_num, in0[global_num]);
//printf("max sub group size: %d\n", sub_group_size);
//work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//barrier(CLK_GLOBAL_MEM_FENCE);
printf("global id: %d local id: %d group id: %d num groups %d\n", global_num, local_num, group_id, group_num);
}
The funny thing is, the host-side OpenCL 2.0 functions work. Using clCreateCommandQueueWithProperties returns with a success. In the older versions of OpenCL, this function existed as clCreateCommandQueue. CL_DEVICE_VERSION pings back OpenCL 2.0. I'm running with the AMD Radeon R9 290X 4GB GDDR5, running Ubuntu 14.04, with the latest drivers, and the AMD-APP-SDK 3.0 beta.
Any help is appreciated.

I have found the solution to my problem.
clBuildProgram will default to the highest version of OpenCL C compiler 1.X if the option "-cl-std=CL2.0" is not specified in the options portion of the clBuildProgram API call.
The OpenCL C compiler is for the device-side kernel code, and is separate from the Host-side compiliation. One must manually specify OpenCL 2.0 if they choose to use it.

Related

Cuda Error (209): cudaLaunchKernel returned cudaErrorNoKernelImageForDevice

Operating System: CentOS 7
Cuda Toolkit Version: 11.0
Nvidia Driver and GPU Info:
NVIDIA-SMI 450.51.05
Driver Version: 450.51.05
CUDA Version: 11.0
GPU: Quadro M2000M
screenshot of nvidia-smi details
I'm very new to cuda programming so any guidance is extremely appreciated. I have a very simple cuda c++ program that computes the sum of two arrays in unified memory on the GPU. However, it appears that the kernel fails to launch due to a cudaErrorNoKernelImageForDevice error. The code is below:
using namespace std;
#include <iostream>
#include <math.h>
#include <cuda_runtime_api.h>
__global__
void add(int n, float *x, float*y){
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main() {
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
int N = 1<<20;
float *x, *y;
cudaMallocManaged((void**)&x, N*sizeof(float));
cudaMallocManaged((void**)&y, N*sizeof(float));
for(int i = 0; i < N; i++){
x[i] = 1.0f;
y[i] = 2.0f;
}
add<<<1, 1>>>(N, x, y);
cudaGetLastError();
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*
* cudaErrorNoKernelImageForDevice = 209,
*/
cudaDeviceSynchronize();
float maxError = 0.0f;
for (int i = 0; i < N; i++){
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
cudaFree(x);
cudaFree(y);
return 0;
}

The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code (PTX, or SASS) is compatible with the GPU that it is being run on. This is a topic with a lot of nuance, so please refer to questions like this (and the links there) for additional background.
The GPU architecture when we want to be precise is referred to as the compute capability. You can discover the compute capability of your GPU either with a google search or by running the deviceQuery CUDA sample code. The compute capability is expressed as (major).(minor) so something like compute capability 5.2, or 7.0, etc.
When compiling code, it's necessary to specify a compute capability (or if not, a default compute capability will be implied). If you specify the compute capability when compiling in a way that matches your GPU, everything should be fine. However newer/higher compute capability code will generally not run on older/lower compute capability GPUs. In that case, you will see errors like what you describe:
cudaErrorNoKernelImageForDevice
209
"no binary for GPU"
or similar. You may also see no explicit error at all if you are not doing proper CUDA error checking. The solution is to match the compute capability specified at compile time with the GPU you intend to run on. The method to do this will vary depending on the toolchain/IDE you are using. For basic nvcc command line usage:
nvcc -arch=sm_XY ...
will specify a compute capability of X.Y
For Eclipse/Nsight Eclipse/Nsight Visual Studio, the compute capability can be specified in the project properties. Depending on the tool it may be expressed as switch values (e.g. compute_XY, sm_XY) or it may be expressed numerically as X.Y

weird CUDA kernel result on old display driver

I'm writing a CUDA program that to be run on thousands of different GPUs, those machine would have different version of display driver installed, I cannot force them to update to the latest driver. Actually most code runs fine on those 'old' machine, but fails with some particular code:
Here's the problem:
#include <stdio.h>
#include <cuda.h>
#include <cuda_profiler_api.h>
__global__
void test()
{
unsigned i = 64;
unsigned j = 192;
int k = 7;
for(j = 1 << (k - 1); i &j; j >>= 1)
i ^= j;
i ^= j;
printf("i,j,k: %d,%d,%d\n", i,j,k);
// i,j,k: 32,32, 7 (correct)
// i,j,k: 0, 64, 7 (wrong)
}
int main() {
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
test<<<1,1>>>();
}
The code prints 32,32,7 as result on GPU with latest driver, which is the correct result. But on old driver(lower than CUDA 6.5) it prints 0,64,7 .
I'm looking for any workaround for this.
Envoronment:
Developing: Win7-32bit, VS2013, CUDA 6.5
Corrent Result on: WinXP-32bit(and Win7-32bit), GTX-650(latest driver)
Wrong Result on: WinXP-32bit + GTX-750-Ti(old driver), WinXP-32bit + GTX-750(old driver)

There is no workaround. The runtime API is versioned and the minimum driver version requirement is non-negotiable.
Your only two choices are to develop using the lowest common denominator toolkit version that supports the driver being used, or switch to the driver API.

Got a very slow solution: use local memory rather than register variable.
just add volatile keyword before i,j
volatile unsigned i = 64;
volatile unsigned j = 192;

cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
}
// Host code
int main()
{
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
MyKernel,
blockSize,
0);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
}
However, when I compiled it, there is the following error :
Compile line :
nvcc ben_deneme2.cu -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error :
ben_deneme2.cu(25): error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance

The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

Using "cuFFT Device Callbacks"

This is my first question, so I'll try to be as detailed as possible. I'm working on implementing noise reduction algorithm in CUDA 6.5. My code is based on this Matlab implementation: http://pastebin.com/HLVq48C1.
I'd love to use new cuFFT Device Callbacks feature, but I'm stuck on cufftXtSetCallback. Every time my cufftResult is CUFFT_NOT_IMPLEMENTED (14). Even example provided by nVidia fails the same way...
My device callback testing code:
__device__ void noiseStampCallback(void *dataOut,
size_t offset,
cufftComplex element,
void *callerInfo,
void *sharedPointer) {
element.x = offset;
element.y = 2;
((cufftComplex*)dataOut)[offset] = element;
}
__device__ cufftCallbackStoreC noiseStampCallbackPtr = noiseStampCallback;
CUDA part of my code:
cufftHandle forwardFFTPlan;//RtC
//find how many windows there are
int batch = targetFile->getNbrOfNoiseWindows();
size_t worksize;
cufftCreate(&forwardFFTPlan);
cufftMakePlan1d(forwardFFTPlan, WINDOW, CUFFT_R2C, batch, &worksize); //WINDOW = 2048
//host memory, allocate
float *h_wave;
cufftComplex *h_complex_waveSpec;
unsigned int m_num_real_elems = batch*WINDOW*2;
h_wave = (float*)malloc(m_num_real_elems * sizeof(float));
h_complex_waveSpec = (cufftComplex*)malloc((m_num_real_elems/2+1)*sizeof(cufftComplex));
//init
memset(h_wave, 0, sizeof(float) * m_num_real_elems); //last window won't probably be full of file data, so fill memory with 0
memset(h_complex_waveSpec, 0, sizeof(cufftComplex) * (m_num_real_elems/2+1));
targetFile->getNoiseFile(h_wave); //fill h_wave with samples from sound file
//device memory, allocate, copy from host
float *d_wave;
cufftComplex *d_complex_waveSpec;
cudaMalloc((void**)&d_wave, m_num_real_elems * sizeof(float));
cudaMalloc((void**)&d_complex_waveSpec, (m_num_real_elems/2+1) * sizeof(cufftComplex));
cudaMemcpy(d_wave, h_wave, m_num_real_elems * sizeof(float), cudaMemcpyHostToDevice);
//prepare callback
cufftCallbackStoreC hostNoiseStampCallbackPtr;
cudaMemcpyFromSymbol(&hostNoiseStampCallbackPtr,
noiseStampCallbackPtr,
sizeof(hostNoiseStampCallbackPtr));
cufftResult status = cufftXtSetCallback(forwardFFTPlan,
(void **)&hostNoiseStampCallbackPtr,
CUFFT_CB_ST_COMPLEX,
NULL);
//always return status 14 - CUFFT_NOT_IMPLEMENTED
//run forward plan
cufftResult result = cufftExecR2C(forwardFFTPlan, d_wave, d_complex_waveSpec);
//result seems to be okay without cufftXtSetCallback
I'm aware that I'm just a beginner in CUDA. My question is:
How can I call cufftXtSetCallback properly or what is a cause of this error?

Referring to the documentation:
The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems. Use of this API requires a current license. Free evaluation licenses are available for registered developers until 6/30/2015. To learn more please visit the cuFFT developer page.
I think you are getting the not implemented error because either you are not on a Linux 64 bit platform, or you are not explicitly linking against the CUFFT static library. The Makefile in the cufft callback sample will give the correct method to link.
Even if you fix that issue, you will likely run into a CUFFT_LICENSE_ERROR unless you have gotten one of the evaluation licenses.
Note that there are various device limitations as well for linking to the cufft static library. It should be possible to build a statically linked CUFFT application that will run on cc 2.0 and greater devices.

A new (2019) possibility are cuFFT device extensions (cuFFTDX). Being part of the Math Library Early Access they are device FFT functions, which can be inlined into user kernels.
Announcement of cuFFTDX:
https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9240-cuda-new-features-and-beyond.pdf
Math Library Early Access:
https://developer.nvidia.com/cuda-math-library-early-access-program-page
Example Code:
https://github.com/mnicely/cufft_examples

OpenCL inside Visual Studio - can we compile one exe that would use all possible CPUs OpenCL can get an all OpenCL supporting platforms?

So I mean compiling code like:
//*******************************************************************
// Demo OpenCL application to compute a simple vector addition
// computation between 2 arrays on the GPU
// ******************************************************************
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
// OpenCL source code
const char* OpenCLSource[] = {
"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",
"{",
" // Index of the elements to add \n",
" unsigned int n = get_global_id(0);",
" // Sum the n’th element of vectors a and b and store in c \n",
" c[n] = a[n] + b[n];",
"}"
};
// Some interesting data for the vectors
int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
// Number of elements in the vectors to be added
#define SIZE 2048
// Main function
// *********************************************************************
int main(int argc, char **argv)
{
// Two integer source vectors in Host memory
int HostVector1[SIZE], HostVector2[SIZE];
// Initialize with some interesting repeating data
for(int c = 0; c < SIZE; c++)
{
HostVector1[c] = InitialData1[c%20];
HostVector2[c] = InitialData2[c%20];
}
// Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU
cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,
NULL, NULL, NULL);
// Get the list of GPU devices associated with this context
size_t ParmDataBytes;
clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes);
cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);
clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL);
// Create a command-queue on the first GPU device
cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,
GPUDevices[0], 0, NULL);
// Allocate GPU memory for source vectors AND initialize from CPU memory
cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);
cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);
// Allocate output memory on GPU
cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,
sizeof(int) * SIZE, NULL, NULL);
// Create OpenCL program with source code
cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7,
OpenCLSource, NULL, NULL);
// Build the program (OpenCL JIT compilation)
clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);
// Create a handle to the compiled OpenCL function (Kernel)
cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);
// In the next step we associate the GPU memory with the Kernel arguments
clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);
clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);
clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);
// Launch the Kernel on the GPU
size_t WorkSize[1] = {SIZE}; // one dimensional Range
clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, NULL,
WorkSize, NULL, 0, NULL, NULL);
// Copy the output in GPU memory back to CPU memory
int HostOutputVector[SIZE];
clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0,
SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);
// Cleanup
free(GPUDevices);
clReleaseKernel(OpenCLVectorAdd);
clReleaseProgram(OpenCLProgram);
clReleaseCommandQueue(GPUCommandQueue);
clReleaseContext(GPUContext);
clReleaseMemObject(GPUVector1);
clReleaseMemObject(GPUVector2);
clReleaseMemObject(GPUOutputVector);
// Print out the results
for (int Rows = 0; Rows < (SIZE/20); Rows++, printf("\n")){
for(int c = 0; c <20; c++){
printf("%c",(char)HostOutputVector[Rows * 20 + c]);
}
}
return 0;
}
into exe with VS (in my case 08) can we be sure that any where we run it it would use maximum of PC computing power? If not how to do that? (btw how to make it work with AMD cards and other non CUDA gpus? (meaning one programm for all PCs))

This code should run on any Windows PC with OpenCL capable GPU drivers.
However, to improve portability and possibly performance, you should be more careful about a few things.
You require a working OpenCL implementation. This won't work at all if no OpenCL implementation is available - a DLL is missing in this case.
You require a GPU implementation of OpenCL. It's a better idea to query all OpenCL platforms and devices and prefer GPU devices
You simply use the first available device, which might not be ideal. It's a better idea to determine the most powerful device via heuristics or to give the user a choice.
Furthermore, you are not using clCreateContextFromType correctly. You need to explicitly specify a platform ID (you can query IDs with clGetPlatformIDs) on ICD-enabled OpenCL implementations. All of today's OpenCL implementations use the ICD wrapper. (Well, except Apple's)

By writing your program for OpenCL, it will run on any computer that supports your executable format and has an OpenCL driver. Might be GPU-accelerated and might not, and won't be quite as fast as hardware-specific machine code, but should be fairly portable. If by portable you mean x86 CPU architecture and Windows OS and only the video card changes. For true portability, you'll need to distribute either source code or bytecode that's JIT compiled -- not C or C++ binaries.

In principle OpenCL code should run anywhere. But there are multiple versions of its implementation, 1.1 and 1.2 (not yet on Nvidia) with 2.0 spec released and due sometime. There are API incompatibilities between the OpenCL versions. AMD have provided a way for their OpenCL 1.2 implementation to use the 1.1 API calls.
There are important caveats noted, by other authors, above, OpenCL ICD http://www.khronos.org/registry/cl/extensions/khr/cl_khr_icd.txt supports running
on different Vendor platforms but since there is no binary portability for compiled kernel code, you'll need to provide the kernel source code.
In the general case, check for the OpenCL Platform Version and Profile. Remember there can easily be multiple platforms and devices on a system. Then make a determination as to what the minimum platform version and profile available you can run on. If your code relies on specific device extensions, like double precision floating point, then you need to check for their presence on the device.
Use
clGetPlatformInfo( ... )
to get the platform profile and version.
Use
clGetDeviceInfo( ... )
to get the device specific extensions

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js