cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined - c++

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
}
// Host code
int main()
{
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
MyKernel,
blockSize,
0);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
}
However, when I compiled it, there is the following error :
Compile line :
nvcc ben_deneme2.cu -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error :
ben_deneme2.cu(25): error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance

The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

Related

cudaMemcpyAsync() not synchronizing after second kernel call

My goal is to set a host variable passed by reference into a cuda kernel:
// nvcc test_cudaMemcpyAsync.cu -rdc=true
#include <iostream>
__global__ void setHostVar(double& host_var) {
double const var = 2.0;
cudaMemcpyAsync(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
// identifier "cudaMemcpy" is undefined in device code
// cudaMemcpy(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
}
int main() {
double host_var = 1.0;
setHostVar<<<1, 1>>>(host_var);
cudaDeviceSynchronize();
std::cout << "host_var = " << host_var << std::endl;
setHostVar<<<1, 1>>>(host_var);
cudaDeviceSynchronize();
std::cout << "host_var = " << host_var << std::endl;
return 0;
}
Compile and run:
$ nvcc test_cudaMemcpyAsync.cu -rdc=true
$ ./a.out
Output:
host_var = 1
host_var = 1
The first output line host_var = 1 I can understand given the asynchronous kernel call in addition to the asynchronous call to cudaMemcpyAsync(). However I would have thought that the second kernel call is executed after the prior async calls complete, yet host_var remains unchanged.
Questions
What is incorrect about my expectations?
What is the best/better way to set a host variable passed by reference/pointer into a kernel?
Version
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
What is incorrect about my expectations?
If we ignore managed memory and host-pinned memory (i.e. if we focus on typical host memory, such as what you are using here), it's a fundamental principle in CUDA that device code cannot touch/modify/access host memory (except on Power9 processor platforms). A direct extension of this is that you cannot (with those provisos) pass a reference to a CUDA kernel and expect to do anything useful with it.
If you really want to pass a variable by reference it will be necessary to use either managed memory or host-pinned memory. These require particular allocators and therefore depend on pointer usage for reference.
In any event, unless you are on a Power9 platform, there is no way to pass a reference to host-based stack memory to a CUDA kernel and use it, sensibly.
If you'd like to see sensible usage of memory between host and device, study any of the CUDA sample codes.
What is the best/better way to set a host variable passed by reference/pointer into a kernel?
The closest thing that I would recommend to what you have shown here would look like this (using a host-pinned allocator):
$ cat t14.cu
#include <iostream>
__global__ void setHostVar(double *host_var) {
double const var = 2.0;
*host_var = var;
}
int main() {
double *host_var_ptr;
cudaHostAlloc(&host_var_ptr, sizeof(double), cudaHostAllocDefault);
*host_var_ptr = 1.0;
setHostVar<<<1, 1>>>(host_var_ptr);
cudaDeviceSynchronize();
std::cout << "host_var = " << *host_var_ptr << std::endl;
setHostVar<<<1, 1>>>(host_var_ptr);
cudaDeviceSynchronize();
std::cout << "host_var = " << *host_var_ptr << std::endl;
return 0;
}
$ nvcc -o t14 t14.cu
$ cuda-memcheck ./t14
========= CUDA-MEMCHECK
host_var = 2
host_var = 2
========= ERROR SUMMARY: 0 errors
$
Although that may not adhere exactly to your request.
You may also be confused about how asynchronous is used in CUDA. Without trying to cover every aspect of the topic, CUDA kernels are launched asynchronously, meaning the CPU thread does not wait for the CUDA kernel to finish before proceeding. However cudaDeviceSynchronize() forces all previously issued work to that device to be complete before the CPU thread is allowed to proceed. That includes the kernel and anything involved with the kernel, such as data copying (however you do it) issued from kernel/device code. So we expect kernel activity to be complete/coherent after such a call.

cuda <<<X,X>>> gives expected an expression error

I am trying to compile and run the following program called test.cu:
#include <iostream>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// Kernel function to add the elements of two arrays
__global__
void add(int n, float* x, float* y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 2.0f;
y[i] = 1.0f;
}
// Run kernel on 1M elements on the GPU
add <<<1, 256>>> (N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
for (int i = 0; i < 10; i++)
std::cout << y[i] << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
I am using visual studio comunity 2019 and it marks the "add <<<1, 256>>> (N, x, y);" line as having an expected an expression error. I tried compiling it and somehow it compiles without mistakes, but when running the .exe file it outputs a bunch of "1" instead of the expected "3".
I also tried compiling using "nvcc test.cu", but initially it said "nvcc fatal : Cannot find compiler 'cl.exe' in PATH", so i added "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\Hostx64\x64" to path and now compiling with nvcc gives the same mistake as compiling with visual studio.
In both cases the program never enter the "add" function.
I am pretty sure the code is right and the problem has something to do with the installation, but i already tried reinstalling cuda toolkit and repairing MCVS, but it didn't work.
The kernel.cu exemple that appears when starting a new project with cuda in visual studio also didn't work. When running it outputted "No kernel image available for execution on the device".
How can is solve this?
nvcc version if that helps:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:35_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28845127_0
Visual Studio provides IntelliSense for C++. In the C++ language, the proper parsing of angle brackets is troublesome. You've got < as less than and for templates, and << as shift. So, the fact is that the guys at NVIDIA choose the worst possible delimiter <<<>>>. This makes Intellisense difficult to work properly. The way to get full IntelliSense in CUDA is to switch from the Runtime API to the Driver API. The C++ is just C++, and the CUDA is still (sort of) C++, there is no <<<>>> badness for the language parsing to have to work around.
You could take a look at the difference between matrixMul and matrixMulDrv. The <<<>>> syntax is handled by the compiler essentially just spitting out code that calls the Driver API calls. You'll link to cuda.lib not cudart.lib, and may have to deal with a "mixed mode" program if you use CUDA-RT only libraries. You could refer to this link for more information.
Also, this link tells how to add Intellisense for CUDA in VS.

Cuda Error (209): cudaLaunchKernel returned cudaErrorNoKernelImageForDevice

Operating System: CentOS 7
Cuda Toolkit Version: 11.0
Nvidia Driver and GPU Info:
NVIDIA-SMI 450.51.05
Driver Version: 450.51.05
CUDA Version: 11.0
GPU: Quadro M2000M
screenshot of nvidia-smi details
I'm very new to cuda programming so any guidance is extremely appreciated. I have a very simple cuda c++ program that computes the sum of two arrays in unified memory on the GPU. However, it appears that the kernel fails to launch due to a cudaErrorNoKernelImageForDevice error. The code is below:
using namespace std;
#include <iostream>
#include <math.h>
#include <cuda_runtime_api.h>
__global__
void add(int n, float *x, float*y){
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main() {
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
int N = 1<<20;
float *x, *y;
cudaMallocManaged((void**)&x, N*sizeof(float));
cudaMallocManaged((void**)&y, N*sizeof(float));
for(int i = 0; i < N; i++){
x[i] = 1.0f;
y[i] = 2.0f;
}
add<<<1, 1>>>(N, x, y);
cudaGetLastError();
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*
* cudaErrorNoKernelImageForDevice = 209,
*/
cudaDeviceSynchronize();
float maxError = 0.0f;
for (int i = 0; i < N; i++){
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
cudaFree(x);
cudaFree(y);
return 0;
}
The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code (PTX, or SASS) is compatible with the GPU that it is being run on. This is a topic with a lot of nuance, so please refer to questions like this (and the links there) for additional background.
The GPU architecture when we want to be precise is referred to as the compute capability. You can discover the compute capability of your GPU either with a google search or by running the deviceQuery CUDA sample code. The compute capability is expressed as (major).(minor) so something like compute capability 5.2, or 7.0, etc.
When compiling code, it's necessary to specify a compute capability (or if not, a default compute capability will be implied). If you specify the compute capability when compiling in a way that matches your GPU, everything should be fine. However newer/higher compute capability code will generally not run on older/lower compute capability GPUs. In that case, you will see errors like what you describe:
cudaErrorNoKernelImageForDevice
209
"no binary for GPU"
or similar. You may also see no explicit error at all if you are not doing proper CUDA error checking. The solution is to match the compute capability specified at compile time with the GPU you intend to run on. The method to do this will vary depending on the toolchain/IDE you are using. For basic nvcc command line usage:
nvcc -arch=sm_XY ...
will specify a compute capability of X.Y
For Eclipse/Nsight Eclipse/Nsight Visual Studio, the compute capability can be specified in the project properties. Depending on the tool it may be expressed as switch values (e.g. compute_XY, sm_XY) or it may be expressed numerically as X.Y

Cuda kernel to compute squares of integers in an array

I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.
My questions are:
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -
1 4 9 16 ... .
What's wrong with my code, since it outputs 1 2 3 4 5 ...
Code:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>
const int N = 128;
__global__ void f(int *dev_a) {
unsigned int tid = threadIdx.x;
if(tid < N) {
dev_a[tid] = tid * tid;
}
}
int main(void) {
int host_a[N];
int *dev_a;
cudaMalloc((void**)&dev_a, N * sizeof(int));
for(int i = 0 ; i < N ; i++) {
host_a[i] = i;
}
cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
f<<<1, N>>>(dev_a);
cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0 ; i < N ; i++) {
printf("%d ", host_a[i]);
}
}
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.
What's wrong with my code?
As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.

Using "cuFFT Device Callbacks"

This is my first question, so I'll try to be as detailed as possible. I'm working on implementing noise reduction algorithm in CUDA 6.5. My code is based on this Matlab implementation: http://pastebin.com/HLVq48C1.
I'd love to use new cuFFT Device Callbacks feature, but I'm stuck on cufftXtSetCallback. Every time my cufftResult is CUFFT_NOT_IMPLEMENTED (14). Even example provided by nVidia fails the same way...
My device callback testing code:
__device__ void noiseStampCallback(void *dataOut,
size_t offset,
cufftComplex element,
void *callerInfo,
void *sharedPointer) {
element.x = offset;
element.y = 2;
((cufftComplex*)dataOut)[offset] = element;
}
__device__ cufftCallbackStoreC noiseStampCallbackPtr = noiseStampCallback;
CUDA part of my code:
cufftHandle forwardFFTPlan;//RtC
//find how many windows there are
int batch = targetFile->getNbrOfNoiseWindows();
size_t worksize;
cufftCreate(&forwardFFTPlan);
cufftMakePlan1d(forwardFFTPlan, WINDOW, CUFFT_R2C, batch, &worksize); //WINDOW = 2048
//host memory, allocate
float *h_wave;
cufftComplex *h_complex_waveSpec;
unsigned int m_num_real_elems = batch*WINDOW*2;
h_wave = (float*)malloc(m_num_real_elems * sizeof(float));
h_complex_waveSpec = (cufftComplex*)malloc((m_num_real_elems/2+1)*sizeof(cufftComplex));
//init
memset(h_wave, 0, sizeof(float) * m_num_real_elems); //last window won't probably be full of file data, so fill memory with 0
memset(h_complex_waveSpec, 0, sizeof(cufftComplex) * (m_num_real_elems/2+1));
targetFile->getNoiseFile(h_wave); //fill h_wave with samples from sound file
//device memory, allocate, copy from host
float *d_wave;
cufftComplex *d_complex_waveSpec;
cudaMalloc((void**)&d_wave, m_num_real_elems * sizeof(float));
cudaMalloc((void**)&d_complex_waveSpec, (m_num_real_elems/2+1) * sizeof(cufftComplex));
cudaMemcpy(d_wave, h_wave, m_num_real_elems * sizeof(float), cudaMemcpyHostToDevice);
//prepare callback
cufftCallbackStoreC hostNoiseStampCallbackPtr;
cudaMemcpyFromSymbol(&hostNoiseStampCallbackPtr,
noiseStampCallbackPtr,
sizeof(hostNoiseStampCallbackPtr));
cufftResult status = cufftXtSetCallback(forwardFFTPlan,
(void **)&hostNoiseStampCallbackPtr,
CUFFT_CB_ST_COMPLEX,
NULL);
//always return status 14 - CUFFT_NOT_IMPLEMENTED
//run forward plan
cufftResult result = cufftExecR2C(forwardFFTPlan, d_wave, d_complex_waveSpec);
//result seems to be okay without cufftXtSetCallback
I'm aware that I'm just a beginner in CUDA. My question is:
How can I call cufftXtSetCallback properly or what is a cause of this error?
Referring to the documentation:
The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems. Use of this API requires a current license. Free evaluation licenses are available for registered developers until 6/30/2015. To learn more please visit the cuFFT developer page.
I think you are getting the not implemented error because either you are not on a Linux 64 bit platform, or you are not explicitly linking against the CUFFT static library. The Makefile in the cufft callback sample will give the correct method to link.
Even if you fix that issue, you will likely run into a CUFFT_LICENSE_ERROR unless you have gotten one of the evaluation licenses.
Note that there are various device limitations as well for linking to the cufft static library. It should be possible to build a statically linked CUFFT application that will run on cc 2.0 and greater devices.
A new (2019) possibility are cuFFT device extensions (cuFFTDX). Being part of the Math Library Early Access they are device FFT functions, which can be inlined into user kernels.
Announcement of cuFFTDX:
https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9240-cuda-new-features-and-beyond.pdf
Math Library Early Access:
https://developer.nvidia.com/cuda-math-library-early-access-program-page
Example Code:
https://github.com/mnicely/cufft_examples