cudaMemcpyAsync() not synchronizing after second kernel call

cudaMemcpyAsync() not synchronizing after second kernel call - c++

My goal is to set a host variable passed by reference into a cuda kernel:
// nvcc test_cudaMemcpyAsync.cu -rdc=true
#include <iostream>
__global__ void setHostVar(double& host_var) {
double const var = 2.0;
cudaMemcpyAsync(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
// identifier "cudaMemcpy" is undefined in device code
// cudaMemcpy(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
}
int main() {
double host_var = 1.0;
setHostVar<<<1, 1>>>(host_var);
cudaDeviceSynchronize();
std::cout << "host_var = " << host_var << std::endl;
setHostVar<<<1, 1>>>(host_var);
cudaDeviceSynchronize();
std::cout << "host_var = " << host_var << std::endl;
return 0;
}
Compile and run:
$ nvcc test_cudaMemcpyAsync.cu -rdc=true
$ ./a.out
Output:
host_var = 1
host_var = 1
The first output line host_var = 1 I can understand given the asynchronous kernel call in addition to the asynchronous call to cudaMemcpyAsync(). However I would have thought that the second kernel call is executed after the prior async calls complete, yet host_var remains unchanged.
Questions
What is incorrect about my expectations?
What is the best/better way to set a host variable passed by reference/pointer into a kernel?
Version
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

What is incorrect about my expectations?
If we ignore managed memory and host-pinned memory (i.e. if we focus on typical host memory, such as what you are using here), it's a fundamental principle in CUDA that device code cannot touch/modify/access host memory (except on Power9 processor platforms). A direct extension of this is that you cannot (with those provisos) pass a reference to a CUDA kernel and expect to do anything useful with it.
If you really want to pass a variable by reference it will be necessary to use either managed memory or host-pinned memory. These require particular allocators and therefore depend on pointer usage for reference.
In any event, unless you are on a Power9 platform, there is no way to pass a reference to host-based stack memory to a CUDA kernel and use it, sensibly.
If you'd like to see sensible usage of memory between host and device, study any of the CUDA sample codes.
What is the best/better way to set a host variable passed by reference/pointer into a kernel?
The closest thing that I would recommend to what you have shown here would look like this (using a host-pinned allocator):
$ cat t14.cu
#include <iostream>
__global__ void setHostVar(double *host_var) {
double const var = 2.0;
*host_var = var;
}
int main() {
double *host_var_ptr;
cudaHostAlloc(&host_var_ptr, sizeof(double), cudaHostAllocDefault);
*host_var_ptr = 1.0;
setHostVar<<<1, 1>>>(host_var_ptr);
cudaDeviceSynchronize();
std::cout << "host_var = " << *host_var_ptr << std::endl;
setHostVar<<<1, 1>>>(host_var_ptr);
cudaDeviceSynchronize();
std::cout << "host_var = " << *host_var_ptr << std::endl;
return 0;
}
$ nvcc -o t14 t14.cu
$ cuda-memcheck ./t14
========= CUDA-MEMCHECK
host_var = 2
host_var = 2
========= ERROR SUMMARY: 0 errors
$
Although that may not adhere exactly to your request.
You may also be confused about how asynchronous is used in CUDA. Without trying to cover every aspect of the topic, CUDA kernels are launched asynchronously, meaning the CPU thread does not wait for the CUDA kernel to finish before proceeding. However cudaDeviceSynchronize() forces all previously issued work to that device to be complete before the CPU thread is allowed to proceed. That includes the kernel and anything involved with the kernel, such as data copying (however you do it) issued from kernel/device code. So we expect kernel activity to be complete/coherent after such a call.

Related

Cuda Error (209): cudaLaunchKernel returned cudaErrorNoKernelImageForDevice

Operating System: CentOS 7
Cuda Toolkit Version: 11.0
Nvidia Driver and GPU Info:
NVIDIA-SMI 450.51.05
Driver Version: 450.51.05
CUDA Version: 11.0
GPU: Quadro M2000M
screenshot of nvidia-smi details
I'm very new to cuda programming so any guidance is extremely appreciated. I have a very simple cuda c++ program that computes the sum of two arrays in unified memory on the GPU. However, it appears that the kernel fails to launch due to a cudaErrorNoKernelImageForDevice error. The code is below:
using namespace std;
#include <iostream>
#include <math.h>
#include <cuda_runtime_api.h>
__global__
void add(int n, float *x, float*y){
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main() {
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
int N = 1<<20;
float *x, *y;
cudaMallocManaged((void**)&x, N*sizeof(float));
cudaMallocManaged((void**)&y, N*sizeof(float));
for(int i = 0; i < N; i++){
x[i] = 1.0f;
y[i] = 2.0f;
}
add<<<1, 1>>>(N, x, y);
cudaGetLastError();
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*
* cudaErrorNoKernelImageForDevice = 209,
*/
cudaDeviceSynchronize();
float maxError = 0.0f;
for (int i = 0; i < N; i++){
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
cudaFree(x);
cudaFree(y);
return 0;
}

The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code (PTX, or SASS) is compatible with the GPU that it is being run on. This is a topic with a lot of nuance, so please refer to questions like this (and the links there) for additional background.
The GPU architecture when we want to be precise is referred to as the compute capability. You can discover the compute capability of your GPU either with a google search or by running the deviceQuery CUDA sample code. The compute capability is expressed as (major).(minor) so something like compute capability 5.2, or 7.0, etc.
When compiling code, it's necessary to specify a compute capability (or if not, a default compute capability will be implied). If you specify the compute capability when compiling in a way that matches your GPU, everything should be fine. However newer/higher compute capability code will generally not run on older/lower compute capability GPUs. In that case, you will see errors like what you describe:
cudaErrorNoKernelImageForDevice
209
"no binary for GPU"
or similar. You may also see no explicit error at all if you are not doing proper CUDA error checking. The solution is to match the compute capability specified at compile time with the GPU you intend to run on. The method to do this will vary depending on the toolchain/IDE you are using. For basic nvcc command line usage:
nvcc -arch=sm_XY ...
will specify a compute capability of X.Y
For Eclipse/Nsight Eclipse/Nsight Visual Studio, the compute capability can be specified in the project properties. Depending on the tool it may be expressed as switch values (e.g. compute_XY, sm_XY) or it may be expressed numerically as X.Y

Why does cudaMemGetInfo report a change in the total amount of device memory?

When I run the following piece of code the reported total amount of memory on my GPU changes (according to cudaMemGetInfo anyway). This behavior is not mentioned in the documentation for cudaMemGetInfo, which says that total should contain the total amount of memory on my device that can be allocated (which cannot change without putting a different GPU in my system right?). Can somebody explain why this is happening? It does not seem to happen when I don't call cudaMallocManaged.
#include <iostream>
void printStats()
{
size_t free, total;
cudaMemGetInfo(&free, &total);
std::cout << "free: " << free << "\ntotal: " << total << std::endl;
}
int main(void)
{
// Before memory allocation
printStats();
int N = 1;
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// After memory allocation.
printStats();
cudaFree(x);
cudaFree(y);
// After freeing the memory.
printStats();
return 0;
}
result:
free: 94383273356630
total: 20
free: 5661994326
total: 4
free: 140729276827856
total: 94383273355680

It turns out that when your CUDA driver version is insufficient for CUDA runtime version this sort of undefined behavior happens. For anyone else running into this problem I recommend checking the output of cudaGetLastError() that's how I discovered what the problem was.
I fixed it by downgrading my CUDA version to 10.1 since that was the newest version my driver supports. (You can check your CUDA and driver version using the nvidia-smi tool).

Understanding the usage of OpenCL in OpenCV (Mat/ Umat Objects)

I ran the code below to check for the performance difference between GPU and CPU usage. I am calculating the Average time for cv::cvtColor() function. I make four function calls:
Just_mat()(Without using OpenCL for Mat object)
Just_UMat()(Without using OpenCL for Umat object)
OpenCL_Mat()(using OpenCL for Mat object)
OpenCL_UMat() (using OpenCL for UMat object)
for both CPU and GPU.
I did not find a huge performance difference between GPU and CPU usage.
int main(int argc, char* argv[])
{
loc = argv[1];
just_mat(loc);// Calling function Without OpenCL
just_umat(loc);//Calling function Without OpenCL
cv::ocl::Context context;
std::vector<cv::ocl::PlatformInfo> platforms;
cv::ocl::getPlatfomsInfo(platforms);
for (size_t i = 0; i < platforms.size(); i++)
{
//Access to Platform
const cv::ocl::PlatformInfo* platform = &platforms[i];
//Platform Name
std::cout << "Platform Name: " << platform->name().c_str() << "\n" << endl;
//Access Device within Platform
cv::ocl::Device current_device;
for (int j = 0; j < platform->deviceNumber(); j++)
{
//Access Device
platform->getDevice(current_device, j);
int deviceType = current_device.type();
cout << "Device name: " << current_device.name() << endl;
if (deviceType == 2)
cout << context.ndevices() << " CPU devices are detected." << std::endl;
if (deviceType == 4)
cout << context.ndevices() << " GPU devices are detected." << std::endl;
cout << "===============================================" << endl << endl;
switch (deviceType)
{
case (1 << 1):
cout << "CPU device\n";
if (context.create(deviceType))
opencl_mat(loc);//With OpenCL Mat
break;
case (1 << 2):
cout << "GPU device\n";
if (context.create(deviceType))
opencl_mat(loc);//With OpenCL UMat
break;
}
cin.ignore(1);
}
}
return 0;
}
int just_mat(string loc);// I check for the average time taken for cvtColor() without using OpenCl
int just_umat(string loc);// I check for the average time taken for cvtColor() without using OpenCl
int opencl_mat(string loc);//ocl::setUseOpenCL(true); and check for time difference for cvtColor function
int opencl_umat(string loc);//ocl::setUseOpenCL(true); and check for time difference for cvtColor function
The output(in miliseconds) for the above code is
__________________________________________
|GPU Name|With OpenCL Mat | With OpenCl UMat|
|_________________________________________|
|--Carrizo---|------7.69052 ------ |------0.247069-------|
|_________________________________________|
|---Island--- |-------7.12455------ |------0.233345-------|
|_________________________________________|
__________________________________________
|----CPU---|With OpenCL Mat | With OpenCl UMat |
|_________________________________________|
|---AMD---|------6.76169 ------ |--------0.231103--------|
|_________________________________________|
________________________________________________
|----CPU---| WithOut OpenCL Mat | WithOut OpenCl UMat |
|_______________________________________________|
|----AMD---|------7.15959------ |------------0.246138------------ |
|_______________________________________________|
In code, using Mat Object always runs on CPU & using UMat Object always runs on GPU, irrespective of the code ocl::setUseOpenCL(true/false);
Can anybody explain the reason for all output time variation?
One more question, i didn't use any OpenCL specific .dll with .exe file and yet GPU was used without any error, while building OpenCV with Cmake i checked With_OpenCL did this built all OpenCL required function within opencv_World310.dll ?

In code, using Mat Object always runs on CPU & using UMat Object always runs on GPU, irrespective of the code ocl::setUseOpenCL(true/false);
I'm sorry, because I'm not sure if this is a question or a statement... in either case it's partially true. In 3.0, for the UMat, if you don't have a dedicated GPU then OpenCV just runs everything on the CPU. If you specifically ask for Mat you get it on the CPU. And in your case you have directed both to run on each of your GPUs/CPU by selecting each specifically (more on "choosing a CPU below)... read this:
Few design choices support the new architecture:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call
OpenCL accelerated version explicitly. These functions use an OpenCL
-enabled GPU if exists in the system, and automatically switch to CPU
operation otherwise.
The UMat abstraction enables functions to be called asynchronously.
Unlike the cv::Mat of the OpenCV version 2.x, access to the underlyi
ng data for the cv::UMat is performed through a method of class, and not though its data member. Such an approach enables the
implementation to explicitly wait for GPU completion only when CPU code absolutely needs the result.
The UMat implementation makes use of CPU-GPU shared physical memory available on Intel SoCs, including allocations that come from pointers passed into OpenCV.
I think there also might be a misunderstanding about "using OpenCL". When you use an UMat, you are specifically trying to use the GPU. And, I'll plead some ignorance here, as a result I believe that CV is using some of the CL library to make that happen automatically... as a side in 2.X we had cv::ocl to specifically/manually do this, so be careful if you are using that 2.X legacy code in 3.X. There are reasons to do it, but they are not always straightforward. But, back on topic, when you say,
with OpenCL UMat
you are potentially being redundant. The CL code you have in your snippet is basically finding out what equipment is installed, how many there are, what their names are, and choosing which to use... I'd have to dig through the way it is instantiated, but perhaps when you make it UMat it automatically sets OpenCL to True? (link) That would definitely support the data you presented. You could probably test that idea by checking what the state of ocl::setUseOpenCL after you set it to false and then use an UMat.
Finally, I'm guessing your CPU has a built in GPU. So it is running parallel processing with OpenCL and not paying a time penalty to travel to the seperate/dedicated GPU and back, hence your perceived performance increase over the GPUs (since it is not technically the CPU running it)... only when you are specifically using the Mat is the CPU only being used.
Your last question, I'm not sure... this is my speculation: OpenCL architexture exists on the GPU, when you install CV with CL you are installing the link between the two libraries and associated header files. I'm not sure which dll files you need to make that magic happen.

cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
}
// Host code
int main()
{
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
MyKernel,
blockSize,
0);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
}
However, when I compiled it, there is the following error :
Compile line :
nvcc ben_deneme2.cu -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error :
ben_deneme2.cu(25): error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance

The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

Boost.Compute slower than plain CPU?

I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>
namespace compute = boost::compute;
int main()
{
// generate random data on the host
std::vector<float> host_vector(16000);
std::generate(host_vector.begin(), host_vector.end(), rand);
BOOST_FOREACH (auto const& platform, compute::system::platforms())
{
std::cout << "====================" << platform.name() << "====================\n";
BOOST_FOREACH (auto const& device, platform.devices())
{
std::cout << "device: " << device.name() << std::endl;
compute::context context(device);
compute::command_queue queue(context, device);
compute::vector<float> device_vector(host_vector.size(), context);
// copy data from the host to the device
compute::copy(
host_vector.begin(), host_vector.end(), device_vector.begin(), queue
);
auto start = boost::chrono::high_resolution_clock::now();
compute::transform(device_vector.begin(),
device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(), queue);
auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
std::cout << "-------------------\n";
}
}
std::cout << "====================plain====================\n";
auto start = boost::chrono::high_resolution_clock::now();
std::transform(host_vector.begin(),
host_vector.end(),
host_vector.begin(),
[](float v){ return std::sqrt(v); });
auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
return 0;
}
And here's the sample output on my machine (win7 64-bit):
====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU # 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms
My question is: why is the plain (non-opencl) version faster?

As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).
To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).
Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:
float ans = 0;
compute::transform_reduce(
device_vector.begin(),
device_vector.end(),
&ans,
compute::sqrt<float>(),
compute::plus<float>(),
queue
);
std::cout << "ans: " << ans << std::endl;
You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).

I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-
CPU GPU
copy data to GPU
set up compute code
calculate sqrt calculate sqrt
sum sum
copy data from GPU
Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.
You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.
In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)

You're getting bad results because you're measuring time incorrectly.
OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)
CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.
Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.
In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.
So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js