Why is cuda device count zero? - c++

I am writing a simple code where I try to get the device count.
#include <cuda.h>
#include <iostream>
int main(){
CUcontext cudaContext;
int deviceCount = 0;
CUresult result = cuDeviceGetCount(&deviceCount);
std::cout << "device count = " << deviceCount << std::endl;
}
compiled command : g++ test.cpp -lcuda
When I try to get the count of the device I get zero even though I have gpu.
Or is it supposed to be zero?

You are using the CUDA driver API here.
A driver API code should start with cuInit(0);. If you don't do that, your usage of the driver API will probably return error codes such as initialization error.
You may want to study some CUDA driver API sample codes such as vectorAddDrv.

Related

cudaMemcpyAsync() not synchronizing after second kernel call

My goal is to set a host variable passed by reference into a cuda kernel:
// nvcc test_cudaMemcpyAsync.cu -rdc=true
#include <iostream>
__global__ void setHostVar(double& host_var) {
double const var = 2.0;
cudaMemcpyAsync(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
// identifier "cudaMemcpy" is undefined in device code
// cudaMemcpy(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
}
int main() {
double host_var = 1.0;
setHostVar<<<1, 1>>>(host_var);
cudaDeviceSynchronize();
std::cout << "host_var = " << host_var << std::endl;
setHostVar<<<1, 1>>>(host_var);
cudaDeviceSynchronize();
std::cout << "host_var = " << host_var << std::endl;
return 0;
}
Compile and run:
$ nvcc test_cudaMemcpyAsync.cu -rdc=true
$ ./a.out
Output:
host_var = 1
host_var = 1
The first output line host_var = 1 I can understand given the asynchronous kernel call in addition to the asynchronous call to cudaMemcpyAsync(). However I would have thought that the second kernel call is executed after the prior async calls complete, yet host_var remains unchanged.
Questions
What is incorrect about my expectations?
What is the best/better way to set a host variable passed by reference/pointer into a kernel?
Version
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
What is incorrect about my expectations?
If we ignore managed memory and host-pinned memory (i.e. if we focus on typical host memory, such as what you are using here), it's a fundamental principle in CUDA that device code cannot touch/modify/access host memory (except on Power9 processor platforms). A direct extension of this is that you cannot (with those provisos) pass a reference to a CUDA kernel and expect to do anything useful with it.
If you really want to pass a variable by reference it will be necessary to use either managed memory or host-pinned memory. These require particular allocators and therefore depend on pointer usage for reference.
In any event, unless you are on a Power9 platform, there is no way to pass a reference to host-based stack memory to a CUDA kernel and use it, sensibly.
If you'd like to see sensible usage of memory between host and device, study any of the CUDA sample codes.
What is the best/better way to set a host variable passed by reference/pointer into a kernel?
The closest thing that I would recommend to what you have shown here would look like this (using a host-pinned allocator):
$ cat t14.cu
#include <iostream>
__global__ void setHostVar(double *host_var) {
double const var = 2.0;
*host_var = var;
}
int main() {
double *host_var_ptr;
cudaHostAlloc(&host_var_ptr, sizeof(double), cudaHostAllocDefault);
*host_var_ptr = 1.0;
setHostVar<<<1, 1>>>(host_var_ptr);
cudaDeviceSynchronize();
std::cout << "host_var = " << *host_var_ptr << std::endl;
setHostVar<<<1, 1>>>(host_var_ptr);
cudaDeviceSynchronize();
std::cout << "host_var = " << *host_var_ptr << std::endl;
return 0;
}
$ nvcc -o t14 t14.cu
$ cuda-memcheck ./t14
========= CUDA-MEMCHECK
host_var = 2
host_var = 2
========= ERROR SUMMARY: 0 errors
$
Although that may not adhere exactly to your request.
You may also be confused about how asynchronous is used in CUDA. Without trying to cover every aspect of the topic, CUDA kernels are launched asynchronously, meaning the CPU thread does not wait for the CUDA kernel to finish before proceeding. However cudaDeviceSynchronize() forces all previously issued work to that device to be complete before the CPU thread is allowed to proceed. That includes the kernel and anything involved with the kernel, such as data copying (however you do it) issued from kernel/device code. So we expect kernel activity to be complete/coherent after such a call.

OpenMP 4.5 won't offload to GPU with target directive

I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.

Running OpenCL under Windows 10

I want to run OpenCL application under Windows 10 using my GTX970 graphics card. But the following code doesn't work =(
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <CL/cl.h>
#include <vector>
#include <fstream>
#include <iostream>
#include <iomanip>
int main() {
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
try
{
cl::Platform::get(&platforms);
std::cout << platforms.size() << std::endl;
for (cl_uint i = 0; i < platforms.size(); ++i)
{
platforms[i].getDevices(CL_DEVICE_TYPE_GPU, &devices);
}
std::cout << devices.size() << std::endl;
}
catch (cl::Error e) {
std::cout << std::endl << e.what() << " : " << e.err() << std::endl;
}
return 0;
}
It gives me error code -1. I am using Visual Studio 2015 Community Edition to launch it with installed NVIDIA CUDA SDK v8.0 and configured paths, so compiler and linker knows about SDK.
Can someone please explain what's wrong with this snippet?
Thanks in advance!
EDIT: Can someone also explain me, why when i try to debug this code it falls when getting platform id, however, when i do not debug this code it prints that i have 2 platforms(my GPU card and integerated GPU)
Probably your iGPU is Intel(I assume you did a combo of gtx970 + intel cpu for gaming) which also has some experimental opencl 2.1 platform support that could give error for an opencl 1.2 app at device picking or platform picking(I had a similar problem).
You should check returned error codes from opencl api commands. Those give better info about what happened.
For example, my system has two platforms for Intel, one being experimental 2.1 only for cpu and one being normal 1.2 for both gpu and cpu.
To check that, query platform version and check its parameter-returned 7th and 9th char values against 1 and 2 for 1.2 or 2 and 0 for 2.0. This should elliminate experimental 2.1 which gives 2 at 7th char and "1" at 9th char (where indexing starts at 0 ofcourse)
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetPlatformInfo.html
CL_PLATFORM_VERSION
Check for both numpad number keycodes and left hand side numbers keycodes.
Nvidia must have 1.2 support already.
If I'm right, you may query for cpu devices and have 2 from Intel and 1 from Nvidia(if it has any) in return.

Weird OpenCL calls side effect on C++ for loop performance

I'm working on a C++ project using OpenCL. I'm using the CPU as an OpenCL device with the intel OpenCL runtime
I noticed a weird side effect in calling OpenCL functions. Here is a simple test:
#include <iostream>
#include <cstdio>
#include <vector>
#include <CL/cl.hpp>
int main(int argc, char* argv[])
{
/*
cl_int status;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
status = queue.finish();
printf("Status: %d\n", status);
*/
int ch;
int b = 0;
int sum = 0;
FILE* f1;
f1 = fopen(argv[1], "r");
while((ch = fgetc(f1)) != EOF)
{
sum += ch;
b++;
if(b % 1000000 == 0)
printf("Char %d read\n", b);
}
printf("Sum: %d\n", sum);
}
It's a simple loop that reads a file char by char and adds them so the compiler doesn't try to optimize it out.
My system is a Core i7-4770K, 2TB HDD 16GB DDR3 running Ubuntu 14.10. The program above, with a 100MB file as input, takes around 770ms. This is consistent with my HDD speed. So far so good.
If you now invert the comments and run only the OpenCL calls region, it takes around 200ms. Again, so far, so good.
Buf if you uncomment all, the program takes more than 2000ms. I would expect 770ms + 200ms, but it is 2000ms. You can even notice an increased delay between the output messages in the for loop. The two regions (OpenCL calls and reading chars) are supposed to be independent.
I don't understand why using OpenCL interferes with a simple C++ for loop performance. It's not a simple OpenCL initialization delay.
I'm compiling this example with:
g++ weird.cpp -O2 -lOpenCL -o weird
I also tried using Clang++, but it happens the same.
This was an interesting one. It's because getc is made threadsafe version at the point when the queue is instantiated and so the time increase is the grab-release cycle of the locks - I'm not sure why/how this occurs but that is the decisive point on the AMD OpenCL SDK with intel CPUs. I was quite amazed I had essentially the same times as OP.
https://software.intel.com/en-us/forums/topic/337984
You can try a remedy for this specific problem by just changing getc to getc_unlocked.
It brought it back down to 930 ms for me - that time increase over 750ms is mainly spent in platform and context creation lines.
I believe that the effect is caused by the OpenCL objects still being in scope, and therefore not being deleted before the for loop. They may be affecting the other computation because of considerations needed. For example, running the example as you gave it yields the following times on my system (g++ 4.2.1 with O2 on Mac OSX):
CL: 0.012s
Loop: 14.447s
Both: 14.874s
But putting the OpenCL code into its own anonymous scope, therefore automatically calling the destructors before the loops seems to get rid of the problem. Using the code:
#include <iostream>
#include <cstdio>
#include <vector>
#include "cl.hpp"
int main(int argc, char* argv[])
{
{
cl_int status;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
status = queue.finish();
printf("Status: %d\n", status);
}
int ch;
int b = 0;
int sum = 0;
FILE* f1;
f1 = fopen(argv[1], "r");
while((ch = fgetc(f1)) != EOF)
{
sum += ch;
b++;
if(b % 1000000 == 0)
printf("Char %d read\n", b);
}
printf("Sum: %d\n", sum);
}
I get the timings:
CL: 0.012s
Loop: 14.635s
Both: 14.648s
Which seems to add linearly. The effect is pretty small compared to other effects on the system, such as CPU load from other processes, but it seems to be gone when adding the anonymous scope. I'll do some profiling and add it as an edit if it produces anything of interest.

VS program crashes in debug but not release mode?

I am running the following program in VS 2012 to try out the Thrust function find:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/find.h>
#include <thrust/device_vector.h>
#include <stdio.h>
int main() {
thrust::device_vector<char> input(4);
input[0] = 'a';
input[1] = 'b';
input[2] = 'c';
input[3] = 'd';
thrust::device_vector<char>::iterator iter;
iter = thrust::find(input.begin(), input.end(), 'a');
std::cout << "Index of a = " << iter - input.begin() << std::endl;
return 0;
}
This is a modified version of a code example taken from http://docs.thrust.googlecode.com/hg/group__searching.html#ga99c7a59cef5b9f4cdbc70f37b2e221be
When I run this in Debug mode, my program crashes and I get the error Debug Error! ... R6010 - abort() has been called. However, running this in Release mode I just get my expected output Index of a = 0.
I know that the crash happens because of the line that includes the find function.
What might cause this to happen?
There are a few similar questions e.g. here
To quote a comment : "Thrust is known to not compile and run correctly when built for debugging"
And from the docs: "nvcc does not support device debugging Thrust code. Thrust functions compiled with (e.g., nvcc -G, nvcc --device-debug 0, etc.) will likely crash."