I'm writing a CUDA program that to be run on thousands of different GPUs, those machine would have different version of display driver installed, I cannot force them to update to the latest driver. Actually most code runs fine on those 'old' machine, but fails with some particular code:
Here's the problem:
#include <stdio.h>
#include <cuda.h>
#include <cuda_profiler_api.h>
__global__
void test()
{
unsigned i = 64;
unsigned j = 192;
int k = 7;
for(j = 1 << (k - 1); i &j; j >>= 1)
i ^= j;
i ^= j;
printf("i,j,k: %d,%d,%d\n", i,j,k);
// i,j,k: 32,32, 7 (correct)
// i,j,k: 0, 64, 7 (wrong)
}
int main() {
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
test<<<1,1>>>();
}
The code prints 32,32,7 as result on GPU with latest driver, which is the correct result. But on old driver(lower than CUDA 6.5) it prints 0,64,7 .
I'm looking for any workaround for this.
Envoronment:
Developing: Win7-32bit, VS2013, CUDA 6.5
Corrent Result on: WinXP-32bit(and Win7-32bit), GTX-650(latest driver)
Wrong Result on: WinXP-32bit + GTX-750-Ti(old driver), WinXP-32bit + GTX-750(old driver)
There is no workaround. The runtime API is versioned and the minimum driver version requirement is non-negotiable.
Your only two choices are to develop using the lowest common denominator toolkit version that supports the driver being used, or switch to the driver API.
Got a very slow solution: use local memory rather than register variable.
just add volatile keyword before i,j
volatile unsigned i = 64;
volatile unsigned j = 192;
Related
I am trying to compile and run the following program called test.cu:
#include <iostream>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// Kernel function to add the elements of two arrays
__global__
void add(int n, float* x, float* y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 2.0f;
y[i] = 1.0f;
}
// Run kernel on 1M elements on the GPU
add <<<1, 256>>> (N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
for (int i = 0; i < 10; i++)
std::cout << y[i] << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
I am using visual studio comunity 2019 and it marks the "add <<<1, 256>>> (N, x, y);" line as having an expected an expression error. I tried compiling it and somehow it compiles without mistakes, but when running the .exe file it outputs a bunch of "1" instead of the expected "3".
I also tried compiling using "nvcc test.cu", but initially it said "nvcc fatal : Cannot find compiler 'cl.exe' in PATH", so i added "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\Hostx64\x64" to path and now compiling with nvcc gives the same mistake as compiling with visual studio.
In both cases the program never enter the "add" function.
I am pretty sure the code is right and the problem has something to do with the installation, but i already tried reinstalling cuda toolkit and repairing MCVS, but it didn't work.
The kernel.cu exemple that appears when starting a new project with cuda in visual studio also didn't work. When running it outputted "No kernel image available for execution on the device".
How can is solve this?
nvcc version if that helps:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:35_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28845127_0
Visual Studio provides IntelliSense for C++. In the C++ language, the proper parsing of angle brackets is troublesome. You've got < as less than and for templates, and << as shift. So, the fact is that the guys at NVIDIA choose the worst possible delimiter <<<>>>. This makes Intellisense difficult to work properly. The way to get full IntelliSense in CUDA is to switch from the Runtime API to the Driver API. The C++ is just C++, and the CUDA is still (sort of) C++, there is no <<<>>> badness for the language parsing to have to work around.
You could take a look at the difference between matrixMul and matrixMulDrv. The <<<>>> syntax is handled by the compiler essentially just spitting out code that calls the Driver API calls. You'll link to cuda.lib not cudart.lib, and may have to deal with a "mixed mode" program if you use CUDA-RT only libraries. You could refer to this link for more information.
Also, this link tells how to add Intellisense for CUDA in VS.
Operating System: CentOS 7
Cuda Toolkit Version: 11.0
Nvidia Driver and GPU Info:
NVIDIA-SMI 450.51.05
Driver Version: 450.51.05
CUDA Version: 11.0
GPU: Quadro M2000M
screenshot of nvidia-smi details
I'm very new to cuda programming so any guidance is extremely appreciated. I have a very simple cuda c++ program that computes the sum of two arrays in unified memory on the GPU. However, it appears that the kernel fails to launch due to a cudaErrorNoKernelImageForDevice error. The code is below:
using namespace std;
#include <iostream>
#include <math.h>
#include <cuda_runtime_api.h>
__global__
void add(int n, float *x, float*y){
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main() {
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
int N = 1<<20;
float *x, *y;
cudaMallocManaged((void**)&x, N*sizeof(float));
cudaMallocManaged((void**)&y, N*sizeof(float));
for(int i = 0; i < N; i++){
x[i] = 1.0f;
y[i] = 2.0f;
}
add<<<1, 1>>>(N, x, y);
cudaGetLastError();
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*
* cudaErrorNoKernelImageForDevice = 209,
*/
cudaDeviceSynchronize();
float maxError = 0.0f;
for (int i = 0; i < N; i++){
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
cudaFree(x);
cudaFree(y);
return 0;
}
The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code (PTX, or SASS) is compatible with the GPU that it is being run on. This is a topic with a lot of nuance, so please refer to questions like this (and the links there) for additional background.
The GPU architecture when we want to be precise is referred to as the compute capability. You can discover the compute capability of your GPU either with a google search or by running the deviceQuery CUDA sample code. The compute capability is expressed as (major).(minor) so something like compute capability 5.2, or 7.0, etc.
When compiling code, it's necessary to specify a compute capability (or if not, a default compute capability will be implied). If you specify the compute capability when compiling in a way that matches your GPU, everything should be fine. However newer/higher compute capability code will generally not run on older/lower compute capability GPUs. In that case, you will see errors like what you describe:
cudaErrorNoKernelImageForDevice
209
"no binary for GPU"
or similar. You may also see no explicit error at all if you are not doing proper CUDA error checking. The solution is to match the compute capability specified at compile time with the GPU you intend to run on. The method to do this will vary depending on the toolchain/IDE you are using. For basic nvcc command line usage:
nvcc -arch=sm_XY ...
will specify a compute capability of X.Y
For Eclipse/Nsight Eclipse/Nsight Visual Studio, the compute capability can be specified in the project properties. Depending on the tool it may be expressed as switch values (e.g. compute_XY, sm_XY) or it may be expressed numerically as X.Y
After learning from a previous question that my VS 2017 C++ AMP project was basically sound, that the error messages while correct were masking the real problem, and that the issue was certain lines of code, I rewrote the code as below. By commenting out various lines at a time, I learned that
extent<2> e(M,N);
index<2> idx(0,0);
will build and execute, that code like
array_view<int, 2> c(e, vC);
for (idx[0] = 0; idx[0] < e[0]; idx[0]++)
will build but will throw an exception if run, and that code like
c[idx] = a[idx] + b[idx];
will not even build. Note that I have not as yet invoked any parallel functions.
This leads me to ask: does Concurrency Runtime or C++ AMP require that GPU hardware be installed to build and/or execute properly?
My machine has two multi-core CPU processors, but the GPU hardware hasn't been installed yet. Still, I thought I would be able to use the the parallelism constructs to take advantage of the processors I do have.
#include "pch.h"
#include <iostream>
#include "amp.h"
#include <vector>
using namespace Concurrency;
int main() {
const int M = 1024; const int N = 1024; //row, col for vector
std::vector<int> vA(M*N); std::vector<int> vB(M*N); //vectors to add
std::vector<int> vC(M*N); //vector for result
for (int i = 0; i < M; i++) { vA[i] = i; } //populate vectors
for (int j = N - 1; j >= 0; j--) { vB[j] = j; }
extent<2> e(M, N); //uses AMP constructs but
index<2> idx(0, 0); //no parallel functions invoked
array_view<int, 2> a(e, vA), b(e, vB);
array_view<int, 2> c(e, vC);
for (idx[0] = 0; idx[0] < e[0]; idx[0]++) {
for (idx[1] = 0; idx[1] < e[1]; idx[1]++) {
c[idx] = a[idx] + b[idx];
c(idx[0], idx[1]) = a(idx[0], idx[1]) + b(idx[0], idx[1]);
}
}
}
No, GPU hardware is not required. After starting a successfully compiled program, without GPU hardware, the system created a "software" GPU as shown in the output when debugging.
'Amp2.exe' (Win32): Loaded 'C:\Windows\SysWOW64\d3d11ref.dll'. [...]
GPU Device Created.
I used the available GPU diagnostics tool to look at performance.
I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.
My questions are:
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -
1 4 9 16 ... .
What's wrong with my code, since it outputs 1 2 3 4 5 ...
Code:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>
const int N = 128;
__global__ void f(int *dev_a) {
unsigned int tid = threadIdx.x;
if(tid < N) {
dev_a[tid] = tid * tid;
}
}
int main(void) {
int host_a[N];
int *dev_a;
cudaMalloc((void**)&dev_a, N * sizeof(int));
for(int i = 0 ; i < N ; i++) {
host_a[i] = i;
}
cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
f<<<1, N>>>(dev_a);
cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0 ; i < N ; i++) {
printf("%d ", host_a[i]);
}
}
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.
What's wrong with my code?
As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.
Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.