nvcc cuda from command prompt not using gpu - c++

Trying to run a CUDA program from command prompt using nvcc, but it seems like GPU code is not running as expected. The exact same code runs successfully on Visual Studio and outputs the expected output.
nvcc -arch=sm_60 -std=c++11 -o test.cu test.exe
test.exe
Environment:
Windows 10,
NVIDIA Quadro k4200,
CUDA 10.2
Source Code
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
/* this is the vector addition kernel.
:inputs: n -> Size of vector, integer
a -> constant multiple, float
x -> input 'vector', constant pointer to float
y -> input and output 'vector', pointer to float */
__global__ void saxpy(int n, float a, const float x[], float y[])
{
int id = threadIdx.x + blockDim.x*blockIdx.x; /* Performing that for loop */
// check to see if id is greater than size of array
if(id < n){
y[id] += a*x[id];
}
}
int main()
{
int N = 256;
//create pointers and device
float *d_x, *d_y;
const float a = 2.0f;
//allocate and initializing memory on host
std::vector<float> x(N, 1.f);
std::vector<float> y(N, 1.f);
//allocate our memory on GPU
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
//Memory Transfer!
cudaMemcpy(d_x, x.data(), N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y.data(), N*sizeof(float), cudaMemcpyHostToDevice);
//Launch the Kernel! In this configuration there is 1 block with 256 threads
//Use gridDim = int((N-1)/256) in general
saxpy<<<1, 256>>>(N, a, d_x, d_y);
//Transfering Memory back!
cudaMemcpy(y.data(), d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
std::cout << y[0] << std::endl;
cudaFree(d_x);
cudaFree(d_y);
return 0;
}
Output
1
Expected Output
3
Things I tried
When I first tried to compile with nvcc, it had the same error as discussed here.
Cuda compilation error: class template has already been defined
So I tried the suggested solution
"now: D:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\bin\Hostx64\x64"
and now it compiles and runs but the output is not as expected.

"Also, -arch=sm_60 is an incorrect arch specification for a Quadro K4200. It should be -arch=sm_30" by Robert Crovella

Related

nvprof - Warning: No profile data collected

On attempting to use nvprof to profile my program, I receive the following output with no other information:
<program output>
======== Warning: No profile data collected.
The code used follows this classic first cuda program. I have had nvprof work on my system before, however I recently had to re-install cuda.
I have attempted to follow the suggestions in this post which suggested to include cudaDeviceReset() and cudaProfilerStart/Stop() and to use some extra profiling flags nvprof --unified-memory-profiling off without luck.
This nvidia developer forum post seems to run into a similar error, however the suggestions here seemed to indicate needing to use a different compiler than nvcc due to some OpenACC library I do not use.
System Specifications
System: Windows 11 x64 using WSL2
CPU: i7 8750H
GPU: gtx 1050 ti
CUDA Version: 11.8
For completeness, I have included my program code, though I imagine it has more to due with my system:
Compiling:
nvcc add.cu -o add_cuda
Profiling:
nvprof ./add_cuda
add.cu:
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20; // 1M elements
cudaProfilerStart();
// Allocate Unified Memory -- accessible from CPU or GPU
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaDeviceReset();
cudaProfilerStop();
return 0;
}
How can I resolve this to get actual profiling information using nvprof?
As per the documentation, there is currently no profiling support in CUDA for WSL. This is why there is no profiling data collected when you are using nvprof.

CUDA kernel returns nothing

I'm using CUDA Toolkit 8 with Visual Studio Community 2015. When I try simple vector addition from NVidia's PDF manual (minus error checking which I don't have the *.h's for) it always comes back as undefined values, which means the output array was never filled. When I pre-fill it with 0's, that's all I get at the end.
Others have had this problem and some people are saying it's caused by compiling for the wrong compute capability. However, I am using an NVidia GTX 750 Ti, which is supposed to be Compute Capability 5. I have tried compiling for Compute Capability 2.0 (the minimum for my SDK) and 5.0.
I also cannot make any of the precompiled examples work, such as vectoradd.exe which says, "Failed to allocate device vector A (error code initialization error)!" And oceanfft.exe says, "Error unable to find GLSL vertex and fragment shaders!" which doesn't make sense because GLSL and fragment shading are very basic features.
My driver version is 361.43 and other apps such as Blender Cycles in CUDA mode and Stellarium work perfectly.
Here is the code that should work:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <algorithm>
#define N 10
__global__ void add(int *a, int *b, int *c) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
int main(void) {
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));
// fill the arrays 'a' and 'b' on the CPU
for (int i = 0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice);
add << <N, 1 >> >(dev_a, dev_b, dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c, dev_c, N * sizeof(int),cudaMemcpyDeviceToHost);
// display the results
for (int i = 0; i<N; i++) {
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}
// free the memory allocated on the GPU
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I'm trying to develop CUDA apps so any help would be greatly appreciated.
This was apparently caused by using an incompatible driver version with the CUDA 8 toolkit. Installing the driver distributed with the version 8 toolkit solved thr problem.
[Answer assembled from comments and added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]

Calculate matrix determinants with cublas device API

I am trying to evaluate a scalar function f(x), where x is a k-dimensional vector (i.e. f:R^k->R). During the evaluation, I have to perform many matrix operations: inversion, multiplication and finding matrix determinants and traces for matrices of moderate sizes (most of them are less than 30x30). Now I want to evaluate the function at many different xs at the same time by using different threads on the GPU. That is why I need the device api.
I have written the following codes to test calculating matrix determinants by the cublas device API, cublasSgetrfBatched, where I first find the LU decomposition of the matrix and calculate the product of all the diagonal elements in the U matrix. I have done this on both the GPU thread and CPU using the result returned by cublas. But the result from the GPU does not make any sense while the result on the CPU is correct. I have used cuda-memcheck, but found no errors. Could someone help shed some light on this issue? Many thanks.
cat test2.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
__host__ __device__ unsigned int IDX(unsigned int i,unsigned int j,unsigned int ld){return j*ld+i;}
#define PERR(call) \
if (call) {\
fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\
cudaGetErrorString(cudaGetLastError()));\
exit(1);\
}
#define ERRCHECK \
if (cudaPeekAtLastError()) { \
fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\
cudaGetErrorString(cudaGetLastError()));\
exit(1);\
}
__device__ float
det_kernel(float *a_copy,unsigned int *n,cublasHandle_t *hdl){
int *info = (int *)malloc(sizeof(int));info[0]=0;
int batch=1;int *p = (int *)malloc(*n*sizeof(int));
float **a = (float **)malloc(sizeof(float *));
*a = a_copy;
cublasStatus_t status=cublasSgetrfBatched(*hdl, *n, a, *n, p, info, batch);
unsigned int i1;
float res=1;
for(i1=0;i1<(*n);++i1)res*=a_copy[IDX(i1,i1,*n)];
return res;
}
__global__ void runtest(float *a_i,unsigned int n){
cublasHandle_t hdl;cublasCreate_v2(&hdl);
printf("det on GPU:%f\n",det_kernel(a_i,&n,&hdl));
cublasDestroy_v2(hdl);
}
int
main(int argc, char **argv)
{
float a[] = {
1, 2, 3,
0, 4, 5,
1, 0, 0};
cudaSetDevice(1);//GTX780Ti on my machine,0 for GTX1080
unsigned int n=3,nn=n*n;
printf("a is \n");
for (int i = 0; i < n; ++i){
for (int j = 0; j < n; j++) printf("%f, ",a[IDX(i,j,n)]);
printf("\n");}
float *a_d;
PERR(cudaMalloc((void **)&a_d, nn*sizeof(float)));
PERR(cudaMemcpy(a_d, a, nn*sizeof(float), cudaMemcpyHostToDevice));
runtest<<<1, 1>>>(a_d,n);
cudaDeviceSynchronize();
ERRCHECK;
PERR(cudaMemcpy(a, a_d, nn*sizeof(float), cudaMemcpyDeviceToHost));
float res=1;
for (int i = 0; i < n; ++i)res*=a[IDX(i,i,n)];
printf("det on CPU:%f\n",res);
}
nvcc -arch=sm_35 -rdc=true -o test test2.cu -lcublas_device -lcudadevrt
./test
a is
1.000000, 0.000000, 1.000000,
2.000000, 4.000000, 0.000000,
3.000000, 5.000000, 0.000000,
det on GPU:0.000000
det on CPU:-2.000000
cublas device calls are asynchronous.
That means that they return control to the calling thread before the cublas call is finished.
If you want the calling thread to be able to process the results directly (as you are doing here to compute res), you must force a synchronization to wait for the results, before beginning computation.
You don't see this in the host side computation, because there is implicit synchronization of any device activity (including cublas device dynamic parallelism), before the parent kernel terminates.
So if you add add a synchronization after the device cublas call, like this:
cublasStatus_t status=cublasSgetrfBatched(*hdl, *n, a, *n, p, info, batch);
cudaDeviceSynchronize(); // add this line
I think you'll see a match between the device computation and the host computation, as you expect.

C++ class dll with CUDA member?

I have a C++ class-based dll. I'd like to convert some of the class members to CUDA based operation.
I am using VS2012, WINDOWS 7, CUDA6.5, sm_20;
Say the original SuperProjector.h file is like:
class __declspec(dllexport) SuperProjector
{
public:
SuperProjector(){};
~SuperProjector(){};
void sumVectors(float* c, float* a, float* b, int N);
};
and the original sumVector() function in SuperProjector.cpp
void SuperProjector::sumVectors(float* c, float* a, float* b, int N)
{
for (int n = 1; n < N; b++)
c[n] = a[n] + b[n];
}
I am stuck on how I should convert sumVector() to CUDA. Specifically:
I read some posts saying add __global__ __device__ keywords in front
of class members will work, but so I need to change the suffix of
the cpp file to cu?
I also tried to create a cuda project from the beginning, but it seems VS2012 does not give me the option of creating a dll once I chose to create a CUDA project.
I am very confused what is the best way to convert some of the members of tthis C++ class based dll into some CUDA kernel functions. I appreciate anyone can offer some ideas, or better with some very simple examples.
Create CUDA project, let's call it cudaSuperProjector and add two files SuperProjector.cu and SuperProjector.h
cudaSuperProjector.h
class __declspec(dllexport) cudaSuperProjector {
public:
cudaSuperProjector(){ }
~cudaSuperProjector(){ }
void sumVectors(float* c, float* a, float* b, int N);
};
cudaSuperProjector.cu
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cudaSuperProjector.h"
__global__ void addKernel(float *c, const float *a, const float *b) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(float *c, const float *a, const float *b, unsigned int size) {
float *dev_a = 0;
float *dev_b = 0;
float *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(float));
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(float));
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(float));
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(float), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(float), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel << <1, size >> >(dev_c, dev_a, dev_b);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(float), cudaMemcpyDeviceToHost);
return cudaStatus;
}
void cudaSuperProjector::sumVectors(float* c, float* a, float* b, int N) {
cudaError_t cudaStatus = addWithCuda(c, a, b, N);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSuperProjector::sumVectors failed!");
}
}
Note: In properties of file cudaSuperProjector.cu Item Type should be CUDA C/C++.
Go to properties of the project and in General set value of Configuration Type to Dynamic Library (.dll). Now everything for creating library is ready. Compile this project and in output folder you will find cudaSuperProjector.dll and cudaSuperProjector.lib. Create directory cudaSuperProjector\lib and copy cudaSuperProjector.dll and cudaSuperProjector.lib there. Also create cudaSuperProjector\include and copy cudaSuperProjector.h in it.
Create another Visual C++ project, let's call it SuperProjector. Add file SuperProjector.cpp to the project.
SuperProjector.cpp
#include <stdio.h>
#include "cudaSuperProjector/cudaSuperProjector.h"
int main(int argc, char** argv) {
float a[6] = { 0, 1, 2, 3, 4, 5 };
float b[6] = { 1, 2, 3, 4, 5, 6 };
float c[6] = { };
cudaSuperProjector csp;
csp.sumVectors(c, a, b, 6);
printf("c = {%f, %f, %f, %f, %f, %f}\n",
c[0], c[1], c[2], c[3], c[4], c[5]);
return 0;
}
In properties of the project add path to the dll and lib files to the VC++ Directories -> Library Directories, for example D:\cudaSuperProjector\lib;, in VC++ Directories -> Include Directories add path to the header, for example D:\cudaSuperProjector\include;. Then go to the Linker -> Input and add cudaSuperProjector.lib;.
Now your project should compile fine, but when you run it it will show you the error
The program can't start because cudaSuperProjector.dll is missing from
your computer. Try reinstalling the program to fix this problem.
You need to copy cudaSuperProjector.dll to the output folder of the project, so it will be under the same folder as SuperProjector.exe. You can do it manually or add
copy D:\cudaSuperProjector\lib\cudaSuperProjector.dll $(SolutionDir)$(Configuration)\
in Build Events -> Post-Build Events -> Command Line,
where $(SolutionDir)$(Configuration)\ is output path for solution (see Configuration Properties -> General -> Output Directory).

Why is my CUDA kernel returning old values?

Kind of almost at the point of ripping my hair out over this issue.
I have a CUDA kernel that does some math on data stored in a 3D array. While testing this, I used to assign some values (non-zero) to the array and observe results. I commented out those lines since, but the result is still the same. It is as if it is completely ignoring the fact that I'm doing a memset to 0.
The code works correctly when I step through it in Debug... But not in Release! My guess is I have a memory leak from this matrix.
I allocate this array as:
cudaExtent m_extent = make_cudaExtent(sizeof(float)*matdim.x, matdim.y, matdim.z); // width, height, depth
cudaPitchedPtr m_device;
cudaMalloc3D(&m_device, m_extent);
cudaMemset3D(m_device, 0, m_extent);
I call the kernel in a loop like this:
for (int iter = 0; iter < gpu_iterations; iter++)
{
PF_iteration_kernel<<<grids,threads>>>(m_device, m_extent, matdim);
cudaDeviceSynchronize();
}
After which I release the m_device pitched pointer:
cudaFree(m_device.ptr);
matdim is just matrix dimensions held by a dim3.
Within the kernel I do the following (well, I commented everything functional out...):
__global__ void PF_iteration_kernel(cudaPitchedPtr mPtr, cudaExtent mExt, dim3 matrix_dimensions)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// Find location within the pitched memory
char *m = (char*)mPtr.ptr;
int sof = sizeof(float);
size_t pitch = mPtr.pitch;
size_t slice_pitch = pitch*mExt.height;
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff); // display the slice
*m_addroff = 0; // WILL THIS RESET IT?!
__syncthreads();
}
That should be just showing 0s, but it displays my old values (25, 26, 27, 28, etc).
I have cleaned and re-cleaned and re-built everything several times. I have relaunched the IDE.
My IDE is Visual Studio 2010 With NSight 4.6 (CUDA 7.0).
I am on Windows 7 x64
Consider this
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff);
The compiler will see a char and promote it to int pushed on to stack - not a float promoted to double that the format requires.
The compiler does not provide arguments to fit the format spec, but some compilers will examine the format specs and warn of problems.
I suggest you cast the argument. I risk guessing and failing, but something like this
printf("m(%d,%d) is %f \n", x, y, *(float*)m_addroff);
Herer is a simple example.
#include <stdio.h>
int main()
{
char car [4] = {0};
char *cptr = car;
printf ("Hello %f\n", *(float*)cptr);
return 0;
}