nvprof - Warning: No profile data collected - c++

On attempting to use nvprof to profile my program, I receive the following output with no other information:
<program output>
======== Warning: No profile data collected.
The code used follows this classic first cuda program. I have had nvprof work on my system before, however I recently had to re-install cuda.
I have attempted to follow the suggestions in this post which suggested to include cudaDeviceReset() and cudaProfilerStart/Stop() and to use some extra profiling flags nvprof --unified-memory-profiling off without luck.
This nvidia developer forum post seems to run into a similar error, however the suggestions here seemed to indicate needing to use a different compiler than nvcc due to some OpenACC library I do not use.
System Specifications
System: Windows 11 x64 using WSL2
CPU: i7 8750H
GPU: gtx 1050 ti
CUDA Version: 11.8
For completeness, I have included my program code, though I imagine it has more to due with my system:
Compiling:
nvcc add.cu -o add_cuda
Profiling:
nvprof ./add_cuda
add.cu:
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20; // 1M elements
cudaProfilerStart();
// Allocate Unified Memory -- accessible from CPU or GPU
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaDeviceReset();
cudaProfilerStop();
return 0;
}
How can I resolve this to get actual profiling information using nvprof?

As per the documentation, there is currently no profiling support in CUDA for WSL. This is why there is no profiling data collected when you are using nvprof.

Related

I followed a CUDA tutorial but my GPU computation time is much longer than my CPU time?

I followed the tutorial on this page but my results are terrible. The time taken is as follows:
CPU: 569
GPU: 11160
Here is my code. What is going wrong? I can't see why this code is so slow?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <chrono>
#include <iostream>
#include <math.h>
#include <stdio.h>
__global__ void addCUDA(int n, float* x, float* y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
void add(int n, float* x, float* y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main()
{
int N = 1 << 20;
float* x = new float[N];
float* y = new float[N];
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
auto t1 = std::chrono::high_resolution_clock::now();
add(N, x, y);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << duration << std::endl;
delete[] x;
delete[] y;
float* u,
float* v;
cudaMallocManaged(&u, N * sizeof(float));
cudaMallocManaged(&v, N * sizeof(float));
for (int i = 0; i < N; i++) {
u[i] = 1.0f;
v[i] = 2.0f;
}
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(u, N * sizeof(float), device, NULL);
cudaMemPrefetchAsync(v, N * sizeof(float), device, NULL);
auto t3 = std::chrono::high_resolution_clock::now();
addCUDA<<<numBlocks, blockSize>>> (N, u, v);
cudaDeviceSynchronize();
auto t4 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t4 - t3).count();
maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(v[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << duration << std::endl;
cudaFree(u);
cudaFree(v);
return 0;
}
For a so trivial operation (+ on each element) it takes way more time to send the buffers from host to gpu and to retrieve the buffer from gpu to host, than performing the actual computation.
Even if the API is very comfortable to make buffer accesses look easy and almost magic, data has to travel through the PCI-express bus...
The transfer is asynchronous here, but the computation has to wait for it to complete before actually starting; asynchronous transfer is interesting only if you have something else to do in the meantime (organise various stages of a complex computation as a pipeline for example).
If you try with another problem that requires much more computation, the buffer transfers will be amortized.
Moreover, two arrays of 1<<20 floats requires only 8MB and can fit in the cache memory of a modern CPU.
Then, after the initialisation of these two arrays, they may be already hot in cache memory and easily accessible for CPU computation.
Because the computation is a perfectly regular loop, a decent optimizing compiler will use SIMD instructions, the CPU won't mispredict branches and will perfectly prefetch the data in the various cache levels; all of this greatly increases CPU efficiency for this kind of computation.
It's not so easy to outperform a modern CPU with a GPU.
It really depends on the size and the complexity of the problem (an on the specific properties of these two pieces of hardware of course).
EDIT
As discussed in the comments, the timing method used in the cited article and the one shown in the question are very different.
In the article, nvprof uses internal counters in the GPU to measure the time spent actively computing the addCUDA() (add() in the article) function, without considering either the time it takes to obtain the two source buffers from host and to send back the resulting buffer to host.
Of course, it's fast! Because on much modern hardware (CPU or GPU) most of the time is spent accessing/transferring data rather than computing. If we measured the time spent in our CPU to perform additions only, ignoring the time spent fetching/writing data from/to cache/memory, it would not be very long either!
(Note that the CPU code in the article is not even compiled with optimisation turned on; do such timings have any meaning?)
In the code shown in the question, the timing method is quite different but much more relevant in my opinion.
The two calls to std::chrono::high_resolution_clock::now() actually consider the time spent doing all the work: sending the two source buffers, computing on them and fetching the resulting buffer.
It's the only duration that matters after all!
This way, it is fair to compare this duration to the one we obtain (with a similar method) when timing the CPU.
The fact that cudaMemPrefetchAsync() is used can be misleading because we could think that the transfer of the source buffers is excluded from the timings: it is not, and that's why we find the result disappointing compared to the article.
We launch the timer right after these two calls in order to measure the time spent in the computation, but the computation has to wait for these transfers to complete before actually starting (I would even have started the timer before these two calls).
Moreover, the call to cudaDeviceSynchronize() before stopping the timer waits for the transfer of the resulting buffer to complete in order to actually make the result available to the host.
If we used cudaDeviceSynchronize() before starting the timer, we could have excluded the two initial transfers from the timing, but what's the point of such a timing?
In conclusion, I think the timing method you used in your question is much better than the one promoted in the article since you can really compare the benefit you obtain (or not!) from one technology over the other.
For information, on my computers, with full optimisation turned on, your code gives these results:
CPU: 809 Intel(R) Xeon(R) CPU E5-2697 v2 # 2.70GHz]
GPU: 1160 NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
CPU: 157 Intel(R) Core(TM) i7-10875H CPU # 2.30GHz
GPU: 1158 NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

nvcc cuda from command prompt not using gpu

Trying to run a CUDA program from command prompt using nvcc, but it seems like GPU code is not running as expected. The exact same code runs successfully on Visual Studio and outputs the expected output.
nvcc -arch=sm_60 -std=c++11 -o test.cu test.exe
test.exe
Environment:
Windows 10,
NVIDIA Quadro k4200,
CUDA 10.2
Source Code
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
/* this is the vector addition kernel.
:inputs: n -> Size of vector, integer
a -> constant multiple, float
x -> input 'vector', constant pointer to float
y -> input and output 'vector', pointer to float */
__global__ void saxpy(int n, float a, const float x[], float y[])
{
int id = threadIdx.x + blockDim.x*blockIdx.x; /* Performing that for loop */
// check to see if id is greater than size of array
if(id < n){
y[id] += a*x[id];
}
}
int main()
{
int N = 256;
//create pointers and device
float *d_x, *d_y;
const float a = 2.0f;
//allocate and initializing memory on host
std::vector<float> x(N, 1.f);
std::vector<float> y(N, 1.f);
//allocate our memory on GPU
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
//Memory Transfer!
cudaMemcpy(d_x, x.data(), N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y.data(), N*sizeof(float), cudaMemcpyHostToDevice);
//Launch the Kernel! In this configuration there is 1 block with 256 threads
//Use gridDim = int((N-1)/256) in general
saxpy<<<1, 256>>>(N, a, d_x, d_y);
//Transfering Memory back!
cudaMemcpy(y.data(), d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
std::cout << y[0] << std::endl;
cudaFree(d_x);
cudaFree(d_y);
return 0;
}
Output
1
Expected Output
3
Things I tried
When I first tried to compile with nvcc, it had the same error as discussed here.
Cuda compilation error: class template has already been defined
So I tried the suggested solution
"now: D:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\bin\Hostx64\x64"
and now it compiles and runs but the output is not as expected.
"Also, -arch=sm_60 is an incorrect arch specification for a Quadro K4200. It should be -arch=sm_30" by Robert Crovella

OpenCL GPU Programming with Intel HD Graphics 4000

I have been trying to implement a simple parallel algorithm using OpenCL c++ bindings (version 1.2).
Roughly here is the c code (no OpenCL):
typedef struct coord{
double _x;
double _y;
double _z;
}__coord;
typedef struct node{
__coord _coord;
double _dist;
} __node;
double input[3] = {-1.0, -2, 3.5};
//nodeVector1D is a 1Dim random array of struct __node
//nodeVectorSize is the Size of the above array (>1,000)
double d = 0.0;
for(int i=0; i < nodeVectorSize; i++){
__node n = nodeVector1D[i];
d += (input[0] - n._coord._x)*(input[0] - n._coord._x);
d += (input[1] - n._coord._y)*(input[1] - n._coord._y);
d += (input[2] - n._coord._z)*(input[2] - n._coord._z);
n._dist = d;
}
I use a MacBook Pro 13" Late 2013, running on Mac Os X Lion.
OpenCL only detects the CPU.
The CPU: an Intel Ivy i5 2.6GHz, has an integrated GPU of 1Gb at 1.6Ghz (Intel HD Graphics 4000).
The maximum detected Group Item Size is 1024 bytes.
When I run the flat code above (with 1024 nodes), it takes around 17 micro seconds.+
When I run its parallel version using OpenCL, C++ library, it takes 10 times as long, around 87 micro seconds
(excluding the program creation, buffer allocation and writing).
What am I doing wrong here?
NB: the OpenCL kernel for this algorithm is obvious to guess, but I can post it if needed.
Thanks in advance.
EDIT N#1: THE KERNEL CODE
__kernel void _computeDist(
__global void* nodeVector1D,
const unsigned int nodeVectorSize,
const unsigned int itemsize,
__global const double* input){
double d = 0.;
int i,c;
double* n;
i = get_global_id(0);
if (i >= nodeVectorSize) return;
n = (double*)(nodeVector1D + i*itemsize);
for (c=0; c<3;c++){
d += (input[c] - n[c])*(input[c] - n[c]);
}
n[3] = d;
}
Sorry for the void pointer arithmetic, but it works (no seg default).
I can also post the OpenCL initialization routine, but I think it's all over the Internet. However, I will post it, if someone asks.
#pmdj: As I said above OpenCL recognizes my CPU, otherwise I wouldn't have been able to run the tests and get the performance results presented above.
#pmdj: OpenCL kernel code, to my knowledge are always written in C. However, I tagged C++ because (as I said above), I'm using the OpenCL C++ bindings.
I finally found the issue.
The problem was that OpenCL on Mac OS X returns the wrong maximum device work group size of 1024.
I tested with various work group sizes and ended up having 200% performance gains when using a work group size of 128 work items per group.
Here is a clearer benchmark picture. IGPU stands for Integrated GPU.
(X-Axis: the array size, Y-Axis: The Time Duration in microseconds)

CUDA kernel returns nothing

I'm using CUDA Toolkit 8 with Visual Studio Community 2015. When I try simple vector addition from NVidia's PDF manual (minus error checking which I don't have the *.h's for) it always comes back as undefined values, which means the output array was never filled. When I pre-fill it with 0's, that's all I get at the end.
Others have had this problem and some people are saying it's caused by compiling for the wrong compute capability. However, I am using an NVidia GTX 750 Ti, which is supposed to be Compute Capability 5. I have tried compiling for Compute Capability 2.0 (the minimum for my SDK) and 5.0.
I also cannot make any of the precompiled examples work, such as vectoradd.exe which says, "Failed to allocate device vector A (error code initialization error)!" And oceanfft.exe says, "Error unable to find GLSL vertex and fragment shaders!" which doesn't make sense because GLSL and fragment shading are very basic features.
My driver version is 361.43 and other apps such as Blender Cycles in CUDA mode and Stellarium work perfectly.
Here is the code that should work:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <algorithm>
#define N 10
__global__ void add(int *a, int *b, int *c) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
int main(void) {
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));
// fill the arrays 'a' and 'b' on the CPU
for (int i = 0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice);
add << <N, 1 >> >(dev_a, dev_b, dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c, dev_c, N * sizeof(int),cudaMemcpyDeviceToHost);
// display the results
for (int i = 0; i<N; i++) {
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}
// free the memory allocated on the GPU
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I'm trying to develop CUDA apps so any help would be greatly appreciated.
This was apparently caused by using an incompatible driver version with the CUDA 8 toolkit. Installing the driver distributed with the version 8 toolkit solved thr problem.
[Answer assembled from comments and added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]

Why can't I get a working 2-D FFT under Visual Studio 2013 using FFTW or AMPFFT?

I've been working with 2D FFTs in a project of mine, and have been unable to get correct results using two different FFT libraries. At first I assumed I was using them wrong, but upon comparing against MATLAB and Linux GCC reference implementations, it now seems something sinister is going on with my compiler (MSVC 2013 express).
My test case is as follows:
256x256 complex to real IFFT, with a single bin at 255 (0,255 for X,Y notation) set to 10000.
Using AMPFFT, I get the following 2D transform:
And with FFTW, I get the following 2D transform:
As you can see, the AMPFFT version is sort of "almost" correct, but has this weird, every sample banding in it, and the FFTW version is just all over the place and out to lunch.
I took the output of the two different test versions and compared them to MATLAB (technically octave, which uses FFTW under the hood). I also ran the same test case for FFTW under Linux with GCC. Here is a slice from that set of tests of the 127th row (the row number technically doesn't matter since with my choice of bins all rows should be identical):
In this example, the octave and Linux implementations represent the correct result and follow the red line (octave was plotted black, the Linux in red, and it agreed completely with the octave). The FFTW under MSVC is plotted blue, and the AMP FFT output is plotted in magenta. As you can see, again the AMPFFT version seems almost close, but has this weird high frequency ripple in it, and the FFTW under MSVC is just a mess, with this weird "packeted" look to it.
At this stage I can only point the finger at Visual Studio, but I have no idea what's going on or how to fix it.
Here are my two test programs:
FFTW test:
//fftwtest.cpp
//2 dimensional complex-to-real inverse FFT test.
//Produces a 256 x 256 real-valued matrix that is loadable by octave/MATLAB
#include <fstream>
#include <iostream>
#include <complex>
#include <fftw3.h>
int main(int argc, char** argv)
{
int FFTSIZE = 256;
std::complex<double>* cpxArray;
std::complex<double>* fftOut;
// cpxArray = new std::complex<double>[FFTSIZE * FFTSIZE];
//fftOut = new double[FFTSIZE * FFTSIZE];
fftOut = (std::complex<double>*)fftw_alloc_complex(FFTSIZE*FFTSIZE);
cpxArray = (std::complex<double>*)fftw_alloc_complex(FFTSIZE * FFTSIZE);
for(int i = 0; i < FFTSIZE * FFTSIZE; i++) cpxArray[i] = 0;
cpxArray[255] = std::complex<double>(10000, 0);
fftw_plan p = fftw_plan_dft_2d(FFTSIZE, FFTSIZE, (fftw_complex*)cpxArray, (fftw_complex*)fftOut, FFTW_BACKWARD, FFTW_DESTROY_INPUT | FFTW_ESTIMATE);
fftw_execute(p);
std::ofstream debugDump("debugdumpFFTW.txt");
for(int j = 0; j < FFTSIZE; j++)
{
for(int i = 0; i < FFTSIZE; i++)
{
debugDump << " " << fftOut[j * FFTSIZE + i].real();
}
debugDump << std::endl;
}
debugDump.close();
}
AMPFFT test:
//ampffttest.cpp
//2 dimensional complex-to-real inverse FFT test.
//Produces a 256 x 256 real-valued matrix that is loadable by octave/MATLAB
#include <amp_fft.h>
#include <fstream>
#include <iostream>
int main(int argc, char** argv)
{
int FFTSIZE = 256;
std::complex<float>* cpxArray;
float* fftOut;
cpxArray = new std::complex<float>[FFTSIZE * FFTSIZE];
fftOut = new float[FFTSIZE * FFTSIZE];
for(size_t i = 0; i < FFTSIZE * FFTSIZE; i++) cpxArray[i] = 0;
cpxArray[255] = std::complex<float>(10000, 0);
concurrency::extent<2> e(FFTSIZE, FFTSIZE);
std::cout << "E[0]: " << e[0] << " E[1]: " << e[1] << std::endl;
fft<float, 2> m_fft(e);
concurrency::array<float, 2> outpArray(concurrency::extent<2>(FFTSIZE, FFTSIZE));
concurrency::array<std::complex<float>, 2> inpArray(concurrency::extent<2>(FFTSIZE, FFTSIZE), cpxArray);
m_fft.inverse_transform(inpArray, outpArray);
std::vector<float> outVec = outpArray;
std::copy(outVec.begin(), outVec.end(), fftOut);
std::ofstream debugDump("debugdump.txt");
for(int j = 0; j < FFTSIZE; j++)
{
for(int i = 0; i < FFTSIZE; i++)
{
debugDump << " " << fftOut[j * FFTSIZE + i];
}
debugDump << std::endl;
}
}
Both of these were compiled with stock settings on MSVC 2013 for a win32 console app, and the FFTW test was also run under Centos 6.4 with GCC 4.4.7. Both FFTW tests use FFTW version 3.3.4, and as an aside both complex to real and complex to complex plans were tested (with identical results).
Does anyone have even the slightest clue on Visual Studio compiler settings I could try to fix this problem?
Looking at the MSVC FFTW output in blue, it appears to be a number of sines multiplied. That is to say, there's a sine with period 64 or so, and a sine with period 4, and possibly yet another one with a similar frequency to produce a beat.
That basically means the MSVC version had at least two non-zero inputs. I suspect the cause is the typecasting, as you write a fftw_complex object through a std::complex<double> type.