I am trying to compile and run the following program called test.cu:
#include <iostream>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// Kernel function to add the elements of two arrays
__global__
void add(int n, float* x, float* y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 2.0f;
y[i] = 1.0f;
}
// Run kernel on 1M elements on the GPU
add <<<1, 256>>> (N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
for (int i = 0; i < 10; i++)
std::cout << y[i] << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
I am using visual studio comunity 2019 and it marks the "add <<<1, 256>>> (N, x, y);" line as having an expected an expression error. I tried compiling it and somehow it compiles without mistakes, but when running the .exe file it outputs a bunch of "1" instead of the expected "3".
I also tried compiling using "nvcc test.cu", but initially it said "nvcc fatal : Cannot find compiler 'cl.exe' in PATH", so i added "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\Hostx64\x64" to path and now compiling with nvcc gives the same mistake as compiling with visual studio.
In both cases the program never enter the "add" function.
I am pretty sure the code is right and the problem has something to do with the installation, but i already tried reinstalling cuda toolkit and repairing MCVS, but it didn't work.
The kernel.cu exemple that appears when starting a new project with cuda in visual studio also didn't work. When running it outputted "No kernel image available for execution on the device".
How can is solve this?
nvcc version if that helps:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:35_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28845127_0
Visual Studio provides IntelliSense for C++. In the C++ language, the proper parsing of angle brackets is troublesome. You've got < as less than and for templates, and << as shift. So, the fact is that the guys at NVIDIA choose the worst possible delimiter <<<>>>. This makes Intellisense difficult to work properly. The way to get full IntelliSense in CUDA is to switch from the Runtime API to the Driver API. The C++ is just C++, and the CUDA is still (sort of) C++, there is no <<<>>> badness for the language parsing to have to work around.
You could take a look at the difference between matrixMul and matrixMulDrv. The <<<>>> syntax is handled by the compiler essentially just spitting out code that calls the Driver API calls. You'll link to cuda.lib not cudart.lib, and may have to deal with a "mixed mode" program if you use CUDA-RT only libraries. You could refer to this link for more information.
Also, this link tells how to add Intellisense for CUDA in VS.
Related
I am currently working with OpenMP offloading using LLVM/clang-16 (built from the github repository). Using the built-in profiling tools in clang (using environment variables such as LIBOMPTARGET_PROFILE=profile.json and LIBOMPTARGET_INFO) I was able to confirm that my code is executed on my GPU but when I try to profile the code using nvprof or ncu (from the NVIDIA Nsight tool suite) I get an error/warning stating, that the profiler did not detect any kernel launches:
> ncu ./saxpy
Time of kernel: 0.000004
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
This is my test code:
#include <iostream>
#include <omp.h>
#include <cstdlib>
void saxpy(float a, float* x, float* y, int sz) {
double t = 0.0;
double tb, te;
tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
for (int i = 0; i < sz; i++) {
y[i] = a * x[i] + y[i];
}
}
te = omp_get_wtime();
t = te - tb;
printf("Time of kernel: %lf\n", t);
}
int main() {
auto x = (float*) malloc(1000 * sizeof(float));
auto y = (float*) calloc(1000, sizeof(float));
for (int i = 0; i < 1000; i++) {
x[i] = i;
}
saxpy(42, x, y, 1000);
return 0;
}
Compiled using the following command:
> clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda main.cpp -o saxpy --cuda-path=/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/10.2 --offload-arch=sm_61 -fopenmp-offload-mandatory
What do I need to do to enable profiling? I have seen others using ncu for clang compiled OpenMP offloading code without additional steps but maybe I am completely missing something.
By looking at the debug output generated when the program is executed with LIBOMPTARGET_DEBUG=1 and after receiving help from other forums I was able to fix this issue. The program cannot find the necessary files of the OpenMP CUDA runtime library whenever it is started through ncu (or nsys).
A workaround is to add the path to those libraries to the LD_LIBRARY_PATH environment variable (e.g. export LD_LIBRARY_PATH=/opt/llvm/lib:$LD_LIBRARY_PATH).
NVIDIA is now aware of this problem and are "looking into why that is the case".
I am using visual studio 2015 in 32bit mode to compile a dll(VSDll) which calls the functions written in another dll (MATLAB function exported as dll by MATLAB C++ compiler). This VSDLL is then called by an .exe file. Now, I can successflly compile the code into debug and release DLLs. I can also successfully run the release dll from the .exe file. It runs without any problem. But when I try to run the code in debug mode in visual studio I get the following error Error message.
This figure has the register information and the stack frame where the error occurs.Stack Frame and registers. Also attached the stack trace information here. Stack trace.
This might be a silly error but I am new to C++ and I don't have a very deep understanding of memory heaps and stacks. I tried enabling/disabling different settings in the debug mode as suggested in other answers which have worked for others but nothing works in my case. I was using the Visual studio professional version before and I could debug without this error. Now recently I had to change to visual studio community edition and since then I have this problem only when I try to set breakpoints in my code and debug it. Could this be the problem? Another thing I have noticed is, I am using Visual studio to compile various MATLAB functions as dll and use them to build customized dlls to run in TRNExe. Each of it works fine in the release mode, but in debug mode, everytime the same boost_log-vc110-mt-1_49.dll that breaks down and at the same memory register 0x7e37a348.
Can anyone please help me solving this error? I appreciate any ideas or suggestions regarding what the problem could be.
Please see the warning messages here. Could this be the source of the problem?Warning message
I have produced a minimal reproducible example below. It's still free lines of code. Another bottleneck for this is, I am trying to co-simulate between Trnsys and MATLAB. Trnsys has it's components written as Type DLLs (Fortran or C++) and it has a structure for this dll. I am trying to export the MATLAB function as dll, call it in the Trnsys type at required parts of the code and compile it as the DLL to run in Trnsys simulation studio. So, C++ links the MATLAB dll to the TRNDll which is the TRNSYS kernel library. I don't know how to do that as a minimal reproducible example. Anyways, I have compiled the MATLAB code to add two numbers and call it in the Type201.cpp(Respecting the TRNSYS structure) and compiled it into the DLL to run in TRNSYS studio. It works there. But when I try to run the Type201.cpp in debug mode, I still get the exact same error as before at the exact stack frame and the register with boost_log-vc110-mt-1_49.dll.
Type 201.cpp code is below
extern "C" __declspec(dllexport) void TYPE201(void)
{
double a;
double b;
double Timestep, Time, StartTime, StopTime;
int index, CurrentUnit, CurrentType;
mxArray* x_ptr ;
mxArray* y_ptr ;
mxArray* z_ptr = NULL;
double* output = NULL;
//Get the Global Trnsys Simulation Variables
Time = getSimulationTime();
Timestep = getSimulationTimeStep();
CurrentUnit = getCurrentUnit();
CurrentType = getCurrentType();
StartTime = getSimulationStartTime();
StopTime = getSimulationStopTime();
//Set the Version Number for This Type
if (getIsVersionSigningTime())
{
int v = 17;
setTypeVersion(&v);
return;
}
//Do All of the Last Call Manipulations Here
if (getIsLastCallofSimulation())
{
addlibTerminate();
mclTerminateApplication();
return;
}
//Perform Any "End of Timestep" Manipulations That May Be Required
//Do All of the "Very First Call of the Simulation Manipulations" Here
if (getIsFirstCallofSimulation())
{
//Tell the TRNSYS Engine How This Type Works
int npar = 2;
int nin = 2;
int nder = 0;
int nout = 1;
int mode = 1;
int staticStore = 0;
int dynamicStore =1;
setNumberofParameters(&npar);
setNumberofInputs(&nin);
setNumberofDerivatives(&nder);
setNumberofOutputs(&nout);
setIterationMode(&mode);
setNumberStoredVariables(&staticStore, &dynamicStore);
return;
}
//Do All of the "Start Time" Manipulations Here - There Are No Iterations at the Intial Time
if (getIsStartTime())
{
index = 1; a = getInputValue(&index);
index = 2; b = getInputValue(&index);
if (!mclInitializeApplication(NULL, 0))
{
//fprintf(stderr, "Could not initialize the application.\n");
exit(1);
}
if (!addlibInitialize())
{
//fprintf(stderr, "Could not initialize the library.\n");
exit(1);
}
//Read in the Values of the Inputs from the Input File
return;
}
if (getIsEndOfTimestep())
{
return;
}
//---------------------------------------------------------------------------------------------------------------------- -
//ReRead the Parameters if Another Unit of This Type Has Been Called Last
//Get the Current Inputs to the Model
index = 1;
a = getInputValue(&index);
index = 2;
b = getInputValue(&index);
int noutput = getNumberOfOutputs();
//Create an mxArray to input into mlfAdd
x_ptr = mxCreateDoubleMatrix(1, 1, mxREAL);
y_ptr = mxCreateDoubleMatrix(1, 1, mxREAL);
memcpy(mxGetPr(x_ptr), &a, 1 * sizeof(double));
memcpy(mxGetPr(y_ptr), &b, 1 * sizeof(double));
mlfAdd(1, &z_ptr, y_ptr, x_ptr);
output = mxGetPr(z_ptr);
index = 1;
setOutputValue(&index, output);
return;
}
Then this is the MATLAB function to add two numbers.
function [s] = add(a,b)
s = a+b;
end
I complied this into dll via command line using
mcc -B csharedlib:addlib add.m
This is likely because some std classes and boost classes have different layout for release and debug builds. In debug builds additional members are inserted via macros to enable better debugging experience like iterator debugging and such. So as documented by Microsoft it is undefined behavior to mix code linked to different c++ runtime. The matlab dll is build in release mode and linked to c++ release runtime so it shouldn't be used by code linked to debug libraries!
Operating System: CentOS 7
Cuda Toolkit Version: 11.0
Nvidia Driver and GPU Info:
NVIDIA-SMI 450.51.05
Driver Version: 450.51.05
CUDA Version: 11.0
GPU: Quadro M2000M
screenshot of nvidia-smi details
I'm very new to cuda programming so any guidance is extremely appreciated. I have a very simple cuda c++ program that computes the sum of two arrays in unified memory on the GPU. However, it appears that the kernel fails to launch due to a cudaErrorNoKernelImageForDevice error. The code is below:
using namespace std;
#include <iostream>
#include <math.h>
#include <cuda_runtime_api.h>
__global__
void add(int n, float *x, float*y){
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main() {
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
int N = 1<<20;
float *x, *y;
cudaMallocManaged((void**)&x, N*sizeof(float));
cudaMallocManaged((void**)&y, N*sizeof(float));
for(int i = 0; i < N; i++){
x[i] = 1.0f;
y[i] = 2.0f;
}
add<<<1, 1>>>(N, x, y);
cudaGetLastError();
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*
* cudaErrorNoKernelImageForDevice = 209,
*/
cudaDeviceSynchronize();
float maxError = 0.0f;
for (int i = 0; i < N; i++){
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
cudaFree(x);
cudaFree(y);
return 0;
}
The error here comes about due to the fact that a CUDA kernel must be compiled in a way that the resulting code (PTX, or SASS) is compatible with the GPU that it is being run on. This is a topic with a lot of nuance, so please refer to questions like this (and the links there) for additional background.
The GPU architecture when we want to be precise is referred to as the compute capability. You can discover the compute capability of your GPU either with a google search or by running the deviceQuery CUDA sample code. The compute capability is expressed as (major).(minor) so something like compute capability 5.2, or 7.0, etc.
When compiling code, it's necessary to specify a compute capability (or if not, a default compute capability will be implied). If you specify the compute capability when compiling in a way that matches your GPU, everything should be fine. However newer/higher compute capability code will generally not run on older/lower compute capability GPUs. In that case, you will see errors like what you describe:
cudaErrorNoKernelImageForDevice
209
"no binary for GPU"
or similar. You may also see no explicit error at all if you are not doing proper CUDA error checking. The solution is to match the compute capability specified at compile time with the GPU you intend to run on. The method to do this will vary depending on the toolchain/IDE you are using. For basic nvcc command line usage:
nvcc -arch=sm_XY ...
will specify a compute capability of X.Y
For Eclipse/Nsight Eclipse/Nsight Visual Studio, the compute capability can be specified in the project properties. Depending on the tool it may be expressed as switch values (e.g. compute_XY, sm_XY) or it may be expressed numerically as X.Y
I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.
I am trying to implement a DFT using the fftw library (link to FFTW documentation).
All the libraries have been correctly linked, and the project builds just fine. However, the code doesn't run the moment any function from the fftw library is called.
#include <iostream>
#include <fftw3.h>
using namespace std;
int main() {
int vectorSize = 100;
cout << vectorSize << endl;
fftw_complex vec[vectorSize], vecOut[vectorSize];
for(int i = 0; i < vectorSize; i++) {
vec[i][0] = i;
vec[i][1] = 1;
}
// Call to function to create an FFT plan
fftw_plan plan = fftw_plan_dft_1d(vectorSize, vec, vecOut, FFTW_FORWARD, FFTW_ESTIMATE);
cout << "test" << endl;
return 0;
}
If I comment the line where the fftw_plan is instantiated, the code outputs 100 and "test" as expected. There are no issues in the build, as far as I can tell. I haven't really been able to find any post which describes a similar problem.
I am running this on eclipse, using MinGW and the 32 bit version of the pre-compiled binary available for windows (download link).
Any help would be really appreciated :)
Fftw requires input/output to be 16-byte aligned. When you declare the arrays on stack, this can't be guaranteed. So you need to call fftw_malloc or other function to allocate the arrays. Also, your code only creates the plan but doesn't execute it, thus no fft is carried out on the input data.