CUDA hello_world not running

CUDA hello_world not running - c++

I apologize if this problem has been addressed before, but I've done some searching and so far I've come up empty handed. I'm trying to compile a cuda version of Hello World, slightly modified from here. My code is:
// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010
#include <stdio.h>
#include <iostream>
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
int runtime_version = -1;
auto error_type_runtime = cudaRuntimeGetVersion(&runtime_version);
int driver_version = -1;
auto error_type_driver = cudaDriverGetVersion(&driver_version);
std::cout << "Blocksize: " << blocksize << std::endl;
std::cout << "NumBlocks: " << (N + blocksize - 1)/blocksize << std::endl;
std::cout << "Runtime API: " << runtime_version << std::endl;
std::cout << "cudaRuntimeGetVersion error type: " << error_type_runtime << std::endl;
std::cout << "Driver API: " << driver_version << std::endl;
std::cout << "cudaRuntimeGetVersion error type: " << error_type_driver << std::endl;
hello<<<(N + blocksize - 1)/blocksize, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
cudaFree( bd );
printf("%s\n", a);
return EXIT_SUCCESS;
}
But I get:
$ nvcc cuda_hello_world.cu -arch=sm_20 --std=c++11
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./a.out
Hello Blocksize: 16
NumBlocks: 1
Runtime API: -1
cudaRuntimeGetVersion error type: 35
Driver API: 0
cudaRuntimeGetVersion error type: 0
Hello
I looked up cuda error 35, which is ' indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library,' but after running
$/usr/bin/nvidia-smi
I get NVIDIA-SMI 375.82 Driver Version: 375.82 which is from Jul 24, 2017, and
$nvcc --version
yields:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
so it looks like the correct libraries/drivers are installed, but nvcc can't find them. If I build with -v I get:
$ nvcc cuda_hello_world.cu -arch=sm_20 --std=c++11 -v
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda-8.0/bin
#$ _THERE_=/usr/local/cuda-8.0/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda-8.0/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda-8.0/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda-8.0/bin/../lib:
#$ PATH=/usr/local/cuda-8.0/bin/../open64/bin:/usr/local/cuda-8.0/bin/../nvvm/bin:/usr/local/cuda-8.0/bin:/home/michael/bin:/home/michael/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games/usr/local/games:/snap/bin:/usr/local/cuda-8.0/bin/:/usr/local/MATLAB/R2016b/bin/
#$ INCLUDES="-I/usr/local/cuda-8.0/bin/../targets/x86_64-linux/include"
#$ LIBRARIES= "-L/usr/local/cuda-8.0/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-8.0/bin/../targets/x86_64-linux/lib"
Am I making a stupid mistake by not including the correct libraries, or is something totally different going on here?

In case anyone else has this problem, I was able to solve it. Turns out that simply updating/upgrading everything (including the nvidia drivers/libraries) fixed the problem.

Related

MAP_FIXED_NOREPLACE not supported on Ubuntu 20

When I run this code I get "mmap: Operation not supported", according to mmap man
that's because one of the flags is invalid (validated by MAP_SHARED_VALIDATE). The "bad" flag is MAP_FIXED_NOREPLACE
#include <fcntl.h>
#include <errno.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char** argv)
{
int fd_addr = open("test", O_CREAT | O_RDWR);
if (fd_addr == -1) {
std::cout << "open: " << strerror(errno) << "\n";
return EXIT_FAILURE;
}
if (ftruncate(fd_addr, 100) == -1) {
std::cout << "ftruncate: " << strerror(errno) << "\n";
return EXIT_FAILURE;
}
auto mem =mmap((void*)0x7f4b1618a000, 1, PROT_READ | PROT_WRITE, MAP_FIXED_NOREPLACE | MAP_SHARED_VALIDATE | MAP_LOCKED, fd_addr, 0);
if (mem == MAP_FAILED) {
std::cout << "mmap: " << strerror(errno) << "\n";
return EXIT_FAILURE;
}
}
Any sane ideas what that could be and how to figure out what the problem is? I use Ubuntu 20 (kernel 5.4.0-131-generic) and g++-11 (with glibc 2.31)
g++-11 (Ubuntu 11.1.0-1ubuntu1~20.04) 11.1.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
Replacing MAP_FIXED_NOREPLACE with MAP_FIXED works just fine.
Compiling like that:
g++-11 -g0 -Ofast -DNDEBUG -Wall -Werror --std=c++2a -march=native -flto -fno-rtti main.cpp -pthread -lrt

The only way to answer your question is to look at the source.
Taking an excerpt from do_mmap:
switch (flags & MAP_TYPE) {
case MAP_SHARED:
/*
* Force use of MAP_SHARED_VALIDATE with non-legacy
* flags. E.g. MAP_SYNC is dangerous to use with
* MAP_SHARED as you don't know which consistency model
* you will get. We silently ignore unsupported flags
* with MAP_SHARED to preserve backward compatibility.
*/
flags &= LEGACY_MAP_MASK;
fallthrough;
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
and comparing your flags against those in LEGACY_MAP_MASK,
it becomes evident that MAP_FIXED_NOREPLACE is not part of LEGACY_MAP_MASK, which returns the "operation not supported" error you report.
In this case, the MAP_FIXED_NOREPLACE bits result in an extra check up front and otherwise behaves as MAP_FIXED.
Long story short: in this situation you can replace MAP_SHARED_VALIDATE with MAP_SHARED without loss of functionality.
Doing what you suggested (MAP_FIXED_NOREPLACE to MAP_FIXED) has the potential of accidentally creating overlapping memory mappings.

Unresolved extern function error with template default parameter in CUDA9.2 and above

I am working with some c++/CUDA code that makes significant use of templates for both classes and functions. We have mostly been using CUDA 9.0 and 9.1, where everything compiles and runs fine. However, compilation fails on newer versions of CUDA (specifically 9.2 and 10).
After further investigation, it seems that trying to compile exactly the same code with CUDA version 9.2.88 and above will fail, whereas with CUDA version 8 through 9.1.85 the code compiles and runs correctly.
A minimal example of the problematic code can be written as follows:
#include <iostream>
template<typename Pt>
using Link_force = void(Pt* x, Pt* y);
template<typename Pt>
__device__ void linear_force(Pt* x, Pt* y)
{
*x += *y;
}
template<typename Pt, Link_force<Pt> force>
__global__ void link(Pt* x, Pt* y)
{
force(x, y);
}
template<typename Pt = float, Link_force<Pt> force = linear_force<Pt>>
void apply_forces(Pt* x, Pt* y)
{
link<Pt, force><<<1, 1, 0>>>(x, y);
}
int main(int argc, const char* argv[])
{
float *x, *y;
cudaMallocManaged(&x, sizeof(float));
cudaMallocManaged(&y, sizeof(float));
*x = 0.0f;
*y = 42.0f;
std::cout << "Pre :: x = " << *x << ", y = " << *y << '\n';
apply_forces(x, y);
cudaDeviceSynchronize();
std::cout << "Post :: x = " << *x << ", y = " << *y << '\n';
return 0;
}
If I compile with nvcc, as below, the eventual result is an error from ptxas:
$ nvcc --verbose -std=c++11 -arch=sm_61 minimal_example.cu
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda-9.2/bin
#$ _THERE_=/usr/local/cuda-9.2/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_SIZE_=64
#$ TOP=/usr/local/cuda-9.2/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda-9.2/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda-9.2/bin/../lib:/usr/local/cuda-9.2/lib64:
#$ PATH=/usr/local/cuda-9.2/bin/../nvvm/bin:/usr/local/cuda-9.2/bin:/usr/local/cuda-9.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
#$ INCLUDES="-I/usr/local/cuda-9.2/bin/..//include"
#$ LIBRARIES= "-L/usr/local/cuda-9.2/bin/..//lib64/stubs" "-L/usr/local/cuda-9.2/bin/..//lib64"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -std=c++11 -D__CUDA_ARCH__=610 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda-9.2/bin/..//include" -D"__CUDACC_VER_BUILD__=148" -D"__CUDACC_VER_MINOR__=2" -D"__CUDACC_VER_MAJOR__=9" -include "cuda_runtime.h" -m64 "minimal_example.cu" > "/tmp/tmpxft_0000119e_00000000-8_minimal_example.cpp1.ii"
#$ cicc --c++11 --gnu_version=70300 --allow_managed -arch compute_61 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0000119e_00000000-2_minimal_example.fatbin.c" -tused -nvvmir-library "/usr/local/cuda-9.2/bin/../nvvm/libdevice/libdevice.10.bc" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0000119e_00000000-3_minimal_example.module_id" --orig_src_file_name "minimal_example.cu" --gen_c_file_name "/tmp/tmpxft_0000119e_00000000-5_minimal_example.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000119e_00000000-5_minimal_example.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0000119e_00000000-5_minimal_example.cudafe1.gpu" "/tmp/tmpxft_0000119e_00000000-8_minimal_example.cpp1.ii" -o "/tmp/tmpxft_0000119e_00000000-5_minimal_example.ptx"
#$ ptxas -arch=sm_61 -m64 "/tmp/tmpxft_0000119e_00000000-5_minimal_example.ptx" -o "/tmp/tmpxft_0000119e_00000000-9_minimal_example.sm_61.cubin"
ptxas fatal : Unresolved extern function '_Z12linear_forceIfEvPT_S1_'
# --error 0xff --
As far as I can tell, the error only occurs when using the default template parameter Link_force<Pt> force = linear_force<Pt> in the template definition for apply_forces. For example, explicitly specifying the template parameters in main
apply_forces<float, linear_force>(x, y);
where we call apply_forces will result in everything compiling and running correctly, as does defining the template parameters explicitly in any other way.
Is it likely that this is a problem with the nvcc toolchain? I didn't spot any changes in the CUDA release notes that would be a likely culprit, so I'm a bit stumped.
Since this was working with older versions of nvcc, and now is not, I don't understand whether this is in fact an illegitimate use of template default parameters? (perhaps specifically when combined with CUDA functions?)

This is a bug in CUDA 9.2 and 10.0 and a fix is being worked on. Thanks for pointing it out.
One possible workaround as you've already pointed out would be to revert to CUDA 9.1
Another possible workaround is to repeat the offending template instantiation in the body of the function (e.g. in a discarded statement). This has no impact on performance, it just forces the compiler to emit code for that function:
template<typename Pt = float, Link_force<Pt> force = linear_force<Pt>>
void apply_forces(Pt* x, Pt* y)
{
(void)linear_force<Pt>; // add this
link<Pt, force><<<1, 1, 0>>>(x, y);
}
I don't have further information on when a fix will be available, but it will be in a future CUDA release.

Can't run any available OpenCL sample with recent API CL/cl2.hpp

I am trying to get started with OpenCL. After installing it, I have found a little oddity compared to almost all online tutorials, there was no cl.hpp header, only cl2.hpp. I have learned that it is a new version. A new version of API with little to no tutorials available.
The tutorials I have found failed to compile. This one (http://github.khronos.org/OpenCL-CLHPP/) for example failed to compile because of an undefined variable and if I avoided it, it reported that I had no OpenCL 2.0 devices. I was able to get this one past the device check (http://simpleopencl.blogspot.cz/2013/06/tutorial-simple-start-with-opencl-and-c.html), but it crashes when I try to create context (device was found).
Code I am trying:
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
if(all_platforms.size()==0){
std::cout<<" No platforms found. Check OpenCL installation!\n";
exit(1);
}
cl::Platform default_platform=all_platforms[0];
std::cout << "Using platform: "<<default_platform.getInfo<CL_PLATFORM_NAME>()<<"\n";
//get default device of the default platform
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
if(all_devices.size()==0){
std::cout<<" No devices found. Check OpenCL installation!\n";
exit(1);
}
cl::Device default_device=all_devices[0];
std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";
cl::Context context({default_device}); // null pointer read here
cl::Program::Sources sources;
// kernel calculates for each element C=A+B
std::string kernel_code=
" void kernel simple_add(global const int* A, global const int* B, global int* C){ "
" C[get_global_id(0)]=A[get_global_id(0)]+B[get_global_id(0)]; "
" } ";
sources.push_back({kernel_code.c_str(),kernel_code.length()});
cl::Program program(context,sources);
if(program.build({default_device})!=CL_SUCCESS){
std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
exit(1);
}
// create buffers on the device
cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(int)*10);
cl::Buffer buffer_B(context,CL_MEM_READ_WRITE,sizeof(int)*10);
cl::Buffer buffer_C(context,CL_MEM_READ_WRITE,sizeof(int)*10);
int A[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int B[] = {0, 1, 2, 0, 1, 2, 0, 1, 2, 0};
//create queue to which we will push commands for the device.
cl::CommandQueue queue(context,default_device);
//write arrays A and B to the device
queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(int)*10,A);
queue.enqueueWriteBuffer(buffer_B,CL_TRUE,0,sizeof(int)*10,B);
//run the kernel
cl::KernelFunctor simple_add(cl::Kernel(program,"simple_add"),queue,cl::NullRange,cl::NDRange(10),cl::NullRange);
simple_add(buffer_A,buffer_B,buffer_C);
//alternative way to run the kernel
cl::Kernel kernel_add=cl::Kernel(program,"simple_add");
kernel_add.setArg(0,buffer_A);
kernel_add.setArg(1,buffer_B);
kernel_add.setArg(2,buffer_C);
queue.enqueueNDRangeKernel(kernel_add,cl::NullRange,cl::NDRange(10),cl::NullRange);
queue.finish();
int C[10];
//read result C from the device to array C
queue.enqueueReadBuffer(buffer_C,CL_TRUE,0,sizeof(int)*10,C);
std::cout<<" result: \n";
for(int i=0;i<10;i++){
std::cout<<C[i]<<" ";
}
I am using Intel Gen OCL Driver on Intel(R) HD Graphics IvyBridge M GT2, on Ubuntu 16.04.
Any idea what am I doing wrong?

If the driver doesn't support CL2.0, you need to "lower the expectation" of cl2.hpp by setting the minimum and target versions of CL to the relevant version (e.g. 1.2):
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_TARGET_OPENCL_VERSION 120
#include <CL/cl2.hpp>
That way, the code inside cl2.hpp will compile for a CL1.2 environment.

malloc issue with a OpenCL code - Huge size of mach_vm_map on OS X

I have an issue about the porting of a OpenCL code from Linux (where it's working) to Mac OS X 10.9.5.
At the part of this code where I am using malloc, when I launch executable, I get the following error :
OpenCLSimu(13400,0x7fff7da7c310) malloc: *** mach_vm_map(size=1556840295209897984) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
As you can see, the requested memory is huge : 1556840295209897984 bytes, so the allocation fails.
Here's the routine for allocation part (NumBodies is 30720 in my case) :
int OpenCLSimu::setup()
{
// Make sure numParticles is multiple of group size
numBodies = (cl_int)(((size_t) getNumParticles()
< groupSize) ? groupSize : getNumParticles());
initPos = (cl_double*) malloc(numBodies * sizeof(cl_double4));
CHECK_ALLOCATION(initPos, "Failed to allocate host memory. (initPos)");
initVel = (cl_double*) malloc(numBodies * sizeof(cl_double4));
CHECK_ALLOCATION(initVel, "Failed to allocate host memory. (initVel)");
pos = (cl_double*) malloc(numBodies * sizeof(cl_double4));
CHECK_ALLOCATION(pos, "Failed to allocate host memory. (pos)");
vel= (cl_double*) malloc(numBodies * sizeof(cl_double4));
CHECK_ALLOCATION(vel, "Failed to allocate host memory. (vel)");
return NBODY_SUCCESS;
}
I don't know if there is a relation but I've found out on https://bugs.openjdk.java.net/browse/JDK-8043507 (with Java language) that on OS X, we have to specify uint32_t type for size.
Maybe this issue comes from the clang compiler that I use for compilation.
CC = /usr/bin/clang
CXX = /usr/bin/clang++
DEFINES = -DQT_NO_DEBUG -DQT_OPENGL_LIB -DQT_GUI_LIB -DQT_CORE_LIB -DQT_SHARED
CFLAGS = -pipe -O2 -arch x86_64 -Xarch_x86_64 -mmacosx-version-min=10.9 -Wall -W $(DEFINES)
CXXFLAGS = -pipe -O2 -arch x86_64 -Xarch_x86_64 -mmacosx-version-min=10.9 -Wall -W $(DEFINES)
I tried also to set numBodies to 3072 in order to see the huge size of mach_vm_map and I get :
malloc: * mach_vm_map(size=868306322687266816) failed (error code=3)
* error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
I noticed these sizes are always changing for different executions.
Finally, I had for the Linux version for pos and vel arrays into the above routine :
pos = (cl_double*)memalign(16, numBodies * sizeof(cl_double4));
vel = (cl_double*)memalign(16, numBodies * sizeof(cl_double4));
instead of malloc using :
pos = (cl_double*) malloc(numBodies * sizeof(cl_double4));
vel= (cl_double*) malloc(numBodies * sizeof(cl_double4));
I have seen that on OS X, the data were aligned by default on 16 byte boundary, that's why I replace memalign by malloc for MacOS version
If someone had a clue, this would be nice.
Thanks in advance.
UPDATE :
The error occurs between the "cout << size of source =" << sourceSize << endl" and the "cout << "status =" << status << endl", so it fails on the clCreateProgramWithSource method :
// create a CL program using the kernel source
const char *kernelName = "Simu_Kernels.cl";
FILE *fp = fopen(kernelName, "r");
if (!fp) {
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
char *source = (char*)malloc(10000);
int sourceSize = fread( source, 1, 10000, fp);
fclose(fp);
cout << "size of source =" << sourceSize << endl;
// Create a program from the kernel source
program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
//program = clCreateProgramWithSource(context, 1, (const char **)&source, NULL, &status);
cout << "status =" << status << endl;
cout << "current_device =" << current_device<< endl;
At the execution, I get :
Selected Platform Vendor : Apple
Device 0 : Iris Pro Device ID is 0x1024500
Device 1 : GeForce GT 750M Device ID is 0x1022700
size of source =2026
OpenCLSimu(15802,0x7fff7da7c310) malloc: *** mach_vm_map(size=59606861803950080) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
status =-6
and status = -6 corresponds to a CL_OUT_OF_HOST_MEMORY
I make you notice that I have 2 GPU units on my macbook (Iris Pro Device and GeForce GT 750M). I have the same error for both devices.

Try to create program as follows:
static char* Read_Source_File(
const char *filename,
size_t *file_len)
{
long int
size = 0,
res = 0;
char *src = NULL;
FILE *file = fopen(filename, "rb");
if (!file) return NULL;
if (fseek(file, 0, SEEK_END))
{
fclose(file);
return NULL;
}
size = ftell(file);
*file_len = size;
if (size == 0)
{
fclose(file);
return NULL;
}
rewind(file);
src = (char *)calloc(size + 1, sizeof(char));
if (!src)
{
src = NULL;
fclose(file);
return src;
}
res = fread(src, 1, sizeof(char) * size, file);
if (res != sizeof(char) * size)
{
fclose(file);
free(src);
return (char*)NULL;
}
src[size] = '\0'; /* NULL terminated */
fclose(file);
return src;
}
size_t file_len;
char *source = Read_Source_File("path_to_kernel.cl", &file_len);
if(source){
program = clCreateProgramWithSource(context, 1, (const char **)&src_file, NULL, &status);
}

cuModuleLoadDataEx ignores all options

This question is similar to cuModuleLoadDataEx options but I would like to bring the topic up again and in addition provide more information.
When loading a PTX string with the NV driver via cuModuleLoadDataEx it seems to ignore all options all together. I provide full working examples so that anyone interested can directly and with no effort reproduce this. First a small PTX kernel (save this as small.ptx) then the C++ program that loads the PTX kernel.
.version 3.1
.target sm_20, texmode_independent
.address_size 64
.entry main()
{
ret;
}
main.cc
#include<cstdlib>
#include<iostream>
#include<fstream>
#include<sstream>
#include<string>
#include<map>
#include "cuda.h"
int main(int argc,char *argv[])
{
CUdevice cuDevice;
CUcontext cuContext;
CUfunction func;
CUresult ret;
CUmodule cuModule;
cuInit(0);
std::cout << "trying to get device 0\n";
ret = cuDeviceGet(&cuDevice, 0);
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "trying to create a context\n";
ret = cuCtxCreate(&cuContext, 0, cuDevice);
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "loading PTX string from file " << argv[1] << "\n";
std::ifstream ptxfile( argv[1] );
std::stringstream buffer;
buffer << ptxfile.rdbuf();
ptxfile.close();
std::string ptx_kernel = buffer.str();
std::cout << "Loading PTX kernel with driver\n" << ptx_kernel;
const unsigned int jitNumOptions = 3;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024*1024;
jitOptVals[0] = (void *)&jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up wall clock time
jitOptions[2] = CU_JIT_WALL_TIME;
float jitTime = -2.0;
jitOptVals[2] = &jitTime;
ret = cuModuleLoadDataEx( &cuModule , ptx_kernel.c_str() , jitNumOptions, jitOptions, (void **)jitOptVals );
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "walltime: " << jitTime << "\n";
std::cout << std::string(jitLogBuffer) << "\n";
}
Build (assuming CUDA is installed under /usr/local/cuda, I use CUDA 5.0):
g++ -I/usr/local/cuda/include -L/usr/local/cuda/lib64/ main.cc -o main -lcuda
If someone is able to extract any sensible information from the compilation process that would be great! The documentation of CUDA driver API where cuModuleLoadDataEx is explained (and which options it is supposed to accept) http://docs.nvidia.com/cuda/cuda-driver-api/index.html
If I run this, the log is empty and jitTime wasn't even touched by the NV driver:
./main small.ptx
trying to get device 0
trying to create a context
loading PTX string from file empty.ptx
Loading PTX kernel with driver
.version 3.1
.target sm_20, texmode_independent
.address_size 64
.entry main()
{
ret;
}
walltime: -2
EDIT:
I managed to get the JIT compile time. However it seems that the driver expects an array of 32bit values as OptVals. Not as stated in the manual as an array of pointers (void *) which are on my system 64 bits. So, this works:
const unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
int *jitOptVals = new int[jitNumOptions];
jitOptions[0] = CU_JIT_WALL_TIME;
// here the call to cuModuleLoadDataEx
std::cout << "walltime: " << (float)jitOptions[0] << "\n";
I believe that it is not possible to do the same with an array of void *. The following code does not work:
const unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
jitOptions[0] = CU_JIT_WALL_TIME;
// here the call to cuModuleLoadDataEx
// here I also would have a problem casting a 64 bit void * to a float (32 bit)
EDIT
Looking at the JIT compilation time jitOptVals[0] was misleading. As mentioned in the comments, the JIT compiler caches previous translations and won't update the JIT compile time if it finds a cached compilation. Since I was looking whether this value has changed or not I assumed that the call ignores the options all together. Which it doesn't. It's works fine.

Your jitOptVals should not contain pointers to your values, instead cast the values to void*:
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024*1024;
jitOptVals[0] = (void *)jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up wall clock time
jitOptions[2] = CU_JIT_WALL_TIME;
float jitTime = -2.0;
//Keep jitOptVals[2] empty as it only an Output value:
//jitOptVals[2] = (void*)jitTime;
and after cuModuleLoadDataEx, you get your jitTime like jitTime = (float)jitOptions[2];

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA hello_world not running - c++

In case anyone else has this problem, I was able to solve it. Turns out that simply updating/upgrading everything (including the nvidia drivers/libraries) fixed the problem.

Related

MAP_FIXED_NOREPLACE not supported on Ubuntu 20

Unresolved extern function error with template default parameter in CUDA9.2 and above

Can't run any available OpenCL sample with recent API CL/cl2.hpp

malloc issue with a OpenCL code - Huge size of mach_vm_map on OS X

cuModuleLoadDataEx ignores all options

Categories

Resources