I have a program with a variety of kernels. In production these kernels run on a gpu device and require JIT (Just in time) compilation because we use specialisation constants. For testing we run on the CPU but we would like AOT (Ahead of time) compilation to save time when running the tests.
So we have a very simple executable:
#include <sycl/sycl.hpp>
int main()
{
auto device = sycl::device{sycl::gpu_selector_v}; // Note that we are selecting the GPU here!
auto queue = sycl::queue{device};
queue
.submit(
[](sycl::handler& cgh)
{
sycl::stream out(1024, 256, cgh);
cgh.parallel_for<class HELLO_WORLD>(
sycl::range<1>{5},
[=](sycl::id<1> id) { out << "Hello #" << id.get(0) << "\n"; }
);
}
)
.wait();
return 0;
}
That is built through cmake with:
set(CMAKE_CXX_COMPILER "icpx")
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
set(SYCL_COMPILER_FLAGS "-fclang-abi-compat=7 -fsycl -sycl-std=2020 -fp-model=precise")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${SYCL_COMPILER_FLAGS}")
set(SYCL_LINK_FLAGS "-fsycl ")
set(CMAKE_CXX_LINK_FLAGS "${CMAKE_CXX_LINK_FLAGS} ${SYCL_LINK_FLAGS}")
add_executable(mix_jit_aot
examples/mix_jit_aot.cpp
)
This compiles and runs just fine on the device:
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics [0x9bc4] 3.0 [22.28.23726.1]
However, if we add AOT (Ahead of time) compilation for a different device, say a CPU:
set(SYCL_AOT_COMPILE_FLAGS -fsycl-targets=spir64_x86_64)
target_compile_options(mix_jit_aot PUBLIC
${SYCL_AOT_COMPILE_FLAGS}
)
set(SYCL_AOT_LINK_FLAGS ${SYCL_AOT_COMPILE_FLAGS} -Xsycl-target-backend=spir64_x86_64 "-march avx2")
target_link_options(mix_jit_aot PUBLIC
${SYCL_AOT_LINK_FLAGS}
)
It compiles, and will run if i set the selection of device to the CPU. (aka auto device = sycl::device{sycl::cpu_selector_v};) However, if i use the GPU,. it crashes with:
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)
Aborted (core dumped)
Is it possible to compile AOT for a single device, but use JIT compilation for everything else?
When compiling for an Intel GPU the documentation here says to use -fsycl-targets=spir64_gen with the appropriate -Xs flag value.
Your current command only uses the flag for Intel CPUs.
You should be able to include all these flags to have the binary compiled for both targets (CPU and GPU).
Also this
set(SYCL_AOT_COMPILE_FLAGS -fsycl-targets=spir64_x86_64)
Should be:
set(SYCL_AOT_COMPILE_FLAGS -fsycl-targets=spir64,spir64_x86_64)
Since spir64 is the JIT target.
Related
I keep getting an "invalid device function" on my kernel launch.
Google turns up a plethora of instances for this, however all of them seem to be related to a mismatch of the embedded SASS/PTX code embedded in the binary.
The way I understand how it works is:
SASS code can only be interpreted by an GPU with the exact same SM version 2
PTX code is forward-compatible, i.e. any newer GPU will be able to run the code (however, driver needs to JIT) 2
I need to specify what I want to target by passing suitable -arch commands to nvcc: -gencode arch=compute_30,code=sm_30 will create a SASS targeting SM 3.0, -gencode arch=compute_60,code=compute_60 will create PTX code 1
To use cuda with static and shared libraries, I need to compile for position-independent code and enable separable compilation
What I did now is:
Confirmed that I have SM 6.1 for my Titan Xp 5
Forced nvcc to generate compatible code 3
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_30,code=compute_30")
confirmed this gets compiled into my object file with cuobjdump:
./cuobjdump /mnt/linuxdata/campvis-nx/build/bin/libcuda-interop-cuda.a
member /mnt/linuxdata/campvis-nx/build/bin/libcuda-interop-cuda.a:test.cu.o:
Fatbin ptx code:
================
arch = sm_61
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
ptxasOptions = --compile-only
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
Fatbin ptx code:
================
arch = sm_30
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
ptxasOptions = --compile-only
Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
member /mnt/linuxdata/campvis-nx/build/bin/libcuda-interop-cuda.a:mocs_compilation.cpp.o:
realized that only parts of it (the SASS part?) are linked into my shared library (why??):
./cuobjdump /mnt/linuxdata/campvis-nx/build/bin/libcampvis-modules.so
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
I even tried compiling all SM versions from here into the same binary, still with the same result.
It seems that according to this example, embedding PTX is more work than just enabling the compilation of it with CMake, so for now I would be happy with a SASS version..
Did I misunderstand any of the information above?
Are there other possible reasons for an "invalid device function" error?
I can post the code if it helps but I feel this is more of a build system problem..
Ultimately, as expected, this was due to a build system setup problem.
TLDR version:
I managed to fix it by changing the library with my CUDA code from STATIC to SHARED.
To fix it, I first used the automatic architecture detection from FindCuda CMake (which seems to have create SM 6.1, so I was at lest right there)
cuda_select_nvcc_arch_flags(ARCH_FLAGS Auto)
list(APPEND CUDA_NVCC_FLAGS ${ARCH_FLAGS})
The application I am integrating this into is modularized with the use of shared libraries. I was unable to include the .cu files in the new module directly because nvcc did not like some of the compilation flags. Therefore, my intention was to create a separate static library with only the cuda code that would get linked into the shared module. However, it seems that this does not properly include the device code into the shared library (possibly because they are linked with a "normal" c++ linker?).
Ultimately, this is the code I ended up using:
add_library(cuda-interop SHARED [c++ only code])
file(GLOB cuda_SOURCES "modules/cudainterop/cuda/*.cu")
# the library that only has the cuda code
add_library(cuda-interop-cuda SHARED ${cuda_SOURCES})
set_target_properties(cuda-interop-cuda PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
set_target_properties(cuda-interop-cuda PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_link_libraries(cuda-interop PRIVATE cuda-interop-cuda)
I'm debugging a crash of my OpenCL application. I attempted to use ASan to pin down where the problem originates. But then I discovered that I enable ASan when recompiling, my application cannot find any OpenCL devices. Simply adding -fsanitize=address to the compiler options made my program unable to use OpenCL.
With further testing, I am certain ASan is the reason.
Why is this happening? How can I use asan with OpenCL?
An MVCE:
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
int main() {
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
if(platforms.size() == 0)
std::cout << "Compiled with ASan\n";
else
std::cout << "Compiled normally\n";
}
cl::Platform::get returns CL_SUCCESS but an empty list of devices.
Some information about my setup:
GPU: GTX 780Ti
Driver: 418.56
OpenCL SDK: Nvidia OpenCL / POCL 1.3 with CPU and CUDA backend
Compiler: GCC 8.2.1
OS: Arch Linux (Kernel 5.0.7 x64)
The NVIDIA driver is known to conflict with ASAN. It attempts to mmap(2) memory into a fixed virtual memory range within the process, which coincides with ASAN's write-protected shadow gap region. Given that ASAN reserves about 20TB of virtual address space on startup, such conflicts are not unlikely with other programs or drivers, too.
ASAN recognizes certain flags that may be set in the ASAN_OPTIONS environment variable. To resolve the shadow gap range conflict, set the protect_shadow_gap option to 0. For example, assuming a POSIX-like shell, you may run your program like
$ ASAN_OPTIONS=protect_shadow_gap=0 ./mandelbrot
The writable shadow gap incurs additional performance costs under ASAN, since an unprotected gap requires its own shadowing. This is why it's not recommended to set this option globally (e. g., in your shell startup script). Enable it only for the programs that in fact require it.
I'm nearly certain this is the root cause of your issue. I am using ASAN with CUDA programs, and always need to set this option. The failure reported by CUDA without it is very similar: cudaErrorNoDevice error when I attempt to select a device.
I am new to thrust library and trying to use it in my project. Here is a very simple code example. It can be compiled without any problem. However, when I try to run it, it gives me an error:
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): std::bad_alloc: unknown error
along with a warning:
nvlink warning : SM Arch ('sm_20') not found in ...
The project can be reproduced by using the following two files.
test.cpp
#include <thrust/device_vector.h>
int main(){
thrust::device_vector<int> x;
x.resize(10);
}
CMakeLists.txt
cmake_minimum_required(VERSION 2.8.9)
project(test_project)
find_package(CUDA QUIET REQUIRED)
list(APPEND CUDA_NVCC_FLAGS "-std=c++11;-arch=compute_52")
set(CUDA_SEPARABLE_COMPILATION ON)
cuda_add_executable("cuda_test" "test.cu")
After some testing, it is obvious that if the line "set(CUDA_SEPARABLE_COMPILATION ON)" is removed, the program runs without problem. But I really need separable compilation activated for my project.
Any help or hint would be appreciated.
UPDATE:
Requested by #RobertCrovella, here are some more infos.
The CUDA version is 7.5, which is newly installed on UBUNTU 14.04 with GTX980. I did not update the Thrust library after that.
The following is the actual command generated by cmake by using "make VERBOSE=1".
CMake script with separable compilation
CMake script without separable compilation
UPDATE 2:
The same error is confirmed by #merelyMark. Since both the code and the CMakeLists file are extremely simple, is it possible that this is a bug in Thrust / CUDA ? [EDIT] No.
UPDATE 3:
Pointed out by #RobertCrovella, thrust library is working fine with proper cmake comands. Now the question: how can I generate those commands using CMakeLists?
Apologies in advance, I don't have enough points to add a comment, but I can confirm the behavior on my rig. This compiles properly on my machine with an E5-1650 v3 and a Quadro M4000 with CUDA 7.5 and Ubuntu 14.04.3. I get one warning error:
nvlink warning : SM Arch ('sm_20') not found in ...
I can confirm the behavior when I run it:
./cuda_test
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): std::bad_alloc: unknown error
Aborted (core dumped)
I agree with #RobertCrovella, I'm not really sure what you're trying to accomplish here.
Here's my VERBOSE output for separable compilation.
Here's my VERBOSE output without separable compilation.
I use the cmake gui tool to configure my cuda project in vs2013.
CMakeLists.txt is as below:
project(CUDA_PART)
# required cmake version
cmake_minimum_required(VERSION 3.0)
include_directories(${CUDA_PART_SOURCE_DIR}/common)
# packages
find_package(CUDA REQUIRED)
# nvcc flags
set(CUDA_NVCC_FLAGS -gencode arch=compute_20,code=sm_20;-G;-g)
set(CUDA_VERBOSE_BUILD ON)
#FILE(GLOB SOURCES "*.cu" "*.cpp" "*.c" "*.h")
CUDA_ADD_EXECUTABLE(CUDA_PART hist_gpu_shmem_atomics.cu)
The .cu file is from Cuda by example source code hist_gpu_shmem_atomics.cu
There are two problems:
After the line histo_kernel <<<blocks * 2, 256 >>>(dev_buffer, SIZE, dev_histo);an "invalid device function" error occurs.
When I use the CUDA debugging tool to debug, its cannot trigger breakpoints in the device code.
But when I create a project with the same code by the cuda project temple in visual studio 2013.It works correctly!
So, is there something wrong in the CMakeLists.txt ?
OS: Win7 64bit;GPU: GTX960;CUDA: CUDA 7.5;VS: 2013 (and also 2010)
When I use set the "Code Generation" in vs2013 as follow :
The CUDA_NVCC_FLAGES turns out to be -gencode=arch=compute_20,code=\"sm_20,compute_20\"
It equals to:
-gencode=arch=compute_20,code=sm_20 \
-gencode=arch=compute_20,code=compute_20
So, I guess it will generate 2 versions machine code: the first one(SASS) with virtual and real architectures and the second one(PTX) with only virtual architecture. Since my GTX960 is a cc5.2 device, it chooses the second one (PTX) and convert it to a suitable SASS.
This is a problem:
set(CUDA_NVCC_FLAGS -gencode arch=compute_20,code=sm_20;-G;-g)
Those flags will cause nvcc to generate SASS code (only) for a cc 2.0 device (only). Such cc2.0 SASS code will not run on your cc5.2 device (GTX960). "Invalid device function" is exactly the error you would get when trying to launch a kernel in such a scenario. Since the kernel will never launch, trying to hit breakpoints in device code won't work.
I'm not a CMake expert, so there might be other, more sensible approaches, but one possible way to try to fix this might be:
set(CUDA_NVCC_FLAGS -gencode arch=compute_52,code=sm_52;-G;-g)
which should generate code for your cc5.2 device. There are undoubtedly other possible settings here, you may want to read this or the nvcc manual for more background on compile options to target specific devices.
Also note that -G generates device debug code, which is fine if that is what you want. However it will generally run slower than code compiled without that switch. If you want to debug, however, that switch is necessary.
I want to make a program "which will be distributed to customers" , so I wanna protect my kernels code from hackers "some one told me that AMD driver some-how puts the kernel source inside the binary , so hacker can log the kernel with AMD device"
as I'm not experienced yet with VexCL , what is the proper compile line to just distribute binaries
for example with CUDA I can type : nvcc -gencode arch=compute_10,code=sm_10 myfile.cu -o myexec
what is the equivilent in VexCL?
also does VexCL work on Mac OS?which IDE? (this is a future task as I didn't have experience on Mac OS before)
my previous experience with OpenCL was by using STDCL library "but it is buggy on windows, no Mac support"
I am the developer of VexCL, and I have also replied to your question here.
VexCL generates OpenCL/CUDA kernels for the expressions you use in your code at runtime. Moreover, it allows the user to dump the generated kernel sources to the standard output stream. For example, if you save the following to a hello.cpp file:
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env);
vex::vector<double> x(ctx, 1024);
vex::vector<double> y(ctx, 1024);
y = 2 * sin(M_PI * x) + 1;
}
then compile it with
g++ -o hello hello.cpp -std=c++11 -I/path/to/vexcl -lOpenCL -lboost_system
then set VEXCL_SHOW_KERNELS=1 and run the compiled binary:
$ export VEXCL_SHOW_KERNELS=1
$ ./hello
you will see the kernel that was generated for the expression y = 2 * sin(M_PI * x) + 1:
#if defined(cl_khr_fp64)
# pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
# pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
kernel void vexcl_vector_kernel
(
ulong n,
global double * prm_1,
int prm_2,
double prm_3,
global double * prm_4,
int prm_5
)
{
for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0))
{
prm_1[idx] = ( ( prm_2 * sin( ( prm_3 * prm_4[idx] ) ) ) + prm_5 );
}
}
VexCL also allows to cache the compiled binary sources (in the $HOME/.vexcl folder by default), and it saves the source code with the cache.
From the one hand, the sources that you see are, being automatically generated, not very human-friendly. On the other hand, those are still more convenient to read than, e.g., disassembled binary. I am afraid there is nothing you can do to keep the sources away from 'hackers' except may be modify VexCL source code to suite your needs. The MIT license allows you to do that, and, if you are ready to do this, I could provide you with some guidance.
Mind you, NVIDIA OpenCL driver does it's own caching, and it also stores the kernel sources together with the cached binaries (in the $HOME/.nv/ComputeCache folder). I don't know if it is possible to alter this behavior, so 'hackers' could still get the kernel sources from there. I don't know if AMD does similar thing, but may be that is what your source meant by "log the kernel with AMD device".
Regarding the MacOS compatibility, I don't have a MacOS machine to do my own testing, but I had reports that VexCL does work there. I am not sure what IDE was used.