I am trying to compile CUDA with clang, but the code I am trying to compile depends on a specific nvcc flag (-default-stream per-thread). How can I tell clang to pass the flag to nvcc?
For example, I can compile with nvcc and everythign works fine:
nvcc -default-stream per-thread *.cu -o app
But when I compile from clang, the program does not behave correctly because I can not pass the default-steam flag:
clang++ --cuda-gpu-arch=sm_35 -L/usr/local/cuda/lib64 *.cu -o app -lcudart_static -ldl -lrt -pthread
How do I get clang to pass flags to nvcc?
It looks like it may not be possible.
nvcc behind the scenes calls either clang/gcc with some custom generated flags and then calls ptxas and some other stuff to create the binary.
e.g.
nvcc -default-stream per-thread foo.cu
# Behind the scenes
gcc -custom-nvcc-generated-flag -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 -o foo.ptx
ptxas foo.ptx -o foo.cubin
When compiling to CUDA from clang, clang compiles directly to ptx and then calls ptxas:
clang++ foo.cu -o app -lcudart_static -ldl -lrt -pthread
# Behind the scenes
clang++ -triple nvptx64-nvidia-cuda foo.cu -o foo.ptx
ptxas foo.ptx -o foo.cubin
clang never actually calls nvcc. It just targets ptx and calls the ptx assembler.
Unless you know what custom backend flags will be produced by nvcc and manually include them when calling clang, I'm not sure you can automatically pass an nvcc flag from clang.
If you are using features specific to clang only for the host side and don't actually need it for the device side - you're probably looking for this :
https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/
As #Increasingly-Idiotic points out - I believe clang does not "call" nvcc internally, hence I don't think you can pass arguments to it.
Related
I maintain the C+=-flavored CUDA API wrappers library. The library's current commit is relatively-well-tested, with some example programs and quite a few users. However, sometime very recently (can't say exactly when), and without committing anything new, I now get NVCC warnings during the "dlink" phase of my example programs, e.g.:
/path/to/nvcc /path/to/cuda-api-wrappers/examples/modified_cuda_samples/vectorAdd/vectorAdd.cu -dc -o /path/to/cuda-api-wrappers/CMakeFiles/vectorAdd.dir/examples/modified_cuda_samples/vectorAdd/./vectorAdd_generated_vectorAdd.cu.o -ccbin /opt/gcc-5.4.0/bin/gcc -m64 -gencode arch=compute_52,code=compute_52 --std=c++11 -Xcompiler -Wall -O3 -DNDEBUG -DNVCC -I/path/to/cuda/include -I/path/to/cuda-api-wrappers/src
/path/to/nvcc -gencode arch=compute_52,code=compute_52 --std=c++11 -Xcompiler -Wall -O3 -DNDEBUG -m64 -ccbin /opt/gcc-5.4.0/bin/gcc -dlink /export/path/to/cuda-api-wrappers/CMakeFiles/vectorAdd.dir/examples/modified_cuda_samples/vectorAdd/./vectorAdd_generated_vectorAdd.cu.o /path/to/cuda/lib64/libcublas_device.a -o /export/path/to/cuda-api-wrappers/CMakeFiles/vectorAdd.dir/./vectorAdd_intermediate_link.o
#O#ptxas info : 'device-function-maxrregcount' is a BETA feature
#O#ptxas info : 'device-function-maxrregcount' is a BETA feature
#O#ptxas info : 'device-function-maxrregcount' is a BETA feature
... this repeats many times ...
but the dlink face does conclude. This is already strange, since I haven't explicitly used any beta features.
/opt/gcc-5.4.0/bin/g++ -Wall -Wpedantic -O2 -DNDEBUG -L/path/to/cuda/lib64 -rdynamic CMakeFiles/vectorAdd.dir/examples/modified_cuda_samples/vectorAdd/vectorAdd_generated_vectorAdd.cu.o CMakeFiles/vectorAdd.dir/vectorAdd_intermediate_link.o -o examples/bin/vectorAdd lib/libcuda-api-wrappers.a -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -ldl -lrt -lnvToolsExt -Wl,-Bstatic -lcudadevrt -Wl,-Bdynamic
CMakeFiles/vectorAdd.dir/vectorAdd_intermediate_link.o: In function `__cudaRegisterLinkedBinary_25_cublas_compute_70_cpp1_ii_f0559976':
link.stub:(.text+0xe0): undefined reference to `__fatbinwrap_25_cublas_compute_70_cpp1_ii_f0559976'
CMakeFiles/vectorAdd.dir/vectorAdd_intermediate_link.o: In function `__cudaRegisterLinkedBinary_25_xerbla_compute_70_cpp1_ii_cd7f3ad3':
link.stub:(.text+0x190): undefined reference to `__fatbinwrap_25_xerbla_compute_70_cpp1_ii_cd7f3ad3'
CMakeFiles/vectorAdd.dir/vectorAdd_intermediate_link.o: In function `__cudaRegisterLinkedBinary_23_nrm2_compute_70_cpp1_ii_8edbce95':
link.stub:(.text+0x240): undefined reference to `__fatbinwrap_23_nrm2_compute_70_cpp1_ii_8edbce95'
... more udnefined reference errors here ...
My question: Why would this happen and how do I circumvent/avoid/resolve it?
Notes:
I'm using separable compilation
I'm getting these specific errors with CUDA 9.1 and a SM 5.2 device (no 7.0).
The CMakeLists.txt is here.
I'm obviously clearing CMakeCache.txt before building.
This has happened to me both on a GNU/Linux Mint 18.3 and Fedora 26. On the first machine there have been some apt-get dist-upgrade's done, and now GCC is up to version 5.5.0, in case that matters. On the second machine - there really has been no change that I'm aware of; same compiler and CUDA version.
A partial answer / workaround:
This issue only seems to occur when libcublas is involved. If I remove /path/to/cuda/lib64/libcublas_device.a from the -dlink phase command-line, all warnings and errors go away (including from later stages). And in fact, my wrapper library is oblivious of cublas, not sure why CMake is adding it; it's not in $CUDA_LIBRARIES. See also:
Why does CMake force the use of libcublas with separable compilation?
I am using mingw 64 bit with cygwin.
I know that if I compile using
x86_64-w64-mingw32-g++.exe -std=c++11 hello.cpp
the output .exe does not run unless the library path to libstdc++ and other libraries is specified in the Path environment variable.
An alternative is to link statically
x86_64-w64-mingw32-g++.exe -std=c++11 hello.cpp -static-libgcc -Wl,-Bstatic -lstdc++ -lpthread
Since I want a single .exe that I can easily copy on different machines, the second solution is better for me. My only problem is that, since I link statically, even for a simple helloworld program, the size of the executable rises to more than 10 Mb. So my question is: is it possible to link statically only the library parts that are actually used by the program?
The binutils linker on Windows, ld, does not support the --gc-sections argument properly, which in combination with compiler flags -ffunction-sections and -fdata-sections, allow the linker to throw away blocks of unused code.
You are straight out of luck. The only thing you can do is the usual: strip the executable (by running the strip command on it after it is linked) and compile your code optimising for size with -Os.
Keeping in mind these options do not work on Windows (which for the purpose of this answer, includes the Cygwin platform), this is generally how you can do this:
g++ -c -Os -ffunction-sections -fdata-sections some_file.cpp -o some_file.o
g++ -c -Os -ffunction-sections -fdata-sections main.cpp -o main.o
g++ -Wl,--gc-sections main.o some_file.p -o my_executable
I'm new to nvcc and I've seen a library where compilation is done with option -O3, for g++ and nvcc.
CC=g++
CFLAGS=--std=c++11 -O3
NVCC=nvcc
NVCCFLAGS=--std=c++11 -arch sm_20 -O3
What is -O3 doing ?
It's optimization on level 3, basically a shortcut for
several other options related to speed optimization etc. (see link below).
I can't find any documentation on it.
... it is one of the best known options:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#options-for-altering-compiler-linker-behavior
I am working on a project where I need to generate just the bitcode using clang, run some optimization passes using opt and then create an executable and measure its hardware counters.
I am able to link through clang directly using:
clang -g -O0 -w -I/opt/apps/papi/5.3.0/include -Wl,-rpath,$PAPI_LIB -L$PAPI_LIB \
-lpapi /scratch/02681/user/papi_helper.c prog.c -o a.out
However now I want to link it after using the front end of clang and applying optimization passes using opt.
I am trying the following way:
clang -g -O0 -w -c -emit-llvm -I/opt/apps/papi/5.3.0/include -Wl,-rpath,$PAPI_LIB -L$PAPI_LIB \
-lpapi /scratch/02681/user/papi_helper.c prog.c -o prog.o
llvm-link prog.o papi_helper.o -o prog-link.o
// run optimization passes
opt -licm prog-link.o -o prog-opt.o
llc -filetype=obj prog-opt.o -o prog-exec.o
clang prog-exec.o
After going through the above process I get the following error:
undefined reference to `PAPI_event_code_to_name'
It's not able to resolve papi functions. Thanks in advance for any help.
Clearly, you need to add -lpapi to the last clang invocation. How else the linker would know about libpapi?
We are trying to implement a jit compiler whose performance is supposed to be same as doing it with clang -o4. Is there a place where I could easily get the list of optimization passes invoked by clang with -o4 is specified?
As far as I know -O4 means same thing as -O3 + enabled LTO (Link Time Optimization).
See the folloing code fragments:
Tools.cpp // Manually translate -O to -O2 and -O4 to -O3;
Driver.cpp // Check for -O4.
Also see here:
You can produce bitcode files from clang using -emit-llvm or -flto, or the -O4 flag which is synonymous with -O3 -flto.
For optimizations used with -O3 flag see this PassManagerBuilder.cpp file (look for OptLevel variable - it will have value 3).
Note that as of LLVM version 5.1 -O4 no longer implies link time optimization. If you want that you need to pass -flto. See Xcode 5 Release Notes.