We are trying to implement a jit compiler whose performance is supposed to be same as doing it with clang -o4. Is there a place where I could easily get the list of optimization passes invoked by clang with -o4 is specified?
As far as I know -O4 means same thing as -O3 + enabled LTO (Link Time Optimization).
See the folloing code fragments:
Tools.cpp // Manually translate -O to -O2 and -O4 to -O3;
Driver.cpp // Check for -O4.
Also see here:
You can produce bitcode files from clang using -emit-llvm or -flto, or the -O4 flag which is synonymous with -O3 -flto.
For optimizations used with -O3 flag see this PassManagerBuilder.cpp file (look for OptLevel variable - it will have value 3).
Note that as of LLVM version 5.1 -O4 no longer implies link time optimization. If you want that you need to pass -flto. See Xcode 5 Release Notes.
Related
I am trying to compile CUDA with clang, but the code I am trying to compile depends on a specific nvcc flag (-default-stream per-thread). How can I tell clang to pass the flag to nvcc?
For example, I can compile with nvcc and everythign works fine:
nvcc -default-stream per-thread *.cu -o app
But when I compile from clang, the program does not behave correctly because I can not pass the default-steam flag:
clang++ --cuda-gpu-arch=sm_35 -L/usr/local/cuda/lib64 *.cu -o app -lcudart_static -ldl -lrt -pthread
How do I get clang to pass flags to nvcc?
It looks like it may not be possible.
nvcc behind the scenes calls either clang/gcc with some custom generated flags and then calls ptxas and some other stuff to create the binary.
e.g.
nvcc -default-stream per-thread foo.cu
# Behind the scenes
gcc -custom-nvcc-generated-flag -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 -o foo.ptx
ptxas foo.ptx -o foo.cubin
When compiling to CUDA from clang, clang compiles directly to ptx and then calls ptxas:
clang++ foo.cu -o app -lcudart_static -ldl -lrt -pthread
# Behind the scenes
clang++ -triple nvptx64-nvidia-cuda foo.cu -o foo.ptx
ptxas foo.ptx -o foo.cubin
clang never actually calls nvcc. It just targets ptx and calls the ptx assembler.
Unless you know what custom backend flags will be produced by nvcc and manually include them when calling clang, I'm not sure you can automatically pass an nvcc flag from clang.
If you are using features specific to clang only for the host side and don't actually need it for the device side - you're probably looking for this :
https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/
As #Increasingly-Idiotic points out - I believe clang does not "call" nvcc internally, hence I don't think you can pass arguments to it.
I was trying some vectorisation after upgrading g++ from version 4.8.5 to 5.4.1. With this flags:
g++ particles-v3.cpp -o v3 -O3 -msse4.2 -mfpmath=sse -ftree-vectorizer-verbose=5 -ffast-math -m32 -march=native -std=c++11
While the same command gives over 4000 lines of detailed information about the vectorization with g++-4.8, with g++-5.4 it does not say anything.
Is there some major change in g++-5 that makes the -ftree-vectorizer-verbose=X unusable, or is there simply somethin wrong in the line? How to change it so that it works?
EDIT:
found out that using -fopt-info-vec-all gives exacty the info I wanted. Thus question solved.
I'm new to nvcc and I've seen a library where compilation is done with option -O3, for g++ and nvcc.
CC=g++
CFLAGS=--std=c++11 -O3
NVCC=nvcc
NVCCFLAGS=--std=c++11 -arch sm_20 -O3
What is -O3 doing ?
It's optimization on level 3, basically a shortcut for
several other options related to speed optimization etc. (see link below).
I can't find any documentation on it.
... it is one of the best known options:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#options-for-altering-compiler-linker-behavior
I have a project where speed is paramount, so was experimenting with compiler flags to try and get some free performance. I have two builds that are identical except for the additional flag march=native in build 2.
For completeness the flags are:
A) -std=c++14 -g -fno-omit-frame-pointer -O3
B) -std=c++14 -g -fno-omit-frame-pointer -O3 -march-native
Running benchmarks on these builds yields a confusing result:
A) 61s
B) 160s
What can possibly going on here?
Using -march=native optimizes code for your current cpu. Most of the time, it will improve the speed of execution. Sometimes it may not produce the fastest possible code because it enables certain CPU instructions.
echo | clang -E - -march=native -###
will display what clang has enabled through -march=native. The most likely culprit is CMOV, which is enabled by -march=native. You can see an explanation of why that might slow things down in the answer to this question: gcc optimization flag -O3 makes code slower than -O2.
The parameter -march=native makes clang optimise code for your current CPU. In this case clang uses as much optimisations as possible and using it may break compatibility with other CPUs which, for example, don't have support for some instruction sets like AVX2, SSSE3 etc.
You may run
echo | clang -E - -O3 -###
and
clang -E - -march=native -O3 -###
to get the lists of features activated in the both of your cases.
I am experimenting with an algorithm I programmed in C++ using XCode 7.0. When I compare the performance of the standard LLVM compiler in XCode to the binary created when compiling using G++ (5.2.0) the binary created using LLVM is an order of magnitude (>10x) faster than the code created using the g++ compiler.
I am using the -o3 code optimisation flag for the g++ compiler as follows:
/usr/local/Cellar/gcc/5.2.0/bin/g++-5 -o3 -fopenmp -DNDEBUG main.cpp \
PattersonInstance.cpp \
... \
-o RROTprog
The g++ compilation is needed because the algorithm has to be compiled and run on a high performance computer where I cannot use the LLVM compiler. Plus I would like to use Open MP to make the code faster.
All ideas on the reason what is causing these speed differences and how they could be resolved is more than welcome.
Thanks in advance for the help!
L
I can bet that what happens is the following: you pass -o3 to the compiler, instead of -O3 (i.e. with CAPITAL O) and for this reason -o3 just instructs the compiler to output the executable to a file called "3". However you use -o RROTprog later in the same command line, and the last -o is the one that's considered by the compiler when outputting the executable.
The net effect: the -O3 is not present, hence no optimization is being done.