I have a project where speed is paramount, so was experimenting with compiler flags to try and get some free performance. I have two builds that are identical except for the additional flag march=native in build 2.
For completeness the flags are:
A) -std=c++14 -g -fno-omit-frame-pointer -O3
B) -std=c++14 -g -fno-omit-frame-pointer -O3 -march-native
Running benchmarks on these builds yields a confusing result:
A) 61s
B) 160s
What can possibly going on here?
Using -march=native optimizes code for your current cpu. Most of the time, it will improve the speed of execution. Sometimes it may not produce the fastest possible code because it enables certain CPU instructions.
echo | clang -E - -march=native -###
will display what clang has enabled through -march=native. The most likely culprit is CMOV, which is enabled by -march=native. You can see an explanation of why that might slow things down in the answer to this question: gcc optimization flag -O3 makes code slower than -O2.
The parameter -march=native makes clang optimise code for your current CPU. In this case clang uses as much optimisations as possible and using it may break compatibility with other CPUs which, for example, don't have support for some instruction sets like AVX2, SSSE3 etc.
You may run
echo | clang -E - -O3 -###
and
clang -E - -march=native -O3 -###
to get the lists of features activated in the both of your cases.
Related
I'd like to ensure that a large set projects are build with the -fno-omit-frame-pointer flag for easier debugging with tools like ebpf.
One way of course would be to modify the build scripts of each of these projects, but that is a lot of work.
I've come across the possibility to configure gcc with --enable-frame-pointer which restores the old default from the early gcc4.x days of using -fno-omit-frame-pointer.
I've built gcc from source like that and confirmed:
❯ ./bin/gcc -m32 -O3 -Q --help=optimizers | grep omit
-fomit-frame-pointer [disabled]
❯ ./bin/gcc -m64 -O3 -Q --help=optimizers | grep omit
-fomit-frame-pointer [enabled]
Is anyone aware of a possibility to extend the effect of --enable-frame-pointer to the 64bit targets?
I believe you are using an older version of GCC because support of this flag on x64 has been enabled some time ago (see e.g. this commit).
I want to get the vectorisation report regrading automated vectorisation and openmp SIMD.
# gcc
-fopenmp-simd -O3 -ffast-math -march=native -fopt-info-omp-vec-optimized-missed
# clang
-fopenmp-simd -O3 -ffast-math -march=native -Rpass="loop|vect" -Rpass-missed="loop|vect" -Rpass-analysis="loop|vect"
# icc on Linux
-qopenmp-simd -O3 -ffast-math -march=native -qopt-report-file=stdout -qopt-report-format=vs -qopt-report=5 -qopt-report-phase=loop,vec
# msvc
-openmp -O2 -fp:fast -arch:AVX2 -Qvec-report:2
I don't think Apple's flavor of clang supports OpenMP (At least not by default on macOS).
You may find ways to extend it though.
I'm new to nvcc and I've seen a library where compilation is done with option -O3, for g++ and nvcc.
CC=g++
CFLAGS=--std=c++11 -O3
NVCC=nvcc
NVCCFLAGS=--std=c++11 -arch sm_20 -O3
What is -O3 doing ?
It's optimization on level 3, basically a shortcut for
several other options related to speed optimization etc. (see link below).
I can't find any documentation on it.
... it is one of the best known options:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#options-for-altering-compiler-linker-behavior
I am experimenting with an algorithm I programmed in C++ using XCode 7.0. When I compare the performance of the standard LLVM compiler in XCode to the binary created when compiling using G++ (5.2.0) the binary created using LLVM is an order of magnitude (>10x) faster than the code created using the g++ compiler.
I am using the -o3 code optimisation flag for the g++ compiler as follows:
/usr/local/Cellar/gcc/5.2.0/bin/g++-5 -o3 -fopenmp -DNDEBUG main.cpp \
PattersonInstance.cpp \
... \
-o RROTprog
The g++ compilation is needed because the algorithm has to be compiled and run on a high performance computer where I cannot use the LLVM compiler. Plus I would like to use Open MP to make the code faster.
All ideas on the reason what is causing these speed differences and how they could be resolved is more than welcome.
Thanks in advance for the help!
L
I can bet that what happens is the following: you pass -o3 to the compiler, instead of -O3 (i.e. with CAPITAL O) and for this reason -o3 just instructs the compiler to output the executable to a file called "3". However you use -o RROTprog later in the same command line, and the last -o is the one that's considered by the compiler when outputting the executable.
The net effect: the -O3 is not present, hence no optimization is being done.
We are trying to implement a jit compiler whose performance is supposed to be same as doing it with clang -o4. Is there a place where I could easily get the list of optimization passes invoked by clang with -o4 is specified?
As far as I know -O4 means same thing as -O3 + enabled LTO (Link Time Optimization).
See the folloing code fragments:
Tools.cpp // Manually translate -O to -O2 and -O4 to -O3;
Driver.cpp // Check for -O4.
Also see here:
You can produce bitcode files from clang using -emit-llvm or -flto, or the -O4 flag which is synonymous with -O3 -flto.
For optimizations used with -O3 flag see this PassManagerBuilder.cpp file (look for OptLevel variable - it will have value 3).
Note that as of LLVM version 5.1 -O4 no longer implies link time optimization. If you want that you need to pass -flto. See Xcode 5 Release Notes.