C++ auto-vectorization requirements for gcc, clang and msvc - c++

Are the following statements correct?
With GCC and clang, my code will be auto-vectorized if I compile with :
-O2 -ftree-vectorize -march=XYZ (XYZ being the target instruction set: native, SSE, AVX2, etc.)
-O3 -march=XYZ
With MSVC, my code will be auto-vectorized if I compile with:
/O2
This video seems to suggest that I do not need to specify the architecture with MSVC. Is that correct? The compiler will use the native architecture by default, and fall back on scalar operations at runtime if vector instructions can't be found.

I do not need to specify the architecture with MSVC. Is that correct?
Yes that is indeed correct. With MSVC, By default, the Auto-Vectorizer is enabled and picks up appropriate instructurion set for fastest vectorization. Moreover, even if you do specify arch, The Auto-Vectorizer may generate different instructions than specified by the /arch switch - as stated by documentation. For example, when you compile /arch:SSE2, SSE4.2 instructions may be emitted.
On another note, The VS vectorizer lacks quite a bit of features when compared to gcc or clang.
With GCC and clang, my code will be auto-vectorized if I compile with -O2 -ftree-vectorize -march=XYZ ? -O3 -march=XYZ ?
Not necessarily, To enable vectorization of floating point reductions you need to use -ffast-math or -fassociative-math as well. However, in general, Yes it'll be enabled. You may find same written in documentation, Vectorization is enabled by the flag -ftree-vectorize and by default at -O3
PS: You may use https://godbolt.org to see all this in action!

Related

Is `-ftree-slp-vectorize` not enabled by `-O2` in GCC?

From https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
It says "-ftree-slp-vectorize: Perform basic block vectorization on trees. This flag is enabled by default at -O2 and by -ftree-vectorize, -fprofile-use, and -fauto-profile."
However it seems I have to pass a flag explicitly to turn on SIMD. Did I mis undertand something here? It is enabled at -O3 though.
https://www.godbolt.org/z/1ffzdqMoT
Is -ftree-slp-vectorize not enabled by -O2 in GCC?
Yes and no. It depends on the version of the compiler.
From https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
You have linked to the latest version of documentation. It applies to the version that is currently under development, which at the moment is version 12.
However it seems I have to pass a flag explicitly to turn on SIMD.
https://www.godbolt.org/z/1ffzdqMoT
Your example uses GCC version 11.
Did I mis undertand something here?
You read the wrong version of documentation, or used wrong version of compiler and hence your assumption didn't hold.

Intel Advisor optimal flags and settings

I'm reading this tutorial about code vectorization using Intel Advisor. In particular in this page they suggest to:
Build the target sample application in release mode ... compiler options: -O2 -g
And following:
To build your own applications to produce the most accurate and
complete Vectorization Advisor analysis results, build an optimized
binary in release mode using the following settings.
-g -O2 (or higher) -qopt-report=5 -vec -simd -qopenmp
Now, I have a couple of questions:
I thought that in release mode we didn't produce any debug information (which is produced in the "debug mode"), so no -g should be included
The weirdest thing is that in the Makefile given for the samples code (vec_samples in /opt/intel/advisor_*/...) uses only -g -O2 why they don't include all the other options. Why?
The relevant entry point to fresh Intel Advisor tutorials is Getting Started, where you can pick and choose appropriate sub-tutorials. Vectorization Advisor sub-tutorial for Linux can be found here. It precisely says that:
-qopt-report=5 : necessary for version 15.0 of the Intel compiler; unnecessary for version 16.0 and higher
With regards to -vec, -simd, -openmp, the tutorial slightly confuses flags needed for proper Advisor functioning (-g, -O2, optionally -opt-report) vs. flags needed for "proper" Compiler functioning ( -vec, -simd and -openmp). The later ones are just flags controlling compiler's vector code generation, they have nothing to do with Advisor profiling capabilities so you may or may not use them.
To give you deeper understanding: there is an important feature in
Advisor, which is called Intel Advisor Survey "Compiler Integration".
This feature leverages the data relatively similar but not identical to opt-report.
In order to make this feature working you need
Using Intel Coimpiler 14.x beta, 15.x, 16.x or 17.x
-g (enable debug info) and -O2 or higher (enable some optimization)
Optional (for Intel Compiler 15.x only) -qopt-report5
All other features in Intel Advisor work equally well regardless of Compiler version (item 1 above) or opt-report and versions (item 3 above) , but they all still require -g (part of option 2 above). -O2 is not needed for some features, but it is typically useless to deal with -O0 or -O1 compiled binaries when analyzing performance aspects.

bcc64 optimizations -O1 vs -O2 still slower than bcc32 by 40% and more

I have a product consisting of a VCL executable plus a Standard C++ DLL, all built with C++ Builder XE4. I publish in 32-bit and 64-bit versions.
When doing performance testing with release builds, the 64-bit version runs much more slowly... 40% more slowly.
I understand that I need to have optimizations turned on for the performance testing to be meaningful. XE4 allows me to set (mutually exclusively):
-O1 = smallest possible code
-O2 = fastest possible code
I have built using each of these, but the results are unchanged.
I see from postings here that Linux/g++ programmers use -O3 (smallest AND fastest?) (see 64-bit executable runs slower than 32-bit version). But -O3 is not an option for my environment.
Are there other compiler settings I should be looking at?
Thanks for your help.
The main downside of 64bit mode is that pointers double in size. Alignment rules might also lead classes/structs to be bigger. Maybe your code just barely fit into cache in 32bit mode, but not 64. This is esp. likely if your code uses a lot of pointers.
Another possibility is that you call some external library, and your 32bit version of it has some asm speedups, but the 64bit version doesn't.
Use a profiler to see what's actually slow in your 64bit version. For Windows, Intel's VTUNE is maybe a good choice. You can see where your code is having a lot of cache misses. Comparing total cache misses between 32bit and 64bit should shed some light.
Re: -O1 vs. -O2: Different compilers have different meanings for options. gcc and clang have:
-Os: optimize for code size
-O0: minimal / no optimization (most things get stored/reloaded from RAM after every step)
-O1: some optimization without taking a lot of extra compile time
-O2: more optimizations
-O3: even more optimizations, including auto-vectorizing
Clang doesn't seem to document its optimization options, so I assume it mirrors gcc. (There are options to report on optimizations it did, and to use profile-guided optimization.) See the latest version of the gcc manual (online) for more descriptions of optimization options: e.g.
-Ofast: -O3 -ffast-math (and maybe "unsafe" optimizations.)
-Og: optimize without breaking debugging. Recommended for the edit/compile/debug cycle.
-funroll-loops: can help in some tight loops, but isn't enabled even at -O3. Don't use for everything, because larger code size can lead to I-cache misses which hurt more. -fprofile-use does enable this, so ideally just use PGO.
-fblah-blah: there are a ton more specific options. Usually just use -O3 to pick the recommended set.

Optimization issue

I'm developing a controller program used to run a humanoid kidsize robot. The OS is debian 6 and whole programs are written in C++11. CPU is a 1GHz VorteX86 SD and its architecture is Intel i486.
I need to compile my code with maximum possible optimization. currently I'm using gcc with 3rd level optimization flag and i486 optimization tunning:
g++ -std=c++0x -O3 -march=i486 -mtunes=i486
I'm wondering if its possible to gain more optimized code or not. I searched around about optimization flags and compiler benchmarks, but didn't find any...
My question is which compiler for C++ is generates faster code? Specially for i486 architecture.
Current candidates are: ICC XE, GCC 4.6, EkoPath
An option which typically makes the code faster is -funroll-loops
See the documentation. There are too many permutations to test them all; maybe give Acovea a try, which tests for the best one with a genetic approach.
If you have many floating points optimizations, you may try -ffast-math or -Ofast, which includes -ffast-math. However, you lose IEEE floating math compliance.

Reducing the execution time of the code using CLANG/LLVM compiler

Well... When i was searching for a good compiler I came across clang/LLVM. This compiler gives me same result as other compilers like icc, pgi. But the problem is there are very few tutorials on this compiler... Kindly let me know where can I find the tutorials on the clang compiler.
Note by:
I have compiled my c code using the following flags clang -O3 -mfpmath=sse file.c
Clang (the command line compiler) takes gcc-compatible options, but accepts and ignores a lot of flags that GCC takes (like -mfpmath=sse). We aim to generate good code out of the box. There are some flags that allow clang to violate the language standards that can be useful in some scenarios, like -ffast-math though.
If you're looking for good performance, I highly recommend experimenting with link-time-optimization, which allows clang to optimize across source files in your application. Depending on what platform you're on, this is enabled by passing -O4 to the compiler. If you're on linux, you need to use the "gold" linker (see http://llvm.org/docs/GoldPlugin.html). If you're on the mac, it should "just work" with any recent version of Xcode.
The clang is not a compiler, it is just frontend of LLVM compiler. So, when you calls clang, it parses c/c++ file but the optimization and code generation is handled in LLVM itself.
Here you can found a documentation of LLVM optimization and analysis options: http://llvm.org/docs/Passes.html
The full documentation is here http://llvm.org/docs/
Also useful options are listed here http://linux.die.net/man/1/llvmc (I suggest clang will accept most of them too)