Expected performance of MSVC link time code generation - c++

I'm compiling the KiCad EDA suite, using MSVC 9 (15.0.30729.1).
This is a fairly complex piece of software, so a total compilation time of 3.5 hours for an /O2 Release build on an i3 is completely acceptable. To further optimize the code, I have enabled /GL and /LTCG to use the link time code generation feature.
Looking at the largest component: on x86, this slowed down the link process somewhat (as expected), but gave no reduction in code size (7.3 MB); on x64, a single linker invocation now takes 1.5 hours on its own, and still does not reduce code size (10.1 MB) in the slightest.
For comparison, gcc on x64 generates 11 MB without -flto, and 9.5 MB with -flto (taking 10 minutes for the linker step) -- while I'm aware that this MSVC version is significantly older, I'm certainly not used to gcc generating smaller code in less time than MSVC.
As my experience with the MSVC toolchain is superficial at best: is it typical for link-time code generation to not yield a reduction in code size? Is there a compiler option I might have missed?

Related

bcc64 optimizations -O1 vs -O2 still slower than bcc32 by 40% and more

I have a product consisting of a VCL executable plus a Standard C++ DLL, all built with C++ Builder XE4. I publish in 32-bit and 64-bit versions.
When doing performance testing with release builds, the 64-bit version runs much more slowly... 40% more slowly.
I understand that I need to have optimizations turned on for the performance testing to be meaningful. XE4 allows me to set (mutually exclusively):
-O1 = smallest possible code
-O2 = fastest possible code
I have built using each of these, but the results are unchanged.
I see from postings here that Linux/g++ programmers use -O3 (smallest AND fastest?) (see 64-bit executable runs slower than 32-bit version). But -O3 is not an option for my environment.
Are there other compiler settings I should be looking at?
Thanks for your help.
The main downside of 64bit mode is that pointers double in size. Alignment rules might also lead classes/structs to be bigger. Maybe your code just barely fit into cache in 32bit mode, but not 64. This is esp. likely if your code uses a lot of pointers.
Another possibility is that you call some external library, and your 32bit version of it has some asm speedups, but the 64bit version doesn't.
Use a profiler to see what's actually slow in your 64bit version. For Windows, Intel's VTUNE is maybe a good choice. You can see where your code is having a lot of cache misses. Comparing total cache misses between 32bit and 64bit should shed some light.
Re: -O1 vs. -O2: Different compilers have different meanings for options. gcc and clang have:
-Os: optimize for code size
-O0: minimal / no optimization (most things get stored/reloaded from RAM after every step)
-O1: some optimization without taking a lot of extra compile time
-O2: more optimizations
-O3: even more optimizations, including auto-vectorizing
Clang doesn't seem to document its optimization options, so I assume it mirrors gcc. (There are options to report on optimizations it did, and to use profile-guided optimization.) See the latest version of the gcc manual (online) for more descriptions of optimization options: e.g.
-Ofast: -O3 -ffast-math (and maybe "unsafe" optimizations.)
-Og: optimize without breaking debugging. Recommended for the edit/compile/debug cycle.
-funroll-loops: can help in some tight loops, but isn't enabled even at -O3. Don't use for everything, because larger code size can lead to I-cache misses which hurt more. -fprofile-use does enable this, so ideally just use PGO.
-fblah-blah: there are a ton more specific options. Usually just use -O3 to pick the recommended set.

LLVM Clang 6.0 fatal error when building a huge C++ source code on OSX

I'm building a pretty huge source code, which builds just fine with MS compiler and Intel compiler, but Clang ends up with this:
fatal error: error in backend: Section too large, can't encode r_address (0x1000008) into 24 bits of scattered relocation entry.
If I remove half of it, it starts working fine, so obviously there are some limitations. This seems to be a well known issue from XCode 4.5, but now it's XCode 6.2 and it still doesn't work!! Any options I could enable to make it work? For example on Windows I needed to use /BIGOBJ to make the compiler work fine.
Solved by cutting the source file into multiple source files. It's quite a shame that a compiler bug can cause this as it may be a lot of work and can degrade performance unless the compiler provides global optimizations.

Cross Compiling To Atom Z510 From Intel i7

I am writing a server application which has a large amount of source code. Compiling the application on my Intel Atom z510 takes around 15-20 minutes, and about 2-3 minutes on my Intel i7.
I am very new to cross compiling, new as in I've never done it. I can't find any reference on how to cross compile to the Z510. I found a great SO article on optimization flags for the atom here. However, no description on how to use them on my Intel i7 pc for my Intel Atom CPU.
I am making the assumption that anything compiled on my i7 will be default to being optimized for my i7, causing performance drops on the Atom. Any advice/search terms/websites would be greatly appreciated.
As always, thank you so much ahead of time.
Edit: I am using gcc 4.4. Apologies. (The one that comes with Ubuntu 10.04)
Constantin
I think your assumption that code compiled on the Atom is automatically optimized for the Atom is faulty.
Even if you request that behavior via -march=native -mtune=native, gcc 4.4 doesn't know how to optimize for Atom.
And code optimized for the Core i7 would run more slowly than code compiled on the Atom only if you are passing those flags to get code optimized for the Core i7 (which I think also requires a later version of gcc). Getting rid of those flags would cause the compiler on the i7 to generate the same code as the one on the Atom.
If you're on your i7 and want to compile binaries compatible with and optimised for your Atom, just use a -march=atom option to gcc. The binaries produced should work, on the condition you're running the same OS on both systems (this includes agreeing on 32/64 bit-ness), and any necessary run-time dependencies are present.

Why do I get a faster binary with XCode's llvm vs. clang++ from MacPorts?

I have written a benchmark method to test my C++ program (which searches a game tree), and I am noticing that compiling with the "LLVM compiler 2.0" option in XCode 4.0.2 gives me a significantly faster binary than if I compile with the latest version of clang++ from MacPorts.
If I understand correctly I am using a clang front-end and llvm back-end in both cases. Has Apple made improvements to their clang/llvm distribution to produce faster binaries for Mac OS? I can't find much information about the project.
Here are the benchmarks my program produces for various compilers, all using -O3 optimization (higher is better):
(Xcode) "gcc 4.2": 38.7
(Xcode) "llvm gcc 4.2": 51.2
(Xcode) "llvm compiler 2.0": 50.6
g++-mp-4.6: 43.4
clang++: 40.6
Also, how do I compile with the clang/llvm XCode is using from the terminal? I can't find the command.
EDIT: The scores I output are "thousands of games per second" which are calculated over a long enough run of the program. Scores are very consistent over multiple runs, and recent major algorithmic improvements have given me 1% - 5% speed ups, for example. A 25% speed up of 40 to 50 is huge for my program.
UPDATE: I wasn't invoking clang++ from the command line with -flto. Now when I compare clang++ -O3 -flto to /Developer/usr/bin/clang++ -O3 -flto from the command line the results are closer, but the Apple one is still 6.5% faster.
Now how to enable link time optimization for gcc? When I try g++ -flto I get the following error:
cc1plus: error: LTO support has not been enabled in this configuration
Apple LLVM Compiler should be available under /Developer/usr/bin/clang.
I can't think of any particular reason why MacPorts clang++ would generate slower code... I would check whether you're passing in comparable command-line options. One thing that would make a large difference is if you're producing 32-bit code with one compiler, and 64-bit code with the other.
If GCC has no LTO then you need to build it yourself:
http://solarianprogrammer.com/2012/07/21/compiling-gcc-4-7-1-mac-osx-lion/
For LTO you need to add 'libelf' to the instructions.
http://sourceforge.net/apps/trac/mingw-w64/wiki/LTO%20and%20GCC
Exact speed of an algorithm can depend on all kinds of things that are totally out of your's and the compiler's power. You may have a loop where the execution time depends on precisely how the instructions are aligned in memory, in a way that the compiler couldn't predict. I have seen cases where a loop could enter different "states" with different execution times per iteration (so after a context switch, it could enter a state where it took either 12 or 13 cycles, rather randomly). This can all be coincidence.
And you might be using different libraries, which is quite possible the reason. In MacOS X, they are using a new and presumably faster implementation of std::string and std::vector, for example.

Benchmarks for Intel C++ compiler and GCC

I have an AMD Opteron server running CentOS 5. I want to have a compiler for a fairly large C++ Boost based program. Which compiler I should choose?
There is an interesting PDF here which compares a number of compilers.
I hope this helps more than hurts :)
I did a little compiler shootout sometime over a year ago, and I am going off memory.
GCC 4.2 (Apple)
Intel 10
GCC 4.2 (Apple) + LLVM
I tested multiple template heavy audio signal processing programs that I'd written.
Compilation times: The Intel compiler was by far the slowest compiler - more than '2x times slower' as another posted cited.
GCC handled deep templates very well in comparison to Intel.
The Intel compiler generated huge object files.
GCC+LLVM yielded the smallest binary.
The generated code may have significant variance due to the program's construction, and where SIMD could be used.
For the way I write, I found that GCC + LLVM generated the best code. For programs which I'd written before I took optimization seriously (as I wrote), Intel was generally better.
Intel's results varied; it handled some programs far better, and some programs far worse. It handled raw processing very well, but I give GCC+LLVM the cake because when put into the context of a larger (normal) program... it did better.
Intel won for out of the box, number crunching on huge data sets.
GCC alone generated the slowest code, though it can be as fast with measurement and nano-optimizations. I prefer to avoid those because the wind may change direction with the next compiler release, so to speak.
I never measured poorly written programs in this test (i.e. results outperformed distributions of popular performance libraries).
Finally, the programs were written over several years, using GCC as the primary compiler in that time.
Update: I was also enabling optimizations/extensions for Core2Duo. The programs were clean enough to enable strict aliasing.
The MySQL team posted once that icc gave them about a 10% performanct boost over gcc. I'll try to find the link.
In general I've found that the 'native' compilers perform better than gcc on their respective platforms
edit: I was a little off. Typical gains were 20-30% not 10%. Some narrow edge cases got a doubling of performance. http://www.mysqlperformanceblog.com/files/presentations/LinuxWorld2004-Intel.pdf
I suppose it varies depending on the code, but with the codebase I am working on now, ICC 11.035 gives an almost 2x improvement over gcc 4.4.0 on a Xeon 5504.
icc options: -O2 -fno-alias
gcc options: -O3 -msse3 -mfpmath=sse -fargument-noalias-global
The options are specific to just the file containing the compute-intensive code, where I know there is no aliasing. Single-threaded code with a 5-level nested loop.
Although autovectorization is enabled, neither compilers generate vectorized code (not a fault of the compilers)
Update (2015/02/27):
While optimizing some geophysics code (Q2, 2013) to run on Sandy Bridge-E Xeons, I had an opportunity to compare the performance of ICC 11.1 against GCC 4.8.0, and GCC was now generating faster code than ICC. The code made used of AVX intrinsics and did use 8-way vectorized instructions (nieither compiler autovectorized the code properly due to certain data layout requirements). In addition, GCC's LTO implementation (with the IR core embedded in the .o files) was much easier to manage than that in ICC. GCC with LTO was running roughly 3 times faster than ICC without LTO. I'm not able to find the numbers right now for GCC without LTO, but I recall it was still faster than ICC. It's by no means a general statement on ICC's performance, but the results were sufficient for us to go ahead with GCC 4.8.*.
Looking forward to GCC 5.0 (http://www.phoronix.com/scan.php?page=article&item=gcc-50-broadwell)!
We use the Intel compiler on our product (DB2), on Linux and Windows IA32/AMD64, and on OS X (i.e. all our Intel platform ports except SunAMD).
I don't know the numbers, but the performance is good enough that we:
pay for the compiler which I'm told is very expensive.
live with the 2x times slower build times (primarily due to the time it spends acquiring licenses before it allows itself to run).
PHP - Compilation from source, with ICC rather than GCC, should result in a 10 % to 20 % speed improvment - http://www.papelipe.no/tags/ez_publish/benchmark_of_intel_compiled_icc_apache_php_and_apc
MySQL - Compilation from source, with ICC rather than GCC, should result in a 25 % to 50 % speed improvment - http://www.mysqlperformanceblog.com/files/presentations/LinuxWorld2005-Intel.pdf
I used to work on a fairly large signal processing system which ran on a large cluster. We used to reckon for heavy maths crunching, the Intel compiler gave us about 10% less CPU load than GCC. That's very unscientific but it was our experience (that was about 18 months ago).
What would have been interesting is if we'd been able to use Intel's math libraries as well which use their chipset more efficiently.
I used UnixBench (v. 5.1.3) on an openSUSE 12.2 (kernel 3.4.33-2.24-default x86_64), and compiled it first with GCC, and then with Intel's compiler.
With 1 parallel copy, UnixBench compiled with Intel's is about 20% faster than the version compiled with GCC.
However this hides huge differences. Dhrystone is about 25% slower with Intel compiler, while Whetstone runs 2x faster.
With 4 copies of UnixBench running in parallel, the improvement of Intel compiler over GCC is only 7%. Again Intel is much better at Whetstone (> 200%), and slower at Dhrystone (about 20%).
Many optimizations which the Intel compiler performs routinely require specific source syntax and use of -O3 -ffast-math for gcc. Unfortunately, the -funsafe-math-optimizations component of -ffast-math -O3 -march=native has turned out to be incompatible with -fopenmp, so I must split my source files into groups named with the different options in Makefile. Today I ran into a failure where a g++ build using -O3 -ffast-math -fopenmp -march=native was able to write to screen but not redirect to a file.
One of the more egregious differences in my opinion is the optimization by icpc only of std::max and min where gcc/g++ want the fmax|min[f] with -ffast-math to change their meaning away from standard.