I see this flag a lot in the makefiles. What does it mean and when should it be used?
Optimization level 2.
From the GCC man page:
-O1 Optimize. Optimizing compilation takes somewhat more time, and a lot
more memory for a large function.
-O2 Optimize even more. GCC performs nearly all supported optimizations
that do not involve a space-speed
tradeoff. The compiler does not
perform loop unrolling or function
inlining when you specify -O2. As
compared to -O, this option increases
both compilation time and the
performance of the generated code.
-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and
also turns on the -finline-functions,
-funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options.
-O0 Reduce compilation time and make debugging produce the expected
results. This is the default.
-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also
performs further optimizations
designed to reduce code size.
Optimization level 2. The maximum is 3.
See: Options That Control Optimization
Note, that in a few years ago -O3 could cause some glitches by excessively "optimizing" the code. AFAIK, that's no longer true with modern versions of GCC. But with inertia, -O2 is considered "the maximum safe".
Compilers can use various optimization techniques like loop unrolling, CPU pipeline optimizations to find useless code and avoid data hazards to speed up your code. For example, a loop that happens a fixed amount of times will be converted to contiguous code without the loop control overhead. Or if all the loop iterations are independent, some code parallelization is possible.
Setting the optimization level to 2 tells how much energy the compiler should spend looking for those optimizations. The possible values range from 1 to 3.
You can learn more about what the compiler can do to optimize your code: Optimizing compiler
As per the man page:
-O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify -O2. As compared to -O, this option increases both compilation time and the performance of the generated code.
In human words: it is the highest truly safe way of optimization. -O3 makes reorganizations which can be troublesome at times. The subject as such is fairly deep.
Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results. Turning on optimization makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program.
The default is optimization off. This results in the fastest compile time, but the compiler makes absolutely no attempt to optimize, and the generated programs are considerably larger and slower than when optimization is enabled. There are various -O switches (the permitted forms are -O0, -O1, -O2, -O3, and -Os) in GCC to control the optimization level:
-O0 No optimization; generates unoptimized code but has the fastest compilation time. This is default.
-O1 Moderate optimization; optimizes reasonably well but does not degrade compilation time significantly. It takes a lot more memory for large function.
-O2 GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify
-O3 Full optimization as in -O2; also uses more aggressive automatic inlining of subprograms within a unit and attempts to vectorize loops. It also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options.
-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
To learn more about flags/options used at various optimization levels and their details:
Options That Control Optimization
Related
I am doing some profiling and performance is important for me (even 5%). The processor is Intel Xeon Platinum 8280 ("Cascade Lake") on Frontera. I compile my code with -Ofast flag, in Release mode. When I add -march=cascadelake, the timing gets worse (5-6%) in my test case. The same is true if use -xCORE-AVX512 instead of march. I am using icpc 19.1.1.217. Can anyone please explain why? Also, what compilation flags do you suggest for better performance?
Edit 1: I am solving a linear system, which consists of different operations, such as dot-product and matrix-vector product. So, it would hard for me to provide reproducible code, but I can say that there are multiple loops in my code that the compiler can apply auto-vectorization. I have used Intel Optimization reports on the critical loops in my code and the report mentioned potential speedups of at least 1.75 for them (for some of the loops it was over 5x potential speedup).
I have also used aligned_alloc(64, size) to allocate aligned memory with 64-alignment as this processor supports AVX512. Also, I round up the size to be a multiple of 64.
I have added OpenMP support to my code and have parallelized some loops, but for these experiments that I am reporting, I am using only 1 OpenMP thread.
I have tried -mavx2, and I got the same result as -xCORE-AVX512.
I have used -O3 instead of -Ofast. I did not get any speed-up.
I've created a very simple benchmark for illustration of short string optimization and run it on quick-bench.com. The benchmark works very well as for the comparison of SSO-disabled/enabled string class and the results are very consistent with both GCC and Clang. However, I realized that when I disable optimizations, the reported times are around 4 times faster than those observed with enabled optimizations (-O2 or -O3), both with GCC and Clang.
The benchmark is here: http://quick-bench.com/DX2G2AdxUb7sGPE-zLRa41-MCk0.
Any idea what may cause the unoptimized benchmark to run 4-times faster?
Unfortunately, I can't see the generated assembly; don't know where the problem is (the "Record disassembly" box is checked but has no effect in my runs). Also, when I run the benchmark locally with Google Benchmark, the results are as expected, i.e., the optimized benchmark runs faster.
I also tried to compare both variants in Compiler Explorer and the unoptimized one seemingly executes much more instructions: https://godbolt.org/z/I4a171.
So, as discussed in the comments, the issue is that quick-bench.com does not show absolute time for the benchmarked code, but rather time relative to the time a no-op benchmark took. The no-op benchmark can be found in the source files of quick-bench.com:
static void Noop(benchmark::State& state) {
for (auto _ : state) benchmark::DoNotOptimize(0);
}
All benchmarks of a run are compiled together. Therefore the optimization flags apply to it as well.
Reproducing and comparing the no-op benchmark for different optimization levels one can see, that there is about a 6 to 7 times speedup from the -O0 to -O1 version. When comparing benchmark runs done with different optimization flags, this factor in the baseline must be considered to compare results. The 4x speed-up observed in the question's benchmark is therefore more than compensated and the behavior is really as one would expect.
One main difference in compilation of the no-op between -O0 and -O1 is that for -O0 there are some assertions and other additional branches in the google-benchmark code, that are optimized out for higher optimization levels.
Additionally at -O0 each iteration of the loop will load into register, modify, and store to memory parts of state multiple time, e.g. for decrementing the loop counter and conditionals on the loop counter, while the -O1 version will keep state in registers, making memory load/stores in the loop unnecessary. The former is much slower, taking at least a few cycles per iteration for necessary store-forwardings and/or reloads from memory.
My colleague likes to use gcc with '-g -O0' for building production binaries because of debugging is easy if core dump happens. He says there is no need to use compiler optimization or tweak the code because he finds the process in production does not have high CPU load, e.g. 30% around.
I asked him the reason behind that and he told me: If CPU load is not high, the bottleneck must not be our code performance, and should be some IO (disk/network). So by using gcc -O2 is of no use to improve the latency and throughput. Also that also indicates we don't have much to improve in the code because CPU is not a bottleneck. Is that correct?
About CPU usage ~ optimisation
I would expect most optimisation problems in a program to correlate to higher-than-usual CPU load, because we say that a sub-optimal program does more than it theoretically needs to. But "usual" here is a complicated word. I don't think you can pick a hard value of system-wide CPU load percentage at which optimisation becomes useful.
If my program reallocates a char buffer in a loop, when it doesn't need to, my program might run ten times slower than it needs to, and my CPU usage may be ten times higher than it needs to be, and optimising the function may yield ten-fold increases in application performance … but the CPU usage may still only be 0.5% of the whole system capacity.
Even if we were to choose a CPU load threshold at which to begin profiling and optimising, on a general-purpose server I'd say that 30% is far too high. But it depends on the system, because if you're programming for an embedded device that only runs your program, and has been chosen and purchased because it has just enough power to run your program, then 30% could be relatively low in the grand scheme of things.
Further still, not all optimisation problems will indeed have anything to do with higher-than-usual CPU load. Perhaps you're just waiting in a sleep longer than you actually need to, causing message latency to increase but substantially reducing CPU usage.
tl;dr: Your colleague's view is simplistic, and probably doesn't match reality in any useful way.
About build optimisation levels
Relating to the real crux of your question, though, it's fairly unusual to deploy a release build with all compiler optimisations turned off. Compilers are designed to emit pretty naive code at -O0, and to do the sort of optimisations that are pretty much "standard" in 2016 at -O1 and -O2. You're generally expected to turn these on for production use, otherwise you're wasting a huge portion of a modern compiler's capability.
Many folks also tend not to use -g in a release build, so that the deployed binary is smaller and easier for your customers to handle. You can drop a 45MB executable to 1MB by doing this, which is no pocket change.
Does this make debugging more difficult? Yes, it can. Generally, if a bug is located, you want to receive reproduction steps that you can then repeat in a debug-friendly version of your application and analyse the stack trace that comes out of that.
But if the bug cannot be reproduced on demand, or it can only be reproduced in a release build, then you may have a problem. It may therefore seem reasonable to keep basic optimisations on (-O1) but also keep debug symbols in (-g); the optimisations themselves shouldn't vastly hinder your ability to analyse the core dump provided by your customer, and the debug symbols will allow you to correlate the information to source code.
That being said, you can have your cake and eat it too:
Build your application with -O2 -g
Copy the resulting binary
Perform strip on one of those copies, to remove the debug symbols; the binaries will otherwise be identical
Store them both forever
Deploy the stripped version
When you have a core dump to analyse, debug it against your original, non-stripped version
You should also have sufficient logging in your application to be able to track down most bugs without needing any of this.
Under certain circumstances he could be correct, and mostly incorrect under other (while under some he's totally correct).
If you assume that you run for 1s the CPU would be busy for 0.3s and waiting for something else 0.7s. If you optimized the code and say got 100% improvement then the CPU would complete what took 0.3s in 0.15s and make the task complete in 0.85s instead of 1s (given that the wait for something else will take the same time).
However if you've got a multicore situation the CPU load is sometimes defined as the amount of processing power that's being used. So if one core runs at 100% and two are idling the CPU load would become 33% so in such a scenario 30% CPU load may be due to the program is only able to make use of one core. In that case it could improve performance drastically if the code were optimized.
Note that sometimes what is thought to be an optimization is actually an pessimization - that's why it's important to measure. I've seen a few "optimizations" that reduce performance. Also some times optimizations would alter the behavior (in particular when you "improve" the source code) so you should probably make sure it doesn't break anything by having proper tests. After doing performance measurement you should decide if it's worth trading debuggability for speed.
A possible improvement might be to compile with gcc -Og -g using a recent GCC. The -Og optimization is debugger-friendly.
Also, you can compile with gcc -O1 -g; you get many (simple) optimizations, so performance is usually 90% of -O2 (with of course some exceptions, where even -O3 matters). And the core dump is usually debuggable.
And it really depends upon the kind of software and the required reliability and ease of debugging. Numerical code (HPC) is quite different from small database post-processing.
At last, using -g3 instead of -g might help (e.g. gcc -Wall -O1 -g3)
BTW synchronization issues and deadlocks might be more likely to appear on optimized code than on non-optimized ones.
It's really simple: CPU time is not free. We like to think that it is, but it's patently false. There are all sorts of magnification effects that make every cycle count in some scenarios.
Suppose that you develop an app that runs on a million of mobile devices. Every second your code wastes is 1-2 years of continuous device use worth on a 4-core device. Even with 0% CPU utilization, wall time latency costs you backlight time, and that's not to be ignored with either: backlight uses about 30% of device's power.
Suppose that you develop an app that runs in a data center. Every 10% of the core that you're using is what someone else won't be using. At the end of the day, you've only got so many cores on a server, and that server has power, cooling, maintenance and amortization costs. Every 1% of CPU usage has costs that are simple to determine, and they aren't zero!
On the other hand: developer time isn't free, and every second of developer's attention requires commensurate energy and resource inputs just to keep her or him alive, fed, well and happy. Yet, in this case all the developer needs to do is flip a compiler switch. I personally don't buy the "easier debugging" myths. Modern debugging information is expressive enough to capture register use, value liveliness, code replication and such. Optimizations don't really get in the way as they did 15 years ago.
If your business has a single, underutilized server, then what the developer is doing might be OK, practically speaking. But all I see here really is an unwillingness to learn how to use the debugging tools or proper tools to begin with.
I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).
When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.
When I use doubles instead of floats, there is no slowdown.
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Yes, you are.
Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".
First of all, compile your code with -ftree-vectorizer-verbose=N to see, what really was vectorized by the compiler.
N is the verbosity level, make that 5 to see all available output (more info can be found here).
Also, you may want to read about GCC's vectorizer.
And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.
There is no code, no description on test procedures, but it generally can be explained this way:
It's not all about cpu bound, it's also bounded by memory speed.
Image processing usually have large working set and exceed the amount of cache of your non-xeon cpu. Eventually the cpu encounter starvation means the overall throughput can be bounded by memory speed.
You may be using an algorithm that is not friendly for vectorization.
Not every algorithm benefits from being vectorized. There are many conditions have to meet - flow dependency, memory layout, etc.
I have a program that has at its heart a 2D array in the form of a
std::vector<std::vector< int > > grid
And there's a simple double for loop going on that goes somewhat like this:
for(int i=1; i<N-1; ++i)
for(int j=1; j<N-1; ++j)
sum += grid[i][j-1] + grid[i][j+1] + grid[i-1][j] + grid[i+1][j] + grid[i][j]*some_float;
With g++ -O3 it runs pretty fast, but for further optimization I profiled with callgrind and see a L1 Cache miss of about 37%, and 33% for LL which is a lot but not too surprising considering the random-ish nature of the computation. So I do a profile-guided optimization a la
g++ -fprofile-generate -O3 ...
./program
g++ -fprofile-use -O3 ...
and the program runs about 48% faster! But the puzzling part: The cache misses have even increased! L1 data cache miss is now 40%, LL same.
How can that be? There are no conditionals in the loop for which prediction could have been optimised and the cache misses are even higher. Yet it is faster.
edit: Alright, here's the sscce: http://pastebin.com/fLgskdQG . Play around with the N for different runtime. Compiled via
g++ -O3 -std=c++11 -sscce.cpp
on gcc 4.8.1 under linux.
profile-guided optimization with the commands above. The Callgrind stuff is done with a g++ -g switch and valgrind --tool=callgrind --simulate-cache=yes ./sscce
I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.
What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...
These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.