I've created a very simple benchmark for illustration of short string optimization and run it on quick-bench.com. The benchmark works very well as for the comparison of SSO-disabled/enabled string class and the results are very consistent with both GCC and Clang. However, I realized that when I disable optimizations, the reported times are around 4 times faster than those observed with enabled optimizations (-O2 or -O3), both with GCC and Clang.
The benchmark is here: http://quick-bench.com/DX2G2AdxUb7sGPE-zLRa41-MCk0.
Any idea what may cause the unoptimized benchmark to run 4-times faster?
Unfortunately, I can't see the generated assembly; don't know where the problem is (the "Record disassembly" box is checked but has no effect in my runs). Also, when I run the benchmark locally with Google Benchmark, the results are as expected, i.e., the optimized benchmark runs faster.
I also tried to compare both variants in Compiler Explorer and the unoptimized one seemingly executes much more instructions: https://godbolt.org/z/I4a171.
So, as discussed in the comments, the issue is that quick-bench.com does not show absolute time for the benchmarked code, but rather time relative to the time a no-op benchmark took. The no-op benchmark can be found in the source files of quick-bench.com:
static void Noop(benchmark::State& state) {
for (auto _ : state) benchmark::DoNotOptimize(0);
}
All benchmarks of a run are compiled together. Therefore the optimization flags apply to it as well.
Reproducing and comparing the no-op benchmark for different optimization levels one can see, that there is about a 6 to 7 times speedup from the -O0 to -O1 version. When comparing benchmark runs done with different optimization flags, this factor in the baseline must be considered to compare results. The 4x speed-up observed in the question's benchmark is therefore more than compensated and the behavior is really as one would expect.
One main difference in compilation of the no-op between -O0 and -O1 is that for -O0 there are some assertions and other additional branches in the google-benchmark code, that are optimized out for higher optimization levels.
Additionally at -O0 each iteration of the loop will load into register, modify, and store to memory parts of state multiple time, e.g. for decrementing the loop counter and conditionals on the loop counter, while the -O1 version will keep state in registers, making memory load/stores in the loop unnecessary. The former is much slower, taking at least a few cycles per iteration for necessary store-forwardings and/or reloads from memory.
Related
I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.
TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments
I am doing some profiling and performance is important for me (even 5%). The processor is Intel Xeon Platinum 8280 ("Cascade Lake") on Frontera. I compile my code with -Ofast flag, in Release mode. When I add -march=cascadelake, the timing gets worse (5-6%) in my test case. The same is true if use -xCORE-AVX512 instead of march. I am using icpc 19.1.1.217. Can anyone please explain why? Also, what compilation flags do you suggest for better performance?
Edit 1: I am solving a linear system, which consists of different operations, such as dot-product and matrix-vector product. So, it would hard for me to provide reproducible code, but I can say that there are multiple loops in my code that the compiler can apply auto-vectorization. I have used Intel Optimization reports on the critical loops in my code and the report mentioned potential speedups of at least 1.75 for them (for some of the loops it was over 5x potential speedup).
I have also used aligned_alloc(64, size) to allocate aligned memory with 64-alignment as this processor supports AVX512. Also, I round up the size to be a multiple of 64.
I have added OpenMP support to my code and have parallelized some loops, but for these experiments that I am reporting, I am using only 1 OpenMP thread.
I have tried -mavx2, and I got the same result as -xCORE-AVX512.
I have used -O3 instead of -Ofast. I did not get any speed-up.
I have recently heard multiple people say that JIT compilation produces really fast code, faster even than any static compiler can produce. I find this hard to believe when it comes to C++ STL-style templated code, but these people (typically from a C#/Java background) insist that this is indeed the case.
My question is thus: what are the type of optimizations that you can make at runtime but not at compile time?
Edit: clarification: I'm more interested in the kind of things that are impossible to do statically rather than the typical case in any one industry.
JIT compilers can measure the likelihood of a conditional jump being taken, and adjust the emitted code accordingly. A static compiler can do this as well, but not automatically; it requires a hint from the programmer.
Obviously this is just one factor among many, but it does indicate that it's possible for JIT to be faster under the right conditions.
things you can do at runtime
check to see what exotic instructions exist (AMD vs intel,....)
detect cache topology
detect memory size
number of cores
and other things i missed from the list
Does this make things always 10x faster, no. But it certainly offers the opportunity for optimization that is not available at compile time (for widely distributed code; obviously if you know its going to be on only 3 different hardware configs then you can do custom builds etc)
Contrary to the what answer above claims:
Architecture-specific extensions can easily be used by a static compiler. Visual Studio, for example, has a SIMD-extension option that can be toggled on and off.
Cache Size is usually the same for processors of a given architecture. Intel, for example, usually has a L1 cache size of 4kB, L2 cache size of 32kB, and L3 cache size of 4MB.
Optimizing for memory size would only be necessary if you are for some reason, writing a massive program that can use over 4GB of memory.
This may actually be an optimization in which using a JIT compiler is actually useful. However, you can create more threads than there are cores, meaning that those threads will use separate cores in CPUs with more cores, and simply be threads in CPUs with fewer cores. I also think it's quite safe to assume that a CPU has 4 cores.
Still, even using multi-core optimizations doesn't make using a JIT compiler useful, because a program's installer can check the number of cores available, and install the appropriate version of the program most optimized for that computer's number of cores.
I do not think that JIT compilation results in better performance than static compilation. You can always create multiple versions of your code that are each optimized for a specific device. The only type of optimization that I can think of that can result in JIT code being faster is when you receive input, and whatever code you use to process it can be optimized in such a way as to make the code faster for the most common case (which the JIT compiler might be able to discover), but slower for the rarer case. Even then, you can perform that optimization (the static compiler, however, would not be able to perform this optimization).
For example, let's say that you can perform an optimization on a mathematical algorithm that results in an error for values 1-100, but all higher numbers work with this optimization. You notice that values 1-100 can easily be pre-calculated, so you do this:
switch(num) {
case 0: {
//code
}
//...until case 100
}
//main calculation code
However, this is inefficient (assuming the switch statement is not compiled to a jump table), since cases 0-100 are rarely entered, as they be found mentally, without the help of a computer. A JIT might be able to discover that this is more efficient (upon seeing that the values in the range 0-100 are rarely entered):
if(num < 101) {
switch(num) {
/...same as other code above
}
}
//main calculation code
In this version of the code, only 1 if is executed if the most common case instead of an average of 50 ifs in the extremely rare case(if the switch statement is implemented as a series of ifs).
I have a program that has at its heart a 2D array in the form of a
std::vector<std::vector< int > > grid
And there's a simple double for loop going on that goes somewhat like this:
for(int i=1; i<N-1; ++i)
for(int j=1; j<N-1; ++j)
sum += grid[i][j-1] + grid[i][j+1] + grid[i-1][j] + grid[i+1][j] + grid[i][j]*some_float;
With g++ -O3 it runs pretty fast, but for further optimization I profiled with callgrind and see a L1 Cache miss of about 37%, and 33% for LL which is a lot but not too surprising considering the random-ish nature of the computation. So I do a profile-guided optimization a la
g++ -fprofile-generate -O3 ...
./program
g++ -fprofile-use -O3 ...
and the program runs about 48% faster! But the puzzling part: The cache misses have even increased! L1 data cache miss is now 40%, LL same.
How can that be? There are no conditionals in the loop for which prediction could have been optimised and the cache misses are even higher. Yet it is faster.
edit: Alright, here's the sscce: http://pastebin.com/fLgskdQG . Play around with the N for different runtime. Compiled via
g++ -O3 -std=c++11 -sscce.cpp
on gcc 4.8.1 under linux.
profile-guided optimization with the commands above. The Callgrind stuff is done with a g++ -g switch and valgrind --tool=callgrind --simulate-cache=yes ./sscce
I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.
What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...
These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.
I see this flag a lot in the makefiles. What does it mean and when should it be used?
Optimization level 2.
From the GCC man page:
-O1 Optimize. Optimizing compilation takes somewhat more time, and a lot
more memory for a large function.
-O2 Optimize even more. GCC performs nearly all supported optimizations
that do not involve a space-speed
tradeoff. The compiler does not
perform loop unrolling or function
inlining when you specify -O2. As
compared to -O, this option increases
both compilation time and the
performance of the generated code.
-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and
also turns on the -finline-functions,
-funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options.
-O0 Reduce compilation time and make debugging produce the expected
results. This is the default.
-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also
performs further optimizations
designed to reduce code size.
Optimization level 2. The maximum is 3.
See: Options That Control Optimization
Note, that in a few years ago -O3 could cause some glitches by excessively "optimizing" the code. AFAIK, that's no longer true with modern versions of GCC. But with inertia, -O2 is considered "the maximum safe".
Compilers can use various optimization techniques like loop unrolling, CPU pipeline optimizations to find useless code and avoid data hazards to speed up your code. For example, a loop that happens a fixed amount of times will be converted to contiguous code without the loop control overhead. Or if all the loop iterations are independent, some code parallelization is possible.
Setting the optimization level to 2 tells how much energy the compiler should spend looking for those optimizations. The possible values range from 1 to 3.
You can learn more about what the compiler can do to optimize your code: Optimizing compiler
As per the man page:
-O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify -O2. As compared to -O, this option increases both compilation time and the performance of the generated code.
In human words: it is the highest truly safe way of optimization. -O3 makes reorganizations which can be troublesome at times. The subject as such is fairly deep.
Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results. Turning on optimization makes the compiler attempt to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program.
The default is optimization off. This results in the fastest compile time, but the compiler makes absolutely no attempt to optimize, and the generated programs are considerably larger and slower than when optimization is enabled. There are various -O switches (the permitted forms are -O0, -O1, -O2, -O3, and -Os) in GCC to control the optimization level:
-O0 No optimization; generates unoptimized code but has the fastest compilation time. This is default.
-O1 Moderate optimization; optimizes reasonably well but does not degrade compilation time significantly. It takes a lot more memory for large function.
-O2 GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify
-O3 Full optimization as in -O2; also uses more aggressive automatic inlining of subprograms within a unit and attempts to vectorize loops. It also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options.
-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
To learn more about flags/options used at various optimization levels and their details:
Options That Control Optimization