I'm working on a codebase that has a lot of SIMD intrinsic code. Now that we have AVX2, we still need to have SIMD code that runs on non-AVX2 capable processors, which will be significantly more work. Plus those 128bit lane crossing limitations for AVX2 shuffles also complicates things. For these reasons, it's a good time to rely more on auto vectorization. The main things that scare me are the prospect of a single innocent change killing the parallelism and the prospect of debugging auto-vectorized code in case there is a problem.
I've compiled the following with g++ -O1 -g -ftree-vectorize and attempted to step through with GDB (does anyone know why -ftree-vectorize doesn't work with -O0 ?)
float a[1000], b[1000], c[1000];
int main(int argc, char **argv)
{
for (int i = 0; i < argc; ++i)
c[i] = a[i] + b[i];
return 0;
}
but don't get any meaningful results. For example sometimes the value for i says <optimized out> while other times it jumps by 20.
It seems the main problem is that it's difficult to map the SIMD state to the original C state for debugging. But realistically, can it be done?
Using a debugger on auto-vectorized code is tricky, esp. when you want to inspect variables that need to behave differently (e.g. the loop counter).
You can either use a debug build (-O0 or -Og), or you can understand how the compiler vectorized the code, and examine registers asm and registers. Depending on what kind of bug you need to track down, you might or might not have a problem with an auto-vectorized build.
It sounds from the comments like you're more interested in checking the efficiency of the auto-vectorization, rather than actually debugging to fix logic bugs in your code. Looking at the asm, and benchmarks, is probably your best bet. (even a simple rdtsc before/after a call, or in a unit-test that tests performance as well as correctness.)
Sometimes the compiler will generate multiple versions of a loop, e.g. for the case where the input arrays overlap, and for the case where they don't. Single-stepping (by instruction, with stepi, with layout asm in gdb) can help, until you find the loop that actually does most of the work. Then you can focus on just how it's vectorized. If you want to eliminate the checks and alternate versions, restrict pointers can be helpful. There's also p = __builtin_assume_aligned(p, 16).
You could also use Intel's free code analyzer to attempt to statically analyze how many cycles an iteration takes. Put IACA marks at the top of your loop body and after the closing paren of your loop, and hope GCC puts them in appropriate places in the auto-vectorized loop, and that the inline asm doesn't break auto-vectorizing.
No optimization answer would be complete with a link to http://agner.org/optimize/, so here you go.
Related
I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.
TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments
I've created a very simple benchmark for illustration of short string optimization and run it on quick-bench.com. The benchmark works very well as for the comparison of SSO-disabled/enabled string class and the results are very consistent with both GCC and Clang. However, I realized that when I disable optimizations, the reported times are around 4 times faster than those observed with enabled optimizations (-O2 or -O3), both with GCC and Clang.
The benchmark is here: http://quick-bench.com/DX2G2AdxUb7sGPE-zLRa41-MCk0.
Any idea what may cause the unoptimized benchmark to run 4-times faster?
Unfortunately, I can't see the generated assembly; don't know where the problem is (the "Record disassembly" box is checked but has no effect in my runs). Also, when I run the benchmark locally with Google Benchmark, the results are as expected, i.e., the optimized benchmark runs faster.
I also tried to compare both variants in Compiler Explorer and the unoptimized one seemingly executes much more instructions: https://godbolt.org/z/I4a171.
So, as discussed in the comments, the issue is that quick-bench.com does not show absolute time for the benchmarked code, but rather time relative to the time a no-op benchmark took. The no-op benchmark can be found in the source files of quick-bench.com:
static void Noop(benchmark::State& state) {
for (auto _ : state) benchmark::DoNotOptimize(0);
}
All benchmarks of a run are compiled together. Therefore the optimization flags apply to it as well.
Reproducing and comparing the no-op benchmark for different optimization levels one can see, that there is about a 6 to 7 times speedup from the -O0 to -O1 version. When comparing benchmark runs done with different optimization flags, this factor in the baseline must be considered to compare results. The 4x speed-up observed in the question's benchmark is therefore more than compensated and the behavior is really as one would expect.
One main difference in compilation of the no-op between -O0 and -O1 is that for -O0 there are some assertions and other additional branches in the google-benchmark code, that are optimized out for higher optimization levels.
Additionally at -O0 each iteration of the loop will load into register, modify, and store to memory parts of state multiple time, e.g. for decrementing the loop counter and conditionals on the loop counter, while the -O1 version will keep state in registers, making memory load/stores in the loop unnecessary. The former is much slower, taking at least a few cycles per iteration for necessary store-forwardings and/or reloads from memory.
I was wondering if there is an optimization in gcc that can make some single-threaded code like the example below execute in parallel. If no, why? If yes, what kind of optimizations are possible?
#include <iostream>
int main(int argc, char *argv[])
{
int array[10];
for(int i = 0; i < 10; ++ i){
array[i] = 0;
}
for(int i = 0; i < 10; ++ i){
array[i] += 2;
}
return 0;
}
Added:
Thanks for OpenMP links, and as much as I think it's useful, my question is related to compiling same code without the need to rewrite smth.
So basically I want to know if:
Making code parallel(at least in some cases) without rewriting it is possible?
If yes, what cases can be handled? If not, why?
The compiler can try to automatically parallelise your code, but it wont do it by creating threads. It may use vectorised instructions (intel intrinsics for an intel CPU, for example) to operate on multiple elements at a time, where it can detect that using those instructions is possible (for example when you perform the same operation multiple times on consecutive elements of a correctly aligned data structure). You can help the compiler by telling it which intrinsic instruction set your CPU supports (-mavx, -msse4.2 ... for example).
You can also use these instructions directly, but it requires a non-trivial amount of work for the programmer. There are also libraries which do this already (see the vector class here Agner Fog's blog).
You can get the compiler to auto-parallelise using multiple threads by using OpenMP (OpenMP introducion), which is more instructing the compiler to auto-parallelise, than the compiler auto-parallelising by itself.
Yes, gcc with -ftree-parallelize-loops=4 will attempt to auto-parallelize with 4 threads, for example.
I don't know how well gcc does at auto-parallelization, but it is something that compiler developers have been working on for years. As other answers point out, giving the compiler some guidance with OpenMP pragmas can give better results. (e.g. by letting the compiler know that it doesn't matter what order something happens in, even when that may slightly change the result, which is common for floating point. Floating point math is not associative.)
And also, only doing auto-parallelization for #pragma omp loops means only the really important loops get this treatment. -ftree-parallelize-loops probably benefits from PGO (profile-guided optimization) to know which loops are actually hot and worth parallelizing and/or vectorizing.
It's somewhat related to finding the kind of parallelism that SIMD can take advantage of, for auto-vectorizing loops. (Which is enabled by default at -O3 in gcc, and at -O2 in clang).
Compilers are allowed to do whatever they want as long as the observable behavior (see 1.9 [intro.execution] paragraph 8) is identical to that specified by the [correct(*)] program. Observable behavior is specified in terms of I/O operations (using standard C++ library I/O) and access to volatile objects (although the compiler actually isn't really required to treat volatile objects special if it can prove that these aren't in observable memory). To this end the C++ execution system may employ parallel techniques.
Your example program actually has no observable outcome and compilers are good a constant folding programs to find out that the program actually does nothing. At best, the heat radiated from the CPU could be an indication of work but the amount of energy consumed isn't one of the observable effects, i.e., the C++ execution system isn't required to do that. If you compile the code above with clang with optimization turned on (-O2 or higher) it will actually entirely remove the loops (use the -S option to have the compiler emit assembly code to reasonably easy inspect the results).
Assuming you have actually loops which are forced to be executed, most contemporary compilers (at least, gcc, clang, and icc) will try to vectorize the code taking advantage of SIMD instructions. To do so, the compiler needs to comprehend the operations in the code to prove that parallel execution doesn't change the results or introduced data races (as far as I can tell, the exact results are actually not necessarily retained when floating point operations are involved as some of the compilers happily parallelize, e.g., loops adding floats although floating point addition isn't associative).
I'm not aware of a contemporary compiler which will utilize different threads of execution to improve the speed of execution without some form of hints like Open MP's pragmas. However, discussion at the committee meetings imply that compiler vendors are considering to so at least.
(*) The C++ standard imposes no restriction on the C++ execution system in case the program execution results in undefined behavior. Correct programs wouldn't invoke any form of undefined behavior.
tl;dr: compilers are allowed but not required to execute code in parallel and most contemporary compilers do so in some situations.
If you want to parallelize your c++ code, you can use openmp. Official documentation can be found here : openmp doc
Openmp provides pragmas so that you can indicate to the compiler that a portion of code has to use a certain number of threads. Sometimes you can do it manually, and some other pragmas can automatically optimize the number of cores used.
The code below is an example of the official documentation :
#include <cmath>
int main() {
const int size = 256;
double sinTable[size];
#pragma omp parallel for
for(int n=0; n<size; ++n) {
sinTable[n] = std::sin(2 * M_PI * n / size);
}
}
This code will automatically parallelize the for loop, this answers your question. There are a lot of other possibilities offered by openmp, you can read the documentation to learn more.
If you need to understand compiling for openmp support, see this stack overflow thread : openmp compilation thread.
Be careful, If you don't use openmp specific options, pragmas will simply be ignored and your code will be run on 1 thread.
I hope this helps.
I believe it is usual to have such code in C++
for(size_t i=0;i<ARRAY_SIZE;++i)
A[i]=B[i]*C[i];
One commonly advocated alternation is:
double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
*pA++=(*pB++)*(*pC++);
What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:
CPU cache. How CPUs fill up their caches to gain best hit rate?
I suppose SSE could improve this?
The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be available.
Any suggestions would be appreciated!
My g++ 4.5.2 produces absolutely identical code for both loops (having fixed the error in double *pA=A, *pB=B, *pC=C;, and it is
.L3:
movapd B(%rax), %xmm0
mulpd C(%rax), %xmm0
movapd %xmm0, A(%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
(where my ARRAY_SIZE was 10000)
The compiler authors know these tricks already. OpenMP and other concurrent solutions are worth investigating, though.
The rule for performance are
not yet
get a target
measure
get an idea of how much improvement is possible and verify it is worthwhile to spend time to get it.
This is even more true for modern processors. About your questions:
simple index to pointer mapping is often done by the compilers, and when they don't do it they may have good reasons.
processors are already often optimized to sequential access to the cache: simple code generation will often give the best performance.
SSE can perhaps improve this. But not if you are already bandwidth limited. So we are back to the measure and determine bounds stage
parallelization: same thing as SSE. Using the multiple cores of a single processor won't help if you are bandwidth limited. Using different processor may help depending on the memory architecture.
manual loop unwinding (suggested in a now deleted answer) is often a bad idea. Compilers know how to do this when it is worth-wise (for instance if it can do software pipelining), and with modern OOO processors it is often not the case (it increase the pressure on instruction and trace caches while OOO execution, speculation over jumps and register renaming will automatically brings most of the benefit of unwinding and software pipelining).
The first form is exactly the sort of structure that your compiler will recognize and optimize, almost certainly emitting SSE instructions automatically.
For this kind of trivial inner loop, cache effects are irrelevant, because you are iterating through everything. If you have nested loops, or a sequence of operations (like g(f(A,B),C)), then you might try to arrange to access small blocks of memory repeatedly to be more cache-friendly.
Do not unroll the loop by hand. Your compiler will already do that, too, if it is a good idea (which it may not be on a modern CPU).
OpenMP can maybe help if the loop is huge and the operations within are complicated enough that you are not already memory-bound.
In general, write your code in a natural and straightforward way, because that is what your optimizing compiler is most likely to understand.
When to start considering SSE or OpenMP? If both of these are true:
If you find that code similar to yours appear 20 times or more in your project:
for (size_t i = 0; i < ARRAY_SIZE; ++i)A[i] = B[i] * C[i];
or some similar operations
If ARRAY_SIZE is routinely bigger than 10 million, or, if profiler tells you that this operation is becoming a bottleneck
Then,
First, make it into a function: void array_mul(double* pa, const double* pb, const double* pc, size_t count){ for (...) }
Second, if you can afford to find a suitable SIMD library, change your function to use it.
Good portable SIMD library
SIMD C++ library
As a side note, if you have a lot of operations that are only slightly more complicated than this, e.g. A[i] = B[i] * C[i] + D[i] then a library which supports expression template will be useful too.
You can use some easy parallelization method. Cuda will be hardware dependent but SSE is almost standard in every CPU. Also you can use multiple threads. In multiple threads you can still use pointer trick which is not very important. Those simple optimizations can be done by compiler as well. If you are using Visual Studio 2010 you can use parallel_invoke to execute functions in parallel without dealing with windows threads. In Linux pThread library is quite easy to use.
I think using valarrays are specialised for such calculations. I am not sure if it will improve the performance.
In most C or C++ environments, there is a "debug" mode and a "release" mode compilation.
Looking at the difference between the two, you find that the debug mode adds the debug symbols (often the -g option on lots of compilers) but it also disables most optimizations.
In "release" mode, you usually have all sorts of optimizations turned on.
Why the difference?
Without any optimization on, the flow through your code is linear. If you are on line 5 and single step, you step to line 6. With optimization on, you can get instruction re-ordering, loop unrolling and all sorts of optimizations.
For example:
void foo() {
1: int i;
2: for(i = 0; i < 2; )
3: i++;
4: return;
In this example, without optimization, you could single step through the code and hit lines 1, 2, 3, 2, 3, 2, 4
With optimization on, you might get an execution path that looks like: 2, 3, 3, 4 or even just 4! (The function does nothing after all...)
Bottom line, debugging code with optimization enabled can be a royal pain! Especially if you have large functions.
Note that turning on optimization changes the code! In certain environment (safety critical systems), this is unacceptable and the code being debugged has to be the code shipped. Gotta debug with optimization on in that case.
While the optimized and non-optimized code should be "functionally" equivalent, under certain circumstances, the behavior will change.
Here is a simplistic example:
int* ptr = 0xdeadbeef; // some address to memory-mapped I/O device
*ptr = 0; // setup hardware device
while(*ptr == 1) { // loop until hardware device is done
// do something
}
With optimization off, this is straightforward, and you kinda know what to expect.
However, if you turn optimization on, a couple of things might happen:
The compiler might optimize the while block away (we init to 0, it'll never be 1)
Instead of accessing memory, pointer access might be moved to a register->No I/O Update
memory access might be cached (not necessarily compiler optimization related)
In all these cases, the behavior would be drastically different and most likely wrong.
Another crucial difference between debug and release is how local variables are stored. Conceptually local variables are allocated storage in a functions stack frame. The symbol file generated by the compiler tells the debugger the offset of the variable in the stack frame, so the debugger can show it to you. The debugger peeks at the memory location to do this.
However, this means every time a local variable is changed the generated code for that source line has to write the value back to the correct location on the stack. This is very inefficient due to the memory overhead.
In a release build the compiler may assign a local variable to a register for a portion of a function. In some cases it may not assign stack storage for it at all (the more registers a machine has the easier this is to do).
However, the debugger doesn't know how registers map to local variables for a particular point in the code (I'm not aware of any symbol format that includes this information), so it can't show it to you accurately as it doesn't know where to go looking for it.
Another optimization would be function inlining. In optimized builds the compiler may replace a call to foo() with the actual code for foo everywhere it is used because the function is small enough. However, when you try to set a breakpoint on foo() the debugger wants to know the address of the instructions for foo(), and there is no longer a simple answer to this -- there may be thousands of copies of the foo() code bytes spread over your program. A debug build will guarantee that there is somewhere for you to put the breakpoint.
Optimizing code is an automated process that improves the runtime performance of the code while preserving semantics. This process can remove intermediate results which are unncessary to complete an expression or function evaluation, but may be of interest to you when debugging. Similarly, optimizations can alter the apparent control flow so that things may happen in a slightly different order than what appears in the source code. This is done to skip unnecessary or redundant calculations. This rejiggering of code can mess with the mapping between source code line numbers and object code addresses making it hard for a debugger to follow the flow of control as you wrote it.
Debugging in unoptimized mode allows you to see everything you've written as you've written it without the optimizer removing or reordering things.
Once you are happy that your program is working correctly you can turn on optimizations to get improved performance. Even though optimizers are pretty trustworthy these days, it's still a good idea to build a good quality test suite to ensure that your program runs identically (from a functional point of view, not considering performance) in both optimized and unoptimized mode.
The expectation is for the debug version to be - debugged! Setting breakpoints, single-stepping while watching variables, stack traces, and everything else you do in a debugger (IDE or otherwise) make sense if every line of non-empty, non-comment source code matches some machine code instruction.
Most optimizations mess with the order of machine codes. Loop unrolling is a good example. Common subexpressions can be lifted out of loops. With optimization turned on, even the simplest level, you may be trying to set a breakpoint on a line that, at the machine code level, doesn't exist. Sometime you can't monitor a local variable due to it being kept in a CPU register, or perhaps even optimized out of existence!
If you're debugging at the instruction level rather than the source level, it's an awful lot for you easier to map unoptimized instructions back to the source. Also, compilers are occasionally buggy in their optimizers.
In the Windows division at Microsoft, all release binaries are built with debugging symbols and full optimizations. The symbols are stored in separate PDB files and do not affect the performance of the code. They don't ship with the product, but most of them are available at the Microsoft Symbol Server.
Another of the issues with optimizations are inline functions, also in the sense that you will always single-step through them.
With GCC, with debugging and optimizations enabled together, if you don't know what to expect you will think that the code is misbehaving and re-executing the same statement multiple times - it happened to a couple of my colleagues.
Also debugging info given by GCC with optimizations on tend to be of poorer quality than they could, actually.
However, in languages hosted by a Virtual Machine like Java, optimizations and debugging can coexist - even during debugging, JIT compilation to native code continues, and only the code of debugged methods is transparently converted to an unoptimized version.
I would like to emphasize that optimization should not change the behaviour of the code, unless the used optimizer is buggy, or the code itself is buggy and relies on partially undefined semantics; the latter is more common in multithreaded programming or when inline assembly is also used.
Code with debugging symbols are larger which may mean more cache misses, i.e. slower, which may be an issue for server software.
At least on Linux (and there's no reason why Windows should be different) debug info are packaged in a separate section of the binary, and are not loaded during normal execution. They can be split into a different file to be used for debugging.
Also, on some compilers (including Gcc, I guess also with Microsoft's C compiler) debugging info and optimizations can be both enabled together. If not, obviously the code is going to be slower.