I believe it is usual to have such code in C++
for(size_t i=0;i<ARRAY_SIZE;++i)
A[i]=B[i]*C[i];
One commonly advocated alternation is:
double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
*pA++=(*pB++)*(*pC++);
What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:
CPU cache. How CPUs fill up their caches to gain best hit rate?
I suppose SSE could improve this?
The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be available.
Any suggestions would be appreciated!
My g++ 4.5.2 produces absolutely identical code for both loops (having fixed the error in double *pA=A, *pB=B, *pC=C;, and it is
.L3:
movapd B(%rax), %xmm0
mulpd C(%rax), %xmm0
movapd %xmm0, A(%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
(where my ARRAY_SIZE was 10000)
The compiler authors know these tricks already. OpenMP and other concurrent solutions are worth investigating, though.
The rule for performance are
not yet
get a target
measure
get an idea of how much improvement is possible and verify it is worthwhile to spend time to get it.
This is even more true for modern processors. About your questions:
simple index to pointer mapping is often done by the compilers, and when they don't do it they may have good reasons.
processors are already often optimized to sequential access to the cache: simple code generation will often give the best performance.
SSE can perhaps improve this. But not if you are already bandwidth limited. So we are back to the measure and determine bounds stage
parallelization: same thing as SSE. Using the multiple cores of a single processor won't help if you are bandwidth limited. Using different processor may help depending on the memory architecture.
manual loop unwinding (suggested in a now deleted answer) is often a bad idea. Compilers know how to do this when it is worth-wise (for instance if it can do software pipelining), and with modern OOO processors it is often not the case (it increase the pressure on instruction and trace caches while OOO execution, speculation over jumps and register renaming will automatically brings most of the benefit of unwinding and software pipelining).
The first form is exactly the sort of structure that your compiler will recognize and optimize, almost certainly emitting SSE instructions automatically.
For this kind of trivial inner loop, cache effects are irrelevant, because you are iterating through everything. If you have nested loops, or a sequence of operations (like g(f(A,B),C)), then you might try to arrange to access small blocks of memory repeatedly to be more cache-friendly.
Do not unroll the loop by hand. Your compiler will already do that, too, if it is a good idea (which it may not be on a modern CPU).
OpenMP can maybe help if the loop is huge and the operations within are complicated enough that you are not already memory-bound.
In general, write your code in a natural and straightforward way, because that is what your optimizing compiler is most likely to understand.
When to start considering SSE or OpenMP? If both of these are true:
If you find that code similar to yours appear 20 times or more in your project:
for (size_t i = 0; i < ARRAY_SIZE; ++i)A[i] = B[i] * C[i];
or some similar operations
If ARRAY_SIZE is routinely bigger than 10 million, or, if profiler tells you that this operation is becoming a bottleneck
Then,
First, make it into a function: void array_mul(double* pa, const double* pb, const double* pc, size_t count){ for (...) }
Second, if you can afford to find a suitable SIMD library, change your function to use it.
Good portable SIMD library
SIMD C++ library
As a side note, if you have a lot of operations that are only slightly more complicated than this, e.g. A[i] = B[i] * C[i] + D[i] then a library which supports expression template will be useful too.
You can use some easy parallelization method. Cuda will be hardware dependent but SSE is almost standard in every CPU. Also you can use multiple threads. In multiple threads you can still use pointer trick which is not very important. Those simple optimizations can be done by compiler as well. If you are using Visual Studio 2010 you can use parallel_invoke to execute functions in parallel without dealing with windows threads. In Linux pThread library is quite easy to use.
I think using valarrays are specialised for such calculations. I am not sure if it will improve the performance.
Related
I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.
TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments
since there is no AVX version of _mm_movelh_ps I usually used _mm256_shuffle_ps(a, b, 0x44) for AVX registers as a replacement. However, I remember reading in other questions, that swizzle instructions without a control integer (like _mm256_unpacklo_ps or _mm_movelh_ps) should be preferred if possible (for some reason I don't know). Yesterday, it occurred to me, that another alternative might be using the following:
_mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Also, if it is truly the case, it would be nice if somebody could explain in simple words (I have very limited understanding of assembly and microarchitecture) why one should prefer instructions without a control integer.
Thanks in advance
Additional note:
Clang actually optimizes the shuffle to vunpcklpd: https://godbolt.org/z/9XFP8D
So it seems that my idea is not too bad. However, GCC and ICC create a shuffle instruction.
Avoiding an immediate saves 1 byte of machine-code size; that's all. It's at the bottom of the list for performance considerations, but all else equal shuffles like _mm256_unpacklo_pd with an implicit "control" are very slightly better than an immediate control byte for that reason.
(But taking the control operand in another vector like vpermilps can or vpermd requires is usually worse, unless you have some weird front-end bottleneck in a long-running loop, and can load the shuffle control outside the loop. Not very plausible and at this point you'd have to be writing by hand in asm to be caring that much about code size/alignment; in C++ that's still not something you can really control directly.)
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Ice Lake has 2/clock vshufps vs. 1/clock vunpcklpd, according to testing by uops.info on real hardware, running on port 1 or port 5. Definitely use _mm256_shuffle_ps. The trivial extra code-size cost probably doesn't actually hurt at all on earlier CPUs, and is probably worth it for the future benefit on ICL, unless you're sure that port 5 won't be a bottleneck.
Ice Lake has a 2nd shuffle unit on port 1 that can handle some common XMM and in-lane YMM shuffles, including vpshufb and apparently some 2-input shuffles like vshufps. I have no idea why it doesn't just decode vunpcklpd as a vshufps with that control vector, or otherwise manage to run that shuffle on port 1. We know the shuffle HW itself can do the shuffle so I guess it's just a matter of control hardware to set up implicit shuffles, mapping an opcode to a shuffle control somehow.
Other than that, it's equal or better on older AVX CPUs; no CPUs have penalties for using PD shuffles between other PS instructions. The only different on any existing CPUs is code-size. Old CPUs like K8 and Core 2 had faster pd shuffles than ps, but no CPUs with AVX have shuffle units with that weakness. Also, AVX non-destructive instructions level differences between which operand has to be the destination.
As you can see from the Godbolt link, there are zero extra instructions before/after the shuffle. The "cast" intrinsics aren't doing conversion, just reinterpret to keep the C++ type system happy because Intel decided to have separate types for __m256 vs. __m256d (vs. __m256i), instead of having one generic YMM type. They chose not to have separate uint8x16 vs. uint32x4 vectors the way ARM did, though; for integer SIMD just __m256i.
So there's no need for compilers to emit extra instructions for casts, and in practice that's true; they don't introduce extra vmovaps/apd register copies or anything like that.
If you're using clang you can just write it conveniently and let clang's shuffle optimizer emit vunpcklpd for you. Or in other cases, do whatever it's going to do anyway; sometimes it makes worse choices than the source, often it does a good job.
Clang gets this wrong with -march=icelake-client, still using vunpcklpd even if you write _mm256_shuffle_ps. (Or depending on surrounding code, might optimize that shuffle into part of something else.)
Related bug report.
I am trying to use Intel SIMD intrinsics to accelerate a query-answer program. Suppose query_cnt is input dependent but is always smaller than SIMD register count (i.e. there is enough SIMD registers to hold them). Since queries are the hot data in my application, instead of loading them each time when needed, may I load them at first and keep them always in registers?
Suppose queries are float type, and AVX256 is supported. Now I have to use something like:
std::vector<__m256> vec_queries(query_cnt / 8);
for (int i = 0; i < query_cnt / 8; ++i) {
vec_queries[i] = _mm256_loadu_ps((float const *)(curr_query_ptr));
curr_query_ptr += 8;
}
I know it is not a good practice since there is potential load/store overhead, but at least there is a slight chance that vec_queries[i] can be optimized so that they can be kept in registers, but I still think it is not a good way.
Any better ideas?
From the code sample you posted, it looks like you're just doing a variable-length memcpy. Depending on what the compiler does, and the surrounding code, you might get better results from just actually calling memcpy. e.g. for aligned copies of with a size that's a multiple of 16B, the break even point between a vector loop and rep movsb is maybe as low as ~128 bytes on Intel Haswell. Check Intel's optimization manual for some implementation notes on memcpy, and a graph of size vs. cycles for a couple different strategies. (Links in the x86 tag wiki).
You didn't say what CPU, so I'm just assuming recent Intel.
I think you're too worried about registers. Loads that hit in L1 cache are extremely cheap. Haswell (and Skylake) can do two __m256 loads per clock (and a store in the same cycle). Previous to that, Sandybridge/IvyBridge can do two memory operations per clock, with a max of one of them being a store. Or under ideal conditions (256b loads/stores), they can manage 2x 16B loaded and 1x 16B stored per clock. So loading/storing 256b vectors is more expensive than on Haswell, but still very cheap if they're aligned and hot in L1 cache.
I mentioned in comments that GNU C global register variables might be a possibility, but mostly in a "this is technically possible in theory" sense. You probably don't want multiple vector registers dedicated to this purpose for the entire run-time of your program (including library function calls, so you'd have to recompile them).
In reality, just make sure the compiler can inline (or at least see while optimizing) the definitions for every function you use inside any important loops. That way it can avoid having to spill/reload vector regs across function calls (since both the Windows and System V x86-64 ABIs have no call-preserved YMM (__m256) registers).
See Agner Fog's microarch pdf to learn even more about the microarchitectural details of modern CPUs, at least the details that are possible to measure by experiment and tune for.
I am learning assembly and making some inlining in my Digital Mars C++ compiler. I searched some things to make a program better and had these parameters to tune the programs:
use better C++ compiler//thinking of GCC or intel compiler
use assembly only in critical part of program
find better algorithm
Cache miss, cache contention.
Loop-carried dependency chain.
Instruction fetching time.
Instruction decoding time.
Instruction retirement.
Register read stalls.
Execution port throughput.
Execution unit throughput.
Suboptimal reordering and scheduling of micro-ops.
Branch misprediction.
Floating point exception.
I understood all except "register read stalls".
Question: Can anybody tell me how is this happening in CPU and the "superscalar" form of the "out of order execution"?
Normal "out of order" seemed logical but i couldnt find a logical explanation of "superscalar" form.
Question 2: Can someone alse give some good instruction list of SSE SSE2 and newer CPU's prefarably with micro-ops table, port throughputs, units and some calculation table for the latencies to find the real bottle-neck of a piece of code?
I would be happy with a small example like this:
//loop carried dependency chain breaking:
__asm
{
loop_begin:
....
....
sub edx,05h //rather than taking i*5 in each iteration, we sub 5 each iteration
sub ecx,01h //i-- counter
...
...
jnz loop_begin//edit: sub ecx must have been after the sub edx for jnz
}
//while sub edx makes us get rid of a multiplication also makes that independent of ecx, making independent
Thank you.
Computer: Pentium-M 2GHz , Windows XP-32 bit
You should take a look at Agner Fogs optimization manuals: Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms or Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
But to really be able to outsmart a modern compiler, you need some good background knowledge of the arch you want to optimize for: The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
My two cents: Intel Architecture Developers Manuals
Really detailed, there are all SSE instructions as well, with opcodes, instruction latency and throughput, and all gory details you might need :)
The "superscalar" stalls is an added problem for scheduling instructions. A modern processor can not only execute instructions out of order, it can also do 3-4 simple instructions at a time, using parallel execution units.
But to actually do that, the instructions must be sufficiently independent of each other. If, for example, one instruction uses the result of a previous instruction, it must wait for that result to be available.
In practice, this makes creating an optimal assembly program by hand extremely difficult. You really have to be like a computer (compiler) to calculate the optimal order of the instructions. And if you change one instruction, you have to do it all over again....
For question #1 I would highly recommend Computer Architecture: A Quantitative Approach. It does a very good job of explaining the concepts in context, so you can see the big picture. The examples are also very useful for a person who is interested in optimizing code, because they always focus on prioritizing and improving the bottleneck.
Newer ARM processors include the PLD and PLI instructions.
I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment!
I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about.
How can I include explicit prefetch instructions in my C++ code?
There should be some Compiler-specific Features. There is no standard way to do it for C/C++. Check out you compiler Compiler Reference Guide. For RealView Compiler see this or this.
If you are trying to extract truly maximum performance from these loops, than I would recommend writing the entire looping construct in assembler. You should be able to use inline assembly depending on the data structures involved in your loop. Even better if you can unroll any piece of your loop (like the parts involved in making the access non-sequential).
At the risk of asking the obvious: have you verified the compiler's target architecture? For example (humor me), if by default the compiler is targeted to ARM7, you're never going to see the PLD instruction.
It is not outside the realm of possibility that other optimizations like software pipelining and loop unrolling may achieve the same effect as your prefetching idea (hiding the latency of the loads by overlapping it with useful computation), but without the extra instruction-cache pressure caused by the extra instructions. I would even go so far as to say that this is the case more often than not, for tight inner loops that tend to have few instructions and little control flow. Is your compiler doing these types of traditional optimizations instead. If so, it may be worth looking at the pipeline diagram to develop a more detailed cost model of how your processor works, and evaluate more quantitatively whether prefetching would help.