x86 assembly instructions optimisation - c++

I'm trying to optimize a block of instructions in a loop, called thousands of time, which is the bottleneck in my algorithm.
This block of code compute the multiplication of a N matrices 3x3 (iA array) against N vectors 3 (iV array) and store the N results in oV array. (N is not fix and is usually between 3000 and 15000)
Each line of matrices and vectors are 128-bits aligned (4 floats) to exploit SSE optimisation (the 4th floating value is ignored).
C++ code :
__m128* ip = (__m128*)iV;
__m128* op = (__m128*)oV;
__m128* A = (__m128*)iA;
__m128 res1, res2, res3;
int i;
for (i=0; i<N; i++)
{
res1 = _mm_dp_ps(*A++, *ip, 0x71);
res2 = _mm_dp_ps(*A++, *ip, 0x72);
res3 = _mm_dp_ps(*A++, *ip++, 0x74);
*op++ = _mm_or_ps(res1, _mm_or_ps(res2, res3));
}
The compiler generates these instructions :
000007FEE7DD4FE0 movaps xmm2,xmmword ptr [rsi] //move "ip" in register
000007FEE7DD4FE3 movaps xmm1,xmmword ptr [rdi+10h] //move second line of A in register
000007FEE7DD4FE7 movaps xmm0,xmmword ptr [rdi+20h] //move third line of A in register
000007FEE7DD4FEB inc r11d //i++
000007FEE7DD4FEE add rbp,10h //op++
000007FEE7DD4FF2 add rsi,10h //ip++
000007FEE7DD4FF6 dpps xmm0,xmm2,74h //dot product of 3rd line of A against ip
000007FEE7DD4FFC dpps xmm1,xmm2,72h //dot product of 2nd line of A against ip
000007FEE7DD5002 orps xmm0,xmm1 //"merge" of the result of the two dot products
000007FEE7DD5005 movaps xmm3,xmmword ptr [rdi] //move first line of A in register
000007FEE7DD5008 add rdi,30h //A+=3
000007FEE7DD500C dpps xmm3,xmm2,71h //dot product of 1st line of A against ip
000007FEE7DD5012 orps xmm0,xmm3 //"merge" of the result
000007FEE7DD5015 movaps xmmword ptr [rbp-10h],xmm0 //move result in memory (op)
000007FEE7DD5019 cmp r11d,dword ptr [rbx+28h] //compare i
000007FEE7DD501D jl MyFunction+370h (7FEE7DD4FE0h) //loop
I'm not very familiar with low-level optimisations, so the question is : Do you see some possible optimisations if I write assembly code myself ?
For example, will it run faster if I change :
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
by
movaps xmmword ptr [rbp],xmm0
add rbp,10h
I have also read that ADD instruction is faster than INC...

Calculating indirect address with offset, such as rbp-10 is very cheap, because there is special hardware for these sort of calculations in the "effective address calculation" unit [which I think has a different name, but can't think of or have any success with google to find it's name].
There is, however, a dependency between the add rbp,10h and [rbp-10h], which could possibly cause a problem - but I doubt it in this particular case. In your case, there is a long distance between rbp-10 and using it, so it's not an issue. The compiler is probably putting it that far up because it's "free" at that point, since the processor will be waiting for the data to come in from the outside into the xmm registers that has been read earlier. In other words, any work we can stick between the reads of xmm0, xmm1 and xmm2 at the beginning of the loop, and the dpps instructions using xmm0, xmm1 and xmm2 will be beneficial, because the processor will be waiting for that data to "arrive" before it can compute the dpps result.

I've done lots of x86 assembly optimizations, and I can tell you it was a great learning experience. It taught me a lot about how compilers work as well, and the biggest thing I learned was that compilers are in general pretty good at what they do. I know that's a flippant comment, but it is true...
I also learned that optimizations you make can have a positive result on one processor family, and a negative result on another processor family. Things like pipelining, branch prediction, and processor cache play a huge role... so unless you're targeting a very specific hardware configuration, be careful about assumptions regarding improvements you make.
To your specific question about reordering the add to remove the rbp-10h offset... it look like an obvious improvement, and you'd have to verify by reading the instruction manual, but I'd guess the -10h memory offset comes for free in that instruction. And moving the add may throw off a pipelined instruction and actually cause a clock cycle loss. You'd have to experiment.

There are a few things you could do to the above code to improve it. Generally, using a value after it has been altered incurs a processor stall as it waits for the result. So these lines would incur a penalty:-
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
but in the code snippet above those two lines a quite far apart, so that isn't really an issue. As others have already said, the rbp-10h is 'free' in that the address calculation hardware handles it.
You could move the movaps xmm3,xmmword ptr [rdi] up a line and maybe rearrange a couple of other lines.
Would it be worth it?
NO
You'd be lucky to see any real performance gain from any of this because your algorithm is
<blink> memory bandwidth limited </blink>*
which means that the the time taken to read the data from RAM into the CPU is greater than the time it takes the CPU to do its processing. At worst, reading a memory address can involve a page fault and a disk read. The prefetch instructions won't help either, it's called 'Streaming SIMD Extension' because it's optimised to stream data into the CPU (the memory interface can handle four separate streams IIRC).
If you were doing a lot of computation on a small set of data (an FFT perhaps) then you could gain a lot from hand-crafting the assembler. But your algorithm is quite simple so the CPU is idling a lot of the time waiting for the data to arrive. If you know the speed of your RAM you could work out the maximum throughput for the algorithm and use that to compare against what it's currently achieving (you'll never reach the maximum theoretical throughput though).
There are things you can do to minimise the memory stalling, and it's a higher level change rather than fiddling with individual instructions (often, optimising the algorithms gets better results). The simplest is to double buffer the input data. Divide the register set into two groups (possible to do here as you're only using four of the SIMD registers):-
load set 1
mainloop:
load set 2
do processing on set 1
save set 1 result
load set 1
do processing on set 2
save set 2 result
goto mainloop
Hopefully that's given you some ideas. Even if it doesn't go much faster, it's still an interesting exercise and you can learn a lot from it.
RIP blink.

Related

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:
#include <iostream>
#include <chrono>
#define NUM_ITERATIONS 100000000
int main()
{
// Block 1: one operation in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
// Block 2: two operations in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
return 0;
}
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:
1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000
I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.
Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?
Edit:
Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).
This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.
On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.
This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.
This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.
Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
See #rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.
With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately #SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)
Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result
I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.
(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.
So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.
See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.
I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.
The loop with optimization enabled:
As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:
orig_main:
...
call std::chrono::_V2::system_clock::now() # demangled C++ symbol name
mov rbp, rax # save the return value = start
call std::chrono::_V2::system_clock::now()
# end in RAX
So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.
With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:
#include <benchmark/benchmark.h>
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
benchmark::DoNotOptimize(x2 += 31);
benchmark::DoNotOptimize(y2 += 31);
}
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);
# just the main loop, from gcc10.1 -O3
.L7: # do{
add rax, 31 # x2 += 31
add rdx, 31 # y2 += 31
sub rbx, 1
jne .L7 # }while(--count != 0)
I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.
For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
x2 += 16;
y2 += 17;
asm volatile("" : "+r"(x2), "+r"(y2));
}
}
All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see #rcgldr's answer.
Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.
You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.
The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away
TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.
Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).
ETA: This was a guess, and Peter Cordes has made a very good argument about why it's incorrect. Go upvote Peter's answer.
I'm leaving my answer here because some found the information useful. Though this doesn't correctly explain the behavior seen in the OP, it highlights some of the issues that make it infeasible (and meaningless) to try to measure the speed of a particular instruction on a modern processor.
Educated guess:
It's the combined effect of pipelining, powering down portions of a core, and dynamic frequency scaling.
Modern processors pipeline so that multiple instructions can be executing at the same time. This is possible because the processor actually works on micro-ops rather than the assembly-level instructions we usually think of as machine language. Processors "schedule" micro-ops by dispatching them to different portions of the chip while keeping track of the dependencies between the instructions.
Suppose the core running your code has two arithmetic/logic units (ALUs). A single arithmetic instruction repeated over and over requires only one ALU. Using two ALUs doesn't help because the next operation depends on completion of the current one, so the second ALU would just be waiting around.
But in your two-expression test, the expressions are independent. To compute the next value of y, you do not have to wait for the current operation on x to complete. Now, because of power-saving features, that second ALU may be powered down at first. The core might run a few iterations before realizing that it could make use of the second ALU. At that point, it can power up the second ALU and most of the two-expression loop will run as fast as the one-expression loop. So you might expect the two examples to take approximately the same amount of time.
Finally, many modern processors use dynamic frequency scaling. When the processor detects that it's not running hard, it actually slows its clock a little bit to save power. But when it's used heavily (and the current temperature of the chip permits), it might increase the actual clock speed as high as its rated speed.
I assume this is done with heuristics. In the case where the second ALU stays powered down, the heuristic may decide it's not worth boosting the clock. In the case where two ALUs are powered up and running at top speed, it may decide to boost the clock. Thus the two-expression case, which should already be just about as fast as the one-expression case, actually runs at a higher average clock frequency, enabling it to complete twice as much work in slightly less time.
Given your numbers, the difference is about 14%. My Windows machine idles at about 3.75 GHz, and if I push it a little by building a solution in Visual Studio, the clock climbs to about 4.25GHz (eyeballing the Performance tab in Task Manager). That's a 13% difference in clock speed, so we're in the right ballpark.
I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.
I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.
There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.
Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.
My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.
.code
public test1
align 16
test1 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst10: add qword ptr[rsp+8],17
dec rcx
jnz tst10
add rsp,16
ret
test1 endp
public test2
align 16
test2 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst20: add qword ptr[rsp+0],17
add qword ptr[rsp+8],-37
dec rcx
jnz tst20
add rsp,16
ret
test2 endp
end
I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).
3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.
4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.
.code
public test1
align 16
test1 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst10: add rdx,17
dec rcx
jnz tst10
ret
test1 endp
public test2
align 16
test2 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst20: add rdx,17
add r8,-37
dec rcx
jnz tst20
ret
test2 endp
public test3
align 16
test3 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst30: add rdx,17
add r8,-37
add r9,47
dec rcx
jnz tst30
ret
test3 endp
public test4
align 16
test4 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst40: add rdx,17
add r8,-37
add r9,47
add r10,-17
dec rcx
jnz tst40
ret
test4 endp
end
#PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.
I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.
But it seems that #Adrian McCarthy got it right with the dynamic frequency scaling.
Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.
http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA
I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case.
http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0
Edit: Thanks to #PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.
Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).
Processors are so complex these days that we can only guess.
The assembly emitted by your compiler is not what is really executed. The microcode/firmware/whatever of your CPU will interpret it and turn it into instructions for its execution engine, much like JIT languages such as C# or java do.
One thing to consider here is that for each loop, there is not 1 or 2 instructions, but n + 2, as you also increment and compare i to your number of iteration. In the vast majority of case it wouldn't matter, but here it does, as the loop body is so simple.
Let's see the assembly :
Some defines:
#define NUM_ITERATIONS 1000000000ll
#define X_INC 17
#define Y_INC -31
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM :
mov QWORD PTR [rbp-32], 0
.L13:
cmp QWORD PTR [rbp-32], 999999999
jg .L12
add QWORD PTR [rbp-24], 17
add QWORD PTR [rbp-32], 1
jmp .L13
.L12:
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=X_INC; y+=Y_INC;}
ASM:
mov QWORD PTR [rbp-80], 0
.L21:
cmp QWORD PTR [rbp-80], 999999999
jg .L20
add QWORD PTR [rbp-64], 17
sub QWORD PTR [rbp-72], 31
add QWORD PTR [rbp-80], 1
jmp .L21
.L20:
So both Assemblies look pretty similar. But then let's think twice : modern CPUs have ALUs which operate on values which are wider than their register size. So there is a chance than in the first case, the operation on x and i are done on the same computing unit. But then you have to read i again, as you put a condition on the result of this operation. And reading means waiting.
So, in the first case, to iterate on x, the CPU might have to be in sync with the iteration on i.
In the second case, maybe x and y are treated on a different unit than the one dealing with i. So in fact, your loop body runs in parallel than the condition driving it. And there goes your CPU computing and computing until someone tells it to stop. It doesn't matter if it goes too far, going back a few loops is still fine compared to the amount of time it just gained.
So, to compare what we want to compare (one operation vs two operations), we should try to get i out of the way.
One solution is to completely get rid of it by using a while loop:
C/C++:
while (x < (X_INC * NUM_ITERATIONS)) { x+=X_INC; }
ASM:
.L15:
movabs rax, 16999999999
cmp QWORD PTR [rbp-40], rax
jg .L14
add QWORD PTR [rbp-40], 17
jmp .L15
.L14:
An other one is to use the antequated "register" C keyword:
C/C++:
register long i;
for (i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM:
mov ebx, 0
.L17:
cmp rbx, 999999999
jg .L16
add QWORD PTR [rbp-48], 17
add rbx, 1
jmp .L17
.L16:
Here are my results:
x1 for: 10.2985 seconds. x,y = 17000000000,0
x1 while: 8.00049 seconds. x,y = 17000000000,0
x1 register-for: 7.31426 seconds. x,y = 17000000000,0
x2 for: 9.30073 seconds. x,y = 17000000000,-31000000000
x2 while: 8.88801 seconds. x,y = 17000000000,-31000000000
x2 register-for :8.70302 seconds. x,y = 17000000000,-31000000000
Code is here: https://onlinegdb.com/S1lAANEhI

Same operations taking different time

I am in the process of optimizing my code for my n-body simulator, and when profiling my code, have seen this:
These two lines,
float diffX = (pNode->CenterOfMassx - pBody->posX);
float diffY = (pNode->CenterOfMassy - pBody->posY);
Where pNode is a pointer to a object of type Node that I have defined, and contains (with other things) 2 floats, CenterOfMassx and CenterOfMassy
Where pBody is a pointer to a object of type Body that I have defined, and contains (with other things) 2 floats, posX and posY.
Should take the same amount of time, but do not. In fact the first line accounts for 0.46% of function samples, but the second accounts for 5.20%.
Now I can see the second line has 3 instructions, and the first only has one.
My question is why do these seemingly do the same thing but in practice to different things?
As previously stated, profiler is listing only one assembly instruction with the first line, but three with the second line. However, because the optimizer can move code around a lot, this isn't very meaningful. It looks like the code is optimized to load all of the values into registers first, and then perform the subtractions. So it performs an action from the first line, then an action from the second line (the loads), followed by an action from the first line and an action from the second line (the subtractions). Since this is difficult to represent, it just does a best approximation of which line corresponds to which assembly code when displaying the disassembly inline with the code.
Take note that the first load is executed and may still be in the CPU pipeline when the next load instruction is executing. The second load has no dependence on the registers used in the first load. However, the first subtraction does. This instruction requires the previous load instruction to be far enough in the pipeline that the result can be used as one of the operands of the subtraction. This will likely cause a stall in the CPU while the pipeline lets the load finish.
All of this really reinforces the concept of memory optimizations being more important on modern CPUs than CPU optimizations. If, for example, you had loaded the required values into registers 15 instructions earlier, the subtractions might have occurred much quicker.
Generally the best thing you can do for optimizations is to keep the cache fresh with the memory you are going to using, and making sure it gets updated as soon as possible, and not right before the memory is needed. Beyond that, optimizations are complicated.
Of course, all of this is further complicated by modern CPUs that might look ahead 40-60 instructions for out of order execution.
To optimize this further, you might consider using a library that does vector and matrix operations in an optimized manner. Using one of these libraries, it might be possible to use two vector instructions instead of 4 scalar instructions.
According to the expanded assembly, the instructions are reordered to load pNodes's data members before performing subtraction with pBody's data members. The purpose may be to take advantage of memory caching.
Thus, the execution order is not the same as C code anymore. Comparing 1 movss which is accounted for the 1st C statement with 1 movss + 2 subss which are accounted for the 2nd C statement is unfair.
Performance counters aren't cycle-accurate. Sometimes the wrong instruction gets blamed. But in this case, it's probably pointing the finger at the instruction that was producing the result everything else was waiting for.
So probably it ran out of things it could do while waiting for the result of the memory access and FP sub. If cache misses are happening, look for ways to structure your code for better memory locality, or at least for memory access to happen in-order. Hardware prefetchers can detect sequential access patterns up to some limit of stride length.
Also, you compiler could have vectorized this. It loads two scalars from sequential addresses, then subtracts two scalars from sequential addresses. It would be faster to
movq xmm0, [esi+30h] # or movlps, but that wouldn't break the dep
movq xmm1, [edi] # on the old value of xmm0 / xmm1
subps xmm0, xmm1
That leaves diffX and diffY in element 0 and 1 of xmm0, rather than 2 different regs, so the benefit depends on the surrounding code.

Why would a compiler generate this assembly?

While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)
Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.
The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.
I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...
In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.

8-bit FFT for CPU architectures?

I am looking for an FFT engine that can handle 8-bit real to complex transforms (of size 65K). The need for this is to accelerate a real-time signal processing engine. It is currently limited by 8-bit -> FP32 and FP32 -> 8-bit conversions, as well as the actual FFT being memory bandwidth bound (we're using FFTW at the moment).
I thought that the Spiral project might be able to do this http://spiral.net, but the only code that seems to be available on their webpage is for single or double transforms.
Anyone know of any C or C++ libraries that can do this?
Sometimes ago I encountered the same problem. FFTW for my dataframe was executed in 14 ms (forward, some calculations, and backward), while straightforward byte (or short) to float array conversion took 12-19 ms. So I've made SSE function to convert bytes to floats (4 elements per cycle), and have got significant speed gain - now conversion is accomplished in 2.2-5 ms.
If you compiler can use autovectorization, try it first.
If not, write simple conversion function with intrinsics.
I've used inline assembler (MOVD, PUNPCKLBW, PUNPCKLWD, CVTDQ2PS, MOVAPS command sequence).
procedure BytesToSingles(Src, Dst: Pointer; Count: Integer);
asm
//EAX = Src pointer to byte array
//EDX = Dst pointer to float array !!! 16 byte-aligned !!!
//ECX = Count (multiple of four)
SHR ECX, 2 // 4 elements per cycle
JZ ##Exit
PXOR XMM7, XMM7 // zeros
##Cycle:
MOVD XMM1, [EAX] // load 4 bytes
PUNPCKLBW XMM1, XMM7 // unpack to words
PUNPCKLWD XMM1, XMM7 // words to int32
CVTDQ2PS XMM0, XMM1 // convert integers to 4 floats
MOVAPS [EDX], XMM0 // store 4 floats to destination array
ADD EAX, 4 // move array pointers
ADD EDX, 16
LOOP ##Cycle
##Exit:
end;
Note that FFT implementation on 8-bit data will suffer from numerical error issues, as Paul R wrote in comment.
You do not want to do all the processing in fixed point. You data will turn to mush in an FFT of that size. Technically, you could use 32bit fixed point and keep all your dynamics, but you'd still have to convert the data and it will be slower than using floats (you tagged SSE, so I assume you are on an intel machine having an FPU). I base my opinions on my work creating kissfft
Focus instead on speeding up the type conversion.
I've not run MBo's assembly code, but it looks like the right approach. I think unrolling might make it faster.
If you are not accustomed to assembly, use SSE2 compiler instrinsics instead. It will be just as fast (assuming decent compiler) and it will make your code more readable and maintainable. This answer will give you most of what you need.

Assembly Performance Tuning

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).
I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:
#include "stdafx.h"
#include <intrin.h>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
__int64 startval;
__int64 stopval;
unsigned int value; // Keep the value to keep from it being optomized out
startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode
// Simple Math: a = (a << 3) + 0x0054E9
_asm {
mov ebx, 0x1E532 // Seed
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
}
stopval = __rdtsc();
__int64 val = (stopval - startval);
cout << "Result: " << value << " -> " << val << endl;
int i;
cin >> i;
return 0;
}
I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.
I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.
Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?
EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.
To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.
You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.
Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.
As such, the normal timing sequence for your code would be something like this:
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID ; Intel says by the third execution, the timing will be stable.
RDTSC ; read the clock
push eax ; save the start time
push edx
mov ebx, 0x1E532 // Seed // execute test sequence
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
XOR EAX, EAX ; serialize
CPUID
rdtsc ; get end time
pop ecx ; get start time back
pop ebp
sub eax, ebp ; find end-start
sbb edx, ecx
We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.
Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.
I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.
The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.
On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!
Go here and download the Architectures Optimization Reference Manual.
There are many myths. I think the EAX claim is one of them.
Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.
I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.
You're getting ridiculous variance because rdtsc does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc instructions! You will probably get better results if you insert a serializing instruction (such as cpuid) immediately after the first rdtsc and immediately before the second. See this Intel tech note (PDF) for gory details.
Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.
There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.
I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.