difference in CPU time for two similar lines

difference in CPU time for two similar lines - c++

There is a while loop in my program, where IterZNext, IterZ are pointers to nodes in a list. The nodes in the list are of type struct with a field called "Index".
double xx = 20.0;
double yy = 10000.0;
double zz;
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
IterZ=IterZNext;
IterZNext = IterZ->Next;
if (IterZNext!=NULL)
{
zz = xx + yy;
NextIndex1 = IterZNext->Index; // line (*)
NextIndex = IterZNext->Index; // line (**)
IterZNext->Index;
}
}
When I profiled my program, I found the line (*)
NextIndex1 = IterZNext->Index;
consumes most of CPU time (2.193s), while the line (**)
NextIndex = IterZNext->Index;
which is all most the same with the line (*) only uses 0.093s. I used the Intel VTune Amplifier to see the assembly of these two lines, which is as follows:
Address Line Assembly CPU Time Instructions Retired
Line (*):
0x1666 561 mov eax, dword ptr [ebp-0x44] 0.015s 50,000,000
0x1669 561 mov ecx, dword ptr [eax+0x8]
0x166c 561 mov dword ptr [ebp-0x68], ecx 2.178s 1,614,000,000
Line (**):
0x166f 562 mov byte ptr [ebp-0x155], 0x1 0.039s 80,000,000
0x1676 562 mov eax, dword ptr [ebp-0x44] 0.027s 44,000,000
0x1679 562 mov ecx, dword ptr [eax+0x8]
0x167c 562 mov dword ptr [ebp-0x5c], ecx 0.026s 94,000,000
If I change the order of the line () and the line (*), then the program changes to
double xx = 20.0;
double yy = 10000.0;
double zz;
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
IterZ=IterZNext;
IterZNext = IterZ->Next;
if (IterZNext!=NULL)
{
zz = xx + yy;
NextIndex = IterZNext->Index; // line (**)
NextIndex1 = IterZNext->Index; // line (*)
IterZNext->Index;
}
}
and the result for assembly changes to
Address Line Assembly CPU Time Instructions Retired
Line (**):
0x1666 560 mov byte ptr [ebp-0x155], 0x1 0.044s 84,000,000
0x166d 560 mov eax, dword ptr [ebp-0x44] 0.006s 2,000,000
0x1670 560 mov ecx, dword ptr [eax+0x8] 0.001s 4,000,000
0x1673 560 mov dword ptr [ebp-0x5c], ecx 1.193s 1,536,000,000
Line (*):
0x1676 561 mov eax, dword ptr [ebp-0x44] 0.052s 128,000,000
0x1679 561 mov ecx, dword ptr [eax+0x8]
0x167c 561 mov dword ptr [ebp-0x68], ecx 0.034s 112,000,000
In this case, line (*) uses most of CPU time (1.245s) while line () only uses 0.086s.
Could someone tell me:
(1) Why does it take so long to make the first assignment? Notice that the line zz=xx+yy only uses 0.058s. Is this related to the cache misses? since all nodes in the list are dynamically genereated.
(2) Why is there huge difference in CPU time between this two lines?
Thanks!

All modern CPUs are superscaler & out-of-order - which means that instructions are not actually executed in the order of the assembly, and there isn't really such thing as the current PC - there are many 10s of instructions in flight and executing at once.
Therefore any sampling information a CPU reports is just a rough area the CPU was executing - it was executing the instruction it indicated when the sampling interrupt went off; but it was also executing all the other in-flight ones!
However people have got used to (and expect) profiling tools to tell them exactly which individual instruction the CPU is currently running - so when the sampling interrupt triggers the CPU essentially picks one of the many active instructions to be the 'current' one.

CPU line caching is probably the reason. accessing [ebp-0x5c] also brings into the cache [ebp-0x68], which then will be fetched much faster (for the second case, vice versa for the first).

It is definitly due to the cache miss. Then bigger miss then more performance penalty will be introduced by processor. Actually in the modern world CPU performs much faster then memory. If nowadays processor can have clock frequency in about 4GHz, memory still run with frequency ~0.3GHz. It is great performance gap that is still continue to grow. Cache introduction was driven by desire to hide this gap. Without cache using modern processor will spend huge amount of time waiting data from the memory and doing nothing at that time. In addition to the performance gap, each memory access incurrs additional latencies related to posible concurrency on memory bus with other CPUs and DMA devices and times required for memory access request processing and routing on the side of the processor memory management logic (checking of the caches of all levels, virtual-to-physical address translation that can involve TLB miss with additional access to memory, request pushing to the memory bus etc.)and memory controller(request routing from the CPU-controller to the controller memory bus, possible waiting for memory bank refresh cycle complition etc.). So, to summarize, raw access to the memory have really big cost in comparision to the L1 cache hit or register access. The difference in cost comparable to the difference in cost to access data in memory and in secondary storage (HDD).
Furthermore, the cost of memory access will grow with moving from the processor to the memory. L2 access will provide bigger penalty then L1 or CPU registers access, L3 access will provide penalty bigger then L2 access and, finally, memory access will provide penalty bigger then memory access. For example, you can compare cost of data access on different levels of memory hierarchy in the next table (captured from http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/6)
Cache/Memory Latency Comparison
-----------------------------------------------------------
| |L1| L2| L3| Main Memory |
-----------------------------------------------------------
|AMD FX-8150 (3.6GHz) | 4| 21| 65| 195 |
-----------------------------------------------------------
|AMD Phenom II X4 975 BE (3.6GHz)| 3| 15| 59| 182 |
-----------------------------------------------------------
|AMD Phenom II X6 1100T (3.3GHz) | 3| 14| 55| 157 |
-----------------------------------------------------------
|Intel Core i5 2500K (3.3GHz) | 4| 11| 25| 148 |
-----------------------------------------------------------
In regard to your particular case:
0x1669 561 mov ecx, dword ptr [eax+0x8]
0x166c 561 mov dword ptr [ebp-0x68], ecx 2.178s 1,614,000,000
0x1670 560 mov ecx, dword ptr [eax+0x8] 0.001s 4,000,000 /* confusing and looks like wrong report for me*/
0x1673 560 mov dword ptr [ebp-0x5c], ecx 1.193s 1,536,000,000
You have penalty on dereferencing of the Index value in the code line.
mov ecx, dword ptr [eax+0x8]
Note that it is first access to the data in each subsequent node of you list, up to this moment you manipulate only by node address, but the data of that address and due to this haven't memory access.
You stated, that you use dynamical list, and this is bad from cache hit rate point of vies. In addition, I suppose that you have big enough list that mean that you will have cache fluded by the previously accessed data (list nodes accessed on previous iterations)and almost always will have cache miss or cache hit only on L3 cache during the accessing Index on each new iteration. But note, that during first access to the Index on each involving cache miss on each new iteration data returned from the memory will be stored in the L1 cache. And when you access Index on the second time during the same cycle iteration, you will have low cost L1 cache hit!
So I hope I provide you detailed answer on both your questions.
In regard to the correctness of the VTune report correctness. I want to advocate Intel VTune developers. Of course, modern processors are very complex devices with number of ILP improving technologies on the board including pipelining, superscalaring, Out-Of-Order execution, branch prediction etc, and of cource this make detailed instruction level performance analysis harder and more precious. But such tools like VTune is developed with that processor features in the mind, and I belive that they are not so stupied to develop and provide the tool or feature that haven't any sence. Furthermore it looks like the developers from the Intel like no another have access to the full understanding of the all processor features details and as no another can take this details into account during profiler designing and development.

Related

imul then mov vs mov then imul - any difference?

If I compile the following C++ program:
int baz(int x) { return x * x; }
in clang 15, I get:
baz(int):
mov eax, edi
imul eax, edi
ret
while gcc 12.2 gives me:
baz(int):
imul edi, edi
mov eax, edi
ret
(See this on GodBolt)
Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).
CPU with mov-elimination
mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See How do *move elimination* slots work in Intel CPU?
All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
CPU without mov-elimination
mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)
But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
Things that don't matter
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.
Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (Why doesn't GCC use partial registers?)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

AND operator + addition faster than a subtraction

I've measured the execution time of following codes:
volatile int r = 768;
r -= 511;
volatile int r = 768;
r = (r & ~512) + 1;
assembly:
mov eax, DWORD PTR [rbp-4]
sub eax, 511
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
and ah, 253
add eax, 1
mov DWORD PTR [rbp-4], eax
the results:
Subtraction time: 141ns
AND + addition: 53ns
I've run the snippet multiple times with consistent results.
Can someone explain me why is this the case even tho there is one more line of assembly for AND + addition version?

Your assertion that one snippet is faster than the other is mistaken.
If you look at the code:
mov eax, DWORD PTR [rbp-4]
....
mov DWORD PTR [rbp-4], eax
You'll see that the running time is dominated by the load/store to memory.
Even on Skylake this will take 2+2 = 4 cycles minimum.
The 1 cycles that the sub or the 3*) cycles that the and bytereg/add full reg takes simply disappears into memory access time.
On older processors such as Core2 it takes 5 cycles minimum to do a load/store pair to the same address.
It is difficult to time such short sequences of code and care should be taken to ensure you have the correct methodology.
You also need to remember that rdstc is not accurate on Intel processors and runs out of order to boot.
If you use proper timing code like:
.... x 100,000 //stress the cpu using integercode in a 100,000 x loop to ensure it's running at 100%
cpuid //serialize instruction to make sure rdtscp does not run early.
rdstcp //use the serializing version to ensure it does not run late
push eax
push edx
mov reg1,1000*1000 //time a minimum of 1,000,000 runs to ensure accuracy
loop:
... //insert code to time here
sub reg1,1 //don't use dec, it causes a partial register stall on the flags.
jnz loop //loop
//kernel mode only!
//mov eax,cr0 //reading and writing to cr0 serializes as well.
//mov cr0,eax
cpuid //serialization in user mode.
rdstcp //make sure to use the 'p' version of rdstc.
push eax
push edx
pop 4x //retrieve the start and end times from the stack.
Run the timing code a 100x and take the lowest cycle count.
Now you'll have an accurate count to within 1 or 2 cycles.
You'll want to time an empty loop as well and subtract the times for that so that you can see the net time spend executing the instructions of interest.
If you do this you'll discover that add and sub run at exactly the same speed, just like it does/did in every x86/x64 CPU since the 8086.
This, of course, is also what Agner Fog, the Intel CPU manuals, the AMD cpu manuals, and just about any other source available say.
*) and ah,value takes 1 cycle, then the CPU stalls for 1 cycle due the partial register write and the add eax,value takes another cycle.
Optimized code
sub DWORD PTR [rbp-4],511
Might be faster if you don't need to reuse the value elsewhere, the latency is slow at 5 cycles, but the reciprocal throughput is 1 cycle, which is much better than either of your versions.

The full machine code is
8b 45 fc mov eax,DWORD PTR [rbp-0x4]
2d ff 01 00 00 sub eax,0x1ff
89 45 fc mov DWORD PTR [rbp-0x4],eax
vs
8b 45 fc mov eax,DWORD PTR [rbp-0x4]
80 e4 fd and ah,0xfd
83 c0 01 add eax,0x1
89 45 fc mov DWORD PTR [rbp-0x4],eax
This means for the code for the secound operation is in fact only one byte longer (11 vs 12). Most likely the CPU fetches code in larger units them bytes, so fetching isn't much slower. Also it can decode multiple instructions at the same time, so there the first sample doesn't have an advantage either. Executing a single add, and or sub each takes up a single ALU pass so they all take only one clock on a single execution unit. That's a 1 ns advantage for you sub on a 1GHz CPU.
So basically both operations are more or less the same. The difference may be attributed to some other factors. Maybe memory cell rbp-0x4 is still in L1 cache before your run the secound code sniplet. Or the instructions for the first sniplet are located worse reachable in memory. Or the CPU was able to run the secound sniplet speculativly before you started measuring etc., you would need to know how you measured the speed etc. to decide that.

Why does this function push RAX to the stack as the first operation?

In the assembly of the C++ source below. Why is RAX pushed to the stack?
RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?
The code:-
#include <functional>
void f(std::function<void()> a)
{
a();
}
Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:
f(std::function<void ()>): # #f(std::function<void ()>)
push rax
cmp qword ptr [rdi + 16], 0
je .LBB0_1
add rsp, 8
jmp qword ptr [rdi + 24] # TAILCALL
.LBB0_1:
call std::__throw_bad_function_call()
I'm sure the reason is obvious, but I'm struggling to figure it out.
Here's a tailcall without the std::function<void()> wrapper for comparison:
void g(void(*a)())
{
a();
}
The trivial:
g(void (*)()): # #g(void (*)())
jmp rdi # TAILCALL

The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.
call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.
(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).

The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is
passed on stack) byte boundary. In other words, the value (%rsp + 8) is always
a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.
As #CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>):
cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
je .L7 #,
jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
sub rsp, 8 #,
call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.

In other cases, clang typically fixes up the stack before returning with a pop rcx.
Using push has an upside for efficiency in code-size (push is only 1 byte vs. 4 bytes for sub rsp, 8), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp directly because the call that brought us to the top of the current function makes the stack engine "dirty").
This long and rambling answer discusses the worst-case performance risks of using push rax / pop rcx for aligning the stack, and whether or not rax and rcx are good choices of register. (Sorry for making this so long.)
(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al or ax are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push to align the stack by 16 for another call unless it was already saving/restoring some call-preserved regs for its own use.)
Using push rax instead of sub rsp, 8 introduces a dependency on the old value of rax, so you'd think it might slow things down if the value of rax is the result of a long-latency dependency chain (and/or a cache miss).
e.g. the caller might have done something slow with rax that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);
# example caller that leaves RAX not-ready for a long time
mov rdi, rax ; prepare function arg
div rbx ; very high latency
mov rax, [table + rdx] ; rax = table[ value % something ], may miss in cache
mov [rsp + 24], rax ; spill the result.
call foo ; foo uses push rax to align the stack
Fortunately out-of-order execution will do a good job here.
The push doesn't make the value of rsp dependent on rax. (It's either handled by the stack engine, or on very old CPUs push decodes to multiple uops, one of which updates rsp independently of the uops that store rax. Micro-fusion of the store-address and store-data uops let push be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)
As long as nothing depends on the output push rax / pop rcx, it's not a problem for out-of-order execution. If push rax has to wait because rax isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push because the instruction that's slow to produce rax, and whatever instruction in the caller consumes rax before the call are even older, and can't retire either until rax is ready. Retirement has to happen in-order in case of exceptions / interrupts.
(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call. The caller's instruction that consumes rax definitely can't execute/retire until our push can do the same.)
When rax does become ready, push can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.
So even in the worst case, where the instruction that produces rax was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push retires.)
A push rax that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8 that could execute would only be consuming a ROB entry, and not much else.)
It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx uop's load address is ready, so it should dispatch to a load port and execute. (When the pop load executes, it finds that its address matches the incomplete push store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)
Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub) is worse.
(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)
ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)
There are cases where push rax / pop rcx can be more dangerous:
The caller of this function knows that rcx is call-clobbered, so won't read the value. But it might have a false dependency on rcx after we return, like bsf rcx, rax / jnz or test eax,eax / setz cl. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl has a false dep on rcx. bsf actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.
The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx with instructions dependent on its inputs.
It would be worse to use push rbx/pop rbx to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx would be written right before the call, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx and rbp.)
On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax with push could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al before the call. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax at all.
It would be nice to push a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil and sil, so rdi and rsi would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl or cl as 8-bit scratch registers, using dil or sil even in tiny functions where nothing else is using rdx or rcx. (Although lack of low8 renaming in some CPUs means that setcc cl has a false dependency on the old rcx, so setcc dil is safer if the flag-setting was dependent on the function arg in rdi.)
pop rcx at the end "cleans" rcx of any partial-register stuff. Since cl is used for shift counts, and functions do sometimes write just cl even when they could have written ecx instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)
push rdi would probably be a good choice in a lot of cases, since the rest of the function also reads rdi, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push out of the way if rax is ready before rdi, though.
Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8, that would be 2 ALU uops at the top of the function.

x86 assembly instructions optimisation

I'm trying to optimize a block of instructions in a loop, called thousands of time, which is the bottleneck in my algorithm.
This block of code compute the multiplication of a N matrices 3x3 (iA array) against N vectors 3 (iV array) and store the N results in oV array. (N is not fix and is usually between 3000 and 15000)
Each line of matrices and vectors are 128-bits aligned (4 floats) to exploit SSE optimisation (the 4th floating value is ignored).
C++ code :
__m128* ip = (__m128*)iV;
__m128* op = (__m128*)oV;
__m128* A = (__m128*)iA;
__m128 res1, res2, res3;
int i;
for (i=0; i<N; i++)
{
res1 = _mm_dp_ps(*A++, *ip, 0x71);
res2 = _mm_dp_ps(*A++, *ip, 0x72);
res3 = _mm_dp_ps(*A++, *ip++, 0x74);
*op++ = _mm_or_ps(res1, _mm_or_ps(res2, res3));
}
The compiler generates these instructions :
000007FEE7DD4FE0 movaps xmm2,xmmword ptr [rsi] //move "ip" in register
000007FEE7DD4FE3 movaps xmm1,xmmword ptr [rdi+10h] //move second line of A in register
000007FEE7DD4FE7 movaps xmm0,xmmword ptr [rdi+20h] //move third line of A in register
000007FEE7DD4FEB inc r11d //i++
000007FEE7DD4FEE add rbp,10h //op++
000007FEE7DD4FF2 add rsi,10h //ip++
000007FEE7DD4FF6 dpps xmm0,xmm2,74h //dot product of 3rd line of A against ip
000007FEE7DD4FFC dpps xmm1,xmm2,72h //dot product of 2nd line of A against ip
000007FEE7DD5002 orps xmm0,xmm1 //"merge" of the result of the two dot products
000007FEE7DD5005 movaps xmm3,xmmword ptr [rdi] //move first line of A in register
000007FEE7DD5008 add rdi,30h //A+=3
000007FEE7DD500C dpps xmm3,xmm2,71h //dot product of 1st line of A against ip
000007FEE7DD5012 orps xmm0,xmm3 //"merge" of the result
000007FEE7DD5015 movaps xmmword ptr [rbp-10h],xmm0 //move result in memory (op)
000007FEE7DD5019 cmp r11d,dword ptr [rbx+28h] //compare i
000007FEE7DD501D jl MyFunction+370h (7FEE7DD4FE0h) //loop
I'm not very familiar with low-level optimisations, so the question is : Do you see some possible optimisations if I write assembly code myself ?
For example, will it run faster if I change :
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
by
movaps xmmword ptr [rbp],xmm0
add rbp,10h
I have also read that ADD instruction is faster than INC...

Calculating indirect address with offset, such as rbp-10 is very cheap, because there is special hardware for these sort of calculations in the "effective address calculation" unit [which I think has a different name, but can't think of or have any success with google to find it's name].
There is, however, a dependency between the add rbp,10h and [rbp-10h], which could possibly cause a problem - but I doubt it in this particular case. In your case, there is a long distance between rbp-10 and using it, so it's not an issue. The compiler is probably putting it that far up because it's "free" at that point, since the processor will be waiting for the data to come in from the outside into the xmm registers that has been read earlier. In other words, any work we can stick between the reads of xmm0, xmm1 and xmm2 at the beginning of the loop, and the dpps instructions using xmm0, xmm1 and xmm2 will be beneficial, because the processor will be waiting for that data to "arrive" before it can compute the dpps result.

I've done lots of x86 assembly optimizations, and I can tell you it was a great learning experience. It taught me a lot about how compilers work as well, and the biggest thing I learned was that compilers are in general pretty good at what they do. I know that's a flippant comment, but it is true...
I also learned that optimizations you make can have a positive result on one processor family, and a negative result on another processor family. Things like pipelining, branch prediction, and processor cache play a huge role... so unless you're targeting a very specific hardware configuration, be careful about assumptions regarding improvements you make.
To your specific question about reordering the add to remove the rbp-10h offset... it look like an obvious improvement, and you'd have to verify by reading the instruction manual, but I'd guess the -10h memory offset comes for free in that instruction. And moving the add may throw off a pipelined instruction and actually cause a clock cycle loss. You'd have to experiment.

There are a few things you could do to the above code to improve it. Generally, using a value after it has been altered incurs a processor stall as it waits for the result. So these lines would incur a penalty:-
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
but in the code snippet above those two lines a quite far apart, so that isn't really an issue. As others have already said, the rbp-10h is 'free' in that the address calculation hardware handles it.
You could move the movaps xmm3,xmmword ptr [rdi] up a line and maybe rearrange a couple of other lines.
Would it be worth it?
NO
You'd be lucky to see any real performance gain from any of this because your algorithm is
<blink> memory bandwidth limited </blink>*
which means that the the time taken to read the data from RAM into the CPU is greater than the time it takes the CPU to do its processing. At worst, reading a memory address can involve a page fault and a disk read. The prefetch instructions won't help either, it's called 'Streaming SIMD Extension' because it's optimised to stream data into the CPU (the memory interface can handle four separate streams IIRC).
If you were doing a lot of computation on a small set of data (an FFT perhaps) then you could gain a lot from hand-crafting the assembler. But your algorithm is quite simple so the CPU is idling a lot of the time waiting for the data to arrive. If you know the speed of your RAM you could work out the maximum throughput for the algorithm and use that to compare against what it's currently achieving (you'll never reach the maximum theoretical throughput though).
There are things you can do to minimise the memory stalling, and it's a higher level change rather than fiddling with individual instructions (often, optimising the algorithms gets better results). The simplest is to double buffer the input data. Divide the register set into two groups (possible to do here as you're only using four of the SIMD registers):-
load set 1
mainloop:
load set 2
do processing on set 1
save set 1 result
load set 1
do processing on set 2
save set 2 result
goto mainloop
Hopefully that's given you some ideas. Even if it doesn't go much faster, it's still an interesting exercise and you can learn a lot from it.
RIP blink.

Why would a compiler generate this assembly?

While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)

Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.

The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.

I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...

In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js