Assembly Performance Tuning - c++

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).
I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:
#include "stdafx.h"
#include <intrin.h>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
__int64 startval;
__int64 stopval;
unsigned int value; // Keep the value to keep from it being optomized out
startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode
// Simple Math: a = (a << 3) + 0x0054E9
_asm {
mov ebx, 0x1E532 // Seed
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
}
stopval = __rdtsc();
__int64 val = (stopval - startval);
cout << "Result: " << value << " -> " << val << endl;
int i;
cin >> i;
return 0;
}
I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.
I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.
Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?
EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.

To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.
You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.
Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.
As such, the normal timing sequence for your code would be something like this:
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID ; Intel says by the third execution, the timing will be stable.
RDTSC ; read the clock
push eax ; save the start time
push edx
mov ebx, 0x1E532 // Seed // execute test sequence
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
XOR EAX, EAX ; serialize
CPUID
rdtsc ; get end time
pop ecx ; get start time back
pop ebp
sub eax, ebp ; find end-start
sbb edx, ecx
We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.
Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.

I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.

The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.
On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!

Go here and download the Architectures Optimization Reference Manual.
There are many myths. I think the EAX claim is one of them.
Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.

I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.

You're getting ridiculous variance because rdtsc does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc instructions! You will probably get better results if you insert a serializing instruction (such as cpuid) immediately after the first rdtsc and immediately before the second. See this Intel tech note (PDF) for gory details.

Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.
There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.

I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.

Related

imul then mov vs mov then imul - any difference?

If I compile the following C++ program:
int baz(int x) { return x * x; }
in clang 15, I get:
baz(int):
mov eax, edi
imul eax, edi
ret
while gcc 12.2 gives me:
baz(int):
imul edi, edi
mov eax, edi
ret
(See this on GodBolt)
Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.
Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).
CPU with mov-elimination
mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See How do *move elimination* slots work in Intel CPU?
All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
CPU without mov-elimination
mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)
But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
Things that don't matter
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.
Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (Why doesn't GCC use partial registers?)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:
#include <iostream>
#include <chrono>
#define NUM_ITERATIONS 100000000
int main()
{
// Block 1: one operation in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
// Block 2: two operations in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
return 0;
}
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:
1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000
I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.
Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?
Edit:
Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).
This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.
On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.
This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.
This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.
Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
See #rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.
With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately #SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)
Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result
I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.
(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.
So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.
See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.
I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.
The loop with optimization enabled:
As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:
orig_main:
...
call std::chrono::_V2::system_clock::now() # demangled C++ symbol name
mov rbp, rax # save the return value = start
call std::chrono::_V2::system_clock::now()
# end in RAX
So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.
With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:
#include <benchmark/benchmark.h>
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
benchmark::DoNotOptimize(x2 += 31);
benchmark::DoNotOptimize(y2 += 31);
}
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);
# just the main loop, from gcc10.1 -O3
.L7: # do{
add rax, 31 # x2 += 31
add rdx, 31 # y2 += 31
sub rbx, 1
jne .L7 # }while(--count != 0)
I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.
For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
x2 += 16;
y2 += 17;
asm volatile("" : "+r"(x2), "+r"(y2));
}
}
All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see #rcgldr's answer.
Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.
You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.
The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away
TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.
Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).
ETA: This was a guess, and Peter Cordes has made a very good argument about why it's incorrect. Go upvote Peter's answer.
I'm leaving my answer here because some found the information useful. Though this doesn't correctly explain the behavior seen in the OP, it highlights some of the issues that make it infeasible (and meaningless) to try to measure the speed of a particular instruction on a modern processor.
Educated guess:
It's the combined effect of pipelining, powering down portions of a core, and dynamic frequency scaling.
Modern processors pipeline so that multiple instructions can be executing at the same time. This is possible because the processor actually works on micro-ops rather than the assembly-level instructions we usually think of as machine language. Processors "schedule" micro-ops by dispatching them to different portions of the chip while keeping track of the dependencies between the instructions.
Suppose the core running your code has two arithmetic/logic units (ALUs). A single arithmetic instruction repeated over and over requires only one ALU. Using two ALUs doesn't help because the next operation depends on completion of the current one, so the second ALU would just be waiting around.
But in your two-expression test, the expressions are independent. To compute the next value of y, you do not have to wait for the current operation on x to complete. Now, because of power-saving features, that second ALU may be powered down at first. The core might run a few iterations before realizing that it could make use of the second ALU. At that point, it can power up the second ALU and most of the two-expression loop will run as fast as the one-expression loop. So you might expect the two examples to take approximately the same amount of time.
Finally, many modern processors use dynamic frequency scaling. When the processor detects that it's not running hard, it actually slows its clock a little bit to save power. But when it's used heavily (and the current temperature of the chip permits), it might increase the actual clock speed as high as its rated speed.
I assume this is done with heuristics. In the case where the second ALU stays powered down, the heuristic may decide it's not worth boosting the clock. In the case where two ALUs are powered up and running at top speed, it may decide to boost the clock. Thus the two-expression case, which should already be just about as fast as the one-expression case, actually runs at a higher average clock frequency, enabling it to complete twice as much work in slightly less time.
Given your numbers, the difference is about 14%. My Windows machine idles at about 3.75 GHz, and if I push it a little by building a solution in Visual Studio, the clock climbs to about 4.25GHz (eyeballing the Performance tab in Task Manager). That's a 13% difference in clock speed, so we're in the right ballpark.
I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.
I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.
There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.
Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.
My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.
.code
public test1
align 16
test1 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst10: add qword ptr[rsp+8],17
dec rcx
jnz tst10
add rsp,16
ret
test1 endp
public test2
align 16
test2 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst20: add qword ptr[rsp+0],17
add qword ptr[rsp+8],-37
dec rcx
jnz tst20
add rsp,16
ret
test2 endp
end
I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).
3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.
4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.
.code
public test1
align 16
test1 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst10: add rdx,17
dec rcx
jnz tst10
ret
test1 endp
public test2
align 16
test2 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst20: add rdx,17
add r8,-37
dec rcx
jnz tst20
ret
test2 endp
public test3
align 16
test3 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst30: add rdx,17
add r8,-37
add r9,47
dec rcx
jnz tst30
ret
test3 endp
public test4
align 16
test4 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst40: add rdx,17
add r8,-37
add r9,47
add r10,-17
dec rcx
jnz tst40
ret
test4 endp
end
#PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.
I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.
But it seems that #Adrian McCarthy got it right with the dynamic frequency scaling.
Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.
http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA
I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case.
http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0
Edit: Thanks to #PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.
Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).
Processors are so complex these days that we can only guess.
The assembly emitted by your compiler is not what is really executed. The microcode/firmware/whatever of your CPU will interpret it and turn it into instructions for its execution engine, much like JIT languages such as C# or java do.
One thing to consider here is that for each loop, there is not 1 or 2 instructions, but n + 2, as you also increment and compare i to your number of iteration. In the vast majority of case it wouldn't matter, but here it does, as the loop body is so simple.
Let's see the assembly :
Some defines:
#define NUM_ITERATIONS 1000000000ll
#define X_INC 17
#define Y_INC -31
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM :
mov QWORD PTR [rbp-32], 0
.L13:
cmp QWORD PTR [rbp-32], 999999999
jg .L12
add QWORD PTR [rbp-24], 17
add QWORD PTR [rbp-32], 1
jmp .L13
.L12:
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=X_INC; y+=Y_INC;}
ASM:
mov QWORD PTR [rbp-80], 0
.L21:
cmp QWORD PTR [rbp-80], 999999999
jg .L20
add QWORD PTR [rbp-64], 17
sub QWORD PTR [rbp-72], 31
add QWORD PTR [rbp-80], 1
jmp .L21
.L20:
So both Assemblies look pretty similar. But then let's think twice : modern CPUs have ALUs which operate on values which are wider than their register size. So there is a chance than in the first case, the operation on x and i are done on the same computing unit. But then you have to read i again, as you put a condition on the result of this operation. And reading means waiting.
So, in the first case, to iterate on x, the CPU might have to be in sync with the iteration on i.
In the second case, maybe x and y are treated on a different unit than the one dealing with i. So in fact, your loop body runs in parallel than the condition driving it. And there goes your CPU computing and computing until someone tells it to stop. It doesn't matter if it goes too far, going back a few loops is still fine compared to the amount of time it just gained.
So, to compare what we want to compare (one operation vs two operations), we should try to get i out of the way.
One solution is to completely get rid of it by using a while loop:
C/C++:
while (x < (X_INC * NUM_ITERATIONS)) { x+=X_INC; }
ASM:
.L15:
movabs rax, 16999999999
cmp QWORD PTR [rbp-40], rax
jg .L14
add QWORD PTR [rbp-40], 17
jmp .L15
.L14:
An other one is to use the antequated "register" C keyword:
C/C++:
register long i;
for (i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM:
mov ebx, 0
.L17:
cmp rbx, 999999999
jg .L16
add QWORD PTR [rbp-48], 17
add rbx, 1
jmp .L17
.L16:
Here are my results:
x1 for: 10.2985 seconds. x,y = 17000000000,0
x1 while: 8.00049 seconds. x,y = 17000000000,0
x1 register-for: 7.31426 seconds. x,y = 17000000000,0
x2 for: 9.30073 seconds. x,y = 17000000000,-31000000000
x2 while: 8.88801 seconds. x,y = 17000000000,-31000000000
x2 register-for :8.70302 seconds. x,y = 17000000000,-31000000000
Code is here: https://onlinegdb.com/S1lAANEhI

C/С++. Why could a simple integer addition on a volatile be translated to a different asm instruction on gcc and clang?

I wrote a simple loop:
int volatile value = 0;
void loop(int limit) {
for (int i = 0; i < limit; ++i) {
++value;
}
}
I compiled this with gcc and clang(-O3 -fno-unroll-loops) and got different outputs. They differ in ++value part:
clang:
add dword ptr [rip + value], 1 # ++value
add edi, -1 # --limit
jne .LBB0_1 # if limit > 0 then continue looping
gcc:
mov eax, DWORD PTR value[rip] # copy value to a register
add edx, 1 # ++i
add eax, 1 # increment a copy of value
mov DWORD PTR value[rip], eax # store incremented copy to value, i. e. ++value
cmp edi, edx # compare i < limit
jne .L3 # if i < limit then continue looping
C and C++ versions are same on each compiler(https://gcc.godbolt.org/z/5x5jGP)
So my questions are:
1) Is gcc doing something wrong? What is the point of copying the value?
2) I have benchmarked that code and for some reason the profiler shows that in gcc's version 73% of time is wasted on instruction add edx, 1, 13% on mov DWORD PTR value[rip], eax and 13% on cmp edi, edx. Am I interpreting this results wrong? Why other addition and move instructions take less than 1% of the time?
3) Why can performance differ on gcc/clang in such a primitive code?
This is all because you used volatile and GCC doesn't optimize it as aggressively
Without volatile, e.g. for a single ++*int_ptr, you get a memory-destination add. (And hopefully not inc when tuning for Intel CPUs; inc reg is fine but inc mem costs an extra uop vs. add 1. Unfortunately gcc and clang both get this wrong and use inc mem with -march=skylake: https://godbolt.org/z/_1Ri20)
clang knows that it can fold the volatile read / write accesses into the load and store portions of a memory-destination add.
GCC does not know how to do this optimization for volatile. Using volatile in GCC typically results in separate mov loads and stores, avoiding x86's ability to save code-size by using CISC memory operands for ALU instructions. On a load/store machine (like any RISC) you'd need separate load and store instructions anyway so it would be non-issue.
TL:DR: different compiler internals around volatile, specifically a GCC missed-optimization.
This missed optimization barely matter because volatile is rarely used. But feel free to report it on GCC's bugzilla if you want.
Without volatile, the loop would of course optimize away. But you can see a single memory-destination add from GCC or clang for a function that just does ++*p.
1) Is gcc doing something wrong? What is the point of copying the value?
It's only copying it to a register. We don't normally call this "copying", just bringing it into a register where it can operate on it.
Note that gcc and clang also differ in how they implement the loop condition, with clang optimizing to just dec/jnz (actually add -1, but it would use dec with -march=skylake or something with efficient dec, i.e. not Silvermont).
GCC spends an extra uop on the loop condition (on Intel CPUs where add/jnz can macro-fuse into a single uop). IDK why it compiles it naively like that.
73% of time is wasted on instruction add edx, 1
perf counters typically blame the instruction that's waiting for a slow result, not the instruction that's actually slow to produce it.
add edx,1 is waiting for the reload of value. With 4 to 5 cycle store-forwarding latency, this is the major bottleneck in your loop.
(Whether it's between the multiple uops of a memory-destination add or between separate instructions makes essentially no difference. There are no other memory accesses in your loop so none of the weird effects of store-forwarding latency being lower if you don't try too soon come into play:
Adding a redundant assignment speeds up code when compiled without optimization or Loop with function call faster than an empty loop )
Why other addition and move instructions take less than 1% of the time?
Because out-of-order execution hides them under the latency of the critical path. They are very rarely the instruction that gets blamed when statistical sampling has to pick one out of the many that are in flight at once in any given cycle.
3) Why can performance differ on gcc/clang in such a primitive code?
I'd expect both those loops run at the same speed. Did you just mean performance as in how well the compilers themselves performed in making code that's both fast and compact?

x86 assembly instructions optimisation

I'm trying to optimize a block of instructions in a loop, called thousands of time, which is the bottleneck in my algorithm.
This block of code compute the multiplication of a N matrices 3x3 (iA array) against N vectors 3 (iV array) and store the N results in oV array. (N is not fix and is usually between 3000 and 15000)
Each line of matrices and vectors are 128-bits aligned (4 floats) to exploit SSE optimisation (the 4th floating value is ignored).
C++ code :
__m128* ip = (__m128*)iV;
__m128* op = (__m128*)oV;
__m128* A = (__m128*)iA;
__m128 res1, res2, res3;
int i;
for (i=0; i<N; i++)
{
res1 = _mm_dp_ps(*A++, *ip, 0x71);
res2 = _mm_dp_ps(*A++, *ip, 0x72);
res3 = _mm_dp_ps(*A++, *ip++, 0x74);
*op++ = _mm_or_ps(res1, _mm_or_ps(res2, res3));
}
The compiler generates these instructions :
000007FEE7DD4FE0 movaps xmm2,xmmword ptr [rsi] //move "ip" in register
000007FEE7DD4FE3 movaps xmm1,xmmword ptr [rdi+10h] //move second line of A in register
000007FEE7DD4FE7 movaps xmm0,xmmword ptr [rdi+20h] //move third line of A in register
000007FEE7DD4FEB inc r11d //i++
000007FEE7DD4FEE add rbp,10h //op++
000007FEE7DD4FF2 add rsi,10h //ip++
000007FEE7DD4FF6 dpps xmm0,xmm2,74h //dot product of 3rd line of A against ip
000007FEE7DD4FFC dpps xmm1,xmm2,72h //dot product of 2nd line of A against ip
000007FEE7DD5002 orps xmm0,xmm1 //"merge" of the result of the two dot products
000007FEE7DD5005 movaps xmm3,xmmword ptr [rdi] //move first line of A in register
000007FEE7DD5008 add rdi,30h //A+=3
000007FEE7DD500C dpps xmm3,xmm2,71h //dot product of 1st line of A against ip
000007FEE7DD5012 orps xmm0,xmm3 //"merge" of the result
000007FEE7DD5015 movaps xmmword ptr [rbp-10h],xmm0 //move result in memory (op)
000007FEE7DD5019 cmp r11d,dword ptr [rbx+28h] //compare i
000007FEE7DD501D jl MyFunction+370h (7FEE7DD4FE0h) //loop
I'm not very familiar with low-level optimisations, so the question is : Do you see some possible optimisations if I write assembly code myself ?
For example, will it run faster if I change :
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
by
movaps xmmword ptr [rbp],xmm0
add rbp,10h
I have also read that ADD instruction is faster than INC...
Calculating indirect address with offset, such as rbp-10 is very cheap, because there is special hardware for these sort of calculations in the "effective address calculation" unit [which I think has a different name, but can't think of or have any success with google to find it's name].
There is, however, a dependency between the add rbp,10h and [rbp-10h], which could possibly cause a problem - but I doubt it in this particular case. In your case, there is a long distance between rbp-10 and using it, so it's not an issue. The compiler is probably putting it that far up because it's "free" at that point, since the processor will be waiting for the data to come in from the outside into the xmm registers that has been read earlier. In other words, any work we can stick between the reads of xmm0, xmm1 and xmm2 at the beginning of the loop, and the dpps instructions using xmm0, xmm1 and xmm2 will be beneficial, because the processor will be waiting for that data to "arrive" before it can compute the dpps result.
I've done lots of x86 assembly optimizations, and I can tell you it was a great learning experience. It taught me a lot about how compilers work as well, and the biggest thing I learned was that compilers are in general pretty good at what they do. I know that's a flippant comment, but it is true...
I also learned that optimizations you make can have a positive result on one processor family, and a negative result on another processor family. Things like pipelining, branch prediction, and processor cache play a huge role... so unless you're targeting a very specific hardware configuration, be careful about assumptions regarding improvements you make.
To your specific question about reordering the add to remove the rbp-10h offset... it look like an obvious improvement, and you'd have to verify by reading the instruction manual, but I'd guess the -10h memory offset comes for free in that instruction. And moving the add may throw off a pipelined instruction and actually cause a clock cycle loss. You'd have to experiment.
There are a few things you could do to the above code to improve it. Generally, using a value after it has been altered incurs a processor stall as it waits for the result. So these lines would incur a penalty:-
add rbp,10h
movaps xmmword ptr [rbp-10h],xmm0
but in the code snippet above those two lines a quite far apart, so that isn't really an issue. As others have already said, the rbp-10h is 'free' in that the address calculation hardware handles it.
You could move the movaps xmm3,xmmword ptr [rdi] up a line and maybe rearrange a couple of other lines.
Would it be worth it?
NO
You'd be lucky to see any real performance gain from any of this because your algorithm is
<blink> memory bandwidth limited </blink>*
which means that the the time taken to read the data from RAM into the CPU is greater than the time it takes the CPU to do its processing. At worst, reading a memory address can involve a page fault and a disk read. The prefetch instructions won't help either, it's called 'Streaming SIMD Extension' because it's optimised to stream data into the CPU (the memory interface can handle four separate streams IIRC).
If you were doing a lot of computation on a small set of data (an FFT perhaps) then you could gain a lot from hand-crafting the assembler. But your algorithm is quite simple so the CPU is idling a lot of the time waiting for the data to arrive. If you know the speed of your RAM you could work out the maximum throughput for the algorithm and use that to compare against what it's currently achieving (you'll never reach the maximum theoretical throughput though).
There are things you can do to minimise the memory stalling, and it's a higher level change rather than fiddling with individual instructions (often, optimising the algorithms gets better results). The simplest is to double buffer the input data. Divide the register set into two groups (possible to do here as you're only using four of the SIMD registers):-
load set 1
mainloop:
load set 2
do processing on set 1
save set 1 result
load set 1
do processing on set 2
save set 2 result
goto mainloop
Hopefully that's given you some ideas. Even if it doesn't go much faster, it's still an interesting exercise and you can learn a lot from it.
RIP blink.

Why would a compiler generate this assembly?

While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)
Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.
The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.
I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...
In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.