Compiler generates costly MOVZX instruction - c++

My profiler has identified the following function profiling as the hotspot.
typedef unsigned short ushort;
bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
for( unsigned i = 0; i < dim; ++i )
{
if( point[i + 1] >= coord[i] ) return false;
}
return true;
}
In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into
mov rcx, QWORD PTR [rdi]
lea r8d, [rax+1]
add rsi, 2
movzx r9d, WORD PTR [rsi-2]
mov rax, r8
cmp WORD PTR [rcx+r8*2], r9w
jae .L5
I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?
(Find the entire function on godbolt compiler explorer.)

Thank you for the good question!
Clearing Registers and Dependency Breaking Idioms
A Quote from the Intel® 64 and IA-32 Architectures
Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
movzx vs mov
The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.
Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.
Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
skipped 6 more similar triplets that do movzx, shr, xor.
dec <<<a counter register >>>>
jnz …… <<repeat the whole loop again>>>
Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:
// ecx is already cleared here to 0
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
<<< and so on – as in the example #1>>>
Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.
Register Renaming
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
Let me just take the first 4 instructions from the first code fragment:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.
So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
shr ebx, 8
xor eax, dword ptr [edx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
Here is what we have now:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!
This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.
The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.
Let me revert the code to the initial version. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).
Now let us go back to movxz vs mov.
Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.
That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.

The movzx instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2] from RAM.
To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32 every time you go through the loop.
You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.

Related

Performance differ significantly when inner loop iterates less times?

I wrote a toy program that compares the performance of two very similar functions. The entire file (minus a couple of macros) looks like this:
constexpr int iterations = 100000;
using u64 = uint64_t;
// Slow.
void test1()
{
u64 sum = 0;
for (int i = 0; i < iterations; i++) {
for (int j = 0; j < 4; j++)
sum += i + j;
doNotOptimize(sum);
}
}
// Fast.
void test2()
{
u64 sum = 0;
for (int i = 0; i < iterations; i++) {
for (int j = 0; j < 10; j++)
sum += i + j;
doNotOptimize(sum);
}
}
void driver()
{
int runs = 1000;
BENCH(runs, test1);
BENCH(runs, test2);
}
I am measuring 1000 executions of each function using __rdtsc and computing the average. With this formula, I am seeing a performance difference of ~172,000 (ticks?) between test1 and test2. What's surprising is that test2 is the one that's faster.
An exotic little detail is that the only magic numbers for which test1 is slower are 4, 8, and 16. If I change the internal loop's condition to j < x where x is anything but those 3 numbers, performance match up.
In the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations performed as operands of lea. So in this case, it would make sense if both functions were equally fast. But that's not at all what's happening. Here's the disassembly and the program source in its entirety: https://godbolt.org/z/d5PsG4YeY
So what's really going on? Is something wrong with my measurements?
Execution environment:
Processor: Intel(R) Core(TM) i5-7200U CPU # 2.50GHz (Kaby Lake)
L1 Cache: 128Kib
OS: Linux (64-bit)
Toolchain: GCC Version 10.3.0
Compiler Options: -O3 -fno-tree-vectorize
4 and 8 are scale factors that x86 addressing modes support, tempting GCC into using a slow-LEA on the critical path dependency chain when adding 4*i or 8*i to the sum along with the constant sum of 1..4 or 1..8 (either way just a constant, irrelevant what it is). Apparently also as part of the dep chain for 16. And you used inline asm to force the sum dep chain to include a store/reload.
Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.
They're both fast, but the different multiples of i take different instructions. So there's little reason to assume they'd be the same. The interesting thing here is the one with more total instructions is faster, because it has a shorter dependency chain through sum.
And you forced a store/reload of it with the somewhat-clunky asm volatile("" :: "g"(&sum) : "memory"), instead of just requiring the compiler to have the value in a register with asm volatile("" : "+r"(sum)). So that dep chain includes store-forwarding latency (typically 3 to 5 cycles) so it's a bigger bottleneck than front-end or ALU throughput of the independent work.
test1():
xor eax, eax # i = 0
xor ecx, ecx # sum = 0
lea rsi, [rsp-8] # operand for inline asm
jmp .L3
.L5:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx # i = i+1
.L3:
lea rdx, [rax+1] # i+1, leaving the old value in RAX
lea rax, [rcx+6+rax*4] # sum += 4*i + 6. (1+2+3)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L5
ret
An LEA with 3 components (two + operations) in the addressing mode has 3 cycle latency on your Skylake-derived CPU. See x86 Multiplication with 3: IMUL vs SHL + ADD
So the loop-carried dependency chain is a slow-LEA (3 cycles) between load/store.
test2():
xor eax, eax
xor ecx, ecx
lea rdi, [rsp-8] # operand for inline asm
jmp .L8
.L9:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx
.L8:
lea rsi, [rax+36+rax*8]
lea rdx, [rax+1]
lea rax, [rax+9+rsi] # prepare some stuff to be added
add rax, rcx # first use of the RCX load result (sum)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L9
ret
So the loop-carried dep chain through the store/reload only includes an add, 1 cycle latency instead of 3.
I assume your performance ratios were something like 3:4 or so, from the 5+1 cycles (6) vs. 5+3 (8) cycles.
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details.
The compiler could have spent more instructions in test1 to reduce the critical path latency to 1 cycle there, instead of folding more work into an lea on the critical path. For a loop running 100k iterations, this is pretty much a missed optimization. But I assume the optimizer isn't tuned for artificially-induced spill/reload from inline asm; normally it would only have to introduce that store/reload if it ran out of registers from doing a lot of stuff in a loop, and that many different values would usually mean there was some instruction-level parallelism.
GCC makes better asm from simpler source on Quick-bench
#TedLyngmo linked the code on Quick-bench without -fno-tree-vectorize, and using benchmark::DoNotOptimize(sum); which only forces GCC to materialize the value in a register, without blocking constant-propagation through it, or as many other optimizations. Unlike taking its address and telling GCC that was potentially globally visible, like the current custom asm.
The inner loop body is just add %rdx,%rax / add $0x4,%rdx (and cmp rdx + jne as the loop branch), if you look at the asm on Quickbench. Or with rdx+=10 for the other loop. So same loops, different constants. Same performance, of course.
The current source is essentially compiling to asm that does
for (int i = 0 ; i<iterations ; i++){
sum += 8*i + 1234;
force_store_reload(&sum);
}
But if you actually write it that way (https://godbolt.org/z/4ja38or9j), we get asm like quick-bench, except for keeping the value in memory instead of a register. (So about 6x slower.)
.L6:
add QWORD PTR [rsp-8], rax # memory destination add
add rax, 4
cmp rax, 401234
jne .L6
It seems to be a missed-optimization bug that GCC doesn't compile your existing source to that. Specifically, missing the strength-reduction from 8*i re-evaluated for every i into tmp += 8.
BTW, it looks like omitting -fno-tree-vectorize makes your original test1() compile even worse. It starts without jumping into the middle of the loop, but it has a longer dep chain.
#gcc 10.3 -O3 (without -fno-tree-vectorize)
test1_orig():
mov QWORD PTR [rsp-8], 6 # init sum
lea rsi, [rsp-8] # operand for inline asm
mov eax, 1
.L2:
mov rdx, QWORD PTR [rsp-8] # load sum
mov rcx, rax
add rdx, rax # used: 1 cycle latency
add rax, 1
add rdx, rax # used: 1 cycle latency
lea rdx, [rdx+5+rcx*2] # used: 3 cycle latency
mov QWORD PTR [rsp-8], rdx # store
cmp rax, 100000
jne .L2
ret

Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems

I'm writing an arbitrary precision integer class to be used in C# (64-bit). Currently I'm working on the multiplication routine, using a recursive divide-and-conquer algorithm to break down the multi-bit multiplication into a series of primitive 64-to-128-bit multiplications, the results of which are recombined then by simple addition. In order to get a significant performance boost, I'm writing the code in native x64 C++, embedded in a C++/CLI wrapper to make it callable from C# code.
It all works great so far, regarding the algorithms. However, my problem is the optimization for speed. Since the 64-to-128-bit multiplication is the real bottleneck here, I tried to optimize my code right there. My first simple approach was a C++ function that implements this multiplication by performing four 32-to-64-bit multiplications and recombining the results with a couple of shifts and adds. This is the source code:
// 64-bit to 128-bit multiplication, using the following decomposition:
// (a*2^32 + i) (b*2^32 + i) = ab*2^64 + (aj + bi)*2^32 + ij
public: static void Mul (UINT64 u8Factor1,
UINT64 u8Factor2,
UINT64& u8ProductL,
UINT64& u8ProductH)
{
UINT64 u8Result1, u8Result2;
UINT64 u8Factor1L = u8Factor1 & 0xFFFFFFFFULL;
UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
UINT64 u8Factor1H = u8Factor1 >> 32;
UINT64 u8Factor2H = u8Factor2 >> 32;
u8ProductL = u8Factor1L * u8Factor2L;
u8ProductH = u8Factor1H * u8Factor2H;
u8Result1 = u8Factor1L * u8Factor2H;
u8Result2 = u8Factor1H * u8Factor2L;
if (u8Result1 > MAX_UINT64 - u8Result2)
{
u8Result1 += u8Result2;
u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
}
else
{
u8Result1 += u8Result2;
u8Result2 = (u8Result1 >> 32);
}
if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
{
u8Result2++;
}
u8ProductL += u8Result1;
u8ProductH += u8Result2;
return;
}
This function expects two 64-bit values and returns a 128-bit result as two 64-bit quantities passed as reference. This works fine. In the next step, I tried to replace the call to this function by ASM code that calls the CPU's MUL instruction. Since there's no inline ASM in x64 mode anymore, the code must be put into a separate .asm file. This is the implementation:
_TEXT segment
; =============================================================================
; multiplication
; -----------------------------------------------------------------------------
; 64-bit to 128-bit multiplication, using the x64 MUL instruction
AsmMul1 proc ; ?AsmMul1##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
; =============================================================================
_TEXT ends
end
That's utmost simple and straightforward. The function is referenced from C++ code using an extern "C" forward definition:
extern "C"
{
void AsmMul1 (UINT64, UINT64, UINT64&, UINT64&);
}
To my surprise, it turned out to be significantly slower than the C++ function. To properly benchmark the performance, I've written a C++ function that computes 10,000,000 pairs of pseudo-random unsigned 64-bit values and performs multiplications in a tight loop, using those implementations one after another, with exactly the same values. The code is compiled in Release mode with optimizations turned on. The time spent in the loop is 515 msec for the ASM version, compared to 125 msec (!) for the C++ version.
That's quite strange. So I opened the disassembly window in the debugger and copied the ASM code generated by the compiler. This is what I found there, slightly edited for readability and for use with MASM:
AsmMul3 proc ; ?AsmMul3##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov eax, 0FFFFFFFFh
and rax, rcx
; UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
mov r10d, 0FFFFFFFFh
and r10, rdx
; UINT64 u8Factor1H = u8Factor1 >> 32;
shr rcx, 20h
; UINT64 u8Factor2H = u8Factor2 >> 32;
shr rdx, 20h
; u8ProductL = u8Factor1L * u8Factor2L;
mov r11, r10
imul r11, rax
mov qword ptr [r8], r11
; u8ProductH = u8Factor1H * u8Factor2H;
mov r11, rdx
imul r11, rcx
mov qword ptr [r9], r11
; u8Result1 = u8Factor1L * u8Factor2H;
imul rax, rdx
; u8Result2 = u8Factor1H * u8Factor2L;
mov rdx, rcx
imul rdx, r10
; if (u8Result1 > MAX_UINT64 - u8Result2)
mov rcx, rdx
neg rcx
dec rcx
cmp rcx, rax
jae label1
; u8Result1 += u8Result2;
add rax, rdx
; u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
mov rdx, rax
shr rdx, 20h
mov rcx, 100000000h
or rcx, rdx
jmp label2
; u8Result1 += u8Result2;
label1:
add rax, rdx
; u8Result2 = (u8Result1 >> 32);
mov rcx, rax
shr rcx, 20h
; if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
label2:
shl rax, 20h
mov rdx, qword ptr [r8]
mov r10, rax
neg r10
dec r10
cmp r10, rdx
jae label3
; u8Result2++;
inc rcx
; u8ProductL += u8Result1;
label3:
add rdx, rax
mov qword ptr [r8], rdx
; u8ProductH += u8Result2;
add qword ptr [r9], rcx
ret
AsmMul3 endp
Copying this code into my MASM source file and calling it from my benchmark routine resulted in 547 msec spent in the loop. That's slightly slower than the ASM function, and considerably slower than the C++ function. That's even stranger, since the latter are supposed to execute exactly the same machine code.
So I tried another variant, this time using hand-optimized ASM code that does exactly the same four 32-to-64-bit multiplications, but in a more straightforward way. The code should avoid jumps and immediate values, make use of the CPU FLAGS for carry evaluation, and use interleaving of instructions in order to avoid register stalls. This is what I came up with:
; 64-bit to 128-bit multiplication, using the following decomposition:
; (a*2^32 + i) (b*2^32 + j) = ab*2^64 + (aj + bi)*2^32 + ij
AsmMul2 proc ; ?AsmMul2##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mov r11, rdx ; r11 = Factor2
shr rax, 32 ; rax = Factor1H
shr r11, 32 ; r11 = Factor2H
and ecx, ecx ; rcx = Factor1L
mov r10d, eax ; r10 = Factor1H
and edx, edx ; rdx = Factor2L
imul rax, r11 ; rax = ab = Factor1H * Factor2H
imul r10, rdx ; r10 = aj = Factor1H * Factor2L
imul r11, rcx ; r11 = bi = Factor1L * Factor2H
imul rdx, rcx ; rdx = ij = Factor1L * Factor2L
xor ecx, ecx ; rcx = 0
add r10, r11 ; r10 = aj + bi
adc ecx, ecx ; rcx = carry (aj + bi)
mov r11, r10 ; r11 = aj + bi
shl rcx, 32 ; rcx = carry (aj + bi) << 32
shl r10, 32 ; r10 = lower (aj + bi) << 32
shr r11, 32 ; r11 = upper (aj + bi) >> 32
add rdx, r10 ; rdx = ij + (lower (aj + bi) << 32)
adc rax, r11 ; rax = ab + (upper (aj + bi) >> 32)
mov qword ptr [r8], rdx ; save ProductL
add rax, rcx ; add carry (aj + bi) << 32
mov qword ptr [r9], rax ; save ProductH
ret
AsmMul2 endp
The benchmark yielded 500 msec, so this seems to be the fastest version of those three ASM implementations. However, the performance differences of them are quite marginal - but all of them are about four times slower than the naive C++ approach!
So what's going on here? It seems to me that there's some general performance penalty for calling ASM code from C++, but I can't find anything on the internet that might explain it. The way I'm interfacing ASM is exactly how Microsoft recommends it.
But now, watch out for another still stranger thing! Well, there are compiler intrinsics, anren't they? The _umul128 intrinsic supposedly should do exactly what my AsmMul1 function does, i.e. call the 64-bit CPU MUL instruction. So I replaced the AsmMul1 call by a corresponding call to _umul128. Now see what performance values I've got in return (again, I'm running all four benchmarks sequentially in a single function):
_umul128: 109 msec
AsmMul2: 94 msec (hand-optimized ASM)
AsmMul3: 125 msec (compiler-generated ASM)
C++ function: 828 msec
Now the ASM versions are blazingly fast, with about the same relative differences as before. However, the C++ function is terribly lazy now! Somehow the use of an intrinsic turns the entire performance values upside down. Scary...
I haven't got any explanation for this strange behavior, and would be thankful at least for any hints about what's going on here. It would be even better if someone could explain how to get these performance issues under control. Currently I'm quite worried, because obviously a small change in the code can have huge performance impacts. I would like to understand the mechanisms underlying here, and how to get reliable results.
And another thing: Why is the 64-to-128-bit MUL slower than four 64-to-64-bit IMULs?!
After a lot of trial-and-error, and additional extensive research on the Internet, it seems I've found the reason for this strange performance behavior. The magic word is thunking of function entry points. But let me start from the beginning.
One observation I made is that it doesn't really matter which compiler intrinsic is used in order to turn my benchmark results upside down. Actually, it suffices to put a __nop() (CPU NOP opcode) anywhere inside a function to trigger this effect. It works even if it's placed right before the return. More tests have shown that the effect is restricted to the function that contains the intrinsic. The __nop() does nothing with respect to the code flow, but obviously it changes the properties of the containing function.
I've found a question on stackoverflow that seems to tackle a similar problem: How to best avoid double thunking in C++/CLI native types In the comments, the following additional information is found:
One of my own classes in our base library - which uses MFC - is called
about a million times.
We are seeing massive sporadic performance issues, and firing up the
profiler I can see a thunk right at the bottom of this chain. That
thunk takes longer than the method call.
That's exactly what I'm observing as well - "something" on the way of the function call is taking about four times longer than my code. Function thunks are explained to some extend in the documentation of the __clrcall modifier and in an article about Double Thunking. In the former, there's a hint to a side effect of using intrinsics:
You can directly call __clrcall functions from existing C++ code that
was compiled by using /clr as long as that function has an MSIL
implementation. __clrcall functions cannot be called directly from
functions that have inline asm and call CPU-specific intrinisics, for
example, even if those functions are compiled with /clr.
So, as far as I understand it, a function that contains intrinsics loses its __clrcall modifier which is added automatically when the /clr compiler switch is specified - which is usually the case if the C++ functions should be compiled to native code.
I don't get all of the details of this thunking and double thunking stuff, but obviously it is required to make unmanaged functions callable from managed functions. However, it is possible to switch it off per function by embedding it into a #pragma managed(push, off) / #pragma managed(pop) pair. Unfortunately, this #pragma doesn't work inside namespace blocks, so some editing might be required to place it everywhere where it is supposed to occur.
I've tried this trick, placing all of my native multi-precision code inside this #pragma, and got the following benchmark results:
AsmMul1: 78 msec (64-to-128-bit CPU MUL)
AsmMul2: 94 msec (hand-optimized ASM, 4 x IMUL)
AsmMul3: 125 msec (compiler-generated ASM, 4 x IMUL)
C++ function: 109 msec
Now this looks reasonable, finally! Now all versions have about the same execution times, which is what I would expect from an optimized C++ program. Alas, there's still no happy end... Placing the winner AsmMul1 into my multi-precision multiplier yielded twice the execution time of the version with the C++ function without #pragma. The explanation is, in my opinion, that this code makes calls to unmanaged functions in other classes, which are outside the #pragma and hence have a __clrcall modifier. This seems to create significant overhead again.
Frankly, I'm tired of investigating further into this issue. Although the ASM PROC with the single MUL instruction seems to beat all other attempts, the gain is not as big as expected, and getting the thunking out of the way leads to so many changes in my code that I don't think it's worth the hassle. So I'll go on with the C++ function I've written in the very beginning, originally destined to be just a placeholder for something better...
It seems to me that ASM interfacing in C++/CLI is not well supported, or maybe I'm still missing something basic here. Maybe there's a way to get this function thunking out of the way for just the ASM functions, but so far I haven't found a solution. Not even remotely.
Feel free to add your own thoughts and observations here - even if they are just speculative. I think it's still a highly interesting topic that needs much more investigation.

How to implement a synthetic benchmark?

I'm doing my first programs in C++ and assembly code. I already know how to program in C++, but I have a lot of problems when I try to program in assembly code.
I want to do a synthetic benchmark, which
"is designed to mimic a particular type of workload on a component or system. Synthetic benchmarks do this by specially created programs that impose the workload on the component." (Wikipedia)
For example, if I want to calculate the factorial of long fact = pow(3.0, 2000), how can I measure the performance of a component in C++? (And not the performance of the whole system).
The rest of the code (the calculation of fact) is done in assembly code.
Following is part of one of my benchmarks (for Linux) that uses assembly code, where repetitive calculations can be carried out without fear of over optimisation. You need to use a high resolution timer with an assembly based loop long enough for reasonable execution time. You might want to repeat calculations in the loop to fill up pipelines.
This one repeats 20M adds 10 times to find maximum speed. IntCount1 value is checked at the end as a simple integrity check
C Code
intCount1 = 0;
max = 0;
for (i=0; i<10; i++)
{
count = intCount1;
start_time();
_mips1Reg();
end_time();
count = intCount1 - count;
mips = (int)((double)count / 1000000.0 / secs + 0.5);
if(mips > max) max = mips;
}
mipsReg[0] = max;
printf(" 1 Register %7d 32 Bit Integer MIPS\n", mipsReg[0]);
########################################################
Hi-Res Timer Used
clock_gettime(CLOCK_REALTIME, &tp1);
theseSecs = tp1.tv_sec + tp1.tv_nsec / 1e9;
########################################################
Assembly Code
global _mips1Reg
_mips1Reg:
push eax
push ebx
push ecx
push edx
push edi
mov edi, 1000000
mov eax, [intCount1]
align 8
dlp:add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 3
dec edi
jnz dlp
mov [intCount1], eax
pop edi
pop edx
pop ecx
pop ebx
pop eax
ret

Is shifting or multiplying faster and why?

What generally is a faster solution, multiplying or bit shifting?
If I want to multiply by 10000, which code would be faster?
v = (v<<13) + (v<<11) + (v<<4) - (v<<8);
or
v = 10000*v;
And the second part of the question - How to find the lowest number of shifts required to do some multiplication? (I'm intereseted in multiplying by 10000, 1000 and 100).
It really depends on the architecture of the processor, as well as the compiler that you're using.
But you can simply view the dis-assembly of each option, and see for yourself.
Here is what I got using Visual-Studio 2010 compiler for Pentium:
int v2 = (v<<13) + (v<<11) + (v<<4) - (v<<8);
mov eax,dword ptr [v]
shl eax,0Dh
mov ecx,dword ptr [v]
shl ecx,0Bh
add eax,ecx
mov edx,dword ptr [v]
shl edx,4
add eax,edx
mov ecx,dword ptr [v]
shl ecx,8
sub eax,ecx
mov dword ptr [v2],eax
int v2 = 10000*v;
mov eax,dword ptr [v]
imul eax,eax,2710h
mov dword ptr [v2],eax
So it appears that the second option is faster in my case.
BTW, you might get a different result if you enable optimization (mine was disabled)...
To the first question: Don't bother. The compiler knows better and will optimize it according to the respective target hardware.
To the second question: Look at the binary representation:
For example: bin(10000) = 0b10011100010000:
1 0 0 1 1 1 0 0 0 1 0 0 0 0
13 12 11 10 9 8 7 6 5 4 3 2 1 0
So you have to shift by 13, 10, 9, 8 and 4. If you want to shortcut consecutive ones (by subtracting as in your question) you need at least three consecutive ones in order to gain anything.
But again, let the compiler do this. It's his job.
there is only one situation in which shift operation are faster than *, and it's defined by two condition:
the operation is with a value power of two
when you multiply with a fractional number -> division.
Let's look a little deeper:
multiplication/division, shift operation are done by units in the HW
architecture; usually you have shifters, multipliers/dividers to
perform these operations but each of the operation is performed by a
different set of registers inside a Arithmeric Locgic Unit.
multiplication/division with a power of two is equivalent to a
left_shift/right_shift operation
if you are not dealing with power of 2 than multiplication and division are performed slightly differently:
Multiplication is performed by the HW ( ALU unit) in a single instrucion (depending on the data type but let's not overcomplicate things)
Division is performed in a loop as consecutive subtractions -> more than one instruction
Summarizing:
multiplication is only one instruction; while replacing
multiplication with a series of shift operations is multiple
instruction -> the first option is faster (even on a parallel
architecture)
multiplication with a power of two is the same as a shift operation; the compiler usually generates a shift when it detects this in the code.
division is multiple instruction; replaving this with a series of shifts might prove faster but it depends on each situation.
division with a power of two is multiple operations and can be replaced with a single right_shift operation; a smart compiler will
do this automatically
An older Microsoft C compiler optimized the shift sequence using lea (load effective address), which allows multiples of 5:
lea eax, DWORD PTR [eax+eax*4] ;eax = v*5
lea ecx, DWORD PTR [eax+eax*4] ;ecx = v*25
lea edx, DWORD PTR [ecx+ecx*4] ;edx = v*125
lea eax, DWORD PTR [edx+edx*4] ;eax = v*625
shl eax, 4 ;eax = v*10000
multiply (signed or unsigned) was still faster on my system with Intel 2600K 3.4ghz. Visual Studio 2005 and 2012 multiplied v*10256, then subtracted (v<<8). Shift and add / subtract sequence was slower than the lea method above:
shl eax,4 ;ecx = v*(16)
mov ecx,eax
shl eax,4 ;ecx = v*(16-256)
sub ecx,eax
shl eax,3 ;ecx = v*(16-256+2048)
add ecx,eax
shl eax,2 ;eax = v*(16-256+2048+8192) = v*(10000)
add eax,ecx

Performance optimization with modular addition when the modulus is of special form

I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.