How do I optimize/improve upon this hash function - c++

I have a hashtable that stores quadtree entries.
The hashfunction looks like this:
Quadtree hash
#define node_hash(a,b,c,d) \
(((int)(d))+3*(((int)(c))+3*(((int)(b))+3*((int)(a))+3)))
Note that the result of this operation is always chunked down using a modulus prime number like so:
h = node_hash(p->nw, p->ne, p->sw, p->se) ;
h %= hashprime ;
...
Comparison with optimal hash
Some statistical analysis shows that this hash is optimum in terms of collision reduction.
Given a hashtable with b buckets and n entries. The collision risk using a perfect hash is:
(n - b * (1 - power((b-1)/b,n)))) * 100 / n
When n = b this means a collision risk of 37%.
Some testing shows that the above hash lines up very nicely with the norm (for all fill levels of the hashtable).
Running time
The runtime is heavily dependent on the value of hashprime
Timings (best out of 1000 runs) are:
hashprime CPU-cycles per run
--------------------------------
4049 56
16217 68
64871 127 <-- whoooh
Is there a way to improve on this, whilst still retaining the optimum collision risk?
Either by optimizing the modulus operation (replacing it with a multiplication using 'magic' numbers computer outside the loop).
Replacing the hash function with some other hash function.
Background
The following assembly is produced:
//--------h = node_hash(p->nw, p->ne, p->sw, p->se) ;
mov eax,[rcx+node.nw] <<+
lea eax,[eax+eax*2+3] |
add eax,[rcx+node.ne] |
lea eax,[eax+eax*2] +- takes +/- 12 cycles
add eax,[rcx+node.sw] |
lea eax,[eax+eax*2] |
add eax,[rcx+node.se] <<+
//--------h %= hashprime ;
mov esi,[hashprime]
xor edx,edx
div esi
mov rax,rdx <<--- takes all the rest
[EDIT]
I may be able to do something with the fact that:
C = A % B is equivalent to C = A – B * (A / B)
Using the fact that integer division is the same as multiplication by its reciprocal.
Thus converting the formula to C = A - B * (A * rB)
Note that for integer division the reciprocals are magic numbers, see: http://www.hackersdelight.org/magic.htm
C code is here: http://web.archive.org/web/20070713211039/http://hackersdelight.org/HDcode/magic.c
[FNV hashes]
See: http://www.isthe.com/chongo/tech/comp/fnv/#FNV-1a
hash = offset_basis
for each byte to be hashed
hash = hash xor octet_of_data
hash = hash * FNV_prime (for 32 bits = 16777619)
return hash
For 4 pointers truncated to 32 bits = 16 bytes the FNV hash takes 27 cycles (hand crafted assembly)
Unfortunately this leads to hash collisions of 81% where it should be 37%.
Running the full 15 multiplications takes 179 cycles.

Replacing modulus by reciprocal multiplication
The main cycle eater in this hash function is the modulus operator.
If you replace this division with a multiplication by the reciprocal the calculation is much faster.
Note that calculating the reciprocal involves 3 divides, so this should only be done when the reciprocal can be reused enough times.
OK, here's the code used: http://www.agner.org/optimize/asmlib.zip
From: http://www.agner.org/optimize/
// ;************************* divfixedi64.asm *********************************
// ; Author: Agner Fog
//extern "C" void setdivisoru32(uint Buffer[2], uint d)
asm
mov r8d, edx // x
mov r9, rcx // Buffer
dec r8d // r8d = r8d or esi
mov ecx, -1 // value for bsr if r8d = 0
bsr ecx, r8d // floor(log2(d-1))
inc r8d
inc ecx // L = ceil(log2(d))
mov edx, 1
shl rdx, cl // 2^L (64 bit shift because cl may be 32)
sub edx, r8d
xor eax, eax
div r8d
inc eax
mov [r9], eax // multiplier
sub ecx, 1
setae dl
movzx edx, dl // shift1
seta al
neg al
and al,cl
movzx eax, al // shift 2
shl eax, 8
or eax, edx
mov [r9+4], eax // shift 1 and shift 2
ret
end;
and the code for the modulus operation:
//extern "C" uint modFixedU32(uint Buffer[2], uint d)
asm
mov eax, edx
mov r10d, edx // x
mov r11d, edx // save x
mul dword [rcx] // Buffer (i.e.: m')
sub r10d, edx // x-t
mov ecx, [rcx+4] // shift 1 and shift 2
shr r10d, cl
lea eax, [r10+rdx]
mov cl, ch
shr eax, cl
// Result:= x - m * fastDiv32.dividefixedu32(Buffer, x);
mul r8d // m * ...
sub r11d, eax // x - (m * ...)
mov eax,r11d
ret
end;
The difference in time is as follows:
hashprime classic hash (mod) new hash new old
(# of runs) cycles/run per run (no cache) (no cache)
--------------------------------------------------------------------
4049 56 21 16.6 51
16217 68 not measured
64871 127 89 16.5 50
Cache issues
The increase in cycle time is caused by the data overflowing the cache, causing main memory to be accessed.
This can be seen clearly when I remove cache effects by hashing the same value over and over.

Something like this might be useful:
static inline unsigned int hash4(unsigned int a, unsigned int b,
unsigned int c, unsigned int d) {
unsigned long long foo = 123456789*(long long)a ^ 243956871*(long long)b
^ 918273645*(long long)c ^ 347562981*(long long)d;
return (unsigned int)(foo >> 32);
}
Replace the four odd numbers I typed in with randomly generated 64-bit odd numbers; the ones above won't work that great. (64-bit so that the high 32 bits are somehow a random mix of the lower bits.) This is about as fast as the code you gave, but it lets you use power-of-two table sizes instead of prime table sizes without fear.
The thing everyone uses for similar workloads is the FNV hash. I'm not sure whether FNV actually has better properties than hashes of the type above, but it's similarly fast and it's in rather widespread use.

Assuming hashprime is a constant, you could implement the modulo-operation as bitwise masks. I'm not sure about the details, but maybe this answer can push you in the right direction.

Related

Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems

I'm writing an arbitrary precision integer class to be used in C# (64-bit). Currently I'm working on the multiplication routine, using a recursive divide-and-conquer algorithm to break down the multi-bit multiplication into a series of primitive 64-to-128-bit multiplications, the results of which are recombined then by simple addition. In order to get a significant performance boost, I'm writing the code in native x64 C++, embedded in a C++/CLI wrapper to make it callable from C# code.
It all works great so far, regarding the algorithms. However, my problem is the optimization for speed. Since the 64-to-128-bit multiplication is the real bottleneck here, I tried to optimize my code right there. My first simple approach was a C++ function that implements this multiplication by performing four 32-to-64-bit multiplications and recombining the results with a couple of shifts and adds. This is the source code:
// 64-bit to 128-bit multiplication, using the following decomposition:
// (a*2^32 + i) (b*2^32 + i) = ab*2^64 + (aj + bi)*2^32 + ij
public: static void Mul (UINT64 u8Factor1,
UINT64 u8Factor2,
UINT64& u8ProductL,
UINT64& u8ProductH)
{
UINT64 u8Result1, u8Result2;
UINT64 u8Factor1L = u8Factor1 & 0xFFFFFFFFULL;
UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
UINT64 u8Factor1H = u8Factor1 >> 32;
UINT64 u8Factor2H = u8Factor2 >> 32;
u8ProductL = u8Factor1L * u8Factor2L;
u8ProductH = u8Factor1H * u8Factor2H;
u8Result1 = u8Factor1L * u8Factor2H;
u8Result2 = u8Factor1H * u8Factor2L;
if (u8Result1 > MAX_UINT64 - u8Result2)
{
u8Result1 += u8Result2;
u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
}
else
{
u8Result1 += u8Result2;
u8Result2 = (u8Result1 >> 32);
}
if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
{
u8Result2++;
}
u8ProductL += u8Result1;
u8ProductH += u8Result2;
return;
}
This function expects two 64-bit values and returns a 128-bit result as two 64-bit quantities passed as reference. This works fine. In the next step, I tried to replace the call to this function by ASM code that calls the CPU's MUL instruction. Since there's no inline ASM in x64 mode anymore, the code must be put into a separate .asm file. This is the implementation:
_TEXT segment
; =============================================================================
; multiplication
; -----------------------------------------------------------------------------
; 64-bit to 128-bit multiplication, using the x64 MUL instruction
AsmMul1 proc ; ?AsmMul1##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
; =============================================================================
_TEXT ends
end
That's utmost simple and straightforward. The function is referenced from C++ code using an extern "C" forward definition:
extern "C"
{
void AsmMul1 (UINT64, UINT64, UINT64&, UINT64&);
}
To my surprise, it turned out to be significantly slower than the C++ function. To properly benchmark the performance, I've written a C++ function that computes 10,000,000 pairs of pseudo-random unsigned 64-bit values and performs multiplications in a tight loop, using those implementations one after another, with exactly the same values. The code is compiled in Release mode with optimizations turned on. The time spent in the loop is 515 msec for the ASM version, compared to 125 msec (!) for the C++ version.
That's quite strange. So I opened the disassembly window in the debugger and copied the ASM code generated by the compiler. This is what I found there, slightly edited for readability and for use with MASM:
AsmMul3 proc ; ?AsmMul3##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov eax, 0FFFFFFFFh
and rax, rcx
; UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
mov r10d, 0FFFFFFFFh
and r10, rdx
; UINT64 u8Factor1H = u8Factor1 >> 32;
shr rcx, 20h
; UINT64 u8Factor2H = u8Factor2 >> 32;
shr rdx, 20h
; u8ProductL = u8Factor1L * u8Factor2L;
mov r11, r10
imul r11, rax
mov qword ptr [r8], r11
; u8ProductH = u8Factor1H * u8Factor2H;
mov r11, rdx
imul r11, rcx
mov qword ptr [r9], r11
; u8Result1 = u8Factor1L * u8Factor2H;
imul rax, rdx
; u8Result2 = u8Factor1H * u8Factor2L;
mov rdx, rcx
imul rdx, r10
; if (u8Result1 > MAX_UINT64 - u8Result2)
mov rcx, rdx
neg rcx
dec rcx
cmp rcx, rax
jae label1
; u8Result1 += u8Result2;
add rax, rdx
; u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
mov rdx, rax
shr rdx, 20h
mov rcx, 100000000h
or rcx, rdx
jmp label2
; u8Result1 += u8Result2;
label1:
add rax, rdx
; u8Result2 = (u8Result1 >> 32);
mov rcx, rax
shr rcx, 20h
; if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
label2:
shl rax, 20h
mov rdx, qword ptr [r8]
mov r10, rax
neg r10
dec r10
cmp r10, rdx
jae label3
; u8Result2++;
inc rcx
; u8ProductL += u8Result1;
label3:
add rdx, rax
mov qword ptr [r8], rdx
; u8ProductH += u8Result2;
add qword ptr [r9], rcx
ret
AsmMul3 endp
Copying this code into my MASM source file and calling it from my benchmark routine resulted in 547 msec spent in the loop. That's slightly slower than the ASM function, and considerably slower than the C++ function. That's even stranger, since the latter are supposed to execute exactly the same machine code.
So I tried another variant, this time using hand-optimized ASM code that does exactly the same four 32-to-64-bit multiplications, but in a more straightforward way. The code should avoid jumps and immediate values, make use of the CPU FLAGS for carry evaluation, and use interleaving of instructions in order to avoid register stalls. This is what I came up with:
; 64-bit to 128-bit multiplication, using the following decomposition:
; (a*2^32 + i) (b*2^32 + j) = ab*2^64 + (aj + bi)*2^32 + ij
AsmMul2 proc ; ?AsmMul2##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mov r11, rdx ; r11 = Factor2
shr rax, 32 ; rax = Factor1H
shr r11, 32 ; r11 = Factor2H
and ecx, ecx ; rcx = Factor1L
mov r10d, eax ; r10 = Factor1H
and edx, edx ; rdx = Factor2L
imul rax, r11 ; rax = ab = Factor1H * Factor2H
imul r10, rdx ; r10 = aj = Factor1H * Factor2L
imul r11, rcx ; r11 = bi = Factor1L * Factor2H
imul rdx, rcx ; rdx = ij = Factor1L * Factor2L
xor ecx, ecx ; rcx = 0
add r10, r11 ; r10 = aj + bi
adc ecx, ecx ; rcx = carry (aj + bi)
mov r11, r10 ; r11 = aj + bi
shl rcx, 32 ; rcx = carry (aj + bi) << 32
shl r10, 32 ; r10 = lower (aj + bi) << 32
shr r11, 32 ; r11 = upper (aj + bi) >> 32
add rdx, r10 ; rdx = ij + (lower (aj + bi) << 32)
adc rax, r11 ; rax = ab + (upper (aj + bi) >> 32)
mov qword ptr [r8], rdx ; save ProductL
add rax, rcx ; add carry (aj + bi) << 32
mov qword ptr [r9], rax ; save ProductH
ret
AsmMul2 endp
The benchmark yielded 500 msec, so this seems to be the fastest version of those three ASM implementations. However, the performance differences of them are quite marginal - but all of them are about four times slower than the naive C++ approach!
So what's going on here? It seems to me that there's some general performance penalty for calling ASM code from C++, but I can't find anything on the internet that might explain it. The way I'm interfacing ASM is exactly how Microsoft recommends it.
But now, watch out for another still stranger thing! Well, there are compiler intrinsics, anren't they? The _umul128 intrinsic supposedly should do exactly what my AsmMul1 function does, i.e. call the 64-bit CPU MUL instruction. So I replaced the AsmMul1 call by a corresponding call to _umul128. Now see what performance values I've got in return (again, I'm running all four benchmarks sequentially in a single function):
_umul128: 109 msec
AsmMul2: 94 msec (hand-optimized ASM)
AsmMul3: 125 msec (compiler-generated ASM)
C++ function: 828 msec
Now the ASM versions are blazingly fast, with about the same relative differences as before. However, the C++ function is terribly lazy now! Somehow the use of an intrinsic turns the entire performance values upside down. Scary...
I haven't got any explanation for this strange behavior, and would be thankful at least for any hints about what's going on here. It would be even better if someone could explain how to get these performance issues under control. Currently I'm quite worried, because obviously a small change in the code can have huge performance impacts. I would like to understand the mechanisms underlying here, and how to get reliable results.
And another thing: Why is the 64-to-128-bit MUL slower than four 64-to-64-bit IMULs?!
After a lot of trial-and-error, and additional extensive research on the Internet, it seems I've found the reason for this strange performance behavior. The magic word is thunking of function entry points. But let me start from the beginning.
One observation I made is that it doesn't really matter which compiler intrinsic is used in order to turn my benchmark results upside down. Actually, it suffices to put a __nop() (CPU NOP opcode) anywhere inside a function to trigger this effect. It works even if it's placed right before the return. More tests have shown that the effect is restricted to the function that contains the intrinsic. The __nop() does nothing with respect to the code flow, but obviously it changes the properties of the containing function.
I've found a question on stackoverflow that seems to tackle a similar problem: How to best avoid double thunking in C++/CLI native types In the comments, the following additional information is found:
One of my own classes in our base library - which uses MFC - is called
about a million times.
We are seeing massive sporadic performance issues, and firing up the
profiler I can see a thunk right at the bottom of this chain. That
thunk takes longer than the method call.
That's exactly what I'm observing as well - "something" on the way of the function call is taking about four times longer than my code. Function thunks are explained to some extend in the documentation of the __clrcall modifier and in an article about Double Thunking. In the former, there's a hint to a side effect of using intrinsics:
You can directly call __clrcall functions from existing C++ code that
was compiled by using /clr as long as that function has an MSIL
implementation. __clrcall functions cannot be called directly from
functions that have inline asm and call CPU-specific intrinisics, for
example, even if those functions are compiled with /clr.
So, as far as I understand it, a function that contains intrinsics loses its __clrcall modifier which is added automatically when the /clr compiler switch is specified - which is usually the case if the C++ functions should be compiled to native code.
I don't get all of the details of this thunking and double thunking stuff, but obviously it is required to make unmanaged functions callable from managed functions. However, it is possible to switch it off per function by embedding it into a #pragma managed(push, off) / #pragma managed(pop) pair. Unfortunately, this #pragma doesn't work inside namespace blocks, so some editing might be required to place it everywhere where it is supposed to occur.
I've tried this trick, placing all of my native multi-precision code inside this #pragma, and got the following benchmark results:
AsmMul1: 78 msec (64-to-128-bit CPU MUL)
AsmMul2: 94 msec (hand-optimized ASM, 4 x IMUL)
AsmMul3: 125 msec (compiler-generated ASM, 4 x IMUL)
C++ function: 109 msec
Now this looks reasonable, finally! Now all versions have about the same execution times, which is what I would expect from an optimized C++ program. Alas, there's still no happy end... Placing the winner AsmMul1 into my multi-precision multiplier yielded twice the execution time of the version with the C++ function without #pragma. The explanation is, in my opinion, that this code makes calls to unmanaged functions in other classes, which are outside the #pragma and hence have a __clrcall modifier. This seems to create significant overhead again.
Frankly, I'm tired of investigating further into this issue. Although the ASM PROC with the single MUL instruction seems to beat all other attempts, the gain is not as big as expected, and getting the thunking out of the way leads to so many changes in my code that I don't think it's worth the hassle. So I'll go on with the C++ function I've written in the very beginning, originally destined to be just a placeholder for something better...
It seems to me that ASM interfacing in C++/CLI is not well supported, or maybe I'm still missing something basic here. Maybe there's a way to get this function thunking out of the way for just the ASM functions, but so far I haven't found a solution. Not even remotely.
Feel free to add your own thoughts and observations here - even if they are just speculative. I think it's still a highly interesting topic that needs much more investigation.

Bit trick to detect if any of some integers has a specific value

Is there any clever bit trick to detect if any of a small number of integers (say 3 or 4) has a specific value?
The straightforward
bool test(int a, int b, int c, int d)
{
// The compiler will pretty likely optimize it to (a == d | b == d | c == d)
return (a == d || b == d || c == d);
}
in GCC compiles to
test(int, int, int, int):
cmp ecx, esi
sete al
cmp ecx, edx
sete dl
or eax, edx
cmp edi, ecx
sete dl
or eax, edx
ret
Those sete instructions have higher latency than I want to tolerate, so I would rather use something bitwise (&, |, ^, ~) stuff and a single comparison.
The only solution I've found yet is:
int s1 = ((a-d) >> 31) | ((d-a) >> 31);
int s2 = ((b-d) >> 31) | ((d-b) >> 31);
int s3 = ((c-d) >> 31) | ((d-c) >> 31);
int s = s1 & s2 & s3;
return (s & 1) == 0;
alternative variant:
int s1 = (a-d) | (d-a);
int s2 = (b-d) | (d-b);
int s3 = (c-d) | (d-c);
int s = (s1 & s2 & s3);
return (s & 0x80000000) == 0;
both are translated to:
mov eax, ecx
sub eax, edi
sub edi, ecx
or edi, eax
mov eax, ecx
sub eax, esi
sub esi, ecx
or esi, eax
and esi, edi
mov eax, edx
sub eax, ecx
sub ecx, edx
or ecx, eax
test esi, ecx
setns al
ret
which has less sete instructions, but obviously more mov/sub.
Update: as BeeOnRope# suggested - it makes sense to cast input variables to unsigned
It is not a full bit trick. Any zero yields a zero product, which gives a zero result. Negate 0 yields a 1. Does not deal with overflow.
bool test(int a, int b, int c, int d)
{
return !((a^d)*(b^d)*(c^d));
}
gcc 7.1 -O3 output. (d is in ecx, the other inputs start in other integer regs).
xor edi, ecx
xor esi, ecx
xor edx, ecx
imul edi, esi
imul edx, edi
test edx, edx
sete al
ret
It might be faster than the original on Core2 or Nehalem where partial-register stalls are a problem. imul r32,r32 has 3c latency on Core2/Nehalem (and later Intel CPUs), and 1 per clock throughput, so this sequence has 7 cycle latency from the inputs to the 2nd imul result, and another 2 cycles of latency for test/sete. Throughput should be fairly good if this sequence runs on multiple independent inputs.
Using a 64-bit multiply would avoid the overflow problem on the first multiply, but the second could still overflow if the total is >= 2**64. It would still be the same performance on Intel Nehalem and Sandybridge-family, and AMD Ryzen. But it would be slower on older CPUs.
In x86 asm, doing the second multiply with a full-multiply one-operand mul instruction (64x64b => 128b) would avoid overflow, and the result could be checked for being all-zero or not with or rax,rdx. We can write that in GNU C for 64-bit targets (where __int128 is available)
bool test_mulwide(unsigned a, unsigned b, unsigned c, unsigned d)
{
unsigned __int128 mul1 = (a^d)*(unsigned long long)(b^d);
return !(mul1*(c^d));
}
and gcc/clang really do emit the asm we hoped for (each with some useless mov instructions):
# gcc -O3 for x86-64 SysV ABI
mov eax, esi
xor edi, ecx
xor eax, ecx
xor ecx, edx # zero-extends
imul rax, rdi
mul rcx # 64 bit inputs (rax implicit), 128b output in rdx:rax
mov rsi, rax # this is useless
or rsi, rdx
sete al
ret
This should be almost as fast as the simple version that can overflow, on modern x86-64. (mul r64 is still only 3c latency, but 2 uops instead of 1 for imul r64,r64 that doesn't produce the high-half), on Intel Sandybridge-family.)
It's still probably worse than clang's setcc/or output from the original version, which uses 8-bit or instructions to avoid reading 32-bit registers after writing the low byte (i.e. no partial-register stalls).
See both sources with both compilers on the Godbolt compiler explorer. (Also included: #BeeOnRope's ^ / & version that risks false positives, with and without a fallback to a full check.)

Compiler generates costly MOVZX instruction

My profiler has identified the following function profiling as the hotspot.
typedef unsigned short ushort;
bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
for( unsigned i = 0; i < dim; ++i )
{
if( point[i + 1] >= coord[i] ) return false;
}
return true;
}
In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into
mov rcx, QWORD PTR [rdi]
lea r8d, [rax+1]
add rsi, 2
movzx r9d, WORD PTR [rsi-2]
mov rax, r8
cmp WORD PTR [rcx+r8*2], r9w
jae .L5
I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?
(Find the entire function on godbolt compiler explorer.)
Thank you for the good question!
Clearing Registers and Dependency Breaking Idioms
A Quote from the Intel® 64 and IA-32 Architectures
Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
movzx vs mov
The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.
Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.
Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
skipped 6 more similar triplets that do movzx, shr, xor.
dec <<<a counter register >>>>
jnz …… <<repeat the whole loop again>>>
Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:
// ecx is already cleared here to 0
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
<<< and so on – as in the example #1>>>
Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.
Register Renaming
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
Let me just take the first 4 instructions from the first code fragment:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.
So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
shr ebx, 8
xor eax, dword ptr [edx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
Here is what we have now:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!
This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.
The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.
Let me revert the code to the initial version. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).
Now let us go back to movxz vs mov.
Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.
That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.
The movzx instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2] from RAM.
To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32 every time you go through the loop.
You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.

Is shifting or multiplying faster and why?

What generally is a faster solution, multiplying or bit shifting?
If I want to multiply by 10000, which code would be faster?
v = (v<<13) + (v<<11) + (v<<4) - (v<<8);
or
v = 10000*v;
And the second part of the question - How to find the lowest number of shifts required to do some multiplication? (I'm intereseted in multiplying by 10000, 1000 and 100).
It really depends on the architecture of the processor, as well as the compiler that you're using.
But you can simply view the dis-assembly of each option, and see for yourself.
Here is what I got using Visual-Studio 2010 compiler for Pentium:
int v2 = (v<<13) + (v<<11) + (v<<4) - (v<<8);
mov eax,dword ptr [v]
shl eax,0Dh
mov ecx,dword ptr [v]
shl ecx,0Bh
add eax,ecx
mov edx,dword ptr [v]
shl edx,4
add eax,edx
mov ecx,dword ptr [v]
shl ecx,8
sub eax,ecx
mov dword ptr [v2],eax
int v2 = 10000*v;
mov eax,dword ptr [v]
imul eax,eax,2710h
mov dword ptr [v2],eax
So it appears that the second option is faster in my case.
BTW, you might get a different result if you enable optimization (mine was disabled)...
To the first question: Don't bother. The compiler knows better and will optimize it according to the respective target hardware.
To the second question: Look at the binary representation:
For example: bin(10000) = 0b10011100010000:
1 0 0 1 1 1 0 0 0 1 0 0 0 0
13 12 11 10 9 8 7 6 5 4 3 2 1 0
So you have to shift by 13, 10, 9, 8 and 4. If you want to shortcut consecutive ones (by subtracting as in your question) you need at least three consecutive ones in order to gain anything.
But again, let the compiler do this. It's his job.
there is only one situation in which shift operation are faster than *, and it's defined by two condition:
the operation is with a value power of two
when you multiply with a fractional number -> division.
Let's look a little deeper:
multiplication/division, shift operation are done by units in the HW
architecture; usually you have shifters, multipliers/dividers to
perform these operations but each of the operation is performed by a
different set of registers inside a Arithmeric Locgic Unit.
multiplication/division with a power of two is equivalent to a
left_shift/right_shift operation
if you are not dealing with power of 2 than multiplication and division are performed slightly differently:
Multiplication is performed by the HW ( ALU unit) in a single instrucion (depending on the data type but let's not overcomplicate things)
Division is performed in a loop as consecutive subtractions -> more than one instruction
Summarizing:
multiplication is only one instruction; while replacing
multiplication with a series of shift operations is multiple
instruction -> the first option is faster (even on a parallel
architecture)
multiplication with a power of two is the same as a shift operation; the compiler usually generates a shift when it detects this in the code.
division is multiple instruction; replaving this with a series of shifts might prove faster but it depends on each situation.
division with a power of two is multiple operations and can be replaced with a single right_shift operation; a smart compiler will
do this automatically
An older Microsoft C compiler optimized the shift sequence using lea (load effective address), which allows multiples of 5:
lea eax, DWORD PTR [eax+eax*4] ;eax = v*5
lea ecx, DWORD PTR [eax+eax*4] ;ecx = v*25
lea edx, DWORD PTR [ecx+ecx*4] ;edx = v*125
lea eax, DWORD PTR [edx+edx*4] ;eax = v*625
shl eax, 4 ;eax = v*10000
multiply (signed or unsigned) was still faster on my system with Intel 2600K 3.4ghz. Visual Studio 2005 and 2012 multiplied v*10256, then subtracted (v<<8). Shift and add / subtract sequence was slower than the lea method above:
shl eax,4 ;ecx = v*(16)
mov ecx,eax
shl eax,4 ;ecx = v*(16-256)
sub ecx,eax
shl eax,3 ;ecx = v*(16-256+2048)
add ecx,eax
shl eax,2 ;eax = v*(16-256+2048+8192) = v*(10000)
add eax,ecx

Performance optimization with modular addition when the modulus is of special form

I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.