What generally is a faster solution, multiplying or bit shifting?
If I want to multiply by 10000, which code would be faster?
v = (v<<13) + (v<<11) + (v<<4) - (v<<8);
or
v = 10000*v;
And the second part of the question - How to find the lowest number of shifts required to do some multiplication? (I'm intereseted in multiplying by 10000, 1000 and 100).
It really depends on the architecture of the processor, as well as the compiler that you're using.
But you can simply view the dis-assembly of each option, and see for yourself.
Here is what I got using Visual-Studio 2010 compiler for Pentium:
int v2 = (v<<13) + (v<<11) + (v<<4) - (v<<8);
mov eax,dword ptr [v]
shl eax,0Dh
mov ecx,dword ptr [v]
shl ecx,0Bh
add eax,ecx
mov edx,dword ptr [v]
shl edx,4
add eax,edx
mov ecx,dword ptr [v]
shl ecx,8
sub eax,ecx
mov dword ptr [v2],eax
int v2 = 10000*v;
mov eax,dword ptr [v]
imul eax,eax,2710h
mov dword ptr [v2],eax
So it appears that the second option is faster in my case.
BTW, you might get a different result if you enable optimization (mine was disabled)...
To the first question: Don't bother. The compiler knows better and will optimize it according to the respective target hardware.
To the second question: Look at the binary representation:
For example: bin(10000) = 0b10011100010000:
1 0 0 1 1 1 0 0 0 1 0 0 0 0
13 12 11 10 9 8 7 6 5 4 3 2 1 0
So you have to shift by 13, 10, 9, 8 and 4. If you want to shortcut consecutive ones (by subtracting as in your question) you need at least three consecutive ones in order to gain anything.
But again, let the compiler do this. It's his job.
there is only one situation in which shift operation are faster than *, and it's defined by two condition:
the operation is with a value power of two
when you multiply with a fractional number -> division.
Let's look a little deeper:
multiplication/division, shift operation are done by units in the HW
architecture; usually you have shifters, multipliers/dividers to
perform these operations but each of the operation is performed by a
different set of registers inside a Arithmeric Locgic Unit.
multiplication/division with a power of two is equivalent to a
left_shift/right_shift operation
if you are not dealing with power of 2 than multiplication and division are performed slightly differently:
Multiplication is performed by the HW ( ALU unit) in a single instrucion (depending on the data type but let's not overcomplicate things)
Division is performed in a loop as consecutive subtractions -> more than one instruction
Summarizing:
multiplication is only one instruction; while replacing
multiplication with a series of shift operations is multiple
instruction -> the first option is faster (even on a parallel
architecture)
multiplication with a power of two is the same as a shift operation; the compiler usually generates a shift when it detects this in the code.
division is multiple instruction; replaving this with a series of shifts might prove faster but it depends on each situation.
division with a power of two is multiple operations and can be replaced with a single right_shift operation; a smart compiler will
do this automatically
An older Microsoft C compiler optimized the shift sequence using lea (load effective address), which allows multiples of 5:
lea eax, DWORD PTR [eax+eax*4] ;eax = v*5
lea ecx, DWORD PTR [eax+eax*4] ;ecx = v*25
lea edx, DWORD PTR [ecx+ecx*4] ;edx = v*125
lea eax, DWORD PTR [edx+edx*4] ;eax = v*625
shl eax, 4 ;eax = v*10000
multiply (signed or unsigned) was still faster on my system with Intel 2600K 3.4ghz. Visual Studio 2005 and 2012 multiplied v*10256, then subtracted (v<<8). Shift and add / subtract sequence was slower than the lea method above:
shl eax,4 ;ecx = v*(16)
mov ecx,eax
shl eax,4 ;ecx = v*(16-256)
sub ecx,eax
shl eax,3 ;ecx = v*(16-256+2048)
add ecx,eax
shl eax,2 ;eax = v*(16-256+2048+8192) = v*(10000)
add eax,ecx
Related
I wrote a toy program that compares the performance of two very similar functions. The entire file (minus a couple of macros) looks like this:
constexpr int iterations = 100000;
using u64 = uint64_t;
// Slow.
void test1()
{
u64 sum = 0;
for (int i = 0; i < iterations; i++) {
for (int j = 0; j < 4; j++)
sum += i + j;
doNotOptimize(sum);
}
}
// Fast.
void test2()
{
u64 sum = 0;
for (int i = 0; i < iterations; i++) {
for (int j = 0; j < 10; j++)
sum += i + j;
doNotOptimize(sum);
}
}
void driver()
{
int runs = 1000;
BENCH(runs, test1);
BENCH(runs, test2);
}
I am measuring 1000 executions of each function using __rdtsc and computing the average. With this formula, I am seeing a performance difference of ~172,000 (ticks?) between test1 and test2. What's surprising is that test2 is the one that's faster.
An exotic little detail is that the only magic numbers for which test1 is slower are 4, 8, and 16. If I change the internal loop's condition to j < x where x is anything but those 3 numbers, performance match up.
In the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations performed as operands of lea. So in this case, it would make sense if both functions were equally fast. But that's not at all what's happening. Here's the disassembly and the program source in its entirety: https://godbolt.org/z/d5PsG4YeY
So what's really going on? Is something wrong with my measurements?
Execution environment:
Processor: Intel(R) Core(TM) i5-7200U CPU # 2.50GHz (Kaby Lake)
L1 Cache: 128Kib
OS: Linux (64-bit)
Toolchain: GCC Version 10.3.0
Compiler Options: -O3 -fno-tree-vectorize
4 and 8 are scale factors that x86 addressing modes support, tempting GCC into using a slow-LEA on the critical path dependency chain when adding 4*i or 8*i to the sum along with the constant sum of 1..4 or 1..8 (either way just a constant, irrelevant what it is). Apparently also as part of the dep chain for 16. And you used inline asm to force the sum dep chain to include a store/reload.
Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.
They're both fast, but the different multiples of i take different instructions. So there's little reason to assume they'd be the same. The interesting thing here is the one with more total instructions is faster, because it has a shorter dependency chain through sum.
And you forced a store/reload of it with the somewhat-clunky asm volatile("" :: "g"(&sum) : "memory"), instead of just requiring the compiler to have the value in a register with asm volatile("" : "+r"(sum)). So that dep chain includes store-forwarding latency (typically 3 to 5 cycles) so it's a bigger bottleneck than front-end or ALU throughput of the independent work.
test1():
xor eax, eax # i = 0
xor ecx, ecx # sum = 0
lea rsi, [rsp-8] # operand for inline asm
jmp .L3
.L5:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx # i = i+1
.L3:
lea rdx, [rax+1] # i+1, leaving the old value in RAX
lea rax, [rcx+6+rax*4] # sum += 4*i + 6. (1+2+3)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L5
ret
An LEA with 3 components (two + operations) in the addressing mode has 3 cycle latency on your Skylake-derived CPU. See x86 Multiplication with 3: IMUL vs SHL + ADD
So the loop-carried dependency chain is a slow-LEA (3 cycles) between load/store.
test2():
xor eax, eax
xor ecx, ecx
lea rdi, [rsp-8] # operand for inline asm
jmp .L8
.L9:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx
.L8:
lea rsi, [rax+36+rax*8]
lea rdx, [rax+1]
lea rax, [rax+9+rsi] # prepare some stuff to be added
add rax, rcx # first use of the RCX load result (sum)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L9
ret
So the loop-carried dep chain through the store/reload only includes an add, 1 cycle latency instead of 3.
I assume your performance ratios were something like 3:4 or so, from the 5+1 cycles (6) vs. 5+3 (8) cycles.
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details.
The compiler could have spent more instructions in test1 to reduce the critical path latency to 1 cycle there, instead of folding more work into an lea on the critical path. For a loop running 100k iterations, this is pretty much a missed optimization. But I assume the optimizer isn't tuned for artificially-induced spill/reload from inline asm; normally it would only have to introduce that store/reload if it ran out of registers from doing a lot of stuff in a loop, and that many different values would usually mean there was some instruction-level parallelism.
GCC makes better asm from simpler source on Quick-bench
#TedLyngmo linked the code on Quick-bench without -fno-tree-vectorize, and using benchmark::DoNotOptimize(sum); which only forces GCC to materialize the value in a register, without blocking constant-propagation through it, or as many other optimizations. Unlike taking its address and telling GCC that was potentially globally visible, like the current custom asm.
The inner loop body is just add %rdx,%rax / add $0x4,%rdx (and cmp rdx + jne as the loop branch), if you look at the asm on Quickbench. Or with rdx+=10 for the other loop. So same loops, different constants. Same performance, of course.
The current source is essentially compiling to asm that does
for (int i = 0 ; i<iterations ; i++){
sum += 8*i + 1234;
force_store_reload(&sum);
}
But if you actually write it that way (https://godbolt.org/z/4ja38or9j), we get asm like quick-bench, except for keeping the value in memory instead of a register. (So about 6x slower.)
.L6:
add QWORD PTR [rsp-8], rax # memory destination add
add rax, 4
cmp rax, 401234
jne .L6
It seems to be a missed-optimization bug that GCC doesn't compile your existing source to that. Specifically, missing the strength-reduction from 8*i re-evaluated for every i into tmp += 8.
BTW, it looks like omitting -fno-tree-vectorize makes your original test1() compile even worse. It starts without jumping into the middle of the loop, but it has a longer dep chain.
#gcc 10.3 -O3 (without -fno-tree-vectorize)
test1_orig():
mov QWORD PTR [rsp-8], 6 # init sum
lea rsi, [rsp-8] # operand for inline asm
mov eax, 1
.L2:
mov rdx, QWORD PTR [rsp-8] # load sum
mov rcx, rax
add rdx, rax # used: 1 cycle latency
add rax, 1
add rdx, rax # used: 1 cycle latency
lea rdx, [rdx+5+rcx*2] # used: 3 cycle latency
mov QWORD PTR [rsp-8], rdx # store
cmp rax, 100000
jne .L2
ret
For my project I need to take a vector array from a file then need to compare it with two A and B vectors and need to find to which one of A and B is closer to the vector we read from file.
I already did the C++ part (taking values of X from file etc.)
For example: for X(1,3,5) , A(2,4,6) (for A distance to X is (|2-1|+|4-3|+|6-5|)= 3) then i need to do the same operation for the B and find which value is smaller(which means closer to the X vector)
Basically for 3 sized arrays i need to find difference between X's and A's 1st, 2nd and 3rd elements (then need the absolute value of their sum then I need to do this for B then compare two values )
but I'm really stuck with the Assembly part:
so far i know to find distance i need to use this code to find absolute value but before using this code down below i need to find the difference between two elements then apply this code to find the absolute value
Here is the code piece for finding absolute value I don't know if that helps:
mov ebx, eax ; move eax to ebx
neg eax ; eax = -eax
cmovl eax, ebx ; if negative move ebx back to eax
but my main problem is: How can I take the first elements from both X and A get the difference between their elements in Assembly.(Need to do this for 2th and 3th values of both arrays as well. Then i need to do same operations for X and B but if you show me for A im sure i can apply the same algorithm for B
my C++ prototype of Assembly function is this :
distance(int n, int * Xptr, int * Aptr, int * Bptr);
and defined A and B as array with 3 members.
You access an array using indirect addressing.
Like so:
;ecx = number of items in the array
push ebx
push esi
push edi
xor ebx,ebx ;outcome is zero.
mov esi,Array1 ;esi = address of array1
mov edi,Array2 ;edi = address of array2
add esi,ecx ;esi = end of array
add edi,ecx ;edi = end of array
neg ecx ;start at the beginning of each array
jz done ;count is zero, nothing to do
loop: ;for (i=0;i<count;i++)
mov edx,[edi+ecx] ;edx = Array1[i] or Array1[start+length-count]
mov eax,[esi+ecx] ;ebx = Array2[i]
sub eax,edx ;calculate difference
cdq ;edx = eax < 0? -1:0
add eax, edx
xor eax, edx ;eax = abs(eax)
add ebx,eax
inc ecx ;i++
jnz loop
done:
mov eax,ebx
pop edi
pop esi
pop ebx
ret
Let me walk you though the code.
We start with setting the sum to zero and setting pointers to the array.
Then we negate the count and update the pointers to the end of the array.
This seemingly complicated setup is a speed hack, it allows you to count from -count to zero whilst not having to keep an extra variable around to keep track of the array indexes.
Then we do some magic to do an abs without having to do jumps or conditional moves.
You call this routine twice. Once to get abs(A[]-X[]) and again to get abs(B[]-X[]).
For the abs trick, see: https://www.strchr.com/optimized_abs_function
You'll have to do some changes to adjust it to your calling convention. I leave this as an exercise for the reader. You might adjust the code to do all of the comparisons in one go, which I also leave up to the reader.
just for fun let's pick apart the abs sample:
Alt-A cycles bytes Alt B cycles bytes
mov ebx, eax 0 2 cdq 1 1
neg eax 1 2 add eax,edx 1 2
cmovl eax, ebx 2 3 xor eax,edx 1 2
As you can see there is very little difference between the two samples. I just prefer the cdq variant because it's more elegant.
My profiler has identified the following function profiling as the hotspot.
typedef unsigned short ushort;
bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
for( unsigned i = 0; i < dim; ++i )
{
if( point[i + 1] >= coord[i] ) return false;
}
return true;
}
In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into
mov rcx, QWORD PTR [rdi]
lea r8d, [rax+1]
add rsi, 2
movzx r9d, WORD PTR [rsi-2]
mov rax, r8
cmp WORD PTR [rcx+r8*2], r9w
jae .L5
I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?
(Find the entire function on godbolt compiler explorer.)
Thank you for the good question!
Clearing Registers and Dependency Breaking Idioms
A Quote from the Intel® 64 and IA-32 Architectures
Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
movzx vs mov
The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.
Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.
Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
skipped 6 more similar triplets that do movzx, shr, xor.
dec <<<a counter register >>>>
jnz …… <<repeat the whole loop again>>>
Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:
// ecx is already cleared here to 0
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
<<< and so on – as in the example #1>>>
Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.
Register Renaming
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
Let me just take the first 4 instructions from the first code fragment:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.
So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
shr ebx, 8
xor eax, dword ptr [edx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
Here is what we have now:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!
This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.
The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.
Let me revert the code to the initial version. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).
Now let us go back to movxz vs mov.
Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.
That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.
The movzx instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2] from RAM.
To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32 every time you go through the loop.
You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.
I have a hashtable that stores quadtree entries.
The hashfunction looks like this:
Quadtree hash
#define node_hash(a,b,c,d) \
(((int)(d))+3*(((int)(c))+3*(((int)(b))+3*((int)(a))+3)))
Note that the result of this operation is always chunked down using a modulus prime number like so:
h = node_hash(p->nw, p->ne, p->sw, p->se) ;
h %= hashprime ;
...
Comparison with optimal hash
Some statistical analysis shows that this hash is optimum in terms of collision reduction.
Given a hashtable with b buckets and n entries. The collision risk using a perfect hash is:
(n - b * (1 - power((b-1)/b,n)))) * 100 / n
When n = b this means a collision risk of 37%.
Some testing shows that the above hash lines up very nicely with the norm (for all fill levels of the hashtable).
Running time
The runtime is heavily dependent on the value of hashprime
Timings (best out of 1000 runs) are:
hashprime CPU-cycles per run
--------------------------------
4049 56
16217 68
64871 127 <-- whoooh
Is there a way to improve on this, whilst still retaining the optimum collision risk?
Either by optimizing the modulus operation (replacing it with a multiplication using 'magic' numbers computer outside the loop).
Replacing the hash function with some other hash function.
Background
The following assembly is produced:
//--------h = node_hash(p->nw, p->ne, p->sw, p->se) ;
mov eax,[rcx+node.nw] <<+
lea eax,[eax+eax*2+3] |
add eax,[rcx+node.ne] |
lea eax,[eax+eax*2] +- takes +/- 12 cycles
add eax,[rcx+node.sw] |
lea eax,[eax+eax*2] |
add eax,[rcx+node.se] <<+
//--------h %= hashprime ;
mov esi,[hashprime]
xor edx,edx
div esi
mov rax,rdx <<--- takes all the rest
[EDIT]
I may be able to do something with the fact that:
C = A % B is equivalent to C = A – B * (A / B)
Using the fact that integer division is the same as multiplication by its reciprocal.
Thus converting the formula to C = A - B * (A * rB)
Note that for integer division the reciprocals are magic numbers, see: http://www.hackersdelight.org/magic.htm
C code is here: http://web.archive.org/web/20070713211039/http://hackersdelight.org/HDcode/magic.c
[FNV hashes]
See: http://www.isthe.com/chongo/tech/comp/fnv/#FNV-1a
hash = offset_basis
for each byte to be hashed
hash = hash xor octet_of_data
hash = hash * FNV_prime (for 32 bits = 16777619)
return hash
For 4 pointers truncated to 32 bits = 16 bytes the FNV hash takes 27 cycles (hand crafted assembly)
Unfortunately this leads to hash collisions of 81% where it should be 37%.
Running the full 15 multiplications takes 179 cycles.
Replacing modulus by reciprocal multiplication
The main cycle eater in this hash function is the modulus operator.
If you replace this division with a multiplication by the reciprocal the calculation is much faster.
Note that calculating the reciprocal involves 3 divides, so this should only be done when the reciprocal can be reused enough times.
OK, here's the code used: http://www.agner.org/optimize/asmlib.zip
From: http://www.agner.org/optimize/
// ;************************* divfixedi64.asm *********************************
// ; Author: Agner Fog
//extern "C" void setdivisoru32(uint Buffer[2], uint d)
asm
mov r8d, edx // x
mov r9, rcx // Buffer
dec r8d // r8d = r8d or esi
mov ecx, -1 // value for bsr if r8d = 0
bsr ecx, r8d // floor(log2(d-1))
inc r8d
inc ecx // L = ceil(log2(d))
mov edx, 1
shl rdx, cl // 2^L (64 bit shift because cl may be 32)
sub edx, r8d
xor eax, eax
div r8d
inc eax
mov [r9], eax // multiplier
sub ecx, 1
setae dl
movzx edx, dl // shift1
seta al
neg al
and al,cl
movzx eax, al // shift 2
shl eax, 8
or eax, edx
mov [r9+4], eax // shift 1 and shift 2
ret
end;
and the code for the modulus operation:
//extern "C" uint modFixedU32(uint Buffer[2], uint d)
asm
mov eax, edx
mov r10d, edx // x
mov r11d, edx // save x
mul dword [rcx] // Buffer (i.e.: m')
sub r10d, edx // x-t
mov ecx, [rcx+4] // shift 1 and shift 2
shr r10d, cl
lea eax, [r10+rdx]
mov cl, ch
shr eax, cl
// Result:= x - m * fastDiv32.dividefixedu32(Buffer, x);
mul r8d // m * ...
sub r11d, eax // x - (m * ...)
mov eax,r11d
ret
end;
The difference in time is as follows:
hashprime classic hash (mod) new hash new old
(# of runs) cycles/run per run (no cache) (no cache)
--------------------------------------------------------------------
4049 56 21 16.6 51
16217 68 not measured
64871 127 89 16.5 50
Cache issues
The increase in cycle time is caused by the data overflowing the cache, causing main memory to be accessed.
This can be seen clearly when I remove cache effects by hashing the same value over and over.
Something like this might be useful:
static inline unsigned int hash4(unsigned int a, unsigned int b,
unsigned int c, unsigned int d) {
unsigned long long foo = 123456789*(long long)a ^ 243956871*(long long)b
^ 918273645*(long long)c ^ 347562981*(long long)d;
return (unsigned int)(foo >> 32);
}
Replace the four odd numbers I typed in with randomly generated 64-bit odd numbers; the ones above won't work that great. (64-bit so that the high 32 bits are somehow a random mix of the lower bits.) This is about as fast as the code you gave, but it lets you use power-of-two table sizes instead of prime table sizes without fear.
The thing everyone uses for similar workloads is the FNV hash. I'm not sure whether FNV actually has better properties than hashes of the type above, but it's similarly fast and it's in rather widespread use.
Assuming hashprime is a constant, you could implement the modulo-operation as bitwise masks. I'm not sure about the details, but maybe this answer can push you in the right direction.
I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.