How to implement a synthetic benchmark? - c++

I'm doing my first programs in C++ and assembly code. I already know how to program in C++, but I have a lot of problems when I try to program in assembly code.
I want to do a synthetic benchmark, which
"is designed to mimic a particular type of workload on a component or system. Synthetic benchmarks do this by specially created programs that impose the workload on the component." (Wikipedia)
For example, if I want to calculate the factorial of long fact = pow(3.0, 2000), how can I measure the performance of a component in C++? (And not the performance of the whole system).
The rest of the code (the calculation of fact) is done in assembly code.

Following is part of one of my benchmarks (for Linux) that uses assembly code, where repetitive calculations can be carried out without fear of over optimisation. You need to use a high resolution timer with an assembly based loop long enough for reasonable execution time. You might want to repeat calculations in the loop to fill up pipelines.
This one repeats 20M adds 10 times to find maximum speed. IntCount1 value is checked at the end as a simple integrity check
C Code
intCount1 = 0;
max = 0;
for (i=0; i<10; i++)
{
count = intCount1;
start_time();
_mips1Reg();
end_time();
count = intCount1 - count;
mips = (int)((double)count / 1000000.0 / secs + 0.5);
if(mips > max) max = mips;
}
mipsReg[0] = max;
printf(" 1 Register %7d 32 Bit Integer MIPS\n", mipsReg[0]);
########################################################
Hi-Res Timer Used
clock_gettime(CLOCK_REALTIME, &tp1);
theseSecs = tp1.tv_sec + tp1.tv_nsec / 1e9;
########################################################
Assembly Code
global _mips1Reg
_mips1Reg:
push eax
push ebx
push ecx
push edx
push edi
mov edi, 1000000
mov eax, [intCount1]
align 8
dlp:add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 1
add eax, 3
dec edi
jnz dlp
mov [intCount1], eax
pop edi
pop edx
pop ecx
pop ebx
pop eax
ret

Related

Performance differ significantly when inner loop iterates less times?

I wrote a toy program that compares the performance of two very similar functions. The entire file (minus a couple of macros) looks like this:
constexpr int iterations = 100000;
using u64 = uint64_t;
// Slow.
void test1()
{
u64 sum = 0;
for (int i = 0; i < iterations; i++) {
for (int j = 0; j < 4; j++)
sum += i + j;
doNotOptimize(sum);
}
}
// Fast.
void test2()
{
u64 sum = 0;
for (int i = 0; i < iterations; i++) {
for (int j = 0; j < 10; j++)
sum += i + j;
doNotOptimize(sum);
}
}
void driver()
{
int runs = 1000;
BENCH(runs, test1);
BENCH(runs, test2);
}
I am measuring 1000 executions of each function using __rdtsc and computing the average. With this formula, I am seeing a performance difference of ~172,000 (ticks?) between test1 and test2. What's surprising is that test2 is the one that's faster.
An exotic little detail is that the only magic numbers for which test1 is slower are 4, 8, and 16. If I change the internal loop's condition to j < x where x is anything but those 3 numbers, performance match up.
In the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations performed as operands of lea. So in this case, it would make sense if both functions were equally fast. But that's not at all what's happening. Here's the disassembly and the program source in its entirety: https://godbolt.org/z/d5PsG4YeY
So what's really going on? Is something wrong with my measurements?
Execution environment:
Processor: Intel(R) Core(TM) i5-7200U CPU # 2.50GHz (Kaby Lake)
L1 Cache: 128Kib
OS: Linux (64-bit)
Toolchain: GCC Version 10.3.0
Compiler Options: -O3 -fno-tree-vectorize
4 and 8 are scale factors that x86 addressing modes support, tempting GCC into using a slow-LEA on the critical path dependency chain when adding 4*i or 8*i to the sum along with the constant sum of 1..4 or 1..8 (either way just a constant, irrelevant what it is). Apparently also as part of the dep chain for 16. And you used inline asm to force the sum dep chain to include a store/reload.
Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.
They're both fast, but the different multiples of i take different instructions. So there's little reason to assume they'd be the same. The interesting thing here is the one with more total instructions is faster, because it has a shorter dependency chain through sum.
And you forced a store/reload of it with the somewhat-clunky asm volatile("" :: "g"(&sum) : "memory"), instead of just requiring the compiler to have the value in a register with asm volatile("" : "+r"(sum)). So that dep chain includes store-forwarding latency (typically 3 to 5 cycles) so it's a bigger bottleneck than front-end or ALU throughput of the independent work.
test1():
xor eax, eax # i = 0
xor ecx, ecx # sum = 0
lea rsi, [rsp-8] # operand for inline asm
jmp .L3
.L5:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx # i = i+1
.L3:
lea rdx, [rax+1] # i+1, leaving the old value in RAX
lea rax, [rcx+6+rax*4] # sum += 4*i + 6. (1+2+3)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L5
ret
An LEA with 3 components (two + operations) in the addressing mode has 3 cycle latency on your Skylake-derived CPU. See x86 Multiplication with 3: IMUL vs SHL + ADD
So the loop-carried dependency chain is a slow-LEA (3 cycles) between load/store.
test2():
xor eax, eax
xor ecx, ecx
lea rdi, [rsp-8] # operand for inline asm
jmp .L8
.L9:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx
.L8:
lea rsi, [rax+36+rax*8]
lea rdx, [rax+1]
lea rax, [rax+9+rsi] # prepare some stuff to be added
add rax, rcx # first use of the RCX load result (sum)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L9
ret
So the loop-carried dep chain through the store/reload only includes an add, 1 cycle latency instead of 3.
I assume your performance ratios were something like 3:4 or so, from the 5+1 cycles (6) vs. 5+3 (8) cycles.
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details.
The compiler could have spent more instructions in test1 to reduce the critical path latency to 1 cycle there, instead of folding more work into an lea on the critical path. For a loop running 100k iterations, this is pretty much a missed optimization. But I assume the optimizer isn't tuned for artificially-induced spill/reload from inline asm; normally it would only have to introduce that store/reload if it ran out of registers from doing a lot of stuff in a loop, and that many different values would usually mean there was some instruction-level parallelism.
GCC makes better asm from simpler source on Quick-bench
#TedLyngmo linked the code on Quick-bench without -fno-tree-vectorize, and using benchmark::DoNotOptimize(sum); which only forces GCC to materialize the value in a register, without blocking constant-propagation through it, or as many other optimizations. Unlike taking its address and telling GCC that was potentially globally visible, like the current custom asm.
The inner loop body is just add %rdx,%rax / add $0x4,%rdx (and cmp rdx + jne as the loop branch), if you look at the asm on Quickbench. Or with rdx+=10 for the other loop. So same loops, different constants. Same performance, of course.
The current source is essentially compiling to asm that does
for (int i = 0 ; i<iterations ; i++){
sum += 8*i + 1234;
force_store_reload(&sum);
}
But if you actually write it that way (https://godbolt.org/z/4ja38or9j), we get asm like quick-bench, except for keeping the value in memory instead of a register. (So about 6x slower.)
.L6:
add QWORD PTR [rsp-8], rax # memory destination add
add rax, 4
cmp rax, 401234
jne .L6
It seems to be a missed-optimization bug that GCC doesn't compile your existing source to that. Specifically, missing the strength-reduction from 8*i re-evaluated for every i into tmp += 8.
BTW, it looks like omitting -fno-tree-vectorize makes your original test1() compile even worse. It starts without jumping into the middle of the loop, but it has a longer dep chain.
#gcc 10.3 -O3 (without -fno-tree-vectorize)
test1_orig():
mov QWORD PTR [rsp-8], 6 # init sum
lea rsi, [rsp-8] # operand for inline asm
mov eax, 1
.L2:
mov rdx, QWORD PTR [rsp-8] # load sum
mov rcx, rax
add rdx, rax # used: 1 cycle latency
add rax, 1
add rdx, rax # used: 1 cycle latency
lea rdx, [rdx+5+rcx*2] # used: 3 cycle latency
mov QWORD PTR [rsp-8], rdx # store
cmp rax, 100000
jne .L2
ret

Compiler generates costly MOVZX instruction

My profiler has identified the following function profiling as the hotspot.
typedef unsigned short ushort;
bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
for( unsigned i = 0; i < dim; ++i )
{
if( point[i + 1] >= coord[i] ) return false;
}
return true;
}
In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into
mov rcx, QWORD PTR [rdi]
lea r8d, [rax+1]
add rsi, 2
movzx r9d, WORD PTR [rsi-2]
mov rax, r8
cmp WORD PTR [rcx+r8*2], r9w
jae .L5
I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?
(Find the entire function on godbolt compiler explorer.)
Thank you for the good question!
Clearing Registers and Dependency Breaking Idioms
A Quote from the Intel® 64 and IA-32 Architectures
Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
movzx vs mov
The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.
Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.
Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
skipped 6 more similar triplets that do movzx, shr, xor.
dec <<<a counter register >>>>
jnz …… <<repeat the whole loop again>>>
Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:
// ecx is already cleared here to 0
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
<<< and so on – as in the example #1>>>
Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.
Register Renaming
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
Let me just take the first 4 instructions from the first code fragment:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.
So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
shr ebx, 8
xor eax, dword ptr [edx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
Here is what we have now:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!
This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.
The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.
Let me revert the code to the initial version. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).
Now let us go back to movxz vs mov.
Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.
That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.
The movzx instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2] from RAM.
To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32 every time you go through the loop.
You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.

Is shifting or multiplying faster and why?

What generally is a faster solution, multiplying or bit shifting?
If I want to multiply by 10000, which code would be faster?
v = (v<<13) + (v<<11) + (v<<4) - (v<<8);
or
v = 10000*v;
And the second part of the question - How to find the lowest number of shifts required to do some multiplication? (I'm intereseted in multiplying by 10000, 1000 and 100).
It really depends on the architecture of the processor, as well as the compiler that you're using.
But you can simply view the dis-assembly of each option, and see for yourself.
Here is what I got using Visual-Studio 2010 compiler for Pentium:
int v2 = (v<<13) + (v<<11) + (v<<4) - (v<<8);
mov eax,dword ptr [v]
shl eax,0Dh
mov ecx,dword ptr [v]
shl ecx,0Bh
add eax,ecx
mov edx,dword ptr [v]
shl edx,4
add eax,edx
mov ecx,dword ptr [v]
shl ecx,8
sub eax,ecx
mov dword ptr [v2],eax
int v2 = 10000*v;
mov eax,dword ptr [v]
imul eax,eax,2710h
mov dword ptr [v2],eax
So it appears that the second option is faster in my case.
BTW, you might get a different result if you enable optimization (mine was disabled)...
To the first question: Don't bother. The compiler knows better and will optimize it according to the respective target hardware.
To the second question: Look at the binary representation:
For example: bin(10000) = 0b10011100010000:
1 0 0 1 1 1 0 0 0 1 0 0 0 0
13 12 11 10 9 8 7 6 5 4 3 2 1 0
So you have to shift by 13, 10, 9, 8 and 4. If you want to shortcut consecutive ones (by subtracting as in your question) you need at least three consecutive ones in order to gain anything.
But again, let the compiler do this. It's his job.
there is only one situation in which shift operation are faster than *, and it's defined by two condition:
the operation is with a value power of two
when you multiply with a fractional number -> division.
Let's look a little deeper:
multiplication/division, shift operation are done by units in the HW
architecture; usually you have shifters, multipliers/dividers to
perform these operations but each of the operation is performed by a
different set of registers inside a Arithmeric Locgic Unit.
multiplication/division with a power of two is equivalent to a
left_shift/right_shift operation
if you are not dealing with power of 2 than multiplication and division are performed slightly differently:
Multiplication is performed by the HW ( ALU unit) in a single instrucion (depending on the data type but let's not overcomplicate things)
Division is performed in a loop as consecutive subtractions -> more than one instruction
Summarizing:
multiplication is only one instruction; while replacing
multiplication with a series of shift operations is multiple
instruction -> the first option is faster (even on a parallel
architecture)
multiplication with a power of two is the same as a shift operation; the compiler usually generates a shift when it detects this in the code.
division is multiple instruction; replaving this with a series of shifts might prove faster but it depends on each situation.
division with a power of two is multiple operations and can be replaced with a single right_shift operation; a smart compiler will
do this automatically
An older Microsoft C compiler optimized the shift sequence using lea (load effective address), which allows multiples of 5:
lea eax, DWORD PTR [eax+eax*4] ;eax = v*5
lea ecx, DWORD PTR [eax+eax*4] ;ecx = v*25
lea edx, DWORD PTR [ecx+ecx*4] ;edx = v*125
lea eax, DWORD PTR [edx+edx*4] ;eax = v*625
shl eax, 4 ;eax = v*10000
multiply (signed or unsigned) was still faster on my system with Intel 2600K 3.4ghz. Visual Studio 2005 and 2012 multiplied v*10256, then subtracted (v<<8). Shift and add / subtract sequence was slower than the lea method above:
shl eax,4 ;ecx = v*(16)
mov ecx,eax
shl eax,4 ;ecx = v*(16-256)
sub ecx,eax
shl eax,3 ;ecx = v*(16-256+2048)
add ecx,eax
shl eax,2 ;eax = v*(16-256+2048+8192) = v*(10000)
add eax,ecx

Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.
I'm surprise with this two tests and how much change with the different compiler flags optimizations.
Test iterator 1:
Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.
Test iterator 2:
Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.
P.D.: I use the following clock code.
Test iterator 1:
void test_iterator_1()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
size_t count = v.size();
for (unsigned int i = 0; i < count; ++i) {
v[i] = 1;
}
}
Test iterator 2:
void test_iterator_2()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
for (int& i : v) {
i = 1;
}
}
UPDATE: The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best.
UPDATE 2: Results:
With -O3:
t1: 80 units
t2: 74 units
With -O0:
t1: 287 units
t2: 538 units
UPDATE 3: The CODE!. Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)
Your first test is actually setting the value of each element in the vector to 1.
Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).
When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.
If you want the second loop to actually set the value:
for (int& i : v) // notice the &
{
i = 1;
}
Once you make that change, your loops are likely to produce assembly code that is almost identical.
As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:
std::vector<int> v(SIZE, 1);
EDIT
The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:
Version 1 will store a value for count and increment i, testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...)). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).
Version 1 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
push eax
lea ecx, DWORD PTR _v$[ebp]
call ??A?$vector#HV?$allocator#H#std###std##QAEAAHI#Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1
Version 2 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1
When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.
Version 1 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov ecx, DWORD PTR _v$[ebp+4]
mov edx, DWORD PTR _v$[ebp]
sub ecx, edx
sar ecx, 2 ; this is the only differing instruction
test ecx, ecx
je SHORT $LN3#test_itera
push edi
mov eax, 1
mov edi, edx
rep stosd
pop edi
$LN3#test_itera:
test edx, edx
je SHORT $LN21#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN21#test_itera:
mov esp, ebp
pop ebp
ret 0
Version 2 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov edx, DWORD PTR _v$[ebp]
mov ecx, DWORD PTR _v$[ebp+4]
mov eax, edx
cmp edx, ecx
je SHORT $LN1#test_itera
$LL33#test_itera:
mov DWORD PTR [eax], 1
add eax, 4
cmp eax, ecx
jne SHORT $LL33#test_itera
$LN1#test_itera:
test edx, edx
je SHORT $LN47#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN47#test_itera:
mov esp, ebp
pop ebp
ret 0
Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.
Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.
Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler, suddenly the effect of your optimization vanishes.
So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.
In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).

Performance optimization with modular addition when the modulus is of special form

I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.