I writing mathematical code with a complicated arithmetic expression that is being used with varying types of inputs, but for the purpose of hyper-optimization AND readability, I want to use it as a macro rather than a function. However, I'm having trouble getting g++ (4.9.2) to optimize in a way I would think is obvious and need help.
The arithmetic operation of interest is in only one location, but the rub is that there are numerous permutations to the input variables. That is, the equation variables can be either a singleton or an array. For the purpose of conserving memory and CPU cache, I want to code for each variation (i.e. array and singleton input). Another element of the problem is that some terms of the equation drop out under certain conditions variables might not just be a singleton, but known a priori to be exactly 0 for certain instances, thus causing terms of the equation to be trivial.
Consider the following simplified, illustrative example
float a[128],b[128],c[128],d[128],e[128],r[128];
float a0, b0, c0, d0, e0;
...
#define ITERATE for (int n=1; n<N; n++)
#define THE_EQ(A,B,C,D,E,F) r[n] = A*(B-C) + D*E + F
...
switch(condition) {
case 1: ITERATE THE_EQ(a[n],b[n],c[n],d[n],e[n],0); break;
case 2: ITERATE THE_EQ( a0 ,b[n],c[n],d[n],e[n],0); break;
case 3: ITERATE THE_EQ( 0 ,b[n],c[n],d[n],e[n],0); break;
case 4: ITERATE THE_EQ(a[n],b[n],c[n],d[n],e[n],0); break;
case 5: ITERATE THE_EQ( a0 ,b[n],c[n], d0 ,e[n],0); break;
case 6: ITERATE THE_EQ( 0 ,b[n],c[n], 0 ,e[n],1); break;
....
} // switch
The real code has a more complex equation/macro and has many more permutations of input variable, but the above code is my test of g++ compiler/optimizer. What I'm finding is that g++ is not optimizing as I had hoped. That is, g++ is not dropping, when appropriate, the trivial arithmetic operations.
For example, in case 3, I expect the final microcode free of any b[n] - c[n] related operation. And for case 6, I expect elements in array r[n] to be simply set to 1, with no arithmetic operations at all. Unfortunately, neither is the case. :(
The optimizations that does occur are the AVX vectorization and using VXORPS to generate the zero value. But with the optimally generated 0, g++ still goes thru with the trival 0*(b[n]-c[n]) and 0*e[n] operations.
Witness:
; case 6: ITERATE THE_EQ(0,b[n],c[n],0,e[n],1); break;
0x400778 vxorps xmm1,xmm1,xmm1 ; generate 0
...
0x4007a0 vmovaps ymm0,YMMWORD PTR [rsi+rax*1]
0x4007a5 vmulps ymm2,ymm1,YMMWORD PTR [rdx+rax*1] ; 0*e[n]
0x4007aa vsubps ymm0,ymm0,YMMWORD PTR [rcx+rax*1] ; b[n]-c[n]
0x4007af vmulps ymm0,ymm0,ymm1 ; (b[n]-c[n])*0
0x4007b3 vaddps ymm0,ymm0,ymm2 ; 0*(b[n]-c[n])+0*e[n]
0x4007b7 vaddps ymm0,ymm0,ymm3 ; ....+1
0x4007bb vmovaps YMMWORD PTR [rdi+rax*1],ymm0
0x4007c0 add rax,0x20
0x4007c4 cmp rax,0x200
0x4007ca jne 0x4007a0 <main()+128>
Obviously, this is NOT optimal. Is there a technique (via PRAGMA or compile flag) that will encourage g++ drop those trivial arithmetic operations? Or is there another method, other than hardcoding 60+ different expressions, to get to my desired level of optimization.
Command is
g++-4.9 -Wall -fexceptions -O3 -g -march=corei7-avx main.cpp
with g++ 4.9.2
/* Maybe fold x * 0 to 0. The expressions aren't the same
when x is NaN, since x * 0 is also NaN. Nor are they the
same in modes with signed zeros, since multiplying a
negative value by 0 gives -0, not +0. */
The optimization in gcc is guarded by a few conditions. -ffast-math is the easiest way to convince gcc that you do not care about those cases.
Related
Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
"csumasm:\n"
"movsd (%rcx), %xmm0\n" //xmm0 = a
"subsd (%r8), %xmm1\n" //xmm1 - r8 (c) | y = b-c
"movapd %xmm0, %xmm2\n"
"addsd %xmm1, %xmm2\n" //xmm2 + xmm1 (y) | b = a+y
"movapd %xmm2, %xmm3\n"
"subsd %xmm0, %xmm3\n" //xmm3 - xmm0 (a) | b - a
"movapd %xmm3, %xmm0\n"
"subsd %xmm1, %xmm0\n" //xmm0 - xmm1 (y) | - y
"movsd %xmm0, (%r8)\n" //xmm0 to c
"movsd %xmm2, (%rcx)\n" //b to a
"ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
double y = b-c; //y is the correction of b argument
b = a+y; //add corrected b argument to a argument. The output of the current summation
c = (b-a)-y; //find new error to be passed as a compensation term
a = b;
}
double fun(double fMin, double fMax){
double f = (double)rand()/RAND_MAX;
return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
int N = 1000;
srand(0); //use 0 seed for each method
double sum1 = 0;
for (int n = 0; n < N; ++n)
sum1 += fun(0,1);
srand(0);
double sum2 = 0;
double c = 0; //compensation term
for (int n = 0; n < N; ++n)
csum(sum2,fun(0,1),c);
srand(0);
double sum3 = 0;
c = 0;
for (int n = 0; n < N; ++n)
csumasm(sum3,fun(0,1),c);
printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
return 0;
}
The output with -O3 is:
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.
Ideas or suggestions are appreciated.
Further information
The contents of "fun" cannot be inlined, but the "csum" function could be.
The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
subsd xmm0, QWORD PTR [rsp+32]
movapd xmm1, xmm3
addsd xmm3, xmm0
movsd QWORD PTR [rsp+16], xmm3
subsd xmm3, xmm1
movapd xmm1, xmm3
subsd xmm1, xmm0
movsd QWORD PTR [rsp+32], xmm1
You can put functions that need to not use -ffast-math (like a csum loop) in a separate file that gets compiled without -ffast-math.
Possibly you could also use __attribute__((optimize("no-fast-math"))), but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.
update: apparently part of the question was based on a misunderstanding that -O3 wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math. Compiling everything with just -O3 apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math.
ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast = -O3 -ffast-math)
You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).
e.g. use SSE2 __m128d for 2 packed additions per instruction. Or AVX __m256d. On modern x86, addpd / subpd have the same performance as addsd and subsd (1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).
So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.
Generating random numbers on the fly with your fun() is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.
If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.
Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)
Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.
And BTW, clang managed to implement csum(double&, double, double&): with only two movapd instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.
And BTW, movaps is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double vs. float, just vec-FP vs. vec-int (vs. GP integer)
But really by far your bet is to get GCC to compile a file or function without -ffast-math. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps instructions when AVX is available, besides letting it optimize better when unrolling.
If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum in a separate file. (Hopefully link-time optimization respects -fno-fast-math for one file, perhaps by not inlining that function.)
But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.
Ideally compile all of your code with -O3 -march=native, and profile-guided optimization. Also -flto link-time optimization to enable cross-file inlining.
It's not surprising that -ffast-math breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math like -fno-math-errno and -fno-trapping-math so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno after calling sqrt so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.
Or maybe try -ffast-math -fno-associative-math for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.
Another way to get optimizations that normally require fast-math is #pragma omp simd to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.
In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/
I'm trying to find an efficient way to check if an integer is zero without jumping.
I have two integer variables in and out. If in is zero, I want out to be zero. If in is not zero, I want out to be one.
If it may help, I know that in will be zero or a power of two (only one set bit). I also know that the most significant and the less significant bits are never set.
I could do the obvious : out = (in == 0 ? 0 : 1); But that implies a jump which is costly.
I could do something like this out = (in * 0xFFFFFFFF) >> 63;. This implies a multiplication and shift that I would like to avoid, but I can't find a way. Maybe it's not possible.
Any other way I could do this without jump and only using bit-wise operators and arithmetic?
Thanks
This will differ with architecture but the code doesn't compile to a jump on Intel CPUs.
This code:
int square(int in) {
int out = (in != 0);
return out;
}
is compiled to:
square(int):
xor eax, eax
test edi, edi
setne al
ret
or:
square, COMDAT PROC
xor eax, eax
test ecx, ecx
setne al
ret 0
square ENDP
by msvc, clang and gcc with O2:
msvc: https://godbolt.org/g/Mfh2Qj
clang: https://godbolt.org/g/6p7kL1
gcc: https://godbolt.org/g/vUM2Zv
It is only a jump with no optimization which you would never do anyway.
I've also found the need to do this, to index a length-2 array at 0 for zero values and 1 for non-zero values.
Cast the int to bool, and then back to int. This does not jump on almost every compiler I've tried (gcc, clang, recent MSVC) except MSVC pre-2018. I recommend you check the assembly code to make sure on your platform.
int one_if_nonzero_else_zero(int value) { return (bool) value; }
EDIT: This does not satisfy your constraint "only using bit-wise operators and arithmetic" but this cast takes advantage of assembly optimization and will be very efficient.
EDIT: Your "obvious" solution out = (in == 0 ? 0 : 1); results in identical assembly code as solutions posted by Jerry Jeremiah and myself on gcc, clang, and msvc. No jump after optimization! I suggest you use it for clarity.
I have two integer variables in and out. If in is zero, I want out to be zero. If in is not zero, I want out to be one.
Try this:
int in = ...;
int out = !!in;
Live Demo
C++ has an implicit conversion defined from int to bool, so in as a bool will be false when in is 0, and will be true otherwise.
Then !false will be true, and !true will be false.
Then !true will be false, and !false will be true.
Then there is also an implicit conversion defined from bool to int, so true as an int will be 1, and false will be 0.
Thus, out will be 0 when in is 0, and will be 1 otherwise.
int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a
You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))
Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.
The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].
There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.
I have the following line in a function to count the number of 'G' and 'C' in a sequence:
count += (seq[i] == 'G' || seq[i] == 'C');
Are compilers smart enough to do nothing when they see 'count += 0' or do they actually lose time 'adding' 0 ?
Generally
x += y;
is faster than
if (y != 0) { x += y; }
Even if y = 0, because there is no branch in the first option. If its really important, you'll have to check the compiler output, but don't assume your way is faster because it sometimes doesn't do an add.
Honestly, who cares??
[Edit:] This actually turned out somewhat interesting. Contrary to my initial assumption, in unoptimized compilation, n += b; is better than n += b ? 1 : 0;. However, with optimizations, the two are identical. More importantly, though, the optimized version of this form is always better than if (b) ++n;, which always creates a cmp/je instruction. [/Edit]
If you're terribly curious, put the following code through your compiler and compare the resulting assembly! (Be sure to test various optimization settings.)
int n;
void f(bool b) { n += b ? 1 : 0; }
void g(bool b) { if (b) ++n; }
I tested it with GCC 4.6.1: With g++ and with no optimization, g() is shorter. With -O3, however, f() is shorter:
g(): f():
cmpb $0, 4(%esp) movzbl 4(%esp), %eax
je .L1 addl %eax, n
addl $1, n
.L1:
rep
Note that the optimization for f() actually does what you wrote originally: It literally adds the value of the conditional to n. This is in C++ of course. It'd be interesting to see what the C compiler would do, absent a bool type.
Another note, since you tagged this C as well: In C, if you don't use bools (from <stdbool.h>) but rather ints, then the advantage of one version over the other disappears, since both now have to do some sort of testing.
It depends on your compiler, its optimization options that you used and its optimization heuristics. Also, on some architectures it may be faster to add than to perform a conditional jump to avoid the addition of 0.
Compilers will NOT optimize away the +0 unless the expression on the right is a compiler const value equaling zero. But adding zero is much faster on all modern processors than branching (if then) to attempt to avoid the add. So the compiler ends up doing the smartest thing available in the given situation- simply adding 0.
Some are and some are not smart enough. its highly dependent on the optimizer implementation.
The optimizer might also determine that if is slower than + so it will still do the addition.