A good way to do a fast divide in C++? - c++

Sometimes I see and have used the following variation for a fast divide in C++ with floating point numbers.
// orig loop
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
// alternative
double y = 44100;
double y_div = 1.0 / y;
for(int i=0; i<10000; ++i) {
double z = x * y_div;
}
But someone hinted recently that this might not be the most accurate way.
Any thoughts?

On just about every CPU, a floating point divide is several times as expensive as a floating point multiply, so multiplying by the inverse of your divisor is a good optimization. The downside is that there is a possibility that you will lose a very small portion of accuracy on certain processors - eg, on modern x86 processors, 64-bit float operations are actually internally computed using 80 bits when using the default FPU mode, and storing it off in a variable will cause those extra precision bits to be truncated according to your FPU rounding mode (which defaults to nearest). This only really matters if you are concatenating many float operations and have to worry about the error accumulation.

Wikipedia agrees that this can be faster. The linked article also contains several other fast division algorithms that might be of interest.
I would guess that any industrial-strength modern compiler will make that optimization for you if it is going to profit you at all.

Your original
// original loop:
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
can easily be optimized to
// haha:
double y = 44100.0;
double z = x / y;
and the performance is pretty nice. ;-)
EDIT: People keep voting this down, so here's the not so funny version:
If there were a general way to make division faster for all cases, don't you think compiler writers might have happened upon it by now? Of course they would have done. Also, some of the people doing FPU circuits aren't exactly stupid, either.
So the only way you're going to get better performance is to know what specific situation you have at hand and doing optimal code for that. Most likely this is a complete waste of your time, because your program is slow for some other reason such as performing math on loop invariants. Go find a better algorithm instead.

In your example using gcc the division with the options -O3 -ffast-math yields the same code as the multiplication without -ffast-math. (In a testing environment with enough volatiles around that the loop is still there.)
So if you really want to optimise those divisions away and don’t care about the consequences, that’s the way to go. Multiplication seems to be roughly 15 times faster, btw.

multiplication is faster than division so the second method is faster. It might be slightly less accurate but unless you are doing hard core numerics the level of accuracy should be more than enough.

When processing audio, I prefer to use fixed point math instead. I suppose this depends on the level of precision you need. But, let's assume that 16.16 fixed point integers will do (meaning high 16 bits is whole number, low 16 is the fraction). Now, all calculation can be done as simple integer math:
unsigned int y = 44100 << 16;
unsigned int z = x / (y >> 16); // divisor must be the whole number portion
Or with macros to help:
#define FP_INT(x) (x << 16)
#define FP_MUL(x, y) (x * (y >> 16))
#define FP_DIV(x, y) (x / (y >> 16))
unsigned int y = FP_INT(44100);
unsigned int z = FP_MUL(x, y);

Audio, hunh? It's not just 44,100 divisions per second when you have, say, five tracks of audio running at once. Even a simple fader consumes cycles, after all. And that's just for a fairly bare-bones, minimal example -- what if you want to have, say, an eq and a compressor? Maybe a little reverb? Your total math budget, so to speak, gets eaten up quickly. It does make sense to wring out a little extra performance in those cases.
Profilers are good. Profilers are your friend. Profilers deserve blowjobs and pudding. But you already know where the main bottle neck is in audio work -- it's in the loop that processes samples, and the faster you can make that, the happier your users will be. Use everything you can! Multiply by reciprocals, shift bits whenever possible (exp(x*y) = exp (x)*exp(y), after all), use lookup tables, refer to variables by reference instead of values (less pushing/popping on the stack), refactor terms, and so forth. (If you're good, you'll laugh at these elementary optimizations.)

I presume from the original post that x is not a constant shown there but probably data from an array so x[i] is likely to be the source of the data and similarly for the output, it will be stored somewhere in memory.
I suggest that if the loop count really is 10,000 as in the original post that it will make little difference which you use as the whole loop won't even take a fraction of millisecond anyway on a modern cpu. If the loop count really is very much higher, perhaps 1,000,000 or more, then I would expect that the cost of memory access would likely make the faster operation completely irrelevent anyway as it will always be waiting for the data anyway.
I suggest trying both with your code and testing if it actually makes any significant difference in run time and if it doesn't then just write the straightforward division if that's what the algorithm needs.

here's the problem with doing it with a reciprocal, you still have to do the division before you can actually divide by Y. unless your only dividing by Y then i suppose this may be useful. this is not very practical since division is done in binary with similar algorithms.

I are looping 10,000 times simply to make the code take long enough to measure the time easily? Or do you have 10000 numbers to divide by the same number? If the former, put the "y_div = 1.0 / y;" inside the loop, because it's part of the operation.
If the latter, yes, floating point multiplication is generally faster than division. Don't change your code from the obvious to the arcane based on guesses, though. Benchmark first to find slow spots, and then optimize those (and take measurements before and after to make sure your idea actually causes an improvement)

On old CPUs like the 80286, floating point maths was abysmally slow and we employed lots of trickiness to speed things up.
On modern CPUs floating point maths is blindingly fast and optimising compilers can generally do wonders with fine-tuning things.
It is almost never worth the effort to employ little micro-optimisations like that.
Try to make your code simple and idiot-proof. Only of you find a real bottleneck (using a profiler) would you think of optimisations in your floating point calculations.

Related

How profitable is it to use powers of two in the calculations?

The question is whether it is possible to achieve a noticeable increase in productivity by using powers of two in multiplications and divisions, since the compiler could convert them to a shift (or it could be explicitly using a shift for this). I have a lot of multiplications by one number in my task (a coefficient that I myself entered), but I can use for example 512 instead of 500.
for(i=0;i<X;i++)
{
cout<<i*512 // or i*500
}
or i need do it same:
for(i=0;i<X;i++)
{
cout<<i>>9;
}
and an additional question - does it make sense to introduce a variable for the condition so that the compiler does not repeatedly read the condition again or does it do it automatically?
For example:
for(int i=0;i<10*K*H;i++)
{
// K and H cant change in this loop
}
I was trying to check it in Compulier Explorer, but it create less lines of code when i divide and no create same code when i multiply
About the limit in the for loop, you may want to give the compiler some assistance.
Compute the limit before the loop:
const int limit = 10 * K * H;
for (i = 0; i < limit; ++i)
{
}
This can help when compiling with no optimizations (e.g. debug mode). Your compiler may perform better optimizations when you increase the optimization level.
I recommend printing the assembly language for your for loop and comparing with the assembly language for the above code. The truth is in the assembly language.
Edit 1: shifting vs. multiplication
In most processors, bit shifting is often faster than multiplication. In modern processors, the savings is in the order of nanoseconds, or possibly microseconds.
Many compilers will convert a multiplication into a bit shift, depending on the optimization level and the context.
In your example, you will probably not notice the optimization gain, because the gain will be wasted in the call to cout. I/O consumes more time than the time gained by micro-optimizations.
Profiling your code will give you the best data for making these kinds of decisions. Also read about benchmarking to collect better data. For example, you may have to run your loop for 1E6 or more iterations to rule out outliers such as interrupts and task swaps.

Set all floating point literals to floats MSVC++

I am writing some numeric code in C++ and I want to be able to swap between using double and float. I have therefore added a #define MYFLT which I can make either a float or a double as needed. However, how do I deal with the various numeric literals.
For example
MYFLT someNumber = 1.2;
MYFLT someOtherNumber = 1.5f;
gives compiler warnings for the first line when MYFLT is a float and for the second line when MYFLT is a double. I know this is a trivial example, but there are other cases where I have longer expresions with literals in and floats can end up being converted to doubles then the result back to floats which I think is costing me significant performance. How should I deal with this?
I could do things like
MYFLT someNumber = MYFLT(1.2);
MYFLT someOtherNumber = MYFLT(1.5);
but this is quite tedious. I'm assuming that in that if I do this the compiler is clever enough to just use a float when needed (can anyone confirm that?). What would be better would be if there was a MSVC++ compiler switch or #define that will tell the compiler to treat all floating point literals as floats instead of doubles. Does such a switch exist?
Even when I wrap all my literals as above my code runs 50% slower when I use float rather than double. I was expecting a performance boost through simd type operations, not a penalty!
Phil
What you'd want is #define MYFLTCONST(x) x##f or #define MYFLTCONST(x) x depending on whether you want a f suffix for float appended.
This is a (not quite complete) answer to my own question.
I found that a small function that was called many times (a fast approximation to sin) didn't have its literals cast as MYFLT. The extra computational hit of this also meant that the compiler wasn't inlining it. This function accounted for most of the difference. Some further profiling seemed to indicate that accessing std::vector<float> was slower than std::vector<double> ( I am using [] to do the access if it matters ). Replacing std::vectors with raw fixed sized arrays sped up the double implementation a little and closed the gap significantly for the float implementation. The float version is now only about 10% slower than the double version. But definitely no speed increase due to either RAM access nor vectorization. I guess I need to think more carefully about my loops to get any benefit there.
I guess the conclusion here (yet again) is that the compiler is pretty good at optimising code - it's much better to work with it and do careful profiling than it is to try and do your own blind "optimisations" which might actually have negative effects, like stopping the compiler performing good inlining.

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}
Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.
TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).
This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.
By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.
C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.
Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity
Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.
Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

Optimal (Time paradigm) solution to check variable within boundary

Sorry if the question is very naive.
I will have to check the below condition in my code
0 < x < y
i.e code similar to if(x > 0 && x < y)
The basic problem at system level is - currently, for every call (Telecom domain terminology), my existing code is hit (many times). So performance is very very critical, Now, I need to add a check for boundary checking (at many location - but different boundary comparison at each location).
At very normal level of coding, the above comparison would look very naive without any issue. However, when added over my statistics module (which is dipped many times), performance will go down.
So I would like to know the best possible way to handle the above scenario (kind of optimal way for limits checking technique). Like for example, if bit comparison works better than normal comparison or can both the comparison be evaluation in shorter time span?
Other Info
x is unsigned integer (which must be checked to be greater than 0 and less than y).
y is unsigned integer.
y is a non-const and varies for every comparison.
Here time is the constraint compared to space.
Language - C++.
Now, later if I need to change the attribute of y to a float/double, would there be another way to optimize the check (i.e will the suggested optimal technique for integer become non-optimal solution when y is changed to float/double).
Thanks in advance for any input.
PS : OS used is SUSE 10 64 bit x64_64, AIX 5.3 64 bit, HP UX 11.1 A 64.
As always, profile first, optimize later. But, given that this is actually an issue, these could be things to look into:
"Unsigned and greater than zero" is the same as "not equal to zero", which is usually about as fast as a comparison gets. So a first optimization would be to do x != 0 && x < y.
Make sure that you do the comparison that is most likely to fail the first one, to maximize the gain from short circuiting.
If possible, use compiler directives to tell the compiler about the most likely code path. This will optimize instruction prefetching etc. I.e. for GCC look at something like this, done in the kernel.
I don't think tricks with subtraction and comparison against zero, etc. will be of any gain. If that is the most effective way to do a less-than comparison, you can be sure your compiler already knows about it.
This eliminates a compare and branch at the expense of two adds; it should be faster:
(x-1) < (y-1)
It works as long as y is guaranteed non-zero.
You probably don't need to change y to a float or a double; you should endeavor to stay in integer for as much as you can. Instead of representing y as seconds, try microseconds or milliseconds (depending on the resolution you need).
Anyway- I suspect you can change
if (x > 0 && x < y)
;
to
if ((unsigned int)x < (unsigned int)y)
;
but that's probably not going to actually speed anything up. Checking against zero is often one or two instructions (depending on ISA) so the read from memory is certainly the bottleneck here.
After you've profiled your code and determined that this is actually where the performance problems are, you could investigate tweaking the branch predictor, since that's somewhere a lot of time can be wasted if it's regularly mispredicting. Different compilers do it differently, but some have an intrinsic like __expect(x < 0);, which will tell the predictor to assume that's usually the case.

What is faster (x < 0) or (x == -1)?

Variable x is int with possible values: -1, 0, 1, 2, 3.
Which expression will be faster (in CPU ticks):
1. (x < 0)
2. (x == -1)
Language: C/C++, but I suppose all other languages will have the same.
P.S. I personally think that answer is (x < 0).
More widely for gurus: what if x from -1 to 2^30?
That depends entirely on the ISA you're compiling for, and the quality of your compiler's optimizer. Don't optimize prematurely: profile first to find your bottlenecks.
That said, in x86, you'll find that both are equally fast in most cases. In both cases, you'll have a comparison (cmp) and a conditional jump (jCC) instructions. However, for (x < 0), there may be some instances where the compiler can elide the cmp instruction, speeding up your code by one whole cycle.
Specifically, if the value x is stored in a register and was recently the result of an arithmetic operation (such as add, or sub, but there are many more possibilities) that sets the sign flag SF in the EFLAGS register, then there's no need for the cmp instruction, and the compiler can emit just a js instruction. There's no simple jCC instruction that jumps when the input was -1.
Try it and see! Do a million, or better, a billion of each and time them. I bet there is no statistical significance in your results, but who knows -- maybe on your platform and compiler, you might find a result.
This is a great experiment to convince yourself that premature optimization is probably not worth your time--and may well be "the root of all evil--at least in programming".
Both operations can be done in a single CPU step, so they should be the same performance wise.
x < 0 will be faster. If nothing else, it prevents fetching the constant -1 as an operand.
Most architectures have special instructions for comparing against zero, so that will help too.
It could be dependent on what operations precede or succeed the comparison. For example, if you assign a value to x just before doing the comparison, then it might be faster to check the sign flag than to compare to a specific value. Or the CPU's branch-prediction performance could be affected by which comparison you choose.
But, as others have said, this is dependent upon CPU architecture, memory architecture, compiler, and a lot of other things, so there is no general answer.
The important consideration, anyway, is which actually directs your program flow accurately, and which just happens to produce the same result?
If x is actually and index or a value in an enum, then will -1 always be what you want, or will any negative value work? Right now, -1 is the only negative, but that could change.
You can't even answer this question out of context. If you try for a trivial microbenchmark, it's entirely possible that the optimizer will waft your code into the ether:
// Get time
int x = -1;
for (int i = 0; i < ONE_JILLION; i++) {
int dummy = (x < 0); // Poof! Dummy is ignored.
}
// Compute time difference - in the presence of good optimization
// expect this time difference to be close to useless.
Same, both operations are usually done in 1 clock.
It depends on the architecture, but the x == -1 is more error-prone. x < 0 is the way to go.
As others have said there probably isn't any difference. Comparisons are such fundamental operations in a CPU that chip designers want to make them as fast as possible.
But there is something else you could consider. Analyze the frequencies of each value and have the comparisons in that order. This could save you quite a few cycles. Of course you still need to compile your code to asm to verify this.
I'm sure you're confident this is a real time-taker.
I would suppose asking the machine would give a more reliable answer than any of us could give.
I've found, even in code like you're talking about, my supposition that I knew where the time was going was not quite correct. For example, if this is in an inner loop, if there is any sort of function call, even an invisible one inserted by the compiler, the cost of that call will dominate by far.
Nikolay, you write:
It's actually bottleneck operator in
the high-load program. Performance in
this 1-2 strings is much more valuable
than readability...
All bottlenecks are usually this
small, even in perfect design with
perfect algorithms (though there is no
such). I do high-load DNA processing
and know my field and my algorithms
quite well
If so, why not to do next:
get timer, set it to 0;
compile your high-load program with (x < 0);
start your program and timer;
on program end look at the timer and remember result1.
same as 1;
compile your high-load program with (x == -1);
same as 3;
on program end look at the timer and remember result2.
compare result1 and result2.
You'll get the Answer.