Is it possible to roll a significantly faster version of sqrt - c++

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.

There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}

Related

How to force pow(float, int) to return float

The overloaded function float pow(float base, int iexp ) was removed in C++11 and now pow returns a double. In my program, I am computing lots of these (in single precision) and I am interested in the most efficient way how to do it.
Is there some special function (in standard libraries or any other) with the above signature?
If not, is it better (in terms of performance in single precision) to explicitly cast result of pow into float before any other operations (which would cast everything else into double) or cast iexp into float and use overloaded function float pow(float base, float exp)?
EDIT: Why I need float and do not use double?
The primarily reason is RAM -- I need tens or hundreds of GB so this reduction is huge advantage. So I need from float to get float. And now I need the most efficient way to achieve that (less casts, use already optimize algorithms, etc).
You could easily write your own fpow using exponentiation by squaring.
float my_fpow(float base, unsigned exp)
{
float result = 1.f;
while (exp)
{
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
Boring part:
This algorithm gives the best accuracy, that can be archived with float type when |base| > 1
Proof:
Let we want to calculate pow(a, n) where a is base and n is exponent.
Let's define b1=a1, b2=a2, b3=a4, b4=a8,and so on.
Then an is a product over all such bi where ith bit is set in n.
So we have ordered set B={bk1,bk1,...,bkn} and for any j the bit kj is set in n.
The following obvious algorithm A can be used for rounding error minimization:
If B contains single element, then it is result
Pick two elements p and q from B with minimal modulo
Remove them from B
Calculate product s = p*q and put it to B
Go to the first step
Now, lets prove that elements in B could be just multiplied from left to right without loosing accuracy. It comes form the fact, that:
bj > b1*b2*...*bj-1
because bj=bj-1*bj-1=bj-1*bj-2*bj-2=...=bj-1*bj-2*...*b1*b1
Since, b1 = a1 = a and its modulo more than one then:
bj > b1*b2*...*bj-1
Hence we may conclude, that during multiplication from left to right the accumulator variable is less than any element from B.
Then, expression result *= base; (except the very first iteration, for sure) does multiplication of two minimal numbers from B, so the rounding error is minimal. So, the code employs algorithm A.
Another question that can only be honestly answered with "wrong question". Or at least: "Are you really willing to go there?". float theoretically needs ca. 80% less die space (for the same number of cycles) and so can be much cheaper for bulk processing. GPUs love float for this reason.
However, let's look at x86 (admittedly, you didn't say what architecture you're on, so I picked the most common). The price in die space has already been paid. You literally gain nothing by using float for calculations. Actually, you may even lose throughput because additional extensions from float to double are required, and additional rounding to intermediate float precision. In other words, you pay extra to have a less accurate result. This is typically something to avoid except maybe when you need maximum compatibility with some other program.
See Jens' comment as well. These options give the compiler permission to disregard some language rules to achieve higher performance. Needless to say this can sometimes backfire.
There are two scenarios where float might be more efficient, on x86:
GPU (including GPGPU), in fact many GPUs don't even support double and if they do, it's usually much slower. Yet, you will only notice when doing very many calculations of this sort.
CPU SIMD aka vectorization
You'd know if you did GPGPU. Explicit vectorization by using compiler intrinsics is also a choice – one you could make, for sure, but this requires quite a cost-benefit analysis. Possibly your compiler is able to auto-vectorize some loops, but this is usually limited to "obvious" applications, such as where you multiply each number in a vector<float> by another float, and this case is not so obvious IMO. Even if you pow each number in such a vector by the same int, the compiler may not be smart enough to vectorize this effectively, especially if pow resides in another translation unit, and without effective link time code generation.
If you are not ready to consider changing the whole structure of your program to allow effective use of SIMD (including GPGPU), and you're not on an architecture where float is indeed much cheaper by default, I suggest you stick with double by all means, and consider float at best a storage format that may be useful to conserve RAM, or to improve cache locality (when you have a lot of them). Even then, measuring is an excellent idea.
That said, you could try ivaigult's algorithm (only with double for the intermediate and for the result), which is related to a classical algorithm called Egyptian multiplication (and a variety of other names), only that the operands are multiplied and not added. I don't know how pow(double, double) works exactly, but it is conceivable that this algorithm could be faster in some cases. Again, you should be OCD about benchmarking.
If you're targeting GCC you can try
float __builtin_powif(float, int)
I have no idea about it's performance tough.
Is there some special function (in standard libraries or any other) with the above signature?
Unfortunately, not that I know of.
But, as many have already mentioned benchmarking is necessary to understand if there is even an issue at all.
I've assembled a quick benchmark online. Benchmark code:
#include <iostream>
#include <boost/timer/timer.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#include <cmath>
int main ()
{
boost::random::mt19937 gen;
boost::random::uniform_real_distribution<> dist(0, 10000000);
const size_t size = 10000000;
std::vector<float> bases(size);
std::vector<float> fexp(size);
std::vector<int> iexp(size);
std::vector<float> res(size);
for(size_t i=0; i<size; i++)
{
bases[i] = dist(gen);
iexp[i] = std::floor(dist(gen));
fexp[i] = iexp[i];
}
std::cout << "float pow(float, int):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], iexp[i]);
}
std::cout << "float pow(float, float):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], fexp[i]);
}
return 0;
}
Benchmark results (quick conclusions):
gcc: c++11 is consistently faster than c++03.
clang: indeed int-version of c++03 seems a little faster. I'm not sure if it is within a margin of error, since I only run the benchmark online.
Both: even with c++11 calling pow with int seems to be a tad more performant.
It would be great if others could verify if this holds for their configurations as well.
Try using powf() instead. This is C99 function that should be also available in C++11.

How to let GCC compiler turn variable-division into mul(if faster)

int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a
You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))
Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.
The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].
There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.

Alternative to using % operator and / Operator in C++

It is told that modulo operator "%" and divide operator "/" are very inefficient in embedded C++.
How can I alternatively achieve the following expression:
a = b % c;
I understand that this can be achieved using the following logic:
a = b - c;
while (a >= c) {
a = a - c;
}
But my question is, is this code involving while loops efficient enough, compared to % operator?
Thanks,
Kirti
Division and modulus are indeed costly hardware operations, whatever you do (this is more related to hardware architecture than to languages or compilers), perhaps ten times slower than addition.
However, on current laptops or servers, and on high-end microcontrollers, cache misses are often much slower than divisions!
The GCC compiler is often able to optimize them, when the divisor is a constant.
Your naive loop is usually much more slower than using the hardware division instruction (or the library routine doing it, if not provided by hardware). I believe you are wrong in avoiding the division & replacing it with your loop.
You might tune your algorithms -e.g. by having power of twos- but I don't recommend using your code. Remember that premature optimization is evil so first try to get your program correct, then profile it to find the trouble spots.
Nothing is going to be considerably more efficient than the % operator. If there was a better way to do it, then any reasonable compiler would automatically convert it. When you're told that % and / are inefficient, that's just because those are difficult operations - if you need to perform a modulo, then do that.
There may be special cases when there are better ways - for example, mod a power of two can be written as a binary or - but those are probably optimized by your compiler.
That code will almost certainly be slower than however your processor/compiler decides to perform the divide/mod. Generally, shortcuts are pretty hard to come by for basic arithmetic operators, since the mcu/cpu designers and compiler programmers are pretty good at optimizing this for almost all applications.
One common shortcut in embedded devices (where every cycle/byte can make a difference) is to keep everything in terms of base-2 to use the bit shift operators to perform multiplication and division, and the bitwise and (&) to perform modulo.
Examples:
unsigned int x = 100;
unsigned int y1 = x << 4; // same as x * 2^4 = x*16
unsigned int y2 = x >> 6; // same as x / 2^6 = x/64
unsigned int y3 = x & 0x07; // same as x % 8
If the divisor is known at compile time, the operation can be transformed into a multiplication by a reciprocal, with some shifts, adds, and other fast operations. This will be faster on any modern processor, even if it implements division in hardware. Embedded targets usually have highly optimized routines for divide / modulo, since these operations are required by the standard.
If you have profiled your code carefully and found that a modulo operator is the major cost in an inner loop then there is an optimisation that might help. You might be already familiar with the trick for determining the sign of an integer using arithmetic left shifts (for 32 bit values):
sign = ( x >> 31 ) | 1;
This extends the sign bit across the word, so negative values yield -1 and positive values 0. Then bit 0 is set so that positive values result in 1.
If we're only incrementing values by a quantity that is less than the modulo then this same trick can be used to wrap the result:
val += inc;
val -= modulo & ( static_cast< int32_t >( ( ( modulo - 1 ) - val ) ) >> 31 );
Alternatively, if you are decrementing by values less than the modulo then the relevant code is:
int32_t signedVal = static_cast< int32_t >( val - dec );
val = signedVal + ( modulo & ( signedVal >> 31 ) );
I've added the static_cast operators because I was passing in uint32_t, but you might not find them necessary.
Does this help much as opposed to a simple % operator? That depends on your compiler and CPU architecture. I found a simple loop ran 60% faster on my i3 processor when compiled under VS2012, however on the ARM11 chip in the Raspberry Pi and compiling with GCC I only got a 20% improvement.
Division by a constant can be achieved by a shift if a power of 2 or a mul add shift combination for others.
http:// masm32.com/board/index.php?topic=9937.0 has x86 assembly version as well as C source in download from first post. that generates this code for you.

Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators
int a = 1;
int b = a<<24
However, I cannot, and I have to stick with doubles.
My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?
I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)
P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).
Edit : Or am I completely mistaken and clever compilers already optimize things for me?
Temporary results (with Qt to measure time, overkill, but I don't care):
#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>
#include <iostream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
while(true)
{
QElapsedTimer timer;
timer.start();
int n=100000000;
volatile double d=12.4;
volatile double D;
for(unsigned int i=0; i<n; ++i)
{
//D = d*32; // 200 ms
//D = d*(1<<5); // 200 ms
D = ldexp (d,5); // 6000 ms
}
qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}
return a.exec();
}
Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.
On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)
This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)
The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)
So something like this:
union{
uint64 i;
double f;
};
f = 123.;
i += 0x0010000000000000ull;
// Check for zero. And if it matters, denormals as well.
Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.
However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.
In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.
All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.
Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.
How about ldexp?
Any half-decent compiler will generate optimal code on your platform.
But as #Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.
Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).
And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.
[update]
OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...
...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.
So, multiplying by 1 << p is probably your best bet, as you have discovered.
You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:
const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK;
void unsafe_shl(double* d, int shift) {
unsigned long long* i = (unsigned long long*)d;
if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) {
*i += (unsigned long long)shift << DOUBLE_EXP_SHIFT;
} else if (*i) {
*d *= (1 << shift);
}
}
EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:
double ds[0x1000];
for (int i = 0; i != 0x1000; i++)
ds[i] = 1.2;
clock_t t = clock();
for (int j = 0; j != 1000000; j++)
for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
ds[i] *= 1 << 4;
#else
((unsigned int*)&ds[i])[1] += 4 << 20;
#endif
clock_t e = clock();
printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);
In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of
movupd xmm0,xmmword ptr [ecx]
lea ecx,[ecx+10h]
mulpd xmm0,xmm1
movupd xmmword ptr [ecx-10h],xmm0
Versus 2.4 seconds otherwise, with an inner loop of:
add dword ptr [ecx],400000h
lea ecx, [ecx+8]
Truly unexpected!
EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:
movsd xmm1,mmword ptr [esp+eax*8+38h]
mulsd xmm1,xmm0
movsd mmword ptr [esp+eax*8+38h],xmm1
inc eax
VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:
fld qword ptr [esp+eax*8+38h]
inc eax
fmul st,st(1)
fstp qword ptr [esp+eax*8+30h]
I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.
The fastest way to do this is probably:
x *= (1 << p);
This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.
Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.
p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.
Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:
d *= 0x1p64;
will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.
What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.
Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.
Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.
Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.
E.g. for
typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2
In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.
//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.
Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.

Hash function for floats

I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...
I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...
Is there a good way to hash floats?
You don't have to give me the function directly, but I'd like to see/understand different concepts...
Notes:
I don't need it to be really fast, just evenly distributed if possible.
I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)
It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.
EDIT:
Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.
Here is a small example that demonstrates the problem:
float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
std::cout << "ooops...\n";
}
Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.
There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.
If your hash function did the following you'd get some degree of fuzziness on the hash lookup
unsigned int Hash( float f )
{
unsigned int ui;
memcpy( &ui, &f, sizeof( float ) );
return ui & 0xfffff000;
}
This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.
You can use the std hash, it's not bad:
std::size_t myHash = std::cout << std::hash<float>{}(myFloat);
unsigned hash(float x)
{
union
{
float f;
unsigned u;
};
f = x;
return u;
}
Technically undefined behavior, but most compilers support this. Alternative solution:
unsigned hash(float x)
{
return (unsigned&)x;
}
Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.
You can of course represent a float as an int type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...
Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.
An obvious case: -0.0 wont match 0.0 for example. *
Further, simply converting to an int of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).
Suggested steps for implementation:
filter out non-finite cases (nan, inf) and (0.0, -0.0 whether you need to do this explicitly or not depends on the method used).
convert to an int of the same size(that is - use a union for example to represent the float as an int, not simply cast to an int).
re-distribute the bits, (intentionally vague here!), this is basically a speed vs quality tradeoff. But if you have many values in a small range you probably don't want them to in a similar range too.
*: You may wan't to check for (nan and -nan) too. How to handle those exactly depends on your use case (you may want to ignore sign for all nan's as CPython does).
Python's _Py_HashDouble is a good reference for how you might hash a float, in production code (ignore the -1 check at the end, since that's a special value for Python).
If you're interested, I just made a hash function that uses floating point and can hash floats. It also passes SMHasher ( which is the main bias-test for non-crypto hash functions ). It's a lot slower than normal non-cryptographic hash functions due to the float calculations.
I'm not sure if tifuhash will become useful for all applications, but it's interesting to see a simple floating point function pass both PractRand and SMHasher.
The main state update function is very simple, and looks like:
function q( state, val, numerator, denominator ) {
// Continued Fraction mixed with Egyptian fraction "Continued Egyptian Fraction"
// with denominator = val + pos / state[1]
state[0] += numerator / denominator;
state[0] = 1.0 / state[0];
// Standard Continued Fraction with a_i = val, b_i = (a_i-1) + i + 1
state[1] += val;
state[1] = numerator / state[1];
}
Anyway, you can get it on npm
Or you can check out the github
Using is simple:
const tifu = require('tifuhash');
const message = 'The medium is the message.';
const number = 333333333;
const float = Math.PI;
console.log( tifu.hash( message ),
tifu.hash( number ),
tifu.hash( float ),
tifu.hash( ) );
There's a demo of some hashes on runkit here https://runkit.com/593a239c56ebfd0012d15fc9/593e4d7014d66100120ecdb9
Side note: I think that in future using floating point,possibly big arrays of floating point calculations, could be a useful way to make more computationally-demanding hash functions in future. A weird side effect I discovered of using floating point is that the hashes are target dependent, and I surmise maybe they could be use to fingerprint the platforms they were calculated on.
Because of the IEEE byte ordering the Java Float.hashCode() and Double.hashCode() do not give good results. This problem is wellknown and can be adressed by this scrambler:
class HashScrambler {
/**
* https://sites.google.com/site/murmurhash/
*/
static int murmur(int x) {
x ^= x >> 13;
x *= 0x5bd1e995;
return x ^ (x >> 15);
}
}
You then get a good hash function, which also allows you to use Float and Double in hash tables. But you need to write your own hash table that allows a custom hash function.
Since in a hash table you need also test for equality, you need an exact equality to make it work. Maybe the later is what President James K. Polk intends to adress?