Is there a fast fabsf replacement for "float" in C++? - c++

I'm just doing some benchmarking and found out that fabsf() is often like 10x slower than fabs(). So I disassembled it and it turns out the double version is using fabs instruction, float version is not. Can this be improved? This is faster, but not so much and I'm afraid it may not work, it's a little too lowlevel:
float mabs(float i)
{
(*reinterpret_cast<MUINT32*>(&i)) &= 0x7fffffff;
return i;
}
Edit: Sorry forgot about the compiler - I still use the good old VS2005, no special libs.

You can easily test different possibilities using the code below. It essentially tests your bitfiddling against naive template abs, and std::abs. Not surprisingly, naive template abs wins. Well, kind of surprisingly it wins. I'd expect std::abs to be equally fast. Note that -O3 actually makes things slower (at least on coliru).
Coliru's host system shows these timings:
random number generation: 4240 ms
naive template abs: 190 ms
ugly bitfiddling abs: 241 ms
std::abs: 204 ms
::fabsf: 202 ms
And these timings for a Virtualbox VM running Arch with GCC 4.9 on a Core i7:
random number generation: 1453 ms
naive template abs: 73 ms
ugly bitfiddling abs: 97 ms
std::abs: 57 ms
::fabsf: 80 ms
And these timings on MSVS2013 (Windows 7 x64):
random number generation: 671 ms
naive template abs: 59 ms
ugly bitfiddling abs: 129 ms
std::abs: 109 ms
::fabsf: 109 ms
If I haven't made some blatantly obvious mistake in this benchmark code (don't shoot me over it, I wrote this up in about 2 minutes), I'd say just use std::abs, or the template version if that turns out to be slightly faster for you.
The code:
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
#include <math.h>
using Clock = std::chrono::high_resolution_clock;
using milliseconds = std::chrono::milliseconds;
template<typename T>
T abs_template(T t)
{
return t>0 ? t : -t;
}
float abs_ugly(float f)
{
(*reinterpret_cast<std::uint32_t*>(&f)) &= 0x7fffffff;
return f;
}
int main()
{
std::random_device rd;
std::mt19937 mersenne(rd());
std::uniform_real_distribution<> dist(-std::numeric_limits<float>::lowest(), std::numeric_limits<float>::max());
std::vector<float> v(100000000);
Clock::time_point t0 = Clock::now();
std::generate(std::begin(v), std::end(v), [&dist, &mersenne]() { return dist(mersenne); });
Clock::time_point trand = Clock::now();
volatile float temp;
for (float f : v)
temp = abs_template(f);
Clock::time_point ttemplate = Clock::now();
for (float f : v)
temp = abs_ugly(f);
Clock::time_point tugly = Clock::now();
for (float f : v)
temp = std::abs(f);
Clock::time_point tstd = Clock::now();
for (float f : v)
temp = ::fabsf(f);
Clock::time_point tfabsf = Clock::now();
milliseconds random_time = std::chrono::duration_cast<milliseconds>(trand - t0);
milliseconds template_time = std::chrono::duration_cast<milliseconds>(ttemplate - trand);
milliseconds ugly_time = std::chrono::duration_cast<milliseconds>(tugly - ttemplate);
milliseconds std_time = std::chrono::duration_cast<milliseconds>(tstd - tugly);
milliseconds c_time = std::chrono::duration_cast<milliseconds>(tfabsf - tstd);
std::cout << "random number generation: " << random_time.count() << " ms\n"
<< "naive template abs: " << template_time.count() << " ms\n"
<< "ugly bitfiddling abs: " << ugly_time.count() << " ms\n"
<< "std::abs: " << std_time.count() << " ms\n"
<< "::fabsf: " << c_time.count() << " ms\n";
}
Oh, and to answer your actual question: if the compiler can't generate more efficient code, I doubt there is a faster way save for micro-optimized assembly, especially for elementary operations such as this.

There are many things at play here. First off, the x87 co-processor is deprecated in favor of SSE/AVX, so I'm surprised to read that your compiler still uses the fabs instruction. It's quite possible that the others who posted benchmark answers on this question use a platform that supports SSE. Your results might be wildly different.
I'm not sure why your compiler uses a different logic for fabs and fabsf. It's totally possible to load a float to the x87 stack and use the fabs instruction on it just as easily. The problem with reproducing this by yourself, without compiler support, is that you can't integrate the operation into the compiler's normal optimizing pipeline: if you say "load this float, use the fabs instruction, return this float to memory", then the compiler will do exactly that... and it may involve putting back to memory a float that was already ready to be processed, loading it back in, using the fabs instruction, putting it back to memory, and loading it again to the x87 stack to resume the normal, optimizable pipeline. This would be four wasted load-store operations because it only needed to do fabs.
In short, you are unlikely to beat integrated compiler support for floating-point operations. If you don't have this support, inline assembler might just make things even slower than they presumably already are. The fastest thing for you to do might even be to use the fabs function instead of the fabsf function on your floats.
For reference, modern compilers and modern platforms use the SSE instructions andps (for floats) and andpd (for doubles) to AND out the bit sign, very much like you're doing yourself, but dodging all the language semantics issues. They're both as fast. Modern compilers may also detect patterns like x < 0 ? -x : x and produce the optimal andps/andpd instruction without the need for a compiler intrinsic.

Did you try the std::abs overload for float? That would be the canonical C++ way.
Also as an aside, I should note that your bit-modifying version does violate the strict-aliasing rules (in addition to the more fundamental assumption that int and float have the same size) and as such would be undefined behavior.

Related

Extreme slow-down when starting at second permutation

Consider the following code:
#include <algorithm>
#include <chrono>
#include <iostream>
#include <numeric>
#include <vector>
int main() {
std::vector<int> v(12);
std::iota(v.begin(), v.end(), 0);
//std::next_permutation(v.begin(), v.end());
using clock = std::chrono::high_resolution_clock;
clock c;
auto start = c.now();
unsigned long counter = 0;
do {
++counter;
} while (std::next_permutation(v.begin(), v.end()));
auto end = c.now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << counter << " permutations took " << duration.count() / 1000.0f << " s";
}
Compiled with GCC (MinGW) 5.3 -O2 on my AMD 4.1 GHz CPU this takes 2.3 s. However if I comment in the uncommented line it slows down to 3.4 s. I would expect a minimal speed-up because we measure the time for one permutation less. With -O3 the difference is less extreme 2.0 s to 2.4 s.
Can anyone explain that? Could a super-smart compiler detect that I want to traverse all permutations and optmize this code?
I think the compiler gets confused by you calling the function in two separate lines in your code, causing it not be inline.
GCC 8.0.0 also behaves as yours.
Benefits of inline functions in C++? It provides a simple mechanism for the compiler to apply more optimizations, so losing the inline identification may cause a severe drop of performance, in some cases.

Correct way of portably timing code using C++11

I'm in the midst of writing some timing code for a part of a program that has a low latency requirement.
Looking at whats available in the std::chrono library, I'm finding it a bit difficult to write timing code that is portable.
std::chrono::high_resolution_clock
std::chrono::steady_clock
std::chrono::system_clock
The system_clock is useless as it's not steady, the remaining two clocks are problematic.
The high_resolution_clock isn't necessarily stable on all platforms.
The steady_clock does not necessarily support fine-grain resolution time periods (eg: nano seconds)
For my purposes having a steady clock is the most important requirement and I can sort of get by with microsecond granularity.
My question is if one wanted to time code that could be running on different h/w architectures and OSes - what would be the best option?
Use steady_clock. On all implementations its precision is nanoseconds. You can check this yourself for your platform by printing out steady_clock::period::num and steady_clock::period::den.
Now that doesn't mean that it will actually measure nanosecond precision. But platforms do their best. For me, two consecutive calls to steady_clock (with optimizations enabled) will report times on the order of 100ns apart.
#include "chrono_io.h"
#include <chrono>
#include <iostream>
int
main()
{
using namespace std::chrono;
using namespace date;
auto t0 = steady_clock::now();
auto t1 = steady_clock::now();
auto t2 = steady_clock::now();
auto t3 = steady_clock::now();
std::cout << t1-t0 << '\n';
std::cout << t2-t1 << '\n';
std::cout << t3-t2 << '\n';
}
The above example uses this free, open-source, header-only library only for convenience of formatting the duration. You can format things yourself (I'm lazy). For me this just output:
287ns
116ns
75ns
YMMV.

pow(NAN) is very slow

What is the reason for the catastrophic performance of pow() for NaN values? As far as I can work out, NaNs should not have an impact on performance if the floating-point math is done with SSE instead of the x87 FPU.
This seems to be true for elementary operations, but not for pow(). I compared multiplication and division of a double to squaring and then taking the square root. If I compile the piece of code below with g++ -lrt, I get the following result:
multTime(3.14159): 20.1328ms
multTime(nan): 244.173ms
powTime(3.14159): 92.0235ms
powTime(nan): 1322.33ms
As expected, calculations involving NaN take considerably longer. Compiling with g++ -lrt -msse2 -mfpmath=sse however results in the following times:
multTime(3.14159): 22.0213ms
multTime(nan): 13.066ms
powTime(3.14159): 97.7823ms
powTime(nan): 1211.27ms
The multiplication / division of NaN is now much faster (actually faster than with a real number), but the squaring and taking the square root still takes a very long time.
Test code (compiled with gcc 4.1.2 on 32bit OpenSuSE 10.2 in VMWare, CPU is a Core i7-2620M)
#include <iostream>
#include <sys/time.h>
#include <cmath>
void multTime( double d )
{
struct timespec startTime, endTime;
double durationNanoseconds;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startTime);
for(int i=0; i<1000000; i++)
{
d = 2*d;
d = 0.5*d;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endTime);
durationNanoseconds = 1e9*(endTime.tv_sec - startTime.tv_sec) + (endTime.tv_nsec - startTime.tv_nsec);
std::cout << "multTime(" << d << "): " << durationNanoseconds/1e6 << "ms" << std::endl;
}
void powTime( double d )
{
struct timespec startTime, endTime;
double durationNanoseconds;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startTime);
for(int i=0; i<1000000; i++)
{
d = pow(d,2);
d = pow(d,0.5);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endTime);
durationNanoseconds = 1e9*(endTime.tv_sec - startTime.tv_sec) + (endTime.tv_nsec - startTime.tv_nsec);
std::cout << "powTime(" << d << "): " << durationNanoseconds/1e6 << "ms" << std::endl;
}
int main()
{
multTime(3.14159);
multTime(NAN);
powTime(3.14159);
powTime(NAN);
}
Edit:
Unfortunately, my knowledge on this topic is extremely limited, but I guess that the glibc pow() never uses SSE on a 32bit system, but rather some assembly in sysdeps/i386/fpu/e_pow.S. There is a function __ieee754_pow_sse2 in more recent glibc versions, but it's in sysdeps/x86_64/fpu/multiarch/e_pow.c and therefore probably only works on x64. However, all of this might be irrelevant here, because pow() is also a gcc built-in function. For an easy fix, see Z boson's answer.
"NaNs should not have an impact on performance if the floating-point math is done with SSE instead of the x87 FPU."
I'm not sure this follows from the resource you quote. In any case, pow is a C library function. It is not implemented as an instruction, even on x87. So there are 2 separate issues here - how SSE handles NaN values, and how a pow function implementation handles NaN values.
If the pow function implementation uses a different path for special values like +/-Inf, or NaN, you might expect a NaN value for the base, or exponent, to return a value quickly. On the other hand, the implementation might not handle this as a separate case, and simply relies on floating-point operations to propagate intermediate results as NaN values.
Starting with 'Sandy Bridge', many of the performance penalties associated with denormals were reduced or eliminated. Not all though, as the author describes a penalty for mulps. Therefore, it would be reasonable to expect that not all arithmetic operations involving NaNs are 'fast'. Some architectures might even revert to microcode to handle NaNs in different contexts.
Your math library is too old. Either find another math library which implements pow with NAN better or implement a fix like this:
inline double pow_fix(double x, double y)
{
if(x!=x) return x;
if(y!=y) return y;
return pow(x,y);
}
Compile with g++ -O3 -msse2 -mfpmath=sse foo.cpp.
If you want to do squaring or taking the square root, use d*d or sqrt(d). The pow(d,2) and pow(d,0.5) will be slower and possibly less accurate, unless your compiler optimizes them based on the constant second argument 2 and 0.5; note that such an optimization may not always be possible for pow(d,0.5) since it returns 0.0 if d is a negative zero, while sqrt(d) returns -0.0.
For those doing timings, please make sure that you test the same thing.
With a complex function like pow() there are lots of ways that NaN could trigger slowness. It could be that the operations on NaNs are slow, or it could be that the pow() implementation checks for all sorts of special values that it can handle efficiently, and the NaN values fail all of those tests, leading to a more expensive path being taken. You'd have to step through the code to find out for sure.
A more recent implementation of pow() might include additional checks to handle NaN more efficiently, but this is always a tradeoff -- it would be a shame to have pow() handle 'normal' cases more slowly in order to accelerate NaN handling.
My blog post only applied to individual instructions, not complex functions like pow().

Programme execution time counter

What is the most accurate way to calculate the elapsed time in C++? I used clock() to calculate this, but I have a feeling this is wrong as I get 0 ms 90% of the time and 15 ms the rest of it which makes little sense to me.
Even if it is really small and very close to 0 ms, is there a more accurate method that will give me the exact the value rather than a rounded down 0 ms?
clock_t tic = clock();
/*
main programme body
*/
clock_t toc = clock();
double time = (double)(toc-tic);
cout << "\nTime taken: " << (1000*(time/CLOCKS_PER_SEC)) << " (ms)";
Thanks
With C++11, I'd use
#include <chrono>
auto t0 = std::chrono::high_resolution_clock::now();
...
auto t1 = std::chrono::high_resolution_clock::now();
auto dt = 1.e-9*std::chrono::duration_cast<std::chrono::nanoseconds>(t1-t0).count();
for the elapsed time in seconds.
For pre 2011 C++, you can use QueryPerformanceCounter() on windows or gettimeofday() with linux/OSX. For example (this is actually C, not C++):
timeval oldCount,newCount;
gettimeofday(&oldCount, NULL);
...
gettimeofday(&newCount, NULL);
double t = double(newCount.tv_sec -oldCount.tv_sec )
+ double(newCount.tv_usec-oldCount.tv_usec) * 1.e-6;
for the elapsed time in seconds.
std::chrono::high_resolution_clock is as portable a solution as you can get, however it may not actually be higher resolution than what you already saw.
Pretty much any function which returns system time is going to jump forward whenever the system time is updated by the timer interrupt handler, and 10ms is a typical interval for that on modern OSes.
For better precision timing, you need to access either a CPU cycle counter or high precision event timer (HPET). Compiler library vendors ought to use these for high_resolution_clock, but not all do. So you may need OS-specific APIs.
(Note: specifically Visual C++ high_resolution_clock uses the low resolution system clock. But there are likely others.)
On Win32, for example, the QueryPerformanceFrequency() and QueryPerformanceCounter() functions are a good choice. For a wrapper that conforms to the C++11 timer interface and uses these functions, see
Mateusz answers "Difference between std::system_clock and std::steady_clock?"
If you have C++11 available, use the chrono library.
Also, different platforms provide access to high precision clocks.
For example, in linux, use clock_gettime. In Windows, use the high performance counter api.
Example:
C++11:
auto start=high_resolution_clock::now();
... // do stuff
auto diff=duration_cast<milliseconds>(high_resolution_clock::now()-start);
clog << diff.count() << "ms elapsed" << endl;

Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators
int a = 1;
int b = a<<24
However, I cannot, and I have to stick with doubles.
My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?
I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)
P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).
Edit : Or am I completely mistaken and clever compilers already optimize things for me?
Temporary results (with Qt to measure time, overkill, but I don't care):
#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>
#include <iostream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
while(true)
{
QElapsedTimer timer;
timer.start();
int n=100000000;
volatile double d=12.4;
volatile double D;
for(unsigned int i=0; i<n; ++i)
{
//D = d*32; // 200 ms
//D = d*(1<<5); // 200 ms
D = ldexp (d,5); // 6000 ms
}
qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}
return a.exec();
}
Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.
On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)
This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)
The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)
So something like this:
union{
uint64 i;
double f;
};
f = 123.;
i += 0x0010000000000000ull;
// Check for zero. And if it matters, denormals as well.
Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.
However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.
In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.
All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.
Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.
How about ldexp?
Any half-decent compiler will generate optimal code on your platform.
But as #Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.
Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).
And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.
[update]
OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...
...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.
So, multiplying by 1 << p is probably your best bet, as you have discovered.
You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:
const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK;
void unsafe_shl(double* d, int shift) {
unsigned long long* i = (unsigned long long*)d;
if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) {
*i += (unsigned long long)shift << DOUBLE_EXP_SHIFT;
} else if (*i) {
*d *= (1 << shift);
}
}
EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:
double ds[0x1000];
for (int i = 0; i != 0x1000; i++)
ds[i] = 1.2;
clock_t t = clock();
for (int j = 0; j != 1000000; j++)
for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
ds[i] *= 1 << 4;
#else
((unsigned int*)&ds[i])[1] += 4 << 20;
#endif
clock_t e = clock();
printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);
In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of
movupd xmm0,xmmword ptr [ecx]
lea ecx,[ecx+10h]
mulpd xmm0,xmm1
movupd xmmword ptr [ecx-10h],xmm0
Versus 2.4 seconds otherwise, with an inner loop of:
add dword ptr [ecx],400000h
lea ecx, [ecx+8]
Truly unexpected!
EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:
movsd xmm1,mmword ptr [esp+eax*8+38h]
mulsd xmm1,xmm0
movsd mmword ptr [esp+eax*8+38h],xmm1
inc eax
VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:
fld qword ptr [esp+eax*8+38h]
inc eax
fmul st,st(1)
fstp qword ptr [esp+eax*8+30h]
I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.
The fastest way to do this is probably:
x *= (1 << p);
This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.
Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.
p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.
Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:
d *= 0x1p64;
will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.
What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.
Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.
Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.
Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.
E.g. for
typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2
In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.
//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.
Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.