Using AVX intrinsics instead of SSE does not improve speed -- why? - c++

I've been using Intel's SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunately, was not the case until now. Probably I am doing a stupid mistake, so I would be very grateful if somebody could help me out.
I use Ubuntu 11.10 with g++ 4.6.1. I compiled my program (see below) with
g++ simpleExample.cpp -O3 -march=native -o simpleExample
The test system has a Intel i7-2600 CPU.
Here is the code which exemplifies my problem. On my system, I get the output
98.715 ms, b[42] = 0.900038 // Naive
24.457 ms, b[42] = 0.900038 // SSE
24.646 ms, b[42] = 0.900038 // AVX
Note that the computation sqrt(sqrt(sqrt(x))) was only chosen to ensure that memory bandwith does not limit execution speed; it is just an example.
simpleExample.cpp:
#include <immintrin.h>
#include <iostream>
#include <math.h>
#include <sys/time.h>
using namespace std;
// -----------------------------------------------------------------------------
// This function returns the current time, expressed as seconds since the Epoch
// -----------------------------------------------------------------------------
double getCurrentTime(){
struct timeval curr;
struct timezone tz;
gettimeofday(&curr, &tz);
double tmp = static_cast<double>(curr.tv_sec) * static_cast<double>(1000000)
+ static_cast<double>(curr.tv_usec);
return tmp*1e-6;
}
// -----------------------------------------------------------------------------
// Main routine
// -----------------------------------------------------------------------------
int main() {
srand48(0); // seed PRNG
double e,s; // timestamp variables
float *a, *b; // data pointers
float *pA,*pB; // work pointer
__m128 rA,rB; // variables for SSE
__m256 rA_AVX, rB_AVX; // variables for AVX
// define vector size
const int vector_size = 10000000;
// allocate memory
a = (float*) _mm_malloc (vector_size*sizeof(float),32);
b = (float*) _mm_malloc (vector_size*sizeof(float),32);
// initialize vectors //
for(int i=0;i<vector_size;i++) {
a[i]=fabs(drand48());
b[i]=0.0f;
}
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// Naive implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
s = getCurrentTime();
for (int i=0; i<vector_size; i++){
b[i] = sqrtf(sqrtf(sqrtf(a[i])));
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
// -----------------------------------------------------------------------------
for(int i=0;i<vector_size;i++) {
b[i]=0.0f;
}
// -----------------------------------------------------------------------------
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// SSE2 implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pA = a; pB = b;
s = getCurrentTime();
for (int i=0; i<vector_size; i+=4){
rA = _mm_load_ps(pA);
rB = _mm_sqrt_ps(_mm_sqrt_ps(_mm_sqrt_ps(rA)));
_mm_store_ps(pB,rB);
pA += 4;
pB += 4;
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
// -----------------------------------------------------------------------------
for(int i=0;i<vector_size;i++) {
b[i]=0.0f;
}
// -----------------------------------------------------------------------------
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// AVX implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pA = a; pB = b;
s = getCurrentTime();
for (int i=0; i<vector_size; i+=8){
rA_AVX = _mm256_load_ps(pA);
rB_AVX = _mm256_sqrt_ps(_mm256_sqrt_ps(_mm256_sqrt_ps(rA_AVX)));
_mm256_store_ps(pB,rB_AVX);
pA += 8;
pB += 8;
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
_mm_free(a);
_mm_free(b);
return 0;
}
Any help is appreciated!

This is because VSQRTPS (AVX instruction) takes exactly twice as many cycles as SQRTPS (SSE instruction) on a Sandy Bridge processor. See Agner Fog's optimize guide: instruction tables, page 88.
Instructions like square root and division don't benefit from AVX. On the other hand, additions, multiplications, etc., do.

If you are interested in increasing square root performance, instead of VSQRTPS you can use VRSQRTPS and Newton-Raphson formula:
x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)
VRSQRTPS itself doesn't benefit from AVX, but other calculations do.
Use it if 23 bits of precision is enough for you.

Just for completeness. The Newton-Raphson (NR) implementation for operations like the division or the square root will only be beneficial if you have a limited number of those operations in your code. This is because if you used these alternative methods you will generate more pressure on other ports such as the multiplication and addition ports. That's basically the reason why x86 architectures have special hardware unit to handle these operation instead of the alternative software solutions (like NR). I quote from Intel 64 and IA-32
Architectures Optimization Reference Manual p.556:
"In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency of these operations, the approximation with Newton-Raphson can slow down execution."
So be careful when using NR in large algorithms. Actually, I had my master's thesis around this point and I will leave a link to it here for future reference, once it is published .
Also for people how always wonder about the throughput and the latency of certain instructions, have a look on IACA. It is a very useful tool provided by Intel to statically analyze the in-core execution performance of codes.
edited
here is a link to the thesis for those who are interested thesis

Depending on your processor hardware, the AVX instructions may be emulated in the hardware as SSE instructions. You'd need to look up your processor's part number to get exact specs on it, but this is one of the main differences between low-end and high-end intel processors, the number of specialize execution units vs. hardware emulation.

Related

C++ Linux fastest way to measure time (faster than std::chrono) ? Benchmark included

#include <iostream>
#include <chrono>
using namespace std;
class MyTimer {
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
double getCounter() {
ender = std::chrono::steady_clock::now();
return double(std::chrono::duration_cast<std::chrono::nanoseconds>(ender - starter).count()) /
1000000; // millisecond output
}
// timer need to have nanosecond precision
int64_t getCounterNs() {
return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
}
};
MyTimer timer1, timer2, timerMain;
volatile int64_t dummy = 0, res1 = 0, res2 = 0;
// time run without any time measure
void func0() {
dummy++;
}
// we're trying to measure the cost of startCounter() and getCounterNs(), not "dummy++"
void func1() {
timer1.startCounter();
dummy++;
res1 += timer1.getCounterNs();
}
void func2() {
// start your counter here
dummy++;
// res2 += end your counter here
}
int main()
{
int i, ntest = 1000 * 1000 * 100;
int64_t runtime0, runtime1, runtime2;
timerMain.startCounter();
for (i=1; i<=ntest; i++) func0();
runtime0 = timerMain.getCounter();
cout << "Time0 = " << runtime0 << "ms\n";
timerMain.startCounter();
for (i=1; i<=ntest; i++) func1();
runtime1 = timerMain.getCounter();
cout << "Time1 = " << runtime1 << "ms\n";
timerMain.startCounter();
for (i=1; i<=ntest; i++) func2();
runtime2 = timerMain.getCounter();
cout << "Time2 = " << runtime2 << "ms\n";
return 0;
}
I'm trying to profile a program where certain critical parts have execution time measured in < 50 nanoseconds. I found that my timer class using std::chrono is too expensive (code with timing takes 40% more time than code without). How can I make a faster timer class?
I think some OS-specific system calls would be the fastest solution. The platform is Linux Ubuntu.
Edit: all code is compiled with -O3. It's ensured that each timer is only initialized once, so the measured cost is due to the startMeasure/stopMeasure functions only. I'm not doing any text printing.
Edit 2: the accepted answer doesn't include the method to actually convert number-of-cycles to nanoseconds. If someone can do that, it'd be very helpful.
What you want is called "micro-benchmarking". It can get very complex. I assume you are using Ubuntu Linux on x86_64. This is not valid form ARM, ARM64 or any other platforms.
std::chrono is implemented at libstdc++ (gcc) and libc++ (clang) on Linux as simply a thin wrapper around the GLIBC, the C library, which does all the heavy lifting. If you look at std::chrono::steady_clock::now() you will see calls to clock_gettime().
clock_gettime() is a VDSO, ie it is kernel code that runs in userspace. It should be very fast but it might be that from time to time it has to do some housekeeping and take a long time every n-th call. So I would not recommend for microbenchmarking.
Almost every platform has a cycle counter and x86 has the assembly instruction rdtsc. This instruction can be inserted in your code by crafting asm calls or by using the compiler-specific builtins __builtin_ia32_rdtsc() or __rdtsc().
These calls will return a 64-bit integer representing the number of clocks since the machine power up. rdtsc is not immediate but fast, it will take roughly 15-40 cycles to complete.
It is not guaranteed in all platforms that this counter will be the same for each core so beware when the process gets moved from core to core. In modern systems this should not be a problem though.
Another problem with rdtsc is that compilers will often reorder instructions if they find they don't have side effects and unfortunately rdtsc is one of them. So you have to use fake barriers around these counter reads if you see that the compiler is playing tricks on you - look at the generated assembly.
Also a big problem is cpu out of order execution itself. Not only the compiler can change the order of execution but the cpu can as well. Since the x86 486 the Intel CPUs are pipelined so several instructions can be executed at the same time - roughly speaking. So you might end up measuring spurious execution.
I recommend you to get familiar with the quantum-like problems of micro-benchmarking. It is not straightforward.
Notice that rdtsc() will return the number of cycles. You have to convert to nanoseconds using the timestamp counter frequency.
Here is one example:
#include <iostream>
#include <cstdio>
void dosomething() {
// yada yada
}
int main() {
double sum = 0;
const uint32_t numloops = 100000000;
for ( uint32_t j=0; j<numloops; ++j ) {
uint64_t t0 = __builtin_ia32_rdtsc();
dosomething();
uint64_t t1 = __builtin_ia32_rdtsc();
uint64_t elapsed = t1-t0;
sum += elapsed;
}
std::cout << "Average:" << sum/numloops << std::endl;
}
This paper is a bit outdated (2010) but it is sufficiently up to date to give you a good introduction to micro-benchmarking:
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".
For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

Using the Apple Accelerate Framework vForce library to improve performance

I have successfully implemented the BLAS library from Apple's Accelerate Framework to improve the performance of my basic vector and matrix operations.
Being satisfied with this, I turned my attention to vForce to vectorize my basic math functions. Here I was a bit surprised to get quite poor performance compared to naive implementations (using automatic compiler optimization -Os).
As a simple benchmark, I ran the following test: Matrix is the basic Matrix type, using a double pointer, AccelerateMatrix is a subclass of Matrix which uses the exponentiation function from vForce:
Matrix A(vec_size);
AccelerateMatrix B(vec_size);
for (int i=0; i<vec_size;i++ ) {
A[i] = i;
B[i] = i;
}
double elapsed_time;
clock_t start = clock();
for(int i=0;i<reps;i++){
A.exp();
A.log();
}
clock_t stop = clock();
elapsed_time = (double)(stop-start)/CLOCKS_PER_SEC/reps;
cerr << "Basic matrix exponentiation/log time = " << elapsed_time << endl;
start = clock();
for(int i=0;i<reps;i++){
B.exp();
B.log();
}
stop = clock();
elapsed_time = (double)(stop-start)/CLOCKS_PER_SEC/reps;
cerr << "Accelerate matrix exponentiation/log time = " << elapsed_time << endl;
The exponentiate/log member functions are implemented as follows:
void AccelerateMatrix::exp(){
int size =(int)this->getSize();
this->goToStart();
vvexp(this->ptr, this->ptr, &size);}
void Matrix::exp(){
double *ptr = data;
while (!atEnd()) {
*ptr = std::exp(*ptr);
ptr++;
}
}
data is the pointer to the first element of the double array.
Below is the output of the performance:
Number of matrix elements = 1000000
Basic matrix exponentiation/log time(secs) = 0.0089806
Accelerate matrix exponentiation/log time(secs) = 0.0149955
I am running in from XCode in Release mode.
My processor is a 2.3 GHz Intel Core i7.
The memory is 8 GB 1600 MHz DDR3.
It appears the issue is to do with how vForce manipulates memory. Essentially it is not good at handling large matrices in one go. For vec_size = 1000; vForce computes the exponential/log twice as fast as the compiler optimised, naive version. I broke the larger example vec_size = 1000000 up into batches of 1000 each, and lo and behold, the vForce implementation was twice as fast as the naive one. Nice!

Benchmarking math.h square root and Quake square root

Okay so I was board and wondered how fast math.h square root was in comparison to the one with the magic number in it (made famous by Quake but made by SGI).
But this has ended up in a world of hurt for me.
I first tried this on the Mac where the math.h would win hands down every time then on Windows where the magic number always won, but I think this is all down to my own noobness.
Compiling on the Mac with "g++ -o sq_root sq_root_test.cpp" when the program ran it takes about 15 seconds to complete. But compiling in VS2005 on release takes a split second. (in fact I had to compile in debug just to get it to show some numbers)
My poor man's benchmarking? is this really stupid? cos I get 0.01 for math.h and 0 for the Magic number. (it cant be that fast can it?)
I don't know if this matters but the Mac is Intel and the PC is AMD. Is the Mac using hardware for math.h sqroot?
I got the fast square root algorithm from http://en.wikipedia.org/wiki/Fast_inverse_square_root
//sq_root_test.cpp
#include <iostream>
#include <math.h>
#include <ctime>
float invSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
int main() {
std::clock_t start;// = std::clock();
std::clock_t end;
float rootMe;
int iterations = 999999999;
// ---
rootMe = 2.0f;
start = std::clock();
std::cout << "Math.h SqRoot: ";
for (int m = 0; m < iterations; m++) {
(float)(1.0/sqrt(rootMe));
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
// ---
std::cout << "Quake SqRoot: ";
rootMe = 2.0f;
start = std::clock();
for (int q = 0; q < iterations; q++) {
invSqrt(rootMe);
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
}
There are several problems with your benchmarks. First, your benchmark includes a potentially expensive cast from int to float. If you want to know what a square root costs, you should benchmark square roots, not datatype conversions.
Second, your entire benchmark can be (and is) optimized out by the compiler because it has no observable side effects. You don't use the returned value (or store it in a volatile memory location), so the compiler sees that it can skip the whole thing.
A clue here is that you had to disable optimizations. That means your benchmarking code is broken. Never ever disable optimizations when benchmarking. You want to know which version runs fastest, so you should test it under the conditions it'd actually be used under. If you were to use square roots in performance-sensitive code, you'd enable optimizations, so how it behaves without optimizations is completely irrelevant.
Also, you're not benchmarking the cost of computing a square root, but of the inverse square root.
If you want to know which way of computing the square root is fastest, you have to move the 1.0/... division down to the Quake version. (And since division is a pretty expensive operation, this might make a big difference in your results)
Finally, it might be worth pointing out that Carmacks little trick was designed to be fast on 12 year old computers. Once you fix your benchmark, you'll probably find that it's no longer an optimization, because today's CPU's are much faster at computing "real" square roots.

How to speed up floating-point to integer number conversion? [duplicate]

This question already has answers here:
What is the fastest way to convert float to int on x86
(10 answers)
Closed 8 years ago.
We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
The default C function which performs the conversion turns out to be quite time consuming.
Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.
Most of the other answers here just try to eliminate loop overhead.
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
Is the time large enough that it outweighs the cost of starting a couple of threads?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
See this Intel article for speeding up integer conversions:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior.
for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding.
read up on the fld and fistp fpu instructions and the fpu control word.
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
As thirtyseven said above, the CPU can convert float<->int in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float over double.
On Intel your best bet is inline SSE2 calls.
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
If you do not care very much about the rounding semantics, you can use the lrint() function. This allows for more freedom in rounding and it can be much faster.
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
lrint documentation
rounding only
excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
In IEEE 754
6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
6755399441055743.5
will also however compile to
0100001100111000000000000000000000000000000000000000000000000000
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
to do truncation you would have to add 0.5 to your double then do this
the guard digits should take care of rounding to the correct result done this way.
also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
WRITING ONLY...
1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s
154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s
108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s
USING CASTING...
5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s
2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s
1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s
USING xs_CRoundToInt...
1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s
1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s
1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}