Extreme slow-down when starting at second permutation

Extreme slow-down when starting at second permutation - c++

Consider the following code:
#include <algorithm>
#include <chrono>
#include <iostream>
#include <numeric>
#include <vector>
int main() {
std::vector<int> v(12);
std::iota(v.begin(), v.end(), 0);
//std::next_permutation(v.begin(), v.end());
using clock = std::chrono::high_resolution_clock;
clock c;
auto start = c.now();
unsigned long counter = 0;
do {
++counter;
} while (std::next_permutation(v.begin(), v.end()));
auto end = c.now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << counter << " permutations took " << duration.count() / 1000.0f << " s";
}
Compiled with GCC (MinGW) 5.3 -O2 on my AMD 4.1 GHz CPU this takes 2.3 s. However if I comment in the uncommented line it slows down to 3.4 s. I would expect a minimal speed-up because we measure the time for one permutation less. With -O3 the difference is less extreme 2.0 s to 2.4 s.
Can anyone explain that? Could a super-smart compiler detect that I want to traverse all permutations and optmize this code?

I think the compiler gets confused by you calling the function in two separate lines in your code, causing it not be inline.
GCC 8.0.0 also behaves as yours.
Benefits of inline functions in C++? It provides a simple mechanism for the compiler to apply more optimizations, so losing the inline identification may cause a severe drop of performance, in some cases.

Related

Fortran vectorize log inside a for loop

Here is the minimum working code.
program test
implicit none
double precision:: c1,c2,rate
integer::ci,cj,cr,cm,i
integer,parameter::max_iter=10000000 !10^7
c1=0.0d+0
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
CALL SYSTEM_CLOCK(ci)
do i=1,max_iter
c1=c1+log(DBLE(i))
end do
CALL SYSTEM_CLOCK(cj)
WRITE(*,*) "system_clock : ",(cj - ci)/rate
print*, c1
end program test
When I compile with gfortran -Ofast -march=core-avx2 -fopt-info-vec-optimized the for loop with the log function does not get vectorized. I have also tried with -O3 but the result does not change.
But if I write the equivalent C++ code,
#include <iostream>
#include <chrono>
#include <cmath>
using namespace std;
using namespace std::chrono;
int main()
{
double c1=0;
const int max_iter=10000000; // 10^7
auto start = high_resolution_clock::now();
for(int i=1;i<=max_iter;i++)
{
c1 += log(i);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
cout << duration.count() << " ms"<<'\n';
printf("%0.15f\n",c1);
return 0;
}
and compile it with g++ -Ofast -march=core-avx2 -fopt-info-vec-optimized, the for loop gets vectorized and runs almost 10 times faster.
What should I do to make the fortran loop vectorized?

The problem with vectorizing loops that include the math functions (like log) is that the compiler has to be taught the semantics of the vectorized math functions (and you see if you look at the assembler output that the Fortran version calls the "normal" scalar function (a line like call log) whereas your C++ version calls the vectorized version (call _ZGVdN4v___log_finite)). There has been some work wrt making GFortran understand the glibc vector math library (libmvec), but I'm not sure what the current status is. See the thread starting at https://gcc.gnu.org/legacy-ml/gcc/2018-04/msg00062.html and continuing in June 2018 starting at https://gcc.gnu.org/legacy-ml/gcc/2018-06/msg00167.html for more details.

Why std::for_each is faster than __gnu_parallel::for_each

I'm trying to understand why std::for_each which runs on single thread is ~3 times faster than __gnu_parallel::for_each in the example below:
Time =0.478101 milliseconds
vs
Time =0.166421 milliseconds
Here the code i'm using to benchmark:
#include <iostream>
#include <chrono>
#include <parallel/algorithm>
//The struct I'm using for timming
struct TimerAvrg
{
std::vector<double> times;
size_t curr=0,n;
std::chrono::high_resolution_clock::time_point begin,end;
TimerAvrg(int _n=30)
{
n=_n;
times.reserve(n);
}
inline void start()
{
begin= std::chrono::high_resolution_clock::now();
}
inline void stop()
{
end= std::chrono::high_resolution_clock::now();
double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
if ( times.size()<n)
times.push_back(duration);
else{
times[curr]=duration;
curr++;
if (curr>=times.size()) curr=0;}
}
double getAvrg()
{
double sum=0;
for(auto t:times)
sum+=t;
return sum/double(times.size());
}
};
int main( int argc, char** argv )
{
float sum=0;
for(int alpha = 0; alpha <5000; alpha++)
{
TimerAvrg Fps;
Fps.start();
std::vector<float> v(1000000);
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
Fps.stop();
sum = sum + Fps.getAvrg()*1000;
}
std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
return 0;
}
This is my configuration:
gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04)
Intel® Core™ i7-7600U CPU # 2.80GHz × 4
htop to check if the program is running in single or multiple threads
g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark
The same code doesn't get compiled with gcc 8.1.0. I got that error message:
/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
using __gnu_cxx::conf_hypergf;
I already checked couple of posts but either they're very old or not the same issue..
My questions are:
Why is it slower in parallel?
I'm using the wrong functions?
In cppreference it is saying that gcc with Standardization of Parallelism TS is not supported (mentioned with red color in the table) and my code is running in parallel!?

Your function [](auto v){ v=0;} is extremely simple.
The function may be replaced it with a single call to memset or use SIMD instructions for single threaded parallellism. With the knowledge that it overwrites the same state as the vector initially had, the entire loop could be optimised away. It may be easier for the optimiser to replace std::for_each than a parallel implementation.
Furthermore, assuming the parallel loop uses threads, one must remember that creation and eventual synchronisation (in this case there is no need for synchronisation during processing) have overhead, which may be significant in relation to your trivial operation.
Threaded parallellism is often only worth it for computationally expensive tasks. v=0 is one of the least computationally expensive operations there are.

Your benchmark is faulty, I'm even surprised it takes time to run it.
You wrote:
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
As v is a local argument of the operator() with no reads, I would expect it to become removed by your compiler.
As you now have a loop with a body, that loop can be removed as well as there isn't an observable effect.
And similar to that, the vector can be removed as well as you don't have any readers.
So, without any side effects, this could all be removed. If you would use a parallel algorithm, chances are you have some kind of synchronization, which make optimizing this much harder as there might be side effects in another thread? Proving it doesn't is more complex, not to mention the side effects of the thread management which could exist?
To solve this, a lot of benchmarks have trucks in macros to force the compiler to assume side effects. Use them in the lambda so the compiler doesn't remove it.

Correct way of portably timing code using C++11

I'm in the midst of writing some timing code for a part of a program that has a low latency requirement.
Looking at whats available in the std::chrono library, I'm finding it a bit difficult to write timing code that is portable.
std::chrono::high_resolution_clock
std::chrono::steady_clock
std::chrono::system_clock
The system_clock is useless as it's not steady, the remaining two clocks are problematic.
The high_resolution_clock isn't necessarily stable on all platforms.
The steady_clock does not necessarily support fine-grain resolution time periods (eg: nano seconds)
For my purposes having a steady clock is the most important requirement and I can sort of get by with microsecond granularity.
My question is if one wanted to time code that could be running on different h/w architectures and OSes - what would be the best option?

Use steady_clock. On all implementations its precision is nanoseconds. You can check this yourself for your platform by printing out steady_clock::period::num and steady_clock::period::den.
Now that doesn't mean that it will actually measure nanosecond precision. But platforms do their best. For me, two consecutive calls to steady_clock (with optimizations enabled) will report times on the order of 100ns apart.
#include "chrono_io.h"
#include <chrono>
#include <iostream>
int
main()
{
using namespace std::chrono;
using namespace date;
auto t0 = steady_clock::now();
auto t1 = steady_clock::now();
auto t2 = steady_clock::now();
auto t3 = steady_clock::now();
std::cout << t1-t0 << '\n';
std::cout << t2-t1 << '\n';
std::cout << t3-t2 << '\n';
}
The above example uses this free, open-source, header-only library only for convenience of formatting the duration. You can format things yourself (I'm lazy). For me this just output:
287ns
116ns
75ns
YMMV.

Keep track of time in a game loop in C++

How can I keep track of time in seconds and milliseconds since my game/program started? I can use the clock() function but I hear it is not that accurate. Is there a better way?

You can use the chrono library in C++
Here is a code sample:
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int main() {
high_resolution_clock::time_point t1 = high_resolution_clock::now();
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
cout << time_span.count() << " seconds\n";
return 0;
}
Note that this c++11, so to compile it you should use the flag -std=c++11
$ g++ -std=c++11 test.cpp -o test
This exact piece of code gave 4e-07 seconds on my PC.
Hope that helps.

A cross-platform and easy solution would be to use the chrono library
Example:
#include <iostream>
#include <chrono>
void gameFunction()
{
// start()
// end()
}
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
gameFunction();
auto t2 = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << elapsed_time << std::endl;
return 0;
}
Finally note that you will need a at least C++11 for this to work, so set the -std= flag to at least c++11. For example:
g++ -std=c++11 game.cpp -o game

I highly recommend you look at Handmade Hero. Casey records creating an entire game in a series of video episodes. In one of the early episodes he discusses determining wall-clock time on windows using QueryPerformanceCounter and QueryPerformanceFrequency
He also discusses using cpu cycle counts although those are useful only assuming a constant processor clock speed.
He talks about these issues again in later episodes: https://hero.handmade.network/episode/game-architecture/day113 and https://hero.handmade.network/episode/game-architecture/day177
Since you are looking for a solution in a game loop, these videos will probably be of interest at least even if you use the cross-platform solution from #Pranshu. You are likely ok for a game using a platform dependent method to get a more accurate clock.
I'll point out that high_resolution_clock provides the highest resolution clock available by the system that the library knows about so it might not be any more accurate than using system_clock or steady_clock. (It uses steady_clock on my OSX box.)

Is there a fast fabsf replacement for "float" in C++?

I'm just doing some benchmarking and found out that fabsf() is often like 10x slower than fabs(). So I disassembled it and it turns out the double version is using fabs instruction, float version is not. Can this be improved? This is faster, but not so much and I'm afraid it may not work, it's a little too lowlevel:
float mabs(float i)
{
(*reinterpret_cast<MUINT32*>(&i)) &= 0x7fffffff;
return i;
}
Edit: Sorry forgot about the compiler - I still use the good old VS2005, no special libs.

You can easily test different possibilities using the code below. It essentially tests your bitfiddling against naive template abs, and std::abs. Not surprisingly, naive template abs wins. Well, kind of surprisingly it wins. I'd expect std::abs to be equally fast. Note that -O3 actually makes things slower (at least on coliru).
Coliru's host system shows these timings:
random number generation: 4240 ms
naive template abs: 190 ms
ugly bitfiddling abs: 241 ms
std::abs: 204 ms
::fabsf: 202 ms
And these timings for a Virtualbox VM running Arch with GCC 4.9 on a Core i7:
random number generation: 1453 ms
naive template abs: 73 ms
ugly bitfiddling abs: 97 ms
std::abs: 57 ms
::fabsf: 80 ms
And these timings on MSVS2013 (Windows 7 x64):
random number generation: 671 ms
naive template abs: 59 ms
ugly bitfiddling abs: 129 ms
std::abs: 109 ms
::fabsf: 109 ms
If I haven't made some blatantly obvious mistake in this benchmark code (don't shoot me over it, I wrote this up in about 2 minutes), I'd say just use std::abs, or the template version if that turns out to be slightly faster for you.
The code:
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
#include <math.h>
using Clock = std::chrono::high_resolution_clock;
using milliseconds = std::chrono::milliseconds;
template<typename T>
T abs_template(T t)
{
return t>0 ? t : -t;
}
float abs_ugly(float f)
{
(*reinterpret_cast<std::uint32_t*>(&f)) &= 0x7fffffff;
return f;
}
int main()
{
std::random_device rd;
std::mt19937 mersenne(rd());
std::uniform_real_distribution<> dist(-std::numeric_limits<float>::lowest(), std::numeric_limits<float>::max());
std::vector<float> v(100000000);
Clock::time_point t0 = Clock::now();
std::generate(std::begin(v), std::end(v), [&dist, &mersenne]() { return dist(mersenne); });
Clock::time_point trand = Clock::now();
volatile float temp;
for (float f : v)
temp = abs_template(f);
Clock::time_point ttemplate = Clock::now();
for (float f : v)
temp = abs_ugly(f);
Clock::time_point tugly = Clock::now();
for (float f : v)
temp = std::abs(f);
Clock::time_point tstd = Clock::now();
for (float f : v)
temp = ::fabsf(f);
Clock::time_point tfabsf = Clock::now();
milliseconds random_time = std::chrono::duration_cast<milliseconds>(trand - t0);
milliseconds template_time = std::chrono::duration_cast<milliseconds>(ttemplate - trand);
milliseconds ugly_time = std::chrono::duration_cast<milliseconds>(tugly - ttemplate);
milliseconds std_time = std::chrono::duration_cast<milliseconds>(tstd - tugly);
milliseconds c_time = std::chrono::duration_cast<milliseconds>(tfabsf - tstd);
std::cout << "random number generation: " << random_time.count() << " ms\n"
<< "naive template abs: " << template_time.count() << " ms\n"
<< "ugly bitfiddling abs: " << ugly_time.count() << " ms\n"
<< "std::abs: " << std_time.count() << " ms\n"
<< "::fabsf: " << c_time.count() << " ms\n";
}
Oh, and to answer your actual question: if the compiler can't generate more efficient code, I doubt there is a faster way save for micro-optimized assembly, especially for elementary operations such as this.

There are many things at play here. First off, the x87 co-processor is deprecated in favor of SSE/AVX, so I'm surprised to read that your compiler still uses the fabs instruction. It's quite possible that the others who posted benchmark answers on this question use a platform that supports SSE. Your results might be wildly different.
I'm not sure why your compiler uses a different logic for fabs and fabsf. It's totally possible to load a float to the x87 stack and use the fabs instruction on it just as easily. The problem with reproducing this by yourself, without compiler support, is that you can't integrate the operation into the compiler's normal optimizing pipeline: if you say "load this float, use the fabs instruction, return this float to memory", then the compiler will do exactly that... and it may involve putting back to memory a float that was already ready to be processed, loading it back in, using the fabs instruction, putting it back to memory, and loading it again to the x87 stack to resume the normal, optimizable pipeline. This would be four wasted load-store operations because it only needed to do fabs.
In short, you are unlikely to beat integrated compiler support for floating-point operations. If you don't have this support, inline assembler might just make things even slower than they presumably already are. The fastest thing for you to do might even be to use the fabs function instead of the fabsf function on your floats.
For reference, modern compilers and modern platforms use the SSE instructions andps (for floats) and andpd (for doubles) to AND out the bit sign, very much like you're doing yourself, but dodging all the language semantics issues. They're both as fast. Modern compilers may also detect patterns like x < 0 ? -x : x and produce the optimal andps/andpd instruction without the need for a compiler intrinsic.

Did you try the std::abs overload for float? That would be the canonical C++ way.
Also as an aside, I should note that your bit-modifying version does violate the strict-aliasing rules (in addition to the more fundamental assumption that int and float have the same size) and as such would be undefined behavior.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extreme slow-down when starting at second permutation - c++

Related

Fortran vectorize log inside a for loop

Why std::for_each is faster than __gnu_parallel::for_each

Correct way of portably timing code using C++11

Keep track of time in a game loop in C++

Is there a fast fabsf replacement for "float" in C++?

Categories

Resources