Fastest way to get square root in float value

Fastest way to get square root in float value - c++

I am trying to find a fastest way to make square root of any float number in C++. I am using this type of function in a huge particles movement calculation like calculation distance between two particle, we need a square root etc. So If any suggestion it will be very helpful.
I have tried and below is my code
#include <math.h>
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
#define CHECK_RANGE 100
inline float msqrt(float a)
{
int i;
for (i = 0;i * i <= a;i++);
float lb = i - 1; //lower bound
if (lb * lb == a)
return lb;
float ub = lb + 1; // upper bound
float pub = ub; // previous upper bound
for (int j = 0;j <= 20;j++)
{
float ub2 = ub * ub;
if (ub2 > a)
{
pub = ub;
ub = (lb + ub) / 2; // mid value of lower and upper bound
}
else
{
lb = ub;
ub = pub;
}
}
return ub;
}
void check_msqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
msqrt(i);
}
}
void check_sqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
sqrt(i);
}
}
int main()
{
auto start1 = high_resolution_clock::now();
check_msqrt();
auto stop1 = high_resolution_clock::now();
auto duration1 = duration_cast<microseconds>(stop1 - start1);
cout << "Time for check_msqrt = " << duration1.count() << " micro secs\n";
auto start2 = high_resolution_clock::now();
check_sqrt();
auto stop2 = high_resolution_clock::now();
auto duration2 = duration_cast<microseconds>(stop2 - start2);
cout << "Time for check_sqrt = " << duration2.count() << " micro secs";
//cout << msqrt(3);
return 0;
}
output of above code showing the implemented method 4 times more slow than sqrt of math.h file.
I need faster than math.h version.

In short, I do not think it is possible to implement something generally faster than the standard library version of sqrt.
Performance is a very important parameter when implementing standard library functions and it is fair to assume that such a commonly used function as sqrt is optimized as much as possible.
Beating the standard library function would require a special case, such as:
Availability of a suitable assembler instruction - or other specialized hardware support - on the particular system for which the standard library has not been specialized.
Knowledge of the needed range or precision. The standard library function must handle all cases specified by the standard. If the application only needs a subset of that or maybe only requires an approximate result then perhaps an optimization is possible.
Making a mathematical reduction of the calculations or combine some calculation steps in a smart way so an efficient implementation can be made for that combination.

Here's another alternative to binary search. It may not be as fast as std::sqrt, haven't tested it. But it will definitely be faster than your binary search.
auto
Sqrt(float x)
{
using F = decltype(x);
if (x == 0 || x == INFINITY || isnan(x))
return x;
if (x < 0)
return F{NAN};
int e;
x = std::frexp(x, &e);
if (e % 2 != 0)
{
++e;
x /= 2;
}
auto y = (F{-160}/567*x + F{2'848}/2'835)*x + F{155}/567;
y = (y + x/y)/2;
y = (y + x/y)/2;
return std::ldexp(y, e/2);
}
After getting +/-0, nan, inf, and negatives out of the way, it works by decomposing the float into a mantissa in the range of [1/4, 1) times 2e where e is an even integer. The answer is then sqrt(mantissa)* 2e/2.
Finding the sqrt of the mantissa can be guessed at with a least squares quadratic curve fit in the range [1/4, 1]. Then that good guess is refined by two iterations of Newton–Raphson. This will get you within 1 ulp of the correctly rounded result. A good std::sqrt will typically get that last bit correct.

I have also tried with the algorithm mention in https://en.wikipedia.org/wiki/Fast_inverse_square_root, but not found desired result, please check
#include <math.h>
#include <iostream>
#include <chrono>
#include <bit>
#include <limits>
#include <cstdint>
using namespace std;
using namespace std::chrono;
#define CHECK_RANGE 10000
inline float msqrt(float a)
{
int i;
for (i = 0;i * i <= a;i++);
float lb = i - 1; //lower bound
if (lb * lb == a)
return lb;
float ub = lb + 1; // upper bound
float pub = ub; // previous upper bound
for (int j = 0;j <= 20;j++)
{
float ub2 = ub * ub;
if (ub2 > a)
{
pub = ub;
ub = (lb + ub) / 2; // mid value of lower and upper bound
}
else
{
lb = ub;
ub = pub;
}
}
return ub;
}
/* mentioned here -> https://en.wikipedia.org/wiki/Fast_inverse_square_root */
inline float Q_sqrt(float number)
{
union Conv {
float f;
uint32_t i;
};
Conv conv;
conv.f= number;
conv.i = 0x5f3759df - (conv.i >> 1);
conv.f *= 1.5F - (number * 0.5F * conv.f * conv.f);
return 1/conv.f;
}
void check_Qsqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
Q_sqrt(i);
}
}
void check_msqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
msqrt(i);
}
}
void check_sqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
sqrt(i);
}
}
int main()
{
auto start1 = high_resolution_clock::now();
check_msqrt();
auto stop1 = high_resolution_clock::now();
auto duration1 = duration_cast<microseconds>(stop1 - start1);
cout << "Time for check_msqrt = " << duration1.count() << " micro secs\n";
auto start2 = high_resolution_clock::now();
check_sqrt();
auto stop2 = high_resolution_clock::now();
auto duration2 = duration_cast<microseconds>(stop2 - start2);
cout << "Time for check_sqrt = " << duration2.count() << " micro secs\n";
auto start3 = high_resolution_clock::now();
check_Qsqrt();
auto stop3 = high_resolution_clock::now();
auto duration3 = duration_cast<microseconds>(stop3 - start3);
cout << "Time for check_Qsqrt = " << duration3.count() << " micro secs\n";
//cout << Q_sqrt(3);
//cout << sqrt(3);
//cout << msqrt(3);
return 0;
}

Related

C++17 parallel algorithm vs tbb parallel vs openmp performance

Since c++17 std library support parallel algorithm, I thought it would be the go-to option for us, but after comparing with tbb and openmp, I changed my mind, I found the std library is much slower.
By this post, I want to ask for professional advice about whether I should abandon the std library's parallel algorithm, and use tbb or openmp, thanks!
Env:
Mac OSX, Catalina 10.15.7
GNU g++-10
Benchmark code:
#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>
const size_t N = 1000000;
double std_for() {
auto values = std::vector<double>(N);
size_t n_par = 5lu;
auto indices = std::vector<size_t>(n_par);
std::iota(indices.begin(), indices.end(), 0lu);
size_t stride = static_cast<size_t>(N / n_par) + 1;
std::for_each(
std::execution::par,
indices.begin(),
indices.end(),
[&](size_t index) {
int begin = index * stride;
int end = (index+1) * stride;
for (int i = begin; i < end; ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
double total = 0;
for (double value : values)
{
total += value;
}
return total;
}
double tbb_for() {
auto values = std::vector<double>(N);
tbb::parallel_for(
tbb::blocked_range<int>(0, values.size()),
[&](tbb::blocked_range<int> r) {
for (int i=r.begin(); i<r.end(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
double omp_for()
{
auto values = std::vector<double>(N);
#pragma omp parallel for
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
double seq_for()
{
auto values = std::vector<double>(N);
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
void time_it(double(*fn_ptr)(), const std::string& fn_name) {
auto t1 = std::chrono::high_resolution_clock::now();
auto rez = fn_ptr();
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}
int main(int argc, char** argv) {
std::string op(argv[1]);
if (op == "std_for") {
time_it(&std_for, op);
} else if (op == "omp_for") {
time_it(&omp_for, op);
} else if (op == "tbb_for") {
time_it(&tbb_for, op);
} else if (op == "seq_for") {
time_it(&seq_for, op);
}
}
Compile options:
g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp
Results:
std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910
We can see that std_for is faster than seq_for(sequential for-loop), but it's still much slower than tbb and openmp.
UPDATE
As people suggested in comments, I run each for separately to be fair. The above code is updated, and results as follows,
>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885
>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619
>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052
>>> ./a.out std_for
std_for, rez = 500106, dur = 12423
And like ppl said, running the 4 versions in a row is not fair, compared to the previous results.

You already found that it matters what exactly is to be measured and how this is done. Your final task will certainty be quite different from this simple exercise and not entirely reflect the results found here.
Besides caching and warming-up that are affected by the sequence of doing tasks (you studied this explicitly in your updated question) there is also another issue in your example you should consider.
The actual parallel code is what matters. If this does not determine your performance/runtime than parallelization is not the right solution. But in your example you measure also resource allocation, initialization and final computation. If those drive the real costs in your final application, again, parallelization is not the silver bullet. Thus, for a fair comparison and to really measure the actual parallel code execution performance. I suggest to modify your code along this line (sorry, I don't have openmp installed) and continue your studies:
#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>
const size_t N = 10000000; // #1
void std_for(std::vector<double>& values,
std::vector<size_t> const& indices,
size_t const stride) {
std::for_each(
std::execution::par,
indices.begin(),
indices.end(),
[&](size_t index) {
int begin = index * stride;
int end = (index+1) * stride;
for (int i = begin; i < end; ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
}
void tbb_for(std::vector<double>& values) {
tbb::parallel_for(
tbb::blocked_range<int>(0, values.size()),
[&](tbb::blocked_range<int> r) {
for (int i=r.begin(); i<r.end(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
}
/*
double omp_for()
{
auto values = std::vector<double>(N);
#pragma omp parallel for
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
*/
void seq_for(std::vector<double>& values)
{
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
}
void time_it(void(*fn_ptr)(std::vector<double>&), const std::string& fn_name) {
std::vector<double> values = std::vector<double>(N);
auto t1 = std::chrono::high_resolution_clock::now();
fn_ptr(values);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
double total = 0;
for (double value : values) {
total += value;
}
std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}
void time_it_std(void(*fn_ptr)(std::vector<double>&, std::vector<size_t> const&, size_t const), const std::string& fn_name) {
std::vector<double> values = std::vector<double>(N);
size_t n_par = 5lu; // #2
auto indices = std::vector<size_t>(n_par);
std::iota(indices.begin(), indices.end(), 0lu);
size_t stride = static_cast<size_t>(N / n_par) + 1;
auto t1 = std::chrono::high_resolution_clock::now();
fn_ptr(values, indices, stride);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
double total = 0;
for (double value : values) {
total += value;
}
std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}
int main(int argc, char** argv) {
std::string op(argv[1]);
if (op == "std_for") {
time_it_std(&std_for, op);
// } else if (op == "omp_for") {
//time_it(&omp_for, op);
} else if (op == "tbb_for") {
time_it(&tbb_for, op);
} else if (op == "seq_for") {
time_it(&seq_for, op);
}
}
On my (slow) system this results in:
std_for, res = 5.00046e+06, dur = 66393
tbb_for, res = 5.00046e+06, dur = 51746
seq_for, res = 5.00046e+06, dur = 196156
I note here that the difference from seq_for to tbb_for has further increased. It is now ~4x while in your example it looks more like ~3x. And std_for is still about 20..30% slower than tbb_for.
However, there are further parameters. After increasing N (see #1) by a factor of 10 (ok, this is not very important) and n_par (see #2) from 5 to 100 (this is important) the results are
tbb_for, res = 5.00005e+07, dur = 486179
std_for, res = 5.00005e+07, dur = 479306
Here std_for is on-par with tbb_for!
Thus, to answer your question: I clearly would NOT discard c++17 std parallelization right away.

Perhaps you already know, but something I don't see mentioned here is the fact that (at least for gcc and clang) the PSTL is actually implemented using/backended by TBB, OpenMP (currently on clang, only, I believe), or a sequential version of it.
I'm guessing you're using libc++ since you are on Mac; as far as I know, for Linux at least, the LLVM distributions do not come with the PSTL enabled, and if building PSTL and libcxx/libcxxabi from source, it defaults to a sequential backend.
https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h

OpenMp is good for straight forward parallel codding.
On the other hand TBB use work-stealing mechanism which can give you
better performance for loops that are imbalance and nested.
I prefer TBB for complex and nested parallelism over OpenMP.(OpenMP
has a huge over-head for the nested parallelism)

Matrix inversion slower using threads

I made a function that makes the inverse and then another multithreaded, as long I have to make inverse of arrays >2000 x 2000.
A 1000x1000 array unthreated takes 2.5 seconds (on a i5-4460 4 cores 2.9ghz)
and multithreaded takes 7.25 seconds
I placed the multithreads in the part that most time consumption is taken. Whai is wrong?
Is due vectors are used instead of 2 dimensions arrays?
This is the minimum code to test both versions:
#include<iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <thread>
const int NUCLEOS = 8;
#ifdef __linux__
#include <unistd.h> //usleep()
typedef std::chrono::system_clock t_clock; //try to use high_resolution_clock on new linux x64 computer!
#else
typedef std::chrono::high_resolution_clock t_clock;
#pragma warning(disable:4996)
#endif
using namespace std;
std::chrono::time_point<t_clock> start_time, stop_time = start_time; char null_char = '\0';
void timer(char *title = 0, int data_size = 1) { stop_time = t_clock::now(); double us = (double)chrono::duration_cast<chrono::microseconds>(stop_time - start_time).count(); if (title) printf("%s time = %7lgms = %7lg MOPs\n", title, (double)us*1e-3, (double)data_size / us); start_time = t_clock::now(); }
//makes columns 0
void colum_zero(vector< vector<double> > &x, vector< vector<double> > &y, int pos0, int pos1,int dim, int ord);
//returns inverse of x, x is not modified, not threaded
vector< vector<double> > inverse(vector< vector<double> > x)
{
if (x.size() != x[0].size())
{
cout << "ERROR on inverse() not square array" << endl; getchar(); return{};//returns a null
}
size_t dim = x.size();
int i, j, ord;
vector< vector<double> > y(dim,vector<double>(dim,0));//initializes output = 0
//init_2Dvector(y, dim, dim);
//1. Unity array y:
for (i = 0; i < dim; i++)
{
y[i][i] = 1.0;
}
double diagon, coef;
double *ptrx, *ptry, *ptrx2, *ptry2;
for (ord = 0; ord<dim; ord++)
{
//2 Hacemos diagonal de x =1
int i2;
if (fabs(x[ord][ord])<1e-15) //If that element is 0, a line that contains a non zero is added
{
for (i2 = ord + 1; i2<dim; i2++)
{
if (fabs(x[i2][ord])>1e-15) break;
}
if (i2 >= dim)
return{};//error, returns null
for (i = 0; i<dim; i++)//added a line without 0
{
x[ord][i] += x[i2][i];
y[ord][i] += y[i2][i];
}
}
diagon = 1.0/x[ord][ord];
ptry = &y[ord][0];
ptrx = &x[ord][0];
for (i = 0; i < dim; i++)
{
*ptry++ *= diagon;
*ptrx++ *= diagon;
}
//uses the same function but not threaded:
colum_zero(x,y,0,dim,dim,ord);
}//end ord
return y;
}
//threaded version
vector< vector<double> > inverse_th(vector< vector<double> > x)
{
if (x.size() != x[0].size())
{
cout << "ERROR on inverse() not square array" << endl; getchar(); return{};//returns a null
}
int dim = (int) x.size();
int i, ord;
vector< vector<double> > y(dim, vector<double>(dim, 0));//initializes output = 0
//init_2Dvector(y, dim, dim);
//1. Unity array y:
for (i = 0; i < dim; i++)
{
y[i][i] = 1.0;
}
std::thread tarea[NUCLEOS];
double diagon;
double *ptrx, *ptry;// , *ptrx2, *ptry2;
for (ord = 0; ord<dim; ord++)
{
//2 Hacemos diagonal de x =1
int i2;
if (fabs(x[ord][ord])<1e-15) //If a diagonal element=0 it is added a column that is not 0 the diagonal element
{
for (i2 = ord + 1; i2<dim; i2++)
{
if (fabs(x[i2][ord])>1e-15) break;
}
if (i2 >= dim)
return{};//error, returns null
for (i = 0; i<dim; i++)//It is looked for a line without zero to be added to make the number a non zero one to avoid later divide by 0
{
x[ord][i] += x[i2][i];
y[ord][i] += y[i2][i];
}
}
diagon = 1.0 / x[ord][ord];
ptry = &y[ord][0];
ptrx = &x[ord][0];
for (i = 0; i < dim; i++)
{
*ptry++ *= diagon;
*ptrx++ *= diagon;
}
int pos0 = 0, N1 = dim;//initial array position
if ((N1<1) || (N1>5000))
{
cout << "It is detected out than 1-5000 simulations points=" << N1 << " ABORT or press enter to continue" << endl; getchar();
}
//cout << "Initiation of " << NUCLEOS << " threads" << endl;
for (int thread = 0; thread<NUCLEOS; thread++)
{
int pos1 = (int)((thread + 1)*N1 / NUCLEOS);//next position
tarea[thread] = std::thread(colum_zero, std::ref(x), std::ref(y), pos0, pos1, dim, ord);//ojo, coil current=1!!!!!!!!!!!!!!!!!!
pos0 = pos1;//next thread will work at next point
}
for (int thread = 0; thread<NUCLEOS; thread++)
{
tarea[thread].join();
//cout << "Thread num: " << thread << " end\n";
}
}//end ord
return y;
}
//makes columns 0
void colum_zero(vector< vector<double> > &x, vector< vector<double> > &y, int pos0, int pos1,int dim, int ord)
{
double coef;
double *ptrx, *ptry, *ptrx2, *ptry2;
//Hacemos '0' la columna ord salvo elemento diagonal:
for (int i = pos0; i<pos1; i++)//Begin to end for every thread
{
if (i == ord) continue;
coef = x[i][ord];//element to make 0
if (fabs(coef)<1e-15) continue; //If already zero, it is avoided
ptry = &y[i][0];
ptry2 = &y[ord][0];
ptrx = &x[i][0];
ptrx2 = &x[ord][0];
for (int j = 0; j < dim; j++)
{
*ptry++ = *ptry - coef * (*ptry2++);//1ª matriz
*ptrx++ = *ptrx - coef * (*ptrx2++);//2ª matriz
}
}
}
void test_6_inverse(int dim)
{
vector< vector<double> > vec1(dim, vector<double>(dim));
for (int i=0;i<dim;i++)
for (int j = 0; j < dim; j++)
{
vec1[i][j] = (-1.0 + 2.0*rand() / RAND_MAX) * 10000;
}
vector< vector<double> > vec2,vec3;
double ini, end;
ini = (double)clock();
vec2 = inverse(vec1);
end = (double)clock();
cout << "=== Time inverse unthreaded=" << (end - ini) / CLOCKS_PER_SEC << endl;
ini=end;
vec3 = inverse_th(vec1);
end = (double)clock();
cout << "=== Time inverse threaded=" << (end - ini) / CLOCKS_PER_SEC << endl;
cout<<vec2[2][2]<<" "<<vec3[2][2]<<endl;//to make the sw to do de inverse
cout << endl;
}
int main()
{
test_6_inverse(1000);
cout << endl << "=== END ===" << endl; getchar();
return 1;
}

After looking deeper in the code of the colum_zero() function I have seen that one thread rewrites in the data to be used by another threads, so the threads are not INDEPENDENT from each other. Fortunately the compiler detect it and avoid it.
Conclusions:
It is not recommended to try Gauss-Jordan method alone to make multithreads
If somebody detects that in multithread is slower and the initial function is spreaded correctly for every thread, perhaps is due one thread results are used by another
The main function inverse() works and can be used by other programmers, so this question should not be deleted
Non answered question:
What is a matrix inverse method that could be spreaded in a lot of independent threads to be used in a gpu?

Eigen: coefficient-wise pow with small integer exponent slow

In Eigen,with
ArrayXXf a;
a = ArrayXXf::Random(1000, 10000);
doing
a = a.pow(4);
takes ~500ms on my pc, whereas doing
a = a.square().square();
takes only about 5ms. I'm compiling with a recent GCC in release.
Is this the expected behaviour or am I doing something wrong? I would expect, that at least for small integer (say < 20, if not using a cost function), an overload should exist that catches such cases.

After a whole day of debugging I realized this was the bottleneck in my code. On the documentation it says that there is no SIMD for a.pow(). For whatever reason, actually on my machine it appears a * a * a * a is faster for a 300 x 50 Eigen::ArrayXXd.
#include <iostream>
#include <chrono>
#include <Eigen/Dense>
int main() {
Eigen::ArrayXXd m = Eigen::ArrayXd::LinSpaced(300 * 50, 0, 300 * 50 - 1).reshaped(300, 50);
{
decltype(m) result; // prevent loop from being eliminated
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < 100000; i++)
{
result = m.square().square();
}
auto end = std::chrono::steady_clock::now();
std::cout << "Per run(microsecond)=" << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 100000.0 << std::endl;
}
{
decltype(m) result; // prevent loop from being eliminated
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < 100000; i++)
{
result = m * m * m * m;
}
auto end = std::chrono::steady_clock::now();
std::cout << "Per run(microsecond)=" << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 100000.0 << std::endl;
}
{
decltype(m) result; // prevent loop from being eliminated
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < 10000; i++)
{
result = m.pow(4);
}
auto end = std::chrono::steady_clock::now();
std::cout << "Per run(microsecond)=" << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 10000.0 << std::endl;
}
return 0;
m.square().square();
Per run(microsecond)=17.9101
m * m * m * m;
Per run(microsecond)=10.3267
m.pow(4);
Per run(microsecond)=431.636

With C++17 if constexpr this could be possible, but otherwise it's not. So currently, a.pow(x) is equivalent to calling std::pow(a[i],x) for each i.

Eigen: simplifying expression with Eigen intrinsics

I'm trying to scale all the columns in a matrix with a corresponding value from a vector. Where this value is 0, I want to replace that column with a column from an other matrix scaled by a constant. Sounds complicated, but in Matlab it's pretty simple (but probably not fully optimized):
a(:,b ~= 0) = a(:,b ~= 0)./b(b ~= 0);
a(:,b == 0) = c(:,b == 0)*x;
doing it with a for loop in C++ would also be pretty simple:
RowVectorXf b;
Matrix3Xf a, c;
float x;
for (int i = 0; i < b.size(); i++) {
if (b(i) != 0) {
a.col(i) = a.col(i) / b(i);
} else {
a.col(i) = c.col(i) * x;
}
}
Is there a possibility to do this operation (faster) with Eigen intrinsics such as colwise and select?
p.s. I tried to shorten the if condition to the form
a.col(i) = (b(i) != 0) ? (a.col(i) / b(i)) : (c.col(i) * x);
But this does not compile with the error error: operands to ?: have different types ...(long listing of the types)
Edit:
I added the code for testing the answers, here it is:
#include <Eigen/Dense>
#include <stdlib.h>
#include <chrono>
#include <iostream>
using namespace std;
using namespace Eigen;
void flushCache()
{
const int size = 20 * 1024 * 1024; // Allocate 20M. Set much larger than L2
volatile char *c = (char *) malloc(size);
volatile int i = 8;
for (volatile int j = 0; j < size; j++)
c[j] = i * j;
free((void*) c);
}
int main()
{
Matrix3Xf a(3, 1000000);
RowVectorXf b(1000000);
Matrix3Xf c(3, 1000000);
float x = 0.4;
a.setRandom();
b.setRandom();
c.setRandom();
for (int testNumber = 0; testNumber < 4; testNumber++) {
flushCache();
chrono::high_resolution_clock::time_point t1 = chrono::high_resolution_clock::now();
for (int repetition = 0; repetition < 1000; repetition++) {
switch (testNumber) {
case 0:
for (int i = 0; i < b.size(); i++) {
if (b(i) != 0) {
a.col(i) = a.col(i) / b(i);
} else {
a.col(i) = c.col(i) * x;
}
}
break;
case 1:
for (int i = 0; i < b.size(); i++) {
a.col(i) = (b(i) != 0) ? (a.col(i) / b(i)).eval() : (c.col(i) * x).eval();
}
break;
case 2:
for (int i = 0; i < b.size(); i++) {
a.col(i) = (b(i) != 0) ? (a.col(i) * (1.0f / b(i))) : (c.col(i) * x);
}
break;
case 3:
a = b.cwiseEqual(0.0f).replicate< 3, 1 >().select(c * x, a.cwiseQuotient(b.replicate< 3, 1 >()));
break;
default:
break;
}
}
chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast< chrono::milliseconds >(t2 - t1).count();
cout << "duration: " << duration << "ms" << endl;
}
return 0;
}
Sample output is:
duration: 14391ms
duration: 15219ms
duration: 9148ms
duration: 13513ms
By the way, not using setRandom to init the variables, the output is totally different:
duration: 10255ms
duration: 11076ms
duration: 8250ms
duration: 5198ms
#chtz suggests it's because of denormalized values, but I think it's because of branch prediction. An evidance that it's because of branch prediction is, that initializing b.setZero(); leads to the same timings as not initializing.

a.col(i) = (b(i) != 0) ? (a.col(i) * (1.0f/b(i))) : (c.col(i) * x);
would work but only because the expressions would be of the same type, and it will likely not safe any time (a ? : expression is essentially translated to the same as an if-else branch.)
If you prefer writing it into one line, the following expression should work:
a = b.cwiseEqual(0.0f).replicate<3,1>().select(c*x, a.cwiseQuotient(b.replicate<3,1>()));
Again, I doubt it will make any significant performance difference.

Why can I not view the run time (nanoseconds)?

I am trying to view what the run-time on my code is. The code is my attempt at Project Euler Problem 5. When I try to output the run time it gives 0ns.
#define MAX_DIVISOR 20
bool isDivisible(long, int);
int main() {
auto begin = std::chrono::high_resolution_clock::now();
int d = 2;
long inc = 1;
long i = 1;
while (d < (MAX_DIVISOR + 1)) {
if ((i % d) == 0) {
inc = i;
i = inc;
d++;
}
else {
i += inc;
}
}
auto end = std::chrono::high_resolution_clock::now();
printf("Run time: %llu ns\n", (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count())); // Gives 0 here.
std::cout << "ANS: " << i << std::endl;
system("pause");
return 0;
}

The timing resolulution of std::chrono::high_resolution_clock::now() is system dependent.
You can find out an order of magnitude with the small piece of code here (edit: here you have a more accurate version):
chrono::nanoseconds mn(1000000000); // asuming the resolution is higher
for (int i = 0; i < 5; i++) {
using namespace std::chrono;
nanoseconds dt;
long d = 1000 * pow(10, i);
for (long e = 0; e < 10; e++) {
long j = d + e*pow(10, i)*100;
cout << j << " ";
auto begin = high_resolution_clock::now();
while (j>0)
k = ((j-- << 2) + 1) % (rand() + 100);
auto end = high_resolution_clock::now();
dt = duration_cast<nanoseconds>(end - begin);
cout << dt.count() << "ns = "
<< duration_cast<milliseconds>(dt).count() << " ms" << endl;
if (dt > nanoseconds(0) && dt < mn)
mn = dt;
}
}
cout << "Minimum resolution observed: " << mn.count() << "ns\n";
where k is a global volatile long k; in order to avoid optimizer to interfere too much.
Under windows, I obtain here 15ms. Then you have platform specific alternatives. For windows, there is a high performance cloeck that enables you to measure timebelow 10µs range (see here http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx) but still not in the nanosecond range.
If you want to time your code very accurately, you could reexecute it a big loop, and dividint the total time by the number of iterations.

Estimation you are going to do is not precise, better approach is to measure CPU time consumption of you program (because other processes are also running concurrently with you process, so time that you are trying to measure can be greatly affected if CPU intensitive tasks are running in parallel with you process).
So my advise use already implemented profilers if you want to estimate your code performance.
Considering your task, OS if doesn`t provide needed precision for time, you need to increase total time your are trying to estimate, the esiest way - run program n times & calculate the avarage, this method provides such advantage that by avareging - you can eleminate errors that arose from CPU intensitive tasks running concurrently with you process.
Here is code snippet of how I see the possible implementation:
#include <iostream>
using namespace std;
#define MAX_DIVISOR 20
bool isDivisible(long, int);
void doRoutine()
{
int d = 2;
long inc = 1;
long i = 1;
while (d < (MAX_DIVISOR + 1))
{
if (isDivisible(i, d))
{
inc = i;
i = inc;
d++;
}
else
{
i += inc;
}
}
}
int main() {
auto begin = std::chrono::high_resolution_clock::now();
const int nOfTrials = 1000000;
for (int i = 0; i < nOfTrials; ++i)
doRoutine();
auto end = std::chrono::high_resolution_clock::now();
printf("Run time: %llu ns\n", (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()/ nOfTrials)); // Gives 0 here.
std::cout << "ANS: " << i << std::endl;
system("pause");
return 0;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fastest way to get square root in float value - c++

Related

C++17 parallel algorithm vs tbb parallel vs openmp performance

Matrix inversion slower using threads

Eigen: coefficient-wise pow with small integer exponent slow

Eigen: simplifying expression with Eigen intrinsics

Why can I not view the run time (nanoseconds)?

Categories

Resources