Tensor Product Algorithm Optimization - c++

double data[12] = {1, z, z^2, z^3, 1, y, y^2, y^3, 1, x, x^2, x^3};
double result[64] = {1, z, z^2, z^3, y, zy, (z^2)y, (z^3)y, y^2, z(y^2), (z^2)(y^2), (z^3)(y^2), y^3, z(y^3), (z^2)(y^3), (z^3)(y^3), x, zx, (z^2)x, (z^3)x, yx, zyx, (z^2)yx, (z^3)yx, (y^2)x, z(y^2)x, (z^2)(y^2)x, (z^3)(y^2)x, (y^3)x, z(y^3)x, (z^2)(y^3)x, (z^3)(y^3)x, x^2, z(x^2), (z^2)(x^2), (z^3)(x^2), y(x^2), zy(x^2), (z^2)y(x^2), (z^3)y(x^2), (y^2)(x^2), z(y^2)(x^2), (z^2)(y^2)(x^2), (z^3)(y^2)(x^2), (y^3)(x^2), z(y^3)(x^2), (z^2)(y^3)(x^2), (z^3)(y^3)(x^2), x^3, z(x^3), (z^2)(x^3), (z^3)(x^3), y(x^3), zy(x^3), (z^2)y(x^3), (z^3)y(x^3), (y^2)(x^3), z(y^2)(x^3), (z^2)(y^2)(x^3), (z^3)(y^2)(x^3), (y^3)(x^3), z(y^3)(x^3), (z^2)(y^3)(x^3), (z^3)(y^3)(x^3)};
What is the fastest (fewest executions) to produce result given data? Assume, that data is variable in size, but always a factor of 4 (e.g., 4, 8, 12, etc.).
No Boost. I am trying to keep my dependencies small. STL Algorithms are ok.
HINT: result array size should always be 4^(multiple size) (e.g., 4, 16, 64, etc.).
BONUS: If you can compute result just given x, y, z
Additional examples:
double data[4] = {1, z, z^2, z^3};
double result[4] = {1, z, z^2, z^3};
double data[8] = {1, z, z^2, z^3, 1, y, y^2, y^3};
double result[16] = { ... };
I chose the accepted answer code after running this benchmark: https://gist.github.com/1232406. Basically, the top two codes were run and the one with the smallest execution time won.

void Tensor(std::vector<double>& result, double x, double y, double z) {
result.resize(64); //almost noop if already right size
double tz = z*z;
double ty = y*y;
double tx = x*x;
std::array<double, 12> data = {0, 0, tz, tz*z, 1, y, ty, ty*y, 1, x, tx, tx*x};
register std::vector<double>::iterator iter = result.begin();
register int yi;
register double xy;
for(register int xi=0; xi<4; ++xi) {
for(yi=0; yi<4; ++yi) {
xy = data[4+yi]*data[8+xi];
*iter = xy; //a smart compiler can do these four in parallell
*(++iter) = z*xy;
*(++iter) = data[2]*xy;
*(++iter) = data[3]*xy;
++iter; //workaround for speed!
}
}
}
There's probably at least one bug in here somewhere, but it should be fast, with no dependancies (outside of std::vector/std::array), just takes x,y,z. I avoided recursion though, so it only works for 3 in/64 out. The concept can be applied to any number of parameters though. You just have to instantiate yourself.

A good compiler will autovectorize this I guess none of my compilers are good:
void tensor(const double *restrict data,
int dimensions,
double *restrict result) {
result[0] = 1.0;
for (int i = 0; i < dimensions; i++) {
for (int j = (1 << (i * 2)) - 1; j > -1; j--) {
double alpha = result[j];
{
double *restrict dst = &result[j * 4];
const double *restrict src = &data[(dimensions - 1 - i) * 4];
for (int k = 0; k < 4; k++) dst[k] = alpha * src[k];
}
}
}
}

you should use dynamic algorithm. that is, you can use previous results. for example, you keep y^2 result and use it when computing (y^2)z instead of computing it again.

#include <vector>
#include <cstddef>
#include <cmath>
void Tensor(std::vector<double>& result, const std::vector<double>& variables, size_t index)
{
double p1 = variables[index];
double p2 = p1*p1;
double p3 = p1*p2;
if (index == variables.size() - 1) {
result.push_back(1);
result.push_back(p1);
result.push_back(p2);
result.push_back(p3);
} else {
Tensor(result, variables, index+1);
ptrdiff_t size = result.size();
for(int j=0; j<size; ++j)
result.push_back(result[j]*p1);
for(int j=0; j<size; ++j)
result.push_back(result[j]*p2);
for(int j=0; j<size; ++j)
result.push_back(result[j]*p3);
}
}
std::vector<double> Tensor(const std::vector<double>& params) {
std::vector<double> result;
double rsize = (1<<(2*params.size());
result.reserve(rsize);
Tensor(result, params);
return result;
}
int main() {
std::vector<double> params;
params.push_back(3.1415926535);
params.push_back(2.7182818284);
params.push_back(42);
params.push_back(65536);
std::vector<double> result = Tensor(params);
}
I verified that this one compiles and runs (http://ideone.com/IU1eQ). It runs fast, with no dependancies (outside of std::vector). It also takes any number of parameters. Since calling the recursive form is awkward, I made a wrapper. It makes one function call for each parameter, and one call to dynamic memory (in the wrapper).

You should look for Pascal's pyramid to get fast solution. Useful link 1, useful link 2, useful link 3 and useful link 4.
One more thing: as I see it would be a base of a finite element solver. Usually to write own BLAS solver is not a good idea. Do not reinvent the wheel! I think you should use a BLAS solver like intel MKL or Cuda base BLAS.

Related

R crashes when calling a Rcpp function in a loop

So I have this Rcpp function in a .cpp file. You'll see that it is calling other custom functions that I don't show for simplicity, but those don't show any problem whatsoever.
// [[Rcpp::export]]
int sim_probability(float present_wealth , int time_left, int n, float mu, float sigma, float r, float gamma, float gu, float gl){
int i;
int count = 0;
float final_wealth;
NumericVector y(time_left);
NumericVector rw(time_left);
for(i=0;i<n;i++){
rw = random_walk(time_left, 0);
y = Y(rw, mu, sigma, r, gamma);
final_wealth = y[time_left-1] - y[0] + present_wealth;
if(final_wealth <= gu && final_wealth >= gl){
count = count + 1;
}
}
return count;
}
Then I can call this function from a .R seamlessly:
library(Rcpp)
sourceCpp("functions.cpp")
sim_probability(present_wealth = 100, time_left = 10, n = 1e3, mu = 0.05, sigma = 0.20, r = 0, gamma = 2, gu = 200, gl = 90)
But, if I call it inside a for loop, no matter how small it is, R crashes without popping any apparent error. The chunk below would make R crash.
for(l in 1:1){
sim_probability(present_wealth = 100, time_left = 10, n = 1e3, mu = 0.05, sigma = 0.20, r = 0, gamma = 2, gu = 200, gl = 90)
}
I've also tried to execute it manually (Ctrl + Enter) many times as fast as I could, and I'm fast enough it also crashes.
I have tried smaller or bigger loops, both out and within the function. It also crashes if it's called from another Rcpp function. I know I shouldn't call Rcpp functions in a R loop. Eventually I intend to call it from another Rcpp function (to generate a matrix of data) but it crashes all the same.
I have followed other cases that I've found googling and tried a few things, as changing to [] brackets for the arrays' index (this question), playing with the gc() garbage collector (as suggested here).
I suspected that something happened with the NumericVector definitions. But as far as I can tell they are declared properly.
It is been fairly pointed out in the comments that this is not a reproducible exaxmple. I'll add down here the missing functions Y() and random_walk():
// [[Rcpp::export]]
NumericVector Y(NumericVector path, float mu, float sigma, float r, float gamma){
int time_step, n, i;
time_step = 1;
float theta, y0, prev, inc_W;
theta = (mu - r) / sigma;
y0 = theta / (sigma*gamma);
n = path.size();
NumericVector output(n);
for(i=0;i<n;i++){
if(i == 0){
prev = y0;
inc_W = path[0];
}else{
prev = output[i-1];
inc_W = path[i] - path[i-1];
}
output[i] = prev + (theta / gamma) * (theta * time_step + inc_W);
}
return output;
}
// [[Rcpp::export]]
NumericVector random_walk(int length, float starting_point){
if(length == 1){return starting_point;}
NumericVector output(length);
output[1] = starting_point;
int i;
for(i=0; i<length; i++){output[i+1] = output[i] + R::rnorm(0,1);}
return output;
}
Edit1: Added more code so it is reproducible.
Edit2: I was assigning local variables when calling the functions. That was dumb from my part, but harmless. The same error still persists. But I've fixed that.
Edit3: As it's been pointed out by Dirk in the comments, I was doing a pointless exercise redefining the rnorm(). Now it's removed and fixed.
The answer has been solved in the comments, by #coatless. I put it here to keep it for future readers. The thing is that the random_walk() function wasn't properly set up correctly.
The problem was that the loop inside the function allowed i to go out of the defined dimension of the vector output. This is just inefficient when called once, yet it works. But it blows up when it's called many times real fast.
So in order to avoid this error and many others, the function should have been defined as
// [[Rcpp::export]]
NumericVector random_walk(int length, float starting_point){
if(length == 0){return starting_point;}
NumericVector output(length);
output[0] = starting_point;
int i;
for(i=0; i<length-1; i++){output[i+1] = output[i] + R::rnorm(0,1);}
return output;
}

Vectorization of pcl_ros::transformPointCloud

I just noticed that the function pcl_ros::transformPointCloud is not vectorized. Below is the code snippet copied from here.
void transformPointCloud(
const Eigen::Matrix4f& transform,
const sensor_msgs::PointCloud2& in,
sensor_msgs::PointCloud2& out)
{
int x_idx = pcl::getFieldIndex(in, "x");
int y_idx = pcl::getFieldIndex(in, "y");
int z_idx = pcl::getFieldIndex(in, "z");
Eigen::Array4i xyz_offset(
in.fields[x_idx].offset,
in.fields[y_idx].offset,
in.fields[z_idx].offset, 0);
// most of the code is not shown here
for (size_t i = 0; i < in.width * in.height; ++i)
{
Eigen::Vector4f pt(*(float*)&in.data[xyz_offset[0]],
*(float*)&in.data[xyz_offset[1]],
*(float*)&in.data[xyz_offset[2]], 1);
Eigen::Vector4f pt_out;
pt_out = transform * pt;
}
memcpy(&out.data[xyz_offset[0]], &pt_out[0], sizeof(float));
memcpy(&out.data[xyz_offset[1]], &pt_out[1], sizeof(float));
memcpy(&out.data[xyz_offset[2]], &pt_out[2], sizeof(float));
xyz_offset += in.point_step;
}
The code above iterated over each point in the point cloud and multiply the transformation with it.
I am wondering if this can be vectorized so as to minimize the elapsed time.
I am looking for suggestions to implement/incorporate the same. I am using ROS Indigo (PCL 1.7.1) on Ubuntu 14.04 LTS PC.
Assuming x_idx, y_idx, and z_idx are 0, 4 and 8 and you don't care about all the special case handling of non-finite data, etc, you can simplify the inner loop to something like this:
void foo(char* data_out, Eigen::Index N, int out_step, const Eigen::Matrix4f& T, const char* data_in, int in_step)
{
for(Eigen::Index i=0; i<N; ++i)
{
Eigen::Vector3f::Map((float*)(data_out + i*out_step)).noalias()
= (T * Eigen::Vector3f::Map((const float*)(data_in + i*in_step)).homogeneous()).head<3>();
}
}
N would be in.width * in.height and out_step and in_step would be the corresponding point_step members. Minor possible improvement: You can copy T into a local variable so it does not need to be loaded from memory every time.
If point_step is a multiple of sizeof(float) you could also reduce this to a single assignment, using out_stride = out.point_step / sizeof(float), etc. However, this usually generates less efficient code than the version above (may change in future versions of Eigen).
void foo2(float* data_out, Eigen::Index N, int out_stride, const Eigen::Matrix4f& T, const float* data_in, int in_stride)
{
Eigen::Matrix3Xf::Map(data_out, 3, N, Eigen::OuterStride<>(out_stride)).noalias()
= (T *
Eigen::Matrix3Xf::Map(data_in, 3, N, Eigen::OuterStride<>(in_stride))
.colwise().homogeneous()
).topRows<3>();
}
Godbolt-Link

pass parameters of double but get Jet<double,6>when using ceres solver

I'm a new learner to Ceres Solver, when adding the residualblock using
problem.AddResidualBlock( new ceres::AutoDiffCostFunction<Opt, 1, 6> (new Opt(Pts[i][j].x, Pts[i][j].y, Pts[i][j].z, Ns[i].at<double>(0, 0), Ns[i].at<double>(1, 0), Ns[i].at<double>(2, 0), Ds[i], weights[i]) ),
NULL,
param );
where param is double[6];
struct Opt
{
const double ptX, ptY, ptZ, nsX, nsY, nsZ, ds, w;
Opt( double ptx, double pty, double ptz, double nsx, double nsy, double nsz, double ds1, double w1):
ptX(ptx), ptY(pty), ptZ(ptz), nsX(nsx), nsY(nsy), nsZ(nsz), ds(ds1), w(w1) {}
template<typename T>
bool operator()(const T* const x, T* residual) const
{
Mat R(3, 3, CV_64F), r(1, 3, CV_64F);
Mat inverse(3,3, CV_64F);
T newP[3];
T xyz[3];
for (int i = 0; i < 3; i++){
r.at<T>(i) = T(x[i]);
cout<<x[i]<<endl;
}
Rodrigues(r, R);
inverse = R.inv();
newP[0]=T(ptX)-x[3];
newP[1]=T(ptY)-x[4];
newP[2]=T(ptZ)-x[5];
xyz[0]= inverse.at<T>(0, 0)*newP[0] + inverse.at<T>(0, 1)*newP[1] + inverse.at<T>(0, 2)*newP[2];
xyz[1] = inverse.at<T>(1, 0)*newP[0] + inverse.at<T>(1, 1)*newP[1] + inverse.at<T>(1, 2)*newP[2];
xyz[2] = inverse.at<T>(2, 0)*newP[0] + inverse.at<T>(2, 1)*newP[1] + inverse.at<T>(2, 2)*newP[2];
T ds1 = T(nsX) * xyz[0] + T(nsY) * xyz[1] + T(nsZ) * xyz[2];
residual[0] = (ds1 - T(ds)) * T(w);
}
};
but when I output the x[0], I got this:
[-1.40926 ; 1, 0, 0, 0, 0, 0]
after I change the type of the x to double
I got this error :
note: no known conversion for argument 1 from ‘const ceres::Jet<double, 6>* const’ to ‘const double*’
in
bool operator()(const double* const x, double* residual) const
what's wrong with my codes?
Thanks a lot!
I am guessing you are using cv::Mat.
The reason the functor is templated is because Ceres evaluates it using doubles when it needs just the residuals, and evaluates with ceres:Jet objects when it needs to compute the Jacobian. So your attempt to fill r as
for (int i = 0; i < 3; i++){
r.at<T>(i) = T(x[i]);
cout<<x[i]<<endl;
}
are trying to convert a Jet into a double. Which is what the compiler is correctly complaining about.
you can re-write your code as (I have not compiled it, so there maybe a minor typo or two).
template<typename T>
bool operator()(const T* const x, T* residual) const {
const T inverse_rotation[3] = {-x[0], -x[1], -x[3]};
const T newP[3] = {ptX - x[3], ptY - x[4]. ptZ - x[5]};
T xyz[3];
ceres::AngleAxisRotatePoint(inverse_rotation, newP, xyz);
const T ds1 = nsX * xyz[0] + nsY * xyz[1] + nsZ * xyz[2];
residual[0] = (ds1 - ds) * w;
return true;
}
The automatic derivatives (AutoDiff) needs a templated cost function to keep track of the operations.
Please take a look at the ceres documentation (http://ceres-solver.org/nnls_modeling.html#autodiffcostfunction). There are a lot of nice examples too. I used them as starting point for my first ceres experiments.
I'm not sure if you can use ceres cost functions with OpenCV functions. In most cases Eigen is used to make the cost function.
Ceres comes with a lot of "ready-to-use" components for cost functions like yours.

How use arrays of double[4][4] contained in a vector?

I want to ask at the community my problem.
I have a series of array of double[4][4] in this format:
double T1[4][4] = {
{-0.9827, -0.1811, -0.0388, 0.1234},
{0.0807, -0.2303, -0.9698, 0.1755},
{0.1666, -0.9561, 0.2409, 0.6729},
{0, 0, 0, 1.00000 }};
double T2[4][4] = {
{-0.8524, -0.5029, -0.1432, 0.1963},
{0.1580, 0.0135, -0.9874, 0.1285},
{0.4984, -0.8643, 0.0680, 0.6237},
{0, 0, 0, 1.00000 }};
T3, T4, and so on....
I need to insert all of these arrays in a container, to pickup one at time from another function, that need arrays in that format, because doing these elaborations:
int verifica_punti(punto P, Mat& I, double TC[4][4], const double fc[2],const double KC[5], const double cc[2],const double alpha){
//punto
double P1[4] = {P.x, P.y, P.z, 1.0};
//iniz
double Pc[3] = {TC[0][3], TC[1][3], TC[2][3]};
//calc
for(int i=0; i<3; i++){
for(int j=0; j<3; j++){
Pc[i] += TC[i][j] * P1[j];
}
}
//norm
double PN[2] = { Pc[0]/Pc[2], Pc[1]/Pc[2] };
Now, searching on this site and on internet I've found some examples to do this, but don't work in my case. Using vector, array, queue...I don't understand a thing.
I paste here my code, and tell you to help me fix this problem.
This is my code:
//array of TC
typedef array<array<double,4>,4> Matrix;
//single TC
Matrix T1 = {{
{{-1.0000, 0.0000, -0.0000, 0.1531}},
{{0.0000, 0.0000, -1.0000, 0.1502 }},
{{-0.0000, -1.0000, -0.0000, 1.0790}},
{{0 , 0, 0, 1.0000 }}}};
Matrix T2 = {{
{{-1.0000, 0.0009, 0.0019, 0.1500}},
{{-0.0021, -0.4464, -0.8948, 0.1845}},
{{0.0000, -0.8948, 0.4464, 0.8094 }},
{{ 0, 0, 0, 1.0000 }}}};
etc....then, declare container and fill it:
vector <Matrix> TCS;
TCS.push_back(T1);
TCS.push_back(T2);
TCS.push_back(T3);
TCS.push_back(T4);
TCS.push_back(T5);
TCS.push_back(T6);
TCS.push_back(T7);
TCS.push_back(T8);
TCS.push_back(T9);
Now, for obtain single matrix in double[4][4] format to pass it at that function "verifica_punti" (written before) how can I do?
I need one TC at time, but in the FIFO order (the first that I've pushed, I need to pop and use.
How can I do this? Because I've write
double temp[4][4] = TCS.pop_back()
or double temp[4][4] = TCS[i];
but isn't correct.
I'm on Visual C++ 2010 on windows 7 64bit.
Help me please :-( thanks in advance.
with
typedef array<array<double,4>,4> Matrix;
vector <Matrix> TCS;
You have
//double temp[4][4] = TCS[i]; // Illegal
Matrix m1 = TCS[i]; // legal
const Matrix& m2 = TCS[i]; // legal, and avoid a copy.
Now, you have to change:
int verifica_punti(punto P, Mat& I, double TC[][4], const double fc[], const double KC[], const double cc[], const double alpha);
to
int verifica_punti(punto P, Mat& I, Matrix& TC, const double fc[], const double KC[], const double cc[], const double alpha);
std::array< std::array<double,4>, 4> and double[4][4] are distinct types. The former encupsulates the latter so that it's copyable and can be used in containers and it has practicaly identical interface. But you can't use them interchangeable.
You already have your typedef, so use that:
while (!TCS.empty()) {
// get the last one
Matrix m = TCS.back();
/* do stuff with m */
// pop the last one out
TCS.pop_back();
}

Need help optimizing code (minimum image convention)

I have written some simulation code and am using the "randomly break in GDB" method of debugging. I am finding that 99.9% of my program's time is spent in this routine (it's the minimum image convention):
inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) {
double xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
}
The optimizations I have performed so far (maybe not very significant ones):
Return the square of the distance instead of the square root
Inline it
Const what I can
No standard library bloat
Compiling with every g++ optimization flag I can think of
I am running out of things I can do with this. Maybe I could use floats instead of doubles but I would prefer that be a last resort. And maybe I could somehow use SIMD on this, but I've never done that so I imagine that's a lot of work. Any ideas?
Thanks
First, you're not using the right algorithm. What if the two points are greater than boxWidth apart? Second, if you have multiple particles, calling a single function that does all of the distance calculations and places the results in an output buffer is going to be significantly more efficient. Inlining helps reduce some of this, but not all. Any of the precalculation -- like dividing the box length by 2 in your algorithm -- is going to be repeated when it doesn't need to be.
Here is some SIMD code to do the calculation. You need to compile with -msse4. Using -O3, on my machine (macbook pro, llvm-gcc-4.2), I get a speed up of about 2x. This does require using 32bit floats instead of double precision arithmetic.
SSE really isn't that complicated, it just looks terrible. e.g. instead of writing a*b, you have to write the clunky _mm_mul_ps(a,b).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>
// you can compile this code with -DDOUBLE to try using doubles vs. floats
// in the unoptimized code. The SSE code uses only floats.
#ifdef DOUBLE
typedef double real;
#else
typedef float real;
#endif
static inline __m128 loadFloat3(const float const* value) {
// Load (x,y,z) into a SSE register, leaving the last entry
// set to zero.
__m128 x = _mm_load_ss(&value[0]);
__m128 y = _mm_load_ss(&value[1]);
__m128 z = _mm_load_ss(&value[2]);
__m128 xy = _mm_movelh_ps(x, y);
return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0));
}
int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth,
float* out, const int n_points) {
int i;
__m128 r1, r2, r12, s12, r12_2, s, box, invBox;
box = _mm_set1_ps(boxWidth);
invBox = _mm_div_ps(_mm_set1_ps(1.0f), box);
for (i = 0; i < n_points; i++) {
r1 = loadFloat3(position1);
r2 = loadFloat3(position1);
r12 = _mm_sub_ps(r1, r2);
s12 = _mm_mul_ps(r12, invBox);
s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT));
r12 = _mm_mul_ps(box, s12);
r12_2 = _mm_mul_ps(r12, r12);
// double horizontal add instruction accumulates the sum of
// all four elements into each of the elements
// (e.g. s.x = s.y = s.z = s.w = r12_2.x + r12_2.y + r12_2.z + r12_2.w)
s = _mm_hadd_ps(r12_2, r12_2);
s = _mm_hadd_ps(s, s);
_mm_store_ss(out++, s);
position1 += 3;
position2 += 3;
}
return 1;
}
inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) {
real xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
}
int main(void) {
real* position1;
real* position2;
real* output;
int n_runs = 10000000;
posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &output, 16, n_runs*sizeof(real));
real boxWidth = 1.8;
real result = 0;
int i;
clock_t t;
#ifdef OPT
printf("Timing optimized SSE implementation\n");
#else
printf("Timinig original implementation\n");
#endif
#ifdef DOUBLE
printf("Using double precision\n");
#else
printf("Using single precision\n");
#endif
t = clock();
#ifdef OPT
fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs);
#else
for (i = 0; i < n_runs; i++) {
*output = distanceSqPeriodic(position1, position2, boxWidth);
position1 += 3;
position2 += 3;
output++;
}
#endif
t = clock() - t;
printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC);
}
you may want to use fabs (standarized in ISO 90 C) since this should be able to be reduced to a single non-branching instruction.
Return the square of the distance instead of the square root
That's a good idea as long as you are comparing squares to squares.
Inline it
This is sometimes a counter-optimization: Inlined code takes up space in the execution pipeline/cache, whether it is branched to or not.
Often it makes no difference because the compiler has the final word on whether to inline or not.
Const what I can
Normally no difference at all.
No standard library bloat
What bloat?
Compiling with every g++ optimization flag I can think of
That's good: Leave most optimizations to the compiler. Only if you measured your real bottleneck, and determined if that bottleneck is significant, invest money on hand optimizing.
What you could try do is to make your code branchfree. Without using bitmasks, this may look like this:
//if (z > zhw)
// z -= boxWidths[2];
//else if (z < -zhw)
// z += boxWidths[2];
const auto z_a[] = {
z,
z - boxWidths[2]
};
z = z_a[z>zhw];
...
or
z -= (z>zhw) * boxWidths[2];
However, there is no guarantee that this is faster. Your compiler may now have a harder time identifying SIMD spots in your code, or the branch target buffer does a good job and most of the times you have the same code paths through your function.
You need to get rid of the comparisons, as those are hard to predict.
The function to be implemented is:
/ / /\ /\
/ / / \/ \
----0----- or ------------ , as (-x)^2 == x^2
/ /
/ /
The latter is a result of two abs statements:
x = abs(half-abs(diff))+half;
The code
double tst(double a[4], double b[4], double half)
{
double sum=0.0,t;
int i;
for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;}
return sum;
}
beats the original implementation by a factor of four (+some) -- and at this point there's not even full parallelism: only the lower half of xmm registers are used.
With parallel processing of x && y, there's a theoretical gain of about 50% to be achieved. Using floats instead of doubles could in theory make it still about 3x faster.