Vectorization of pcl_ros::transformPointCloud - c++

I just noticed that the function pcl_ros::transformPointCloud is not vectorized. Below is the code snippet copied from here.
void transformPointCloud(
const Eigen::Matrix4f& transform,
const sensor_msgs::PointCloud2& in,
sensor_msgs::PointCloud2& out)
{
int x_idx = pcl::getFieldIndex(in, "x");
int y_idx = pcl::getFieldIndex(in, "y");
int z_idx = pcl::getFieldIndex(in, "z");
Eigen::Array4i xyz_offset(
in.fields[x_idx].offset,
in.fields[y_idx].offset,
in.fields[z_idx].offset, 0);
// most of the code is not shown here
for (size_t i = 0; i < in.width * in.height; ++i)
{
Eigen::Vector4f pt(*(float*)&in.data[xyz_offset[0]],
*(float*)&in.data[xyz_offset[1]],
*(float*)&in.data[xyz_offset[2]], 1);
Eigen::Vector4f pt_out;
pt_out = transform * pt;
}
memcpy(&out.data[xyz_offset[0]], &pt_out[0], sizeof(float));
memcpy(&out.data[xyz_offset[1]], &pt_out[1], sizeof(float));
memcpy(&out.data[xyz_offset[2]], &pt_out[2], sizeof(float));
xyz_offset += in.point_step;
}
The code above iterated over each point in the point cloud and multiply the transformation with it.
I am wondering if this can be vectorized so as to minimize the elapsed time.
I am looking for suggestions to implement/incorporate the same. I am using ROS Indigo (PCL 1.7.1) on Ubuntu 14.04 LTS PC.

Assuming x_idx, y_idx, and z_idx are 0, 4 and 8 and you don't care about all the special case handling of non-finite data, etc, you can simplify the inner loop to something like this:
void foo(char* data_out, Eigen::Index N, int out_step, const Eigen::Matrix4f& T, const char* data_in, int in_step)
{
for(Eigen::Index i=0; i<N; ++i)
{
Eigen::Vector3f::Map((float*)(data_out + i*out_step)).noalias()
= (T * Eigen::Vector3f::Map((const float*)(data_in + i*in_step)).homogeneous()).head<3>();
}
}
N would be in.width * in.height and out_step and in_step would be the corresponding point_step members. Minor possible improvement: You can copy T into a local variable so it does not need to be loaded from memory every time.
If point_step is a multiple of sizeof(float) you could also reduce this to a single assignment, using out_stride = out.point_step / sizeof(float), etc. However, this usually generates less efficient code than the version above (may change in future versions of Eigen).
void foo2(float* data_out, Eigen::Index N, int out_stride, const Eigen::Matrix4f& T, const float* data_in, int in_stride)
{
Eigen::Matrix3Xf::Map(data_out, 3, N, Eigen::OuterStride<>(out_stride)).noalias()
= (T *
Eigen::Matrix3Xf::Map(data_in, 3, N, Eigen::OuterStride<>(in_stride))
.colwise().homogeneous()
).topRows<3>();
}
Godbolt-Link

Related

Adding 3D vectors using SIMD intrinsics

I've got two streams of 3D vectors which I'd like to add using x86 AVX2 intrinsics. I'm using the GNU compiler 11.1.0. Hopefully, the code illustrates what I want to do:
// Example program
#include <utility> // std::size_t
#include <immintrin.h>
struct v3
{
float data[3] = {};
};
void add(const v3* a, const v3* b, v3* c, const std::size_t& n)
{
// c <- a + b
for (auto i = std::size_t{}; i < n; i += 2) // 2 vector3s at a time ~6 data
{
// masking
// [95:0] of a[i] move into [255:128], [95:0] of a[i+1] move into [255:128] of *another* 256-bit register
// ^same with b[i]
static const auto p1_mask = _mm256_setr_epi32(-1, -1, -1, 0, 0, 0, 0, 0);
static const auto p2_mask = _mm256_setr_epi32(0, 0, 0, -1, -1, -1, 0, 0);
const auto p1_leftop_packed = _mm256_maskload_ps(a[i].data, p1_mask);
const auto p2_lefttop_packed = _mm256_maskload_ps(a[i].data, p2_mask);
const auto p1_rightop_packed = _mm256_maskload_ps(b[i].data, p1_mask);
const auto p2_rightop_packed = _mm256_maskload_ps(b[i].data, p2_mask);
// addition is being done inefficiently with 2 AVX2 instructions!
const auto result1_packed = _mm256_add_ps(p1_leftop_packed, p1_rightop_packed);
const auto result2_packed = _mm256_add_ps(p2_leftop_packed, p2_rightop_packed);
// store them back
_mm256_maskstore_ps(c[i].data, p1_mask, result1_packed);
_mm256_maskstore_ps(c[i].data, p2_mask, result2_packed);
}
}
int main()
{
// data
const auto n = std::size_t{1000};
v3 a[n] = {};
v3 b[n] = {};
v3 c[n] = {};
// run
add(a, b, c, n);
return 0;
}
The above code works but the performance is quite terrible. To correct it, I think I need a version which looks approximately like the following:
// c <- a + b
for (auto i = std::size_t{}; i < n; i += 2) // 2 vector3s at a time ~6 data
{
// masking
// [95:0] of a[i] move into [255:128], [95:0] of a[i+1] in [127:0]
const auto leftop_packed = /*code required here*/;
const auto rightop_packed = /*code required here*/;
// addition is being done with only 1 AVX2 instruction
const auto result_packed = _mm256_add_ps(leftop_packed, rightop_packed);
// store them back
// [95:0] of result_packed move into c[i], [223:128] of result_packed into c[i+1]
/*code required here*/
}
How do I achieve this? I will gladly provide any additional information when needed. Any help would be much appreciated.
The two following comments say the same. They are good. Do as they say.
I think you can just load 8 floats at a time and then if you have anything left over at the end you can do a masked store (not sure about this part). – LHLaurini
Use char*, float*, or __m256* to work in 32-byte or 8-float chunks, ignoring vector boundaries since you're just doing pure vertical addition. float* should be good for cleanup of the last up-to-7 floats – Peter Cordes
The Eigen library supports vectorization. It also has a lot of the vector/matrix math algorithms already implemented, and quite efficiently too. If you can, I'd recommend looking into using it instead of rolling your own logic.

pass parameters of double but get Jet<double,6>when using ceres solver

I'm a new learner to Ceres Solver, when adding the residualblock using
problem.AddResidualBlock( new ceres::AutoDiffCostFunction<Opt, 1, 6> (new Opt(Pts[i][j].x, Pts[i][j].y, Pts[i][j].z, Ns[i].at<double>(0, 0), Ns[i].at<double>(1, 0), Ns[i].at<double>(2, 0), Ds[i], weights[i]) ),
NULL,
param );
where param is double[6];
struct Opt
{
const double ptX, ptY, ptZ, nsX, nsY, nsZ, ds, w;
Opt( double ptx, double pty, double ptz, double nsx, double nsy, double nsz, double ds1, double w1):
ptX(ptx), ptY(pty), ptZ(ptz), nsX(nsx), nsY(nsy), nsZ(nsz), ds(ds1), w(w1) {}
template<typename T>
bool operator()(const T* const x, T* residual) const
{
Mat R(3, 3, CV_64F), r(1, 3, CV_64F);
Mat inverse(3,3, CV_64F);
T newP[3];
T xyz[3];
for (int i = 0; i < 3; i++){
r.at<T>(i) = T(x[i]);
cout<<x[i]<<endl;
}
Rodrigues(r, R);
inverse = R.inv();
newP[0]=T(ptX)-x[3];
newP[1]=T(ptY)-x[4];
newP[2]=T(ptZ)-x[5];
xyz[0]= inverse.at<T>(0, 0)*newP[0] + inverse.at<T>(0, 1)*newP[1] + inverse.at<T>(0, 2)*newP[2];
xyz[1] = inverse.at<T>(1, 0)*newP[0] + inverse.at<T>(1, 1)*newP[1] + inverse.at<T>(1, 2)*newP[2];
xyz[2] = inverse.at<T>(2, 0)*newP[0] + inverse.at<T>(2, 1)*newP[1] + inverse.at<T>(2, 2)*newP[2];
T ds1 = T(nsX) * xyz[0] + T(nsY) * xyz[1] + T(nsZ) * xyz[2];
residual[0] = (ds1 - T(ds)) * T(w);
}
};
but when I output the x[0], I got this:
[-1.40926 ; 1, 0, 0, 0, 0, 0]
after I change the type of the x to double
I got this error :
note: no known conversion for argument 1 from ‘const ceres::Jet<double, 6>* const’ to ‘const double*’
in
bool operator()(const double* const x, double* residual) const
what's wrong with my codes?
Thanks a lot!
I am guessing you are using cv::Mat.
The reason the functor is templated is because Ceres evaluates it using doubles when it needs just the residuals, and evaluates with ceres:Jet objects when it needs to compute the Jacobian. So your attempt to fill r as
for (int i = 0; i < 3; i++){
r.at<T>(i) = T(x[i]);
cout<<x[i]<<endl;
}
are trying to convert a Jet into a double. Which is what the compiler is correctly complaining about.
you can re-write your code as (I have not compiled it, so there maybe a minor typo or two).
template<typename T>
bool operator()(const T* const x, T* residual) const {
const T inverse_rotation[3] = {-x[0], -x[1], -x[3]};
const T newP[3] = {ptX - x[3], ptY - x[4]. ptZ - x[5]};
T xyz[3];
ceres::AngleAxisRotatePoint(inverse_rotation, newP, xyz);
const T ds1 = nsX * xyz[0] + nsY * xyz[1] + nsZ * xyz[2];
residual[0] = (ds1 - ds) * w;
return true;
}
The automatic derivatives (AutoDiff) needs a templated cost function to keep track of the operations.
Please take a look at the ceres documentation (http://ceres-solver.org/nnls_modeling.html#autodiffcostfunction). There are a lot of nice examples too. I used them as starting point for my first ceres experiments.
I'm not sure if you can use ceres cost functions with OpenCV functions. In most cases Eigen is used to make the cost function.
Ceres comes with a lot of "ready-to-use" components for cost functions like yours.

How to speed up this GSL code for selecting a submatrix?

I wrote a very simple function in GSL, to select a submatrix from an existing matrix in a struct.
EDIT: I had timed VERY INCORRECTLY and didn't notice the changed number of zeros in front.Still, I hope this can be sped up
For 100x100 submatrices of a 10000x10000 matrix, it takes 1.2E-5 seconds. So, repeating that 1E4 times, takes 50 times longer than I need to diagonalise the 100x100 matrix.
EDIT:
I realise, it happens even if I comment out everything except return(0);
Thus, I theorize, it must be something about struct TOWER. This is how TOWER looks:
struct TOWER
{
int array_level[TOWERSIZE];
int array_window[TOWERSIZE];
gsl_matrix *matrix_ordered_covariance;
gsl_matrix *matrix_peano_covariance;
double array_angle_tw[XISTEP];
double array_correl_tw[XISTEP];
gsl_interp_accel *acc_correl; // interpolating for correlation
gsl_spline *spline_correl;
double array_all_eigenvalues[TOWERSIZE]; //contains all eiv. of whole matrix
std::vector< std::vector<double> > cropped_peano_covariance, peano_mask;
};
Below comes my function!
/* --- --- */
int monolevelsubmatrix(int i, int j, struct TOWER *tower, gsl_matrix *result) //relying on spline!! //must addd auto vanishing
{
int firstrow, firstcol,mu,nu,a,b;
double aux, correl;
firstrow = helix*i;
firstcol = helix*j;
gsl_matrix_view Xi = gsl_matrix_submatrix (tower ->matrix_ordered_covariance, firstrow, firstcol, helix, helix);
gsl_matrix_memcpy (result, &(Xi.matrix));
return(0);
}
/* --- --- */
The problem is almost certainly gls_matric_memcpy. The source for that is in copy_source.c, with:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
for (i = 0; i < src_size1 ; i++)
{
for (j = 0; j < MULTIPLICITY * src_size2; j++)
{
dest->data[MULTIPLICITY * dest_tda * i + j]
= src->data[MULTIPLICITY * src_tda * i + j];
}
}
This would be quite slow. Note that gls_matrix_memcpy returns a GLS_ERROR if the matrices are different sizes, so it's very likely the data member could be served with a CRT memcpy on the data members of dest and src.
This loop is very slow. Each cell is derefence through dest & src structs for the data member, and THEN indexed.
You could choose to write a replacement for the library, or write your own personal version of this matrix copy, with something like (untested suggestion code here):
unsigned int cellsize = sizeof( src->data[0] ); // just psuedocode here
memcpy( dest->data, src->data, cellsize * src_size1 * src_size2 * MULTIPLICITY )
Note that MULTIPLICITY is a define, usually 1 or 2, probably depends on library configuration - might not apply to your usage (if it's 1 )
Now, important caveat....if the source matrix is a subview, then you have to go by rows...that is, a loop of rows in i where crt's memcpy is limited to rows at a time, not the entire matrix as I show above.
In other words, you do have to account for the source matrix geometry from which the subview was taken...that's probably why they index each cell (makes it simple).
If, however, you KNOW the geometry, you can very likely optimize this WAY above the performance you're seeing.
If all you did was take out the src/dest derefence, you'd see SOME performance gain, as in:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
float * dest_data = dest->data; // psuedocode here
float * src_data = src->data; // psuedocode here
for (i = 0; i < src_size1 ; i++)
{
for (j = 0; j < MULTIPLICITY * src_size2; j++)
{
dest_data[MULTIPLICITY * dest_tda * i + j]
= src_data[MULTIPLICITY * src_tda * i + j];
}
}
We'd HOPE the compiler recognized that anyway, but...sometimes...

Parallelizing a for loop gives no performance gain

I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel.
The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.
However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).
Using Intel's compiler with tbb::parallel_for gave near exact results.
For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).
Any ideas what is preventing my code from parallelizing well?
My code:
template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
//Translates index offset in Bayer image to channel offset in RGB image
int offsets[4];
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
offsets[0] = 2;
offsets[1] = 1;
offsets[2] = 1;
offsets[3] = 0;
break;
...other color spaces
}
memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row%2)*2 + (col%2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.
Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.
Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:
When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.
What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.
You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).
I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):
Native_For 0.21 s
OpenMP_For 0.15 s
Intel_TBB_For 0.15 s
MS_PPL_For 0.15 s
If remove memset then:
Native_For 0.15 s
OpenMP_For 0.09 s
Intel_TBB_For 0.09 s
MS_PPL_For 0.09 s
As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.
FULL SOURCE CODE:
#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>
using namespace boost;
using namespace std;
const auto Width = 3264;
const auto Height = 2540*8;
struct MS_PPL_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
concurrency::parallel_for(first,last,f);
}
};
struct Intel_TBB_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
tbb::parallel_for(first,last,f);
}
};
struct Native_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
for(; first!=last; ++first) f(first);
}
};
struct OpenMP_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
#pragma omp parallel for
for(auto i=first; i<last; ++i) f(i);
}
};
template<typename T>
struct ConvertBayerToRgbImageAsIs
{
const T* BayerChannel;
T* RgbChannel;
template<typename For>
void operator()(For for_)
{
cout << type_name<For>() << "\t";
progress_timer t;
int offsets[] = {2,1,1,0};
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
for_(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row % 2)*2 + (col % 2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
};
int main()
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
for(auto i=0;i!=4;++i)
{
mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
cout << string(16,'_') << endl;
}
}
Synchronization overhead
I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.
Cache usage
An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results
bad good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2
Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.
Code optimizations
the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.
// is the memset really necessary?
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
int rowMod = (row & 1) << 1;
for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
{
auto rgbIndex = tripleBayerIndex + offsets[rowMod];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
//unrolled the loop to save col & 1 operation
rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
}
});
Here comes my suggestion:
Computer larger chunks in parallel
get rid of modulo/multiplication
unroll inner loop to compute one full pixel (simplifies code)
template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
{
// convert BGGR->RGB
// have as many threads as the hardware concurrency is
parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
{
for (auto row = stride; row<2*stride; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
});
}
This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.
edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:
int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
res[i].wait();
}
delete [] res;
with converter being a simple class:
template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) :
BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
// convert BGGR->RGB
for(int row = startRow; row < endRow; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
};
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};
This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.
I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.
I tested your code exactly using three different methods, averaged over 1000 tries.
Serial: 14.6 += 1.0 ms
std::async: 13.6 += 1.6 ms
workers: 11.8 += 1.2 ms
The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.
The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.
For reference, here's my std::async parallel for:
template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
auto N = std::thread::hardware_concurrency();
std::vector<std::future<void>> futures;
for (Int ti=0; ti<N; ++ti) {
Int b = ti * (end - beg) / N;
Int e = (ti+1) * (end - beg) / N;
if (ti == N-1) { e = end; }
futures.emplace_back( std::async([&,b,e]() {
for (Int ix=b; ix<e; ++ix) {
fun( ix );
}
}));
}
for (auto&& f : futures) {
f.wait();
}
}
Things to check or do
Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
How are you timing your code?
The timing should be done inside the application like Evgeny Panasyuk suggests.
You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
Remove the superfluous memset, as others have explained.
As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
Make sure your machine is otherwise idle when doing performance testing.
Additional timings
I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win32 implementation:
#include "stdafx.h"
#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>
using namespace std;
const int Width = 3264;
const int Height = 2540*8;
class Timer {
private:
string name;
LARGE_INTEGER start;
LARGE_INTEGER stop;
LARGE_INTEGER frequency;
public:
Timer(const char *name) : name(name) {
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);
}
~Timer() {
QueryPerformanceCounter(&stop);
LARGE_INTEGER time;
time.QuadPart = stop.QuadPart - start.QuadPart;
double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
printf("%-20s : %5.2f\n", name.c_str(), elapsed);
}
};
static const int offsets[] = {2,1,1,0};
template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = 0, bayerIndex = row * Width;
col < Width; col++, bayerIndex++)
{
int offset = (row % 2)*2 + (col % 2); //0...3
int rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
}
// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = row*Width, rgbCol =row*Width;
col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
int _tmain(int argc, _TCHAR* argv[])
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
for(int i = 0; i < 4; ++i)
{
{
Timer t("serial_orig");
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_orig");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_orig");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("serial_unrolled");
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_unrolled");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_unrolled");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
printf("-----------------------------\n");
}
return 0;
}
Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:
serial_orig : 0.13
omp_dynamic_orig : 0.10
omp_static_orig : 0.10
serial_unrolled : 0.06
omp_dynamic_unrolled : 0.04
omp_static_unrolled : 0.04
The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.
The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.
Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.
Can you break up your array access into 4k chunks somewhat align with the page boundary?
There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)
parallel_for(0, Height, [=] (int row) noexcept
{
for (auto col=0, bayerindex=row*Width,
rgb0=3*bayerindex+offset[(row%2)*2],
rgb1=3*bayerindex+offset[(row%2)*2+1];
col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
{
RgbChannel[rgb0] = BayerChannel[bayerindex ];
RgbChannel[rgb1] = BayerChannel[bayerindex+1];
}
});

Tensor Product Algorithm Optimization

double data[12] = {1, z, z^2, z^3, 1, y, y^2, y^3, 1, x, x^2, x^3};
double result[64] = {1, z, z^2, z^3, y, zy, (z^2)y, (z^3)y, y^2, z(y^2), (z^2)(y^2), (z^3)(y^2), y^3, z(y^3), (z^2)(y^3), (z^3)(y^3), x, zx, (z^2)x, (z^3)x, yx, zyx, (z^2)yx, (z^3)yx, (y^2)x, z(y^2)x, (z^2)(y^2)x, (z^3)(y^2)x, (y^3)x, z(y^3)x, (z^2)(y^3)x, (z^3)(y^3)x, x^2, z(x^2), (z^2)(x^2), (z^3)(x^2), y(x^2), zy(x^2), (z^2)y(x^2), (z^3)y(x^2), (y^2)(x^2), z(y^2)(x^2), (z^2)(y^2)(x^2), (z^3)(y^2)(x^2), (y^3)(x^2), z(y^3)(x^2), (z^2)(y^3)(x^2), (z^3)(y^3)(x^2), x^3, z(x^3), (z^2)(x^3), (z^3)(x^3), y(x^3), zy(x^3), (z^2)y(x^3), (z^3)y(x^3), (y^2)(x^3), z(y^2)(x^3), (z^2)(y^2)(x^3), (z^3)(y^2)(x^3), (y^3)(x^3), z(y^3)(x^3), (z^2)(y^3)(x^3), (z^3)(y^3)(x^3)};
What is the fastest (fewest executions) to produce result given data? Assume, that data is variable in size, but always a factor of 4 (e.g., 4, 8, 12, etc.).
No Boost. I am trying to keep my dependencies small. STL Algorithms are ok.
HINT: result array size should always be 4^(multiple size) (e.g., 4, 16, 64, etc.).
BONUS: If you can compute result just given x, y, z
Additional examples:
double data[4] = {1, z, z^2, z^3};
double result[4] = {1, z, z^2, z^3};
double data[8] = {1, z, z^2, z^3, 1, y, y^2, y^3};
double result[16] = { ... };
I chose the accepted answer code after running this benchmark: https://gist.github.com/1232406. Basically, the top two codes were run and the one with the smallest execution time won.
void Tensor(std::vector<double>& result, double x, double y, double z) {
result.resize(64); //almost noop if already right size
double tz = z*z;
double ty = y*y;
double tx = x*x;
std::array<double, 12> data = {0, 0, tz, tz*z, 1, y, ty, ty*y, 1, x, tx, tx*x};
register std::vector<double>::iterator iter = result.begin();
register int yi;
register double xy;
for(register int xi=0; xi<4; ++xi) {
for(yi=0; yi<4; ++yi) {
xy = data[4+yi]*data[8+xi];
*iter = xy; //a smart compiler can do these four in parallell
*(++iter) = z*xy;
*(++iter) = data[2]*xy;
*(++iter) = data[3]*xy;
++iter; //workaround for speed!
}
}
}
There's probably at least one bug in here somewhere, but it should be fast, with no dependancies (outside of std::vector/std::array), just takes x,y,z. I avoided recursion though, so it only works for 3 in/64 out. The concept can be applied to any number of parameters though. You just have to instantiate yourself.
A good compiler will autovectorize this I guess none of my compilers are good:
void tensor(const double *restrict data,
int dimensions,
double *restrict result) {
result[0] = 1.0;
for (int i = 0; i < dimensions; i++) {
for (int j = (1 << (i * 2)) - 1; j > -1; j--) {
double alpha = result[j];
{
double *restrict dst = &result[j * 4];
const double *restrict src = &data[(dimensions - 1 - i) * 4];
for (int k = 0; k < 4; k++) dst[k] = alpha * src[k];
}
}
}
}
you should use dynamic algorithm. that is, you can use previous results. for example, you keep y^2 result and use it when computing (y^2)z instead of computing it again.
#include <vector>
#include <cstddef>
#include <cmath>
void Tensor(std::vector<double>& result, const std::vector<double>& variables, size_t index)
{
double p1 = variables[index];
double p2 = p1*p1;
double p3 = p1*p2;
if (index == variables.size() - 1) {
result.push_back(1);
result.push_back(p1);
result.push_back(p2);
result.push_back(p3);
} else {
Tensor(result, variables, index+1);
ptrdiff_t size = result.size();
for(int j=0; j<size; ++j)
result.push_back(result[j]*p1);
for(int j=0; j<size; ++j)
result.push_back(result[j]*p2);
for(int j=0; j<size; ++j)
result.push_back(result[j]*p3);
}
}
std::vector<double> Tensor(const std::vector<double>& params) {
std::vector<double> result;
double rsize = (1<<(2*params.size());
result.reserve(rsize);
Tensor(result, params);
return result;
}
int main() {
std::vector<double> params;
params.push_back(3.1415926535);
params.push_back(2.7182818284);
params.push_back(42);
params.push_back(65536);
std::vector<double> result = Tensor(params);
}
I verified that this one compiles and runs (http://ideone.com/IU1eQ). It runs fast, with no dependancies (outside of std::vector). It also takes any number of parameters. Since calling the recursive form is awkward, I made a wrapper. It makes one function call for each parameter, and one call to dynamic memory (in the wrapper).
You should look for Pascal's pyramid to get fast solution. Useful link 1, useful link 2, useful link 3 and useful link 4.
One more thing: as I see it would be a base of a finite element solver. Usually to write own BLAS solver is not a good idea. Do not reinvent the wheel! I think you should use a BLAS solver like intel MKL or Cuda base BLAS.