I am using Ceres Solver for the non-linear least squares optimization of a model for which the number of parameters is not known at compile time. Because the number of parameters is not known at compile time, I implement the computation of the cost function and the corresponding automatic differentiation using the DynamicAutoDiffCostFunction class as described in http://ceres-solver.org/nnls_modeling.html#dynamicautodiffcostfunction.
This cost function looks roughly like this:
class MyCostFunctor {
private:
unsigned int num_parameters;
const Measurement &meas;
public:
MyCostFunctor (unsigned int num_parameters, const Measurement &meas)
: num_parameters(num_parameters), meas(meas) {}
template<typename T>
bool operator()(T const* const* parameters, T* residuals) const {
T *transformed_parameters = new [transformed_space_dim(num_parameters)];
TransformParams<T> (parameters[0], num_parameters, transformed_parameters);
ComputeResiduals<T> (transformed_parameters, num_parameters, meas, residuals);
delete[] transformed_parameters;
}
}
where the function TransformParams transforms the parameters to a different space, which is then used by ComputeResiduals to compute the residuals.
The code above sketches the idea of what I want to compute. However, it would be very inefficient to allocate and free the memory (pointed to by transformed_parameters) to represent the model in its intermediate parameter space, every time the residuals are computed. Therefor I would like to allocate this memory in the constructor of MyCostFunctor. The problem with that, is that the parameters of the intermediate results should be of type T, which is a template parameter of the operator() method.
Since I cannot come up with a solution (one, that is not dirty) how to implement a pre-allocation of the transformed_parameters array, I am wondering if someone else has a nice solution to my problem.
Related
I'm using Eigen extensively in a scientific application I've been developing for some time. Since I'm implementing a numerical method, a number below a certain threshold (e.g. 1e-15) is not a point of interest, and slowing down calculations, and increasing error rate.
Hence, I want to round off numbers below that threshold to 0. I can do it with a for loop, but hammering multiple relatively big matrices (2M cells and up per matrix) with a for-if loop is expensive and slowing me down since I need to do it multiple times.
Is there a more efficient way to do this with Eigen library?
In other words, I'm trying to eliminate numbers below a certain threshold in my calculation pipeline.
The shortest way to write what you want, is
void foo(Eigen::VectorXf& inout, float threshold)
{
inout = (threshold < inout.array().abs()).select(inout, 0.0f);
}
However, neither comparisons nor the select method get vectorized by Eigen (as of now).
If speed is essential, you need to either write some manual SIMD code, or write a custom functor which supports a packet method (this uses internal functionality of Eigen, so it is not guaranteed to be stable!):
template<typename Scalar> struct threshold_op {
Scalar threshold;
threshold_op(const Scalar& value) : threshold(value) {}
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a) const{
return threshold < std::abs(a) ? a : Scalar(0); }
template<typename Packet>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Packet packetOp(const Packet& a) const {
using namespace Eigen::internal;
return pand(pcmp_lt(pset1<Packet>(threshold),pabs(a)), a);
}
};
namespace Eigen { namespace internal {
template<typename Scalar>
struct functor_traits<threshold_op<Scalar> >
{ enum {
Cost = 3*NumTraits<Scalar>::AddCost,
PacketAccess = packet_traits<Scalar>::HasAbs };
};
}}
This can then be passed to unaryExpr:
inout = inout.unaryExpr(threshold_op<float>(threshold));
Godbolt-Demo (should work with SSE/AVX/AVX512/NEON/...): https://godbolt.org/z/bslATI
It might actually be that the only reason for your slowdown are subnormal numbers. In that case, a simple
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
should do the trick (cf: Why does changing 0.1f to 0 slow down performance by 10x?)
Eigen has a method called UnaryExpr which applies a given function pointer to every coefficient in a matrix (it has sparse and array variants too).
Will test its performance and update this answer accordingly.
Suppose , I have a class object matrix. To add two matrix of large element, I can define a operator overloading + or define a function Add like these
matrix operator + (const matrix &A, const matrix &B)
matrix C;
/* all required things */
for(int i .........){
C(i)=A(i)+B(i);
}
return C;
}
and I have call like,
matrix D = A+B;
Now if I define the Add function,
void Add(const matrix &A, const matrix &B, matrix &C)
C.resize(); // according to dimensions of A, B
// for C.resize , copy constructor will be called.
/* all required things */
for(int i .........){
C(i)=A(i)+B(i);
}
}
And I have to call this function like,
matrix D;
Add(A,B,D); //D=A+B
which of above method is faster and efficient. Which should we use ?
Without using any tools,
like a profiler (e.g. gprof) to see how much time is spent where,
nor any other tools like "valgrind + cachegrind" to see how many operations are performed in either of the two functions,
And also ignoring all compiler optimizations i.e. compiling with -O0,
And assuming whatever else there is in the two functions (what you represent as /* all required things */), is trivial,
Then all one can say, just by looking at both your functions is, that both of your functions have a complexity of O(n), since both your functions are spending most of the time in the two for loops. Depending on how big the size of the matrices is, especially if they are really large, everything else in the code is pretty much insignificant when it comes to down speed.
So, what your question boils down to, in my opinion is,
In how much time it takes,
to call the constructor of C
plus returning this C,
versus,
How much time it takes,
to call the resize function for C,
plus calling the copy constructor of C.
This you can 'crudely but relatively quickly' measure using the std::clock() or chrono as shown here in multiple answers.
#include <chrono>
auto t_start = std::chrono::high_resolution_clock::now();
matrix D = A+B; // To compare replace on 2nd run with this ---> matrix D; Add(A,B,D);
auto t_end = std::chrono::high_resolution_clock::now();
double elaspedTimeMs = std::chrono::duration<double, std::milli>(t_end-t_start).count();
Although once again, in my honest opinion if your matrices are big, most of the time would go in the for loop.
p.s. Premature optimization is the root of all evil.
An external library gives me a raw pointer of doubles that I want to map to an Eigen type. The raw array is logically a big ordered collection of small dense fixed-size matrices, all of the same size. The main issue is that the small dense matrices may be in row-major or column-major ordering and I want to accommodate them both.
My current approach is as follows. Note that all the entries of a small fixed-size block (in the array of blocks) need to be contiguous in memory.
template<int bs, class Mattype>
void block_operation(double *const vals, const int numblocks)
{
Eigen::Map<Mattype> mappedvals(vals,
Mattype::IsRowMajor ? numblocks*bs : bs,
Mattype::IsRowMajor ? bs : numblocks*bs
);
for(int i = 0; i < numblocks; i++)
if(Mattype::isRowMajor)
mappedvals.template block<bs,bs>(i*bs,0) = block_operation_rowmajor(mappedvals);
else
mappedvals.template block<bs,bs>(0,i*bs) = block_operation_colmajor(mappedvals);
}
The calling function first figures out the Mattype (out of 2 options) and then calls the above function with the correct template parameter.
Thus all my algorithms need to be written twice and my code is interspersed with these layout checks. Is there a way to do this in a layout-agnostic way? Keep in mind that this code needs to be as fast as possible.
Ideally, I would Map the data just once and use it for all the operations needed. However, the only solution I could come up with was invoking the Map constructor once for every small block, whenever I need to access the block.
template<int bs, StorageOptions layout>
inline Map<Matrix<double,bs,bs,layout>> extractBlock(double *const vals,
const int bindex)
{
return Map<Matrix<double,bs,bs,layout>>(vals+bindex*bs*bs);
}
Would this function be optimized away to nothing (by a modern compiler like GCC 7.3 or Intel 2017 under -std=c++14 -O3), or would I be paying a small penalty every time I invoke this function (once for each block, and there are a LOT of small blocks)? Is there a better way to do this?
Your extractBlock is fine, a simpler but somewhat uglier solution is to use a reinterpret cast at the start of block_operation:
using BlockType = Matrix<double,bs,bs,layout|DontAlign>;
BlockType* blocks = reinterpret_cast<BlockType*>(vals);
for(int i...)
block[i] = ...;
This will work for fixed sizes matrices only. Also note the DontAlign which is important unless you can guaranty that vals is aligned on a 16 or even 32 bytes depending on the presence of AVX and bs.... so just use DontAlign!
I have a structure that looks like this:
struct SoA
{
int arr1[COUNT];
int arr2[COUNT];
};
And I want it to look like this:
struct AoS
{
int arr1_data;
int arr2_data;
};
std::vector<AoS> points;
as quickly as possible. Order must be preserved.
Is constructing each AoS object individually and pushing it back the fastest way to do this, or is there a faster option?
SoA before;
std::vector<AoS> after;
for (int i = 0; i < COUNT; i++)
points.push_back(AoS(after.arr1[i], after.arr2[i]));
There are SoA/AoS related questions on StackOverflow, but I haven't found one related to fastest-possible conversion. Because of struct packing differences I can't see any way to avoid copying the data from one format to the next, but I'm hoping someone can tell me there's a way to simply reference the data differently and avoid a copy.
Off the wall solutions especially encouraged.
Binary layout of SoA and AoS[]/std::vector<AoS> is different, so there is really no way to transform one to another without copy operation.
Code you have is pretty close to optimal - one improvement maybe to pre-allocate vector with expected number of elements. Alternatively try raw array with both constructing whole element and per-property initialization. Changes need to be measured carefully (definitely measure using fully optimized build with array sizes you expect) and weighted against readabilty/correctness of the code.
If you don't need exact binary layout (seem to be that case as you are using vector) you may be able to achieve similarly looking syntax by creating couple custom classes that would expose existing data differently. This will avoid copying altogether.
You would need "array" type (provide indexing/iteration over instance of SoA) and "element" type (initialized with referece to instance of SoA and index, exposing accessors for separate fields at that index)
Rough sketch of code (add iterators,...):
class AoS_Element
{
SoA& soa;
int index;
public:
AoS_Element(SoA& soa, int index) ...
int arr1_data() { return soa.arr1[index];}
int arr2_data() { return soa.arr2[index];}
}
class AoS
{
SoA& soa;
public:
AoS(SoA& _soa):soa(_soa){}
AoS_Element operator[](int index) { return AoS_Element(soa, index);}
}
I'm going to write a templatized implementation of a KDTree, which for now should only work as Quadtree or Octree for a BarnesHut implementation.
The crucial point here is the design, I would like to specify the number of dimension where the tree is defined as template parameter and then simply declare some common methods, which automatically behave the correct way (I think some template specialization is needed then).
I would like to specialize the template in order to have 2^2 (quadtree) or 2^3 (octree) nodes.
Do someone have some design ideas? I would like to avoid inheritance because it constrains me to do dynamic memory allocation rather than static allocations.
Here N can be 2 or 3
template<int N>
class NTree
{
public:
NTree<N>( const std::vector<Mass *> &);
~NTree<N>()
{
for (int i=0; i<pow(2,N); i++)
delete nodes[i];
}
private:
void insert<N>( Mass *m );
NTree *nodes[pow(2,N)]; // is it possible in a templatized way?
};
Another problem is that quadtree has 4 nodes but 2 dimension, octree has 8 nodes, but 3 dimension, i.e. number of nodes is 2^dimension. Can I specify this via template-metaprogramming? I would like to keep the number 4 and 8 so the loop unroller can be faster.
Thank you!
You can use 1 << N instead of pow(2, N). This works because 1 << N is a compile-time constant, whereas pow(2, N) is not a compile time constant (even though it will be evaluated at compile-time anyway).
If you are using a C++11 compiler that supports constexpr you could write yourself a constexpr-version of pow to do the calculation at runtime.