I'm using Eigen extensively in a scientific application I've been developing for some time. Since I'm implementing a numerical method, a number below a certain threshold (e.g. 1e-15) is not a point of interest, and slowing down calculations, and increasing error rate.
Hence, I want to round off numbers below that threshold to 0. I can do it with a for loop, but hammering multiple relatively big matrices (2M cells and up per matrix) with a for-if loop is expensive and slowing me down since I need to do it multiple times.
Is there a more efficient way to do this with Eigen library?
In other words, I'm trying to eliminate numbers below a certain threshold in my calculation pipeline.
The shortest way to write what you want, is
void foo(Eigen::VectorXf& inout, float threshold)
{
inout = (threshold < inout.array().abs()).select(inout, 0.0f);
}
However, neither comparisons nor the select method get vectorized by Eigen (as of now).
If speed is essential, you need to either write some manual SIMD code, or write a custom functor which supports a packet method (this uses internal functionality of Eigen, so it is not guaranteed to be stable!):
template<typename Scalar> struct threshold_op {
Scalar threshold;
threshold_op(const Scalar& value) : threshold(value) {}
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a) const{
return threshold < std::abs(a) ? a : Scalar(0); }
template<typename Packet>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Packet packetOp(const Packet& a) const {
using namespace Eigen::internal;
return pand(pcmp_lt(pset1<Packet>(threshold),pabs(a)), a);
}
};
namespace Eigen { namespace internal {
template<typename Scalar>
struct functor_traits<threshold_op<Scalar> >
{ enum {
Cost = 3*NumTraits<Scalar>::AddCost,
PacketAccess = packet_traits<Scalar>::HasAbs };
};
}}
This can then be passed to unaryExpr:
inout = inout.unaryExpr(threshold_op<float>(threshold));
Godbolt-Demo (should work with SSE/AVX/AVX512/NEON/...): https://godbolt.org/z/bslATI
It might actually be that the only reason for your slowdown are subnormal numbers. In that case, a simple
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
should do the trick (cf: Why does changing 0.1f to 0 slow down performance by 10x?)
Eigen has a method called UnaryExpr which applies a given function pointer to every coefficient in a matrix (it has sparse and array variants too).
Will test its performance and update this answer accordingly.
Related
I am using Ceres Solver for the non-linear least squares optimization of a model for which the number of parameters is not known at compile time. Because the number of parameters is not known at compile time, I implement the computation of the cost function and the corresponding automatic differentiation using the DynamicAutoDiffCostFunction class as described in http://ceres-solver.org/nnls_modeling.html#dynamicautodiffcostfunction.
This cost function looks roughly like this:
class MyCostFunctor {
private:
unsigned int num_parameters;
const Measurement &meas;
public:
MyCostFunctor (unsigned int num_parameters, const Measurement &meas)
: num_parameters(num_parameters), meas(meas) {}
template<typename T>
bool operator()(T const* const* parameters, T* residuals) const {
T *transformed_parameters = new [transformed_space_dim(num_parameters)];
TransformParams<T> (parameters[0], num_parameters, transformed_parameters);
ComputeResiduals<T> (transformed_parameters, num_parameters, meas, residuals);
delete[] transformed_parameters;
}
}
where the function TransformParams transforms the parameters to a different space, which is then used by ComputeResiduals to compute the residuals.
The code above sketches the idea of what I want to compute. However, it would be very inefficient to allocate and free the memory (pointed to by transformed_parameters) to represent the model in its intermediate parameter space, every time the residuals are computed. Therefor I would like to allocate this memory in the constructor of MyCostFunctor. The problem with that, is that the parameters of the intermediate results should be of type T, which is a template parameter of the operator() method.
Since I cannot come up with a solution (one, that is not dirty) how to implement a pre-allocation of the transformed_parameters array, I am wondering if someone else has a nice solution to my problem.
I want to note that in C++ the generation of pseudo random numbers is overcomplicated. If you remember about old languages like Pascal, then they had the function Random(n), where n is integer and the generation range is from 0 to n-1. Now, going back to modern C++, I want to get a similar interface, but with a function random_int(a,b), which generates numbers in the [a,b].
Consider the following example:
#include <random>
namespace utils
{
namespace implementation_details
{
struct eng_wrap {
std::mt19937 engine;
eng_wrap()
{
std::random_device device;
engine.seed(device());
}
std::mt19937& operator()()
{
return engine;
}
};
eng_wrap rnd_eng;
}
template <typename int_t, int_t a, int_t b> int_t random_int()
{
static_assert(a <= b);
static std::uniform_int_distribution<int_t> distr(a, b);
return distr(implementation_details::rnd_eng());
}
}
You can see that the distr is marked with the static keyword. Due to this, repeated calls with the same arguments will not cause the construction of the type std::uniform_int_distribution.
In some cases, at the compilation time we do not know the generation boundaries.
Therefore, we have to rewrite this function:
template <typename int_t> int_t random_int2(int_t a, int_t b)
{
std::uniform_int_distribution<int_t> distr(a, b);
return distr(implementation_details::rnd_eng());
}
Next, suppose the second version of this function is called more times:
int a, b;
std::cin>>a>>b;
for (int i=1;i!=1000000;++i)
std::cout<<utils::random_int2(a,b)<<' ';
Question
What is the cost of creating std::uniform_int_distribution in each
iteration of the loop?
Can you suggest a more optimized function that returns a pseudo-random number in the passed range for a normal desktop application?
If you want to use the same a and b repeatedly, use a class with a member function—that’s what they’re for. If you don’t want to expose your rnd_eng (choosing instead to preclude useful multithreaded clients), write the class to use it:
template<class T>
struct random_int {
random_int(T a,T b) : d(a,b) {}
T operator()() const {return d(implementation_details::rnd_eng());}
private:
std::uniform_int_distribution<T> d;
};
IMO, for most simple programs such as games, graphics, and Monte Carlo simulations, the API you actually want is
static xoshiro256ss g;
// Generate a random number between 0 and n-1.
// For example, randint0(2) flips a coin; randint0(6) rolls a die.
int randint0(int n) {
return g() % n;
}
// This version is useful for games like NetHack, where you often
// want to express an ad-hoc percentage chance of something happening.
bool pct(int n) {
return randint0(100) < n;
}
(or substitute std::mt19937 for xoshiro256ss but be aware you're trading away performance in exchange for... something. :))
The % n above is mathematically dubious, when n is astronomically large (e.g. if you're rolling a 12297829382473034410-sided die, you'll find that values between 0 and 6148914691236517205 come up twice as often as they should). So you may prefer to use C++11's uniform_int_distribution:
int randint0(int n) {
return std::uniform_int_distribution<int>(0, n-1)(g);
}
However, again be aware you're gaining mathematical perfection at the cost of raw speed. uniform_int_distribution is more for when you don't already trust your random number engine to be sane (e.g. if the engine's output range might be 0 to 255 but you want to generate numbers from 1 to 1000), or when you're writing template code to work with any arbitrary integer distribution (e.g. binomial_distribution, geometric_distribution) and need a uniform distribution object of that same general "shape" to plug into your template.
The answer to your question #1 is "The cost is free." You will not gain anything by stashing the result of std::uniform_int_distribution<int>(0, n-1) into a static variable. A distribution object is very small, trivially copyable, and basically free to construct. In fact, the cost of constructing the uniform_int_distribution in this case is orders of magnitude cheaper than the cost of thread-safe static initialization.
(There are special cases such as std::normal_distribution where not-stashing the distribution object between calls can result in your doing twice as much work as needed; but uniform_int_distribution is not one of those cases.)
Basically, I have a collection std::vector<std::pair<std::vector<float>, unsigned int>> which contains pairs of templates std::vector<float> of size 512 (2048 bytes) and their corresponding identifier unsigned int.
I am writing a function in which I am provided with a template and I need to return the identifier of the most similar template in the collection. I am using dot product to compute the similarity.
My naive implementation looks as follows:
// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
bool found = false;
similarity = 0.f;
for (size_t i = 0; i < collection.size(); ++i) {
const float* candidateTemplate = collection[i].first.data();
float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture.
if (consinSimilarity > similarity) {
found = true;
similarity = consinSimilarity;
label = collection[i].second;
}
}
return found;
}
How can I speed this up using parallelization. My collection can contain potentially millions of templates. I have read that you can add #pragma omp parallel for reduction but I am not entirely sure how to use it (and if this is even the best option).
Also note:
For my dot product implementation, if the base architecture supports AVX & FMA, I am using this implementation.
Will this affect performance when we parallelize since there are only a limited number of SIMD registers?
Since we don't have access to an example that actually compiles (which would have been nice), I didn't actually try to compile the example below. Nevertheless, some minor typos (maybe) aside, the general idea should be clear.
The task is to find the highest value of similarity and the corresponding label, for this we can indeed use reduction, but since we need to find the maximum of one value and then store the corresponding label, we make use of a pair to store both values at once, in order to implement this as a reduction in OpenMP.
I have slightly rewritten your code, possibly made things a bit harder to read with the original naming (temp) of the variable. Basically, we perform the search in parallel, so each thread finds an optimal value, we then ask OpenMP to find the optimal solution between the threads (reduction) and we are done.
//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair.
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function.
#pragma omp declare reduction(custom_reduction : \
std::pair<float, unsigned int>: \
reduce_custom(omp_out, omp_in)) \
initializer(omp_priv(omp_orig))
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label.
#pragma omp parallel for reduction(custom_reduction:temp)
for (size_t i = 0; i < collection.size(); ++i) {
const float* candidateTemplate = collection[i].first.data();
float consinSimilarity = getSimilarity(data, candidateTemplate, length);
if (consinSimilarity > temp.first) {
temp.first = consinSimilarity;
temp.second = collection[i].second;
}
}
if (temp.first > 0.f) {
similarity = temp.first;
label = temp.second;
return true;
}
return false;
}
Regarding your concern on the limited number of SIMD registers, their number depends on the specific CPU you are using. To the best of my understanding each core has a set number of vector registers available, so as long as you were not using more than there were available before it should be fine now as well, besides, AVX512 for instance provides 32 vector registers and 2 arithemtic units for vector operations per core, so running out of compute resources is not trivial, you are more likely to suffer due to poor memory locality (particularly in your case with vectors being saved all over the place). I might of course be wrong, if so, please feel free to correct me in the comments.
I am using Eigen on a C++ program for solving linear equation for very small square matrix(4X4).
My test code is like
template<template <typename MatrixType> typename EigenSolver>
Vertor3d solve(){
//Solve Ax = b and A is a real symmetric matrix and positive semidefinite
... // Construct 4X4 square matrix A and 4X1 vector b
EigenSolver<Matrix4d> solver(A);
auto x = solver.solve(b);
... // Compute relative error for validating
}
I test some EigenSolver which include:
FullPixLU
PartialPivLU
HouseholderQR
ColPivHouseholderQR
ColPivHouseholderQR
CompleteOrthogonalDecomposition
LDLT
Direct Inverse
Direct Inverse is:
template<typename MatrixType>
struct InverseSolve
{
private:
MatrixType inv;
public:
InverseSolve(const MatrixType &matrix) :inv(matrix.inverse()) {
}
template<typename VectorType>
auto solve(const VectorType & b) {
return inv * b;
}
};
I found that the fast method is DirectInverse,Even If I linked Eigen with MKL , the result was not change.
This is the test result
FullPixLU : 477 ms
PartialPivLU : 468 ms
HouseholderQR : 849 ms
ColPivHouseholderQR : 766 ms
ColPivHouseholderQR : 857 ms
CompleteOrthogonalDecomposition : 832 ms
LDLT : 477 ms
Direct Inverse : 88 ms
which all use 1000000 matrices with random double from uniform distribution [0,100].I fristly construct upper-triangle and then copy to lower-triangle.
The only problem of DirectInverse is that its relative error slightly larger than other solver but acceptble.
Is there any faster or more felegant solution for my program?Is DirectInverse the fast solution for my program?
DirectInverse does not use the symmetric infomation so why is DirectInverse far faster than LDLT?
Despite what many people suggest of never explicitly computing an inverse when you only want to solve a linear system, for very small matrices this can actually be beneficial, since there are closed-form solutions using co-factors.
All other alternatives you tested will be slower, since they will do pivoting (which implies branching), even for small fixed-sized matrices. Also, most of them will result in more divisions and be not vectorizable as good, as the direct computation.
To increase the accuracy (this technique can actually be used independent of the solver if required), you can refine an initial solution by solving the system again with the residual:
Eigen::Vector4d solveDirect(const Eigen::Matrix4d& A, const Eigen::Vector4d& b)
{
Eigen::Matrix4d inv = A.inverse();
Eigen::Vector4d x = inv * b;
x += inv*(b-A*x);
return x;
}
I don't think Eigen directly provides a way to exploit the symmetry of A here (for the directly computed inverse). You can try hinting that by explicitly copying a selfadjoint view of A into a temporary and hope that the compiler is smart enough to find common sub-expressions:
Eigen::Matrix4d tmp = A.selfadjointView<Eigen::Upper>();
Eigen::Matrix4d inv = tmp.inverse();
To reduce some divisions, you can also compile with -freciprocal-math (on gcc or clang), this will slightly reduce accuracy of course.
If this is really performance critical, try implementing a hand-tuned inverse_4x4_symmetric method.
Exploiting the symmetry of inv * b will unlikely be beneficial for such small matrices.
I am using the boost geometry library to compare two different polygons. Specifically, I am using the equals algorithm to see if two polygons are congruent (equal dimensions).
The problem is that the tolerance on the algorithm is too tight and two polygons that should be congruent (after some floating point operations) are not within the tolerance defined by the algorithm.
I'm almost certain that the library is using std::numeric_limits<double>::epsilon() (~2.22e-16) to establish the tolerance. I would like to set the tolerance to be larger (say 1.0e-10).
Any ideas on how to do this?
EDIT: I've changed the title to reflect the responses in the comments. Please respond to the follow-up below:
Is it possible to override just the boost::geometry::math::detail::equals<Type,true>::apply function?
This way I could replace only the code where the floating point comparison occurs and I wouldn't have to rewrite a majority of the boost::geometry::equals algorithm.
For reference, here is the current code from the boost library:
template <typename Type, bool IsFloatingPoint>
struct equals
{
static inline bool apply(Type const& a, Type const& b)
{
return a == b;
}
};
template <typename Type>
struct equals<Type, true>
{
static inline Type get_max(Type const& a, Type const& b, Type const& c)
{
return (std::max)((std::max)(a, b), c);
}
static inline bool apply(Type const& a, Type const& b)
{
if (a == b)
{
return true;
}
// See http://www.parashift.com/c++-faq-lite/newbie.html#faq-29.17,
// FUTURE: replace by some boost tool or boost::test::close_at_tolerance
return std::abs(a - b) <= std::numeric_limits<Type>::epsilon() * get_max(std::abs(a), std::abs(b), 1.0);
}
};
The mentioned code can be found in boost/geometry/util/math.hpp, currently in Boost 1.56 or older (here on GitHub).
There is a free function boost::geometry::math::equals() calling internally boost::geometry::math::detail::equals<>::apply(). So to change the default behavior you could overload this function or specialize the struct for some coordinate type or types. Have in mind that in some algorithms that type may be promoted to some more precise type.
Of course you could also use your own, non-standard coordinate type and implement required operators or overload the function mentioned above.
But... you might consider describing a specific case when you think that the calculated result is wrong to be sure that this question is not a XY problem. Playing with epsilon might improve the result in some cases but make things worse in other. What if some parts of the algorithm not related to the comparison might be improved? Then it would be helpful if you wrote which version of Boost.Geometry you're using, the compiler, etc.