Eigen matrix library coefficient-wise modulo operation - c++

In one of the functions in the project I am working on, I need to find the remainder of each element of my eigen library matrix when divided by a given number. Here is the Matlab equivalent to what I want to do:
mod(X,num)
where X is the dividend matrix and num is the divisor.
What is the easiest way to achieve this?

You can use a C++11 lambda with unaryExpr:
MatrixXi A(4,4), B;
A.setRandom();
B = A.unaryExpr([](const int x) { return x%2; });
or:
int l = 2;
B = A.unaryExpr([&](const int x) { return x%l; });

For completeness, another solution would be to:
transform X to an eigen array (for coeffwise operations),
Apply the modulo formula a%b = a - (b * int(a/b))
C++ code that return an eigen array:
auto mod_array = X.array() - (num * (X.array()/num));
C++ code to get a matrix:
auto mod_matrix = (X.array() - (num * (X.array()/num))).matrix();
Note that, parentheses are important especially in (X.array()/num) as eigen will optimize (num * X.array()/num) to X.array() which is not what we expect.
The first version with eigen array is faster than version with unaryExpr.
The second version with matrix takes about the same time as version with unaryExpr.
If X contains float numbers, you will need to cast X.array() in (X.array()/num) to int

Related

Performance bottlenecks in fast evaluation of trig functions using Eigen and MEX

In a project using Matlab's C++ MEX API, I have to compute the value exp(j * 2pi * x) for over 100,000 values of x where x is always a positive double. I've written some helper functions that breakdown the computation into sin/cos using euler's formula. I then apply the method of range reduction to reduce my values to their corresponding points in the domain [0,T/4] where T is the period of the exponential I'm computing. I keep track of which quadrant in [0, T] the original value would have fallen into for later. I can then compute the trig function using a taylor series polynomial in horner form and apply the appropriate shift depending on which quadrant the original value was in. For further information on some of the concepts in this technique, check out this answer. Here is the code for this function:
Eigen::VectorXcd calcRot2(const Eigen::Ref<const Eigen::VectorXd>& idxt) {
Eigen::VectorXd vidxt = idxt.array() - idxt.array().floor();
Eigen::VectorXd quadrant = (vidxt.array()*2+0.5).floor();
vidxt.array() -= (quadrant.array()*0.5);
vidxt.array() *= 2*3.14159265358979;
const Eigen::VectorXd sq = vidxt.array()*vidxt.array();
Eigen::VectorXcd M(vidxt.size());
M.real() = fastCos2(sq);
M.imag() = fastSin2(vidxt,sq);
M = (quadrant.array() == 1).select(-M,M);
return M;
}
I profiled the code segment in which this function is called using std::chrono and averaged over 500 calls to the function (where each call to the mex function processes all 100,000+ values by calling calcRot2 in a loop. Each iteration passes about 200 values to calcRot2). I find the following average runtimes:
runtime with calcRot2: 75.4694 ms
runtime with fastSin/Cos commented out: 50.2409 ms
runtime with calcRot2 commented out: 30.2547 ms
Looking at the difference between the two extreme cases, it seems like calcRot has a large contribution to the runtime. However, only a portion of that comes from the sin/cos calculation. I would assume Eigen's implicit vectorization and the compiler would make the runtime of the other operations in the function effectively negligible. (floor shouldn't be a problem!) Where exactly is the performance bottleneck here?
This is the compilation command I'm performing (It uses MinGW64 which I think is the same as gcc):
mex(ipath,'CFLAGS="$CFLAGS -O3 -fno-math-errno -ffast-math -fopenmp -mavx2"','LDFLAGS="$LDFLAGS -fopenmp"','DAS.cpp','DAShelper.cpp')
Reference Code
For reference, here is the code segment in the main mex function where the timer is called, and the helper function that calls calcRot2():
MEX function call:
chk1 = std::chrono::steady_clock::now();
// Calculate beamformed signal at each point
Eigen::MatrixXcd bfVec(p.nPoints,1);
#pragma omp parallel for
for (int i = 0; i < p.nPoints; i++) {
calcPoint(idxt.col(i),SIG,p,bfVec(i));
}
chk2 = std::chrono::steady_clock::now();
auto diff3 = chk2 - chk1;
calcPoint:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::MatrixXcd>& SIG,
Parameters& p, std::complex<double>& bfVal) {
Eigen::VectorXcd pRot = calcRot2(idxt*p.fc/p.fs);
int j = 0;
for (auto x : idxt) {
if(x >= 0) {
int vIDX = static_cast<int>(x);
bfVal += (SIG(vIDX,j)*(vIDX + 1 - x) + SIG(vIDX+1,j)*(x - vIDX))*pRot(j);
}
j++;
}
}
Clarification
To clarify, the line
(vidxt.array()*2+0.5).floor()
is meant to yield:
0 if vidxt is between [0,0.25]
1 if vidxt is between [0.25,0.75]
2 if vidxt is between [0.75,1]
The idea here is that when vidxt is in the second interval (quadrants 2 and 3 on the unit circle for functions with period 2pi), then the value needs to map to its negative value. Otherwise, the range reduction maps the values to the correct values.
The benefits of Eigen's vectorization are outweighed because you evaluate your expressions into temporary vectors. Allocating, deallocating, filling and reading these vectors has cost that seems significant. This is especially so because the expressions themselves are relatively simple (just a few scalar operations).
Expression objects
What usually helps here is aggregating into fewer expressions. For example line 3 and 4 can be collapsed into one:
vidxt.array() = 2*3.14159265358979 * (vidxt.array() - quadrant.array()*0.5);
(BTW: Note that that math.h contains a constant M_PI with pi in double precision).
Beyond that, Eigen expressions can be combined and reused. Something like this:
auto vidxt0 = idxt.array() - idxt.array().floor();
auto quadrant = (vidxt0*2+0.5).floor();
auto vidxt = 2*3.14159265358979 * (vidxt0 - quadrant.array()*0.5);
auto sq = vidxt.array().square();
Eigen::VectorXcd M(vidxt.size());
M.real() = fastCos2(sq);
M.imag() = fastSin2(vidxt,sq);
M = (quadrant.array() == 1).select(-M,M);
Note that none of the auto values are vectors. They are expression objects that behave like arrays and can be evaluated into vectors or arrays.
You can pass these on to your fastCos2 and fastSin2 function by declaring them as templates. The typical Eigen pattern would be something like
template<Derived>
void fastCos2(const Eigen::ArrayBase<Derived>& sq);
The idea here is that ultimately, everything compiles into one huge loop that gets executed when you evaluate the expression into a vector or array. If you reference the same sub-expression multiple times, the compiler may be able to eliminate the redundant computations.
Unfortunately, I could not get any better performance out of this particular code, so it is no real help here but it is still something worth exploring in these kind of cases.
fastSin/Cos return value
Speaking of temporary vectors: You didn't include the code for your fastSin/Cos functions but it looks a lot like you return a temporary vector which is then copied into the real and imaginary parts or the actual return value. This is another temporary that you may want to avoid. Something like this:
template<class Derived1, class Derived2>
void fastCos2(const Eigen::MatrixBase<Derived1>& M, const Eigen::MatrixBase<Derived2>& sq)
{
Eigen::MatrixBase<Derived1>& M_mut = const_cast<Eigen::MatrixBase<Derived1>&>(M);
M_mut = sq...;
}
fastCos2(M.real(), sq);
Please refer to Eigen's documentation on the topic of function arguments.
The downside of this approach in this particular case is that now the output is not consecutive (real and imaginary parts are interleaved). This may affect vectorization negatively. You may be able to work around this by combining the sin and cos functions into one expression for both. Benchmarking is required.
Using a plain loop
As others have pointed out, using a loop may be easier in this particular case. You noted that this was slower. I have a theory why: You did not specify -DNDEBUG in your compile options. If you don't, all array indices in Eigen vectors are range-checked with an assertion. These cost time and prevent vectorization. If you include this compile flag, I find my code significantly faster than using Eigen expressions.
Alternatively, you can use raw C pointers to the input and output vector. Something like this:
std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
const double* iidxt = idxt.data();
std::complex<double>* iM = M.data();
for(std::ptrdiff_t j = 0; j < n; ++j) {
double ival = iidxt[j];
double vidxt = ival - std::floor(ival);
double quadrant = std::floor(vidxt * 2. + 0.5);
vidxt = (vidxt - quadrant * 0.5) * (2. * 3.14159265358979);
double sq = vidxt * vidxt;
// stand-in for sincos
std::complex<double> jval(sq, vidxt + sq);
iM[j] = quadrant == 1. ? -jval : jval;
}
Fixed sized arrays
To avoid the cost of memory allocation and make it easier for the compiler to avoid memory operations in the first place, it can help to run the computation on blocks of fixed size. Something like this:
std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
std::ptrdiff_t i;
for(i = 0; i + 4 <= n; i += 4) {
Eigen::Array4d idxt_i = idxt.segment<4>(i);
...
M.segment<4>(i) = ...;
}
if(i + 2 <= n) {
Eigen::Array2D idxt_i = idxt.segment<2>(i);
...
M.segment<2>(i) = ...;
i += 2;
}
if(i < n) {
// last index scalar
}
This kind of stuff needs careful tuning to ensure that vectorized code is generated and there are no unnecessary temporary values on the stack. If you can read assembler, Godbolt is very helpful.
Other remarks
Eigen includes vectorized versions of sin and cos. Have you compared your code to these instead of e.g. Eigen's complex exp function?
Depending on your math library, there is also an explicit sincos function to compute sine and cosine in one function. It is not vectorized but still saves time on range reduction. You can (usually) access it through std::polar. Try this:
Eigen::VectorXd scale = ...;
Eigen::VectorXd phase = ...;
// M = scale * exp(-2 pi j phase)
Eigen::VectorXd M = scale.binaryExpr(-2. * M_PI * phase,
[](double s, double p) noexcept -> std::complex<double> {
return std::polar(s, p);
});
If your goal is an approximation instead of a precise result, shouldn't your first step be to cast to single precision? Maybe after the range reduction to avoid losing too many decimal places. At the very least it will double the work done per clock cycle. Also, regular sine and cosine implementations take less time in float.
Edit
I had to correct myself on the cast to int64 instead of int. There is no vectorized conversion to int64_t until AVX512
The line (vidxt.array()*2+0.5).floor() bugs me slightly. This is meant to round down to negative infinity for [0, 0.5) and up to positive infinity for [0.5, 1), correct? vidxt is never negative. Therefore this line should be equivalent to (vidxt.array()*2).round(). With AVX2 and -ffast-math that saves one instruction. With SSE2 none of these actually vectorize, as can be seen on Godbolt

How do I input a quadratic equation in C++?

We were given an assignment to program the following, which I have been trying to figure out how to solve for the last 2 hours but to no avail.
How do you actually solve a complex formula having different operations in one mathematical expression?
For you to properly understand which of the operations are to be solved in order, try recreating the quadratic equation, ax^2 + bx + c, in C++ on your own!
Instructions:
The value of a, b, c, and x are already provided for you in the code editor. Remake the formula using C++'s math functions and operators and store it into one variable.
Print the value of the variable that stores the formula. To understand which to actually solve first in the equation, try tracing it out by yourself and manually solve it and see if your answer match that of the sample output.
Sample output: 16
TL;DR I am told to recreate the quadratic equation on my own using the given variables with their values.
Here is my current output which failed:
#include<iostream>
#include <cmath>
int main(void) {
int a = 2;
int b = 2;
int c = 4;
int x = 2;
// TODO:
// 1. Compute for the result of the quadratic equation
// using the variables provided above and the math library
double result;
result = (a * x + b * x) pow(2, 2) + c;
// 2. Print the output required by the output sample
std::cout << result;
return 0;
}
It prints out 4 when it should be 16.
You use this code:
result = a * pow(x, 2) + b * x + c;

Is Eigen library matrix/vector manipulation faster than .net ones if the matrix is dense and unsymmetrical?

I have some matrix operations, mostly dealing with operations like running over all the each of the rows and columns of the matrix and perform multiplication a*mat[i,j]*mat[ii,j]:
public double[] MaxSumFunction()
{
var maxSum= new double[vector.GetLength(1)];
for (int j = 0; j < matrix.GetLength(1); j++)
{
for (int i = 0; i < matrix.GetLength(0); i++)
{
for (int ii = 0; ii < matrix.GetLength(0); ii++)
{
double wi= Math.Sqrt(vector[i]);
double wii= Math.Sqrt(vector[ii]);
maxSum[j] += SomePowerFunctions(wi, wii) * matrix[i, j]*matrix[ii, j];
}
}
}
}
private double SomePowerFunctions(double wi, double wj)
{
var betaij = wi/ wj;
var numerator = 8 * Math.Sqrt(wi* wj) * Math.Pow(betaij, 3.0 / 2)
* (wi+ betaij * wj);
var dominator = Math.Pow(1 - betaij * betaij, 2) +
4 * wi* wj* betaij * (1 + Math.Pow(betaij, 2)) +
4 * (wi* wi+ wj* wj) * Math.Pow(betaij, 2);
if (wi== 0 && wj== 0)
{
if (Math.Abs(betaij - 1) < 1.0e-8)
return 1;
else
return 0;
}
return numerator / dominator;
}
I found such loops to be particularly slow if the matrix size is big.
I want the speed to be fast. So I am thinking about re-implementing these algorithms using the Eigen library.
My matrix is not symmetrical, not sparse and contains no regularity that any solver can exploit reliably.
I read that Eigen solver can be fast because of:
Compiler optimization
Vectorization
Multi-thread support
But I wonder those advantages are really applicable given my matrix characteristics?
Note: I could have just run a sample or two to find out, but I believe that asking the question here and have it documented on the Internet is going to help others as well.
Before thinking about low level optimizations, look at your code and observe that many quantities are recomputed many time. For instance, f(wi,wii) does not depend on j, so they could either be precomputed once (see below) or you can rewrite your loop to make the loop on j the nested one. Then the nested loop will simply be a coefficient wise product between a constant scalar and two columns of your matrix (I don't .net and assume j is indexing columns). If the storage if column-major, then this operation should be fully vectorized by your compiler (again, I don't know .net, but any C++ compiler will do, and if you Eigen, it will be vectorized explicitly). This should be enough to get a huge performance boost.
Depending on the sizes of matrix, you might also try to leverage optimized matrix-matrix implementation by precomputed f(wi,wii) into a MatrixXd F; (using Eigen's language), and then observe that the whole computation amount to:
VectorXd v = your_vector;
MatrixXd F = MatrixXd::nullaryExpr(n,n,[&](Index i,Index j) {
return SomePowerFunctions(sqrt(v(i)), sqrt(v(j)));
});
MatrixXd M = your_matrix;
MatrixXd FM = F * M;
VectorXd maxSum = (M.array() * FM.array()).colwise().sum();

From Matlab to C++ Eigen matrix operations - vector normalization

Converting some Matlab code to C++.
Questions (how to in C++):
Concatenate two vectors in a matrix. (already found the solution)
Normalize each array ("pts" col) dividing it by its 3rd value
Matlab code for 1 and 2:
% 1. A 3x1 vector. d0, d1 double.
B = [d0*A (d0+d1)*A]; % B is 3x2
% 2. Normalize a set of 3D points
% Divide each col by its 3rd value
% pts 3xN. C 3xN.
% If N = 1 you can do: C = pts./pts(3); if not:
C = bsxfun(#rdivide, pts, pts(3,:));
C++ code for 1 and 2:
// 1. Found the solution for that one!
B << d0*A, (d0 + d1)*A;
// 2.
for (int i=0, i<N; i++)
{
// Something like this, but Eigen may have a better solution that I don't know.
C.block<3,1>(0,i) = C.block<3,1>(0,i)/C(0,i);
}
Edit:
I hope the question is more clear now².
For #2:
C = C.array().rowwise() / C.row(2).array();
Only arrays have multiplication and division operators defined for row and column partial reductions. The array converts back to a matrix when you assign it back into C

Avoid numerical underflow when obtaining determinant of large matrix in Eigen

I have implemented a MCMC algorithm in C++ using the Eigen library. The main part of the algorithm is a loop in which first some some matrix calculations are performed after which the determinant of the resulting matrix is obtained and added to the output. E.g.:
MatrixXd delta0;
NumericVector out(3);
out[0] = 0;
out[1] = 0;
for (int i = 0; i < s; i++) {
...
delta0 = V*(A.cast<double>()-(A+B).cast<double>()*theta.asDiagonal());
...
I = delta0.determinant()
out[1] += I;
out[2] += std::sqrt(I);
}
return out;
Now on certain matrices I unfortunately observe a numerical underflow so that the determinant is outputted as zero (which it actually isn't).
How can I avoid this underflow?
One solution would be to obtain, instead of the determinant, the log of the determinant. However,
I do not know how to do this;
how could I then add up these logs?
Any help is greatly appreciated.
There are 2 main options that come to my mind:
The product of eigenvalues of square matrix is the determinant of this matrix, therefore a sum of logarithms of each eigenvalue is a logarithm of the determinant of this matrix. Assume det(A) = a and det(B) = b for compact notation. After applying aforementioned for 2 matrices A and B, we end up with log(a) and log(b), then actually the following is true:
log(a + b) = log(a) + log(1 + e ^ (log(b) - log(a)))
Yes, we get a logarithm of the sum. What would you do with it next? I don't know, depends on what you have to. If you have to remove logarithm by e ^ log(a + b) = a + b, then you might be lucky that the value of a + b does not underflow now, but in some cases it can still underflow as well.
Perform clever preconditioning; there might be tons of options here, and you better read about them from some trusted sources as this is a serious topic. The simplest (and probably the cheapest ever) example of preconditioning for this particular problem could be to recall that det(c * A) = (c ^ n) * det(A), where A is n by n matrix, and to premultiply your matrix with some c, compute the determinant, and then to divide it by c ^ n to get the actual one.
Update
I thought about one more option. If on the last stages of #1 or #2 you still experience underflow too frequently, then it might be a good idea to increase precision specifically for these last operations, for example, by utilizing GNU MPFR.
You can use Householder elimination to get the QR decomposition of delta0. Then the determinant of the Q part is +/-1 (depending on whether you did an even or odd number of reflections) and the determinant of the R part is the product of the diagonal elements. Both of these are easy to compute without running into underflow hell---and you might not even care about the first.