Sum of product: can we vectorize the following in C++? (using Eigen or other libraries) - c++

UPDATE: the (sparse) three-dimensional matrix v in my question below is symmetric: v(i1,i2,i3) = v(j1,j2,j3) where (j1,j2,j3) is any of the 6 permutations of (i1,i2,i3), i.e.
v(i1,i2,i3) = v(i1,i3,i2) = v(i2,i3,i1) = v(i2,i1,i3) = v(i3,i1,i2) = v(i3,i2,i1).
Moreover, v(i1,i2,i3) != 0 only when i1 != i2 && i1 != i3 && i2 != i3.
E.g. v(i,i,j) = 0, v(i, k, k) = 0, v(k, j, k) = 0, etc...
I thought that with this additional information, I could already get a significant speed-up by doing the following:
Remark: v contains duplicate values (a triplet (i,j,k) has 6 permutations, and the values of v for these 6 are the same).
So I defined a more compact matrix uthat contains only non-duplicates of v. The indices of u are (i1,i2,i3) where i1 < i2 < i3. The length of u is equal to the length of v divided by 6.
I computed the sum by iterating over the new value vector and the new index vectors.
With this, I only got a little speed-up. I realized that instead of iterating N times doing a multiplication each time, I iterated N/6 times doing 6 multiplications each time, and that's pretty much the same as before :(
Hope somebody could come up with a better solution.
--- (Original question) ---
In my program I have an expensive operation that is repeated every iteration.
I have three n-dimensional vectors x1, x2 and x3 that are supposed to change every iteration.
I have four N-dimensional vectors I1, I2, I3 and v that are pre-defined and will not change, where:
I1, I2 and I3 contain the indices of respectively x1, x2 and x3 (the elements in I_i are between 0 and n-1)
v is a vector of values.
For example:
We can see v as a (reshaped) sparse three-dimensional matrix, each index k of v corresponds to a triplet (i1,i2,i3) of indices of x1, x2, x3.
I want to compute at each iteration three n-dimensional vectors y1, y2 and y3 defined by:
y1[i1] = sum_{i2,i3} v(i1,i2,i3)*x2(i2)*x3(i3)
y2[i2] = sum_{i1,i3} v(i1,i2,i3)*x1(i1)*x3(i3)
y3[i3] = sum_{i1,i2} v(i1,i2,i3)*x1(i1)*x2(i2)
More precisely what the program does is:
Repeat:
Compute y1 then update x1 = f(y1)
Compute y2 then update x2 = f(y2)
Compute y3 then update x3 = f(y3)
where f is some external function.
I would like to know if there is a C++ library that helps me to do so as fast as possible. Using for loops is just too slow.
Thank you very much for your help!
Update: Looks like it's not easy to get a better solution than the straight-forward for loops. If the vector of indices I1 above is ordered in non-decreasing order, can we compute y1 faster?
For example: I1 = [0 0 0 0 1 1 2 2 2 3 3 3 ... n n].

The simple answer is no, at least, not trivially. Your access pattern (e.g. x2(i2)*x3(i3)) does not (at least at compile time) access contiguous memory, but rather has a layer of indirection. Due to this, SIMD instructions are pretty useless, as they work on chunks of memory. What you may want to consider doing is creating a copy of xM sorted according to iM, removing the layer of indirection. This should reduce the number of cache misses in that xM(iM) generates and since it's accessed N times, that may reduce some of the wall time (assuming N is large).
If maximal accuracy is not critical, you may want to consider using a FFT method instead of the convolution (at least, that's how I understood your question. Feel free to correct me if I'm wrong).
Assuming you are doing a convolution and the vectors (a and b, same size as in your question) are large, the result (c) can be calculated naïvely as
// O(n^2)
for(int i = 0; i < c.size(); i++)
c(i) = a(i) * b.array();
Using the convolution theorem, you could take the Fourier transform of both a and b and perform an element wise multiplication and then take the inverse Fourier transform of the result to get c (will probably differ a little):
// O(n log(n)); note A, B, and C are vectors of complex floating point numbers
fft.fwd(a, A);
fft.fwd(b, B);
C = A.array() * B.array();
fft.inv(C, c);

Related

Given N lines on a Cartesian plane. How to find the bottommost intersection of lines efficiently?

I have N distinct lines on a cartesian plane. Since slope-intercept form of a line is, y = mx + c, slope and y-intercept of these lines are given. I have to find the y coordinate of the bottommost intersection of any two lines.
I have implemented a O(N^2) solution in C++ which is the brute-force approach and is too slow for N = 10^5. Here is my code:
int main() {
int n;
cin >> n;
vector<pair<int, int>> lines(n);
for (int i = 0; i < n; ++i) {
int slope, y_intercept;
cin >> slope >> y_intercept;
lines[i].first = slope;
lines[i].second = y_intercept;
}
double min_y = 1e9;
for (int i = 0; i < n; ++i) {
for (int j = i + 1; j < n; ++j) {
if (lines[i].first ==
lines[j].first) // since lines are distinct, two lines with same slope will never intersect
continue;
double x = (double) (lines[j].second - lines[i].second) / (lines[i].first - lines[j].first); //x-coordinate of intersection point
double y = lines[i].first * x + lines[i].second; //y-coordinate of intersection point
min_y = min(y, min_y);
}
}
cout << min_y << endl;
}
How to solve this efficiently?
In case you are considering solving this by means of Linear Programming (LP), it could be done efficiently, since the solution which minimizes or maximizes the objective function always lies in the intersection of the constraint equations. I will show you how to model this problem as a maximization LP. Suppose you have N=2 first degree equations to consider:
y = 2x + 3
y = -4x + 7
then you will set up your simplex tableau like this:
x0 x1 x2 x3 b
-2 1 1 0 3
4 1 0 1 7
where row x0 represents the negation of the coefficient of "x" in the original first degree functions, x1 represents the coefficient of "y" which is generally +1, x2 and x3 represent the identity matrix of dimensions N by N (they are the slack variables), and b represents the value of the idepent term. In this case, the constraints are subject to <= operator.
Now, the objective function should be:
x0 x1 x2 x3
1 1 0 0
To solve this LP, you may use the "simplex" algorithm which is generally efficient.
Furthermore, the result will be an array representing the assigned values to each variable. In this scenario the solution is:
x0 x1 x2 x3
0.6666666667 4.3333333333 0.0 0.0
The pair (x0, x1) represents the point which you are looking for, where x0 is its x-coordinate and x1 is it's y-coordinate. There are other different results that you could get, for an example, there could exist no solution, you may find out more at plenty of books such as "Linear Programming and Extensions" by George Dantzig.
Keep in mind that the simplex algorithm only works for positive values of X0, x1, ..., xn. This means that before applying the simplex, you must make sure the optimum point which you are looking for is not outside of the feasible region.
EDIT 2:
I believe making the problem feasible could be done easily in O(N) by shifting the original functions into a new position by means of adding a big factor to the independent terms of each function. Check my comment below. (EDIT 3: this implies it won't work for every possible scenario, though it's quite easy to implement. If you want an exact answer for any possible scenario, check the following explanation on how to convert the infeasible quadrants into the feasible back and forth)
EDIT 3:
A better method to address this problem, one that is capable of precisely inferring the minimum point even if it is in the negative side of either x or y: converting to quadrant 1 all of the other 3.
Consider the following generic first degree function template:
f(x) = mx + k
Consider the following generic cartesian plane point template:
p = (p0, p1)
Converting a function and a point from y-negative quadrants to y-positive:
y_negative_to_y_positive( f(x) ) = -mx - k
y_negative_to_y_positive( p ) = (p0, -p1)
Converting a function and a point from x-negative quadrants to x-positive:
x_negative_to_x_positive( f(x) ) = -mx + k
x_negative_to_x_positive( p ) = (-p0, p1)
Summarizing:
quadrant sign of corresponding (x, y) converting f(x) or p to Q1
Quadrant 1 (+, +) f(x)
Quadrant 2 (-, +) x_negative_to_x_positive( f(x) )
Quadrant 3 (-, -) y_negative_to_y_positive( x_negative_to_x_positive( f(x) ) )
Quadrant 4 (+, -) y_negative_to_y_positive( f(x) )
Now convert the functions from quadrants 2, 3 and 4 into quadrant 1. Run simplex 4 times, one based on the original quadrant 1 and the other 3 times based on the converted quadrants 2, 3 and 4. For the cases originating from a y-negative quadrant, you will need to model your simplex as a minimization instance, with negative slack variables, which will turn your constraints to the >= format. I will leave to you the details on how to model the same problem based on a minimization task.
Once you have the results of each quadrant, you will have at hands at most 4 points (because you might find out, for example, that there is no point on a specific quadrant). Convert each of them back to their original quadrant, going back in an analogous manner as the original conversion.
Now you may freely compare the 4 points with each other and decide which one is the one you need.
EDIT 1:
Note that you may have the quantity N of first degree functions as huge as you wish.
Other methods for solving this problem could be better.
EDIT 3: Check out the complexity of simplex. In the average case scenario, it works efficiently.
Cheers!

Efficient sparse matrix addition in Armadillo

I am trying to construct a sparse matrix L of the form
L and Hi are respectively a very sparse matrix and row vector. The final L matrix should have a density of around 1 % .
Armadillo provides a arma::sp_mat class that seems to suit my needs. The assembly of L then looks like this
arma::sp_mat L(N,N);
arma::sp_mat Hi(1,N);
for (int i = 0; i < p; ++ i){
// The non-zero terms in Hi are populated here
L += Hi.t() * Hi;
}
The number of non-zero elements in Hi is constant with i. I do not have much experience with sparse matrices but I was expecting the incremental assembly of L to be relatively constant in speed.
Yet, it seems that the speed at which Hi.t() * Hi is added to L decreases over time. Am I doing something wrong in the way I assemble L? Should I preconstruct L by specifying which of its components I know will not be zero?
It seems that L is not initialized so that it effectively changes size when incremented with Hi.t() * Hi. This was likely the cause for the decrease in the speed.

Product of a multi-dimensional array (or tensor) and vectors

I would like to ask for a fast way to perform the following operations, either in native Matlab, C++, or using toolboxes/libraries, whichever would give the fastest solutions.
Let M be a tensor of D dimensions: n1 x n2 x... x nD, and let v1, v2,..., vD be D vectors whose dimensions are respectively n1, n2,..., nD.
Compute the product M*vi (1 <= i <= D). The result is a multi-dimensional array of (D-1) dimensions.
Compute the product of M with all vectors, except vi.
For example, with D = 3:
The product of M and v1 is a tensor N of 2 dimensions (i.e. a matrix) where
N[i2][i3] = Sum_over_i1 of M[i1][i2][i3]*v1[i1]
The product of M and v2 is a matrix N where
N[i1][i3] = Sum_over_i2 of M[i1][i2][i3]*v2[i2]
The product of M and v2 and v3 is a vector v where
v[i1] = Sum_over_i2 of (Sum_over_i3 of M[i1][i2][i3]*v2[i2]*v3[i3])
A further question: the above but for sparse tensors.
An example of Matlab code is given below.
Thank you very much in advance for your help!!
n1 = 3;
n2 = 5;
n3 = 4;
M = randn(n1,n2,n3);
v1 = randn(n1,1);
v2 = randn(n2,1);
v3 = randn(n3,1);
%% N = M*v2
N = zeros(n1,n3);
for i1=1:n1
for i3=1:n3
for i2=1:n2
N(i1,i3) = N(i1,i3) + M(i1,i2,i3)*v2(i2);
end
end
end
%% v = M*v2*v3
v = zeros(n1,1);
for i1=1:n1
for i2=1:n2
for i3=1:n3
v(i1) = v(i1) + M(i1,i2,i3)*v2(i2)*v3(i3);
end
end
end
I've noticed that operation you are describing takes (D - 1) dimensional slices of M and scales them by the corresponding entry of vi subsequently summing the result over the indices of vi. This code seems to work for getting N in your example:
N2 = squeeze(sum(M.*(v2)', 2));
To get v in your code, all you need to do is multiply N by v3:
v2 = N2*v3;
EDIT
On older versions of MatLab the element-wise operator .* doesn't work the way I've used it above. One alternative is bsxfun:
N2 = squeeze(sum(bsxfun(#times, M, v2'), 2));
Just checked: In terms of performance, the bsxfun way seems as fast as the .* way for large arrays, at least on R2016b.

Divide and conquer algorithm: find the minimum of a matrix

I'm starting to learn how to implement divide and conquer algorithms, but I'm having some serious trouble with this exercise.
I have written an algorithm which finds the minimum value in a given vector using the divide and conquer method:
int minimum (int v[], int inf, int sup)
{
int med, m1, m2;
if (inf == sup)
return v[inf];
med = (inf+sup)/2;
m1 = minimum (v, inf, med);
m2 = minimum (v, med+1, sup);
if (m1 < m2)
return m1;
else
return m2;
}
And it works. Now, I have to do the same exercise on a matrix, but I'm getting lost. Specifically, I have been told to do the following:
Let n = 2^k. Consider a nxn square matrix. Calculate its minimum value using a recursive function
double (minmatrix(Matrix M))
return minmatrix2(M, 0, 0, M.row);
(the Matrix type is given, and as you can imagine M.row gives the number of rows and columns of the matrix). Use an auxiliary function
double minmatrix2(Matrix M, int i, int j, int m)
This has to be done use a recursive divide and conquer algorithm.
So.. I can't figure out a way of doing it. I have been given the suggestion of splitting the matrix in 4 parts each time (from (i,j) to (i+m/2, j+m/2), from (i+m/2,j) to (i+m,j+m/2), from (i,j+m/2) to (i+m/2,j+m), from (i+m/2,j+m/2) to (i+m,j+m)) and try to implement a code working in a similar way to the one I have written for the array.. but I just seem to be unable to do it. Any suggestions? Even if you don't want to post a complete answer, just give me some indications. I really want to understand this.
EDIT: All right, I've done this. I'm posting the code I have used just in case someone else has the same doubt.
double minmatrix2(Matrix M, int i, int j, int m)
{
int a1, a2, a3, a4;
if (m == 1)
return M[i][j];
else
a1 = minmatrix2(M, i, j, m/2);
a2 = minmatrix2(M, i+(m/2), j, m/2);
a3 = minmatrix2(M, i, j+(m/2), m/2);
a4 = minmatrix2(M, i+(m/2), j+(m/2), m/2);
if (min (a1, a2) < min (a3, a4))
return min (a1, a2);
else
return min (a3, a4);
}
(function min defined elsewhere)
Consider that a 2D matrix in C or C++ is often implemented as accessor functions on top of a 1D array. You already know how to do this for a 1D array, so the only difference is how you address the cells. If you do this, your performance will intrinsically be optimal because you will address neighboring cells together.
Alternatively, consider that a 2D matrix has two dimensions N and M. Just break it in half along the larger dimension repeatedly until the larger dimension is less than X, some reasonable value to stop and do the actual computation sequentially. This is not entirely optimal because you will have to "skip" over parts of the matrix as you address memory.
A final idea is to divide by the major dimension first, then the minor one. In C this means divide by rows until you have single rows, then run your 1D array algorithm on each row. This produces roughly optimal performance.

2D rigid body physics using runge kutta

Does anyone know any c++/opengl sourcecode demos for 2D rigid body physics using runge kutta?
I want to build a physics engine but I need some reference code to understand better how others have implemented this.
There are a lot of things you have to take care to do this nicely. I will focus on the integrator implementation and what I have found works good for me.
For all the degrees of freedom in your system implement a function to return the accelerations a as a function of time t, positions x and velocities v. This should operate on arrays or vectors of quantities and not just scalars.
a = accel(t,x,v);
After each RK step evaluate the acceleration to be ready for the next step. In the loop then do this:
{
// assume t,x[],v[], a[] are known
// step time t -> t+h and calc new values
float h2=h/2;
vec q1 = v + h2*a;
vec k1 = accel(t+h2, x+h2*v, q1);
vec q2 = v + h2*k1;
vec k2 = accel(t+h2, x+h2*q1, q2);
vec q3 = v + h*k2;
vec k3 = accel(t_h, x+h*q2, q3);
float h6 = h/6;
t = t + h;
x = x + h*(v+h6*(a+k1+k2));
v = v + h6*(a+2*k1+2*k2+k3);
a = accel(t,x,v);
}
Why? Well the standard RK method requires you to make a 2xN state vector, but the derivatives of the fist N elements are equal to the last N elements. If you split the problem up to two N state vectors and simplify a little you will arrive at the above scheme for 2nd order RK.
I have done this and the results are identical to commercial software for a plan system with N=6 degrees of freedom.