Matlab VS. C++ in matrix calculation

Matlab VS. C++ in matrix calculation - c++

I am using C++ to do some matrix calculations using Armadillo library.
I tried to make it similar to the Matlab version.
But when I run the code.
While Matlab took about 2 - 3 min, C++ took about 20 min.
I searched a bit and realized that some people also asked why C++ is slower than Matlab in matrix calculations.
But I heard that C++ is way faster than Matlab. So I was wondering whether C++ is not as good as Matlab in terms of Matrix calculations in usual.
Below is just part of my entire code.
Is there any way I can speed up C++ matrix calculations?
Should I use a different library?
while (dif >= tol && it <= itmax) {
it = it + 1;
V = Vnew;
Vfuture = beta * (Ptrans(0) * Vnew.slice(0) + Ptrans(1) * Vnew.slice(1) + Ptrans(2) * Vnew.slice(2));
for (int a = 0; a < Na; a++) {
for (int b = 0; b < Nd; b++) {
for (int c = 0; c < Ny; c++) {
Mat<double> YY(Na, Nd);
YY.fill(Y(c));
Mat<double> AA(Na, Nd);
AA.fill(A(a));
Mat<double> DD(Na, Nd);
DD.fill(D(b));
Mat<double> CC = YY + AA - mg_A_v / R - (mg_D_v - (1 - delta) * DD);
Mat<double> Val = 1 / (1 - 1 / sig) * pow(pow(CC, psi) % pow(mg_D_v, 1 - psi), (1 - 1 / sig)) + Vfuture;
double max_val = Val.max();
uword maxindex_val = Val.index_max();
int index_column = maxindex_val / Na; // column
int index_row = maxindex_val - index_column * Na; // row
Vnew(a, b, c) = max_val;
maxposition_a(a, b, c) = index_row;
maxposition_d(a, b, c) = index_column;
}
}
}
// Howard improvement
for (int h = 0; h < H; h++) {
Vhoward = Vnew;
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
temphoward(i, j) = beta * Vhoward(maxposition_a(i, j, k), maxposition_d(i, j, k), 0) * Ptrans(0) + beta * Vhoward(maxposition_a(i, j, k), maxposition_d(i, j, k), 1) * Ptrans(1) + beta * Vhoward(maxposition_a(i, j, k), maxposition_d(i, j, k), 2) * Ptrans(2);
Vnew(i, j, k) = temphoward(i, j) + utility(Y(k) + A(i) - A(maxposition_a(i, j, k)) / R - D(maxposition_d(i, j, k)) + (1 - delta) * D(j), D(maxposition_d(i, j, k)), sig, psi);
}
}
}
}
tempdiff = abs(V - Vnew);
dif = tempdiff.max();
cout << dif << endl;
cout << it << endl;
}
And this is the part from the matlab.
while dif >= tol && it <= itmax
tic;
it = it + 1;
V = Vnew;
vFuture = beta*reshape(V,Na*Nd,Ny)*P;
for i_a = 1:Na %Loop over state variable a
for i_d = 1:Nd %Loop over state variable d
for i_y = 1:Ny %Loop over state variable y
val = reshape(Utility(Y(i_y) + A(i_a) - mg_A_v/R - (mg_D_v - (1-delta)*D(i_d)),mg_D_v),Na*Nd,1) + vFuture;
[Vnew(i_a,i_d,i_y), indpol(i_a,i_d,i_y)] = max(val);
[indpol_ap(i_a,i_d,i_y),indpol_dp(i_a,i_d,i_y)] = ind2sub([Na,Nd],indpol(i_a,i_d,i_y));
end
end
end
% Howard improvement step
for h = 1:H
Vhoward = Vnew;
for i_a = 1:Na %Loop over state variable a
for i_d = 1:Nd %Loop over state variable d
for i_y = 1:Ny %Loop over state variable y
Vnew(i_a,i_d,i_y) = Utility(Y(i_y) + A(i_a) - A(indpol_ap(i_a,i_d,i_y))/R - ...
(D(indpol_dp(i_a,i_d,i_y)) - (1-delta)*D(i_d)),D(indpol_dp(i_a,i_d,i_y))) ...
+ beta*reshape(Vhoward(indpol_ap(i_a,i_d,i_y),indpol_dp(i_a,i_d,i_y),:),1,Ny)*P;
end
end
end
end
dif = max(max(max(abs(V-Vnew))));
disp([it dif toc])
end

Related

Reorganizing nested loops for multithreading

I'm trying to rewrite the main loop in a physics simulation and split the workload between more threads.
It calls dostuff on every unique pair of indices and looks like this:
for (int i = 0; i < n - 1; ++i)
{
for (int j = i + 1; j < n; ++j)
{
dostuff(i, j);
}
}
I came up with two options:
//#1
//sqrt is implemented as binary search on ints, floors the result
for (int x = 0; x < n * (n - 1) / 2; ++x)
{
int i = (1 + sqrt(1 + 8 * x)) / 2;
int j = x - i * (i - 1) / 2;
dostuff(i, j);
}
//#2
for (int x = 0; x < n * n; ++x)
{
int i = x % n;
int j = x / n;
if (i < j)
dostuff(i, j);
}
And for each option, there is corresponding thread loop using shared atomic counter:
//#1
while(int x = counter.fetch_add(1) < n * (n - 1) / 2)
{
int i = (1 + sqrt(1 + 8 * x)) / 2;
int j = x - i * (i - 1) / 2;
dostuff(i, j);
}
//#2
while(int x = counter.fetch_add(1) < n * n)
{
int i = x % n;
int j = x / n;
if (i < j)
dostuff(i, j);
}
My question is, what is the best way to share the workload of the main loop between threads for n < 10^6?
EDIT:
//dostuff
Element& a = elements[i];
Element& b = elements[j];
glm::dvec3 r = b.getPosition() - a.getPosition();
double rv = glm::length(r);
double base = G / (rv * rv);
glm::dvec3 dir = glm::normalize(r);
glm::dvec3 bd = dir * base;
accelerations[i] += bd * b.getMass();
accelerations[j] -= bd * a.getMass();

Your work is a triangle. You want to.divide the triangle into k distinct pieces.
If k is a power of 2 you can do this:
a
a a
b c d
b c d d
Each of those regions are equal in size.

Implementing modular Runge-kutta 4th order method for a n-dimension system

i'm trying to make my runge-kutta 4th order code modular. I don't want to have to write and declare the code everytime I use it, but declare it in a .hpp and a .cpp file to use it separetely. But i'm having some problems. Generally I want to solve a n-dimension system of equations. For that I use two functions: one for the system of equations and another for the runge-kutta method as follows:
double F(double t, double x[], int eq)
{
// System equations
if (eq == 0) { return (x[1]); }
else if (eq == 1) { return (gama * sin(OMEGA*t) - zeta * x[1] - alpha * x[0] - beta * pow(x[0], 3) - chi * x[2]); }
else if (eq == 2) { return (-kappa * x[1] - phi * x[2]); }
else { return 0; }
}
void rk4(double &t, double x[], double step)
{
double x_temp1[sistvar], x_temp2[sistvar], x_temp3[sistvar];
double k1[sistvar], k2[sistvar], k3[sistvar], k4[sistvar];
int j;
for (j = 0; j < sistvar; j++)
{
x_temp1[j] = x[j] + 0.5*(k1[j] = step * F(t, x, j));
}
for (j = 0; j < sistvar; j++)
{
x_temp2[j] = x[j] + 0.5*(k2[j] = step * F(t + 0.5 * step, x_temp1, j));
}
for (j = 0; j < sistvar; j++)
{
x_temp3[j] = x[j] + (k3[j] = step * F(t + 0.5 * step, x_temp2, j));
}
for (j = 0; j < sistvar; j++)
{
k4[j] = step * F(t + step, x_temp3, j);
}
for (j = 0; j < sistvar; j++)
{
x[j] += (k1[j] + 2 * k2[j] + 2 * k3[j] + k4[j]) / 6.0;
}
t += step;
}
The above code works and it is validated. However it has some dependencies as it uses some global variables to work:
gama, OMEGA, zeta, alpha, beta, chi, kappa and phi are global variables that I want to read from a .txt file. I already manage to do that, however only in a single .cpp file with all code included.
Also, sistvar is the system dimension and also a global variable. I'm trying to enter it as an argument in F. But the way it is written seems to give errors as sistvar is a const and can't be changed as a variable and I can't put variables inside an array's size.
In addition, the two functions has an interdependency as when a call F inside rk4, eq number is needeed.
Could you give me tips in how to do that? I already searched and read books about this and could not find an answer for it. It is probably an easy task but i'm relatively new in c/c++ programming languages.
Thanks in advance!
* EDITED (Tried to implement using std::vector)*
double F(double t, std::vector<double> x, int eq)
{
// System Equations
if (eq == 0) { return (x[1]); }
else if (eq == 1) { return (gama * sin(OMEGA*t) - zeta * x[1] - alpha * x[0] - beta * pow(x[0], 3) - chi * x[2]); }
else if (eq == 2) { return (-kappa * x[1] - phi * x[2]); }
else { return 0; }
}
double rk4(double &t, std::vector<double> &x, double step, const int dim)
{
std::vector<double> x_temp1(dim), x_temp2(dim), x_temp3(dim);
std::vector<double> k1(dim), k2(dim), k3(dim), k4(dim);
int j;
for (j = 0; j < dim; j++) {
x_temp1[j] = x[j] + 0.5*(k1[j] = step * F(t, x, j));
}
for (j = 0; j < dim; j++) {
x_temp2[j] = x[j] + 0.5*(k2[j] = step * F(t + 0.5 * step, x_temp1, j));
}
for (j = 0; j < dim; j++) {
x_temp3[j] = x[j] + (k3[j] = step * F(t + 0.5 * step, x_temp2, j));
}
for (j = 0; j < dim; j++) {
k4[j] = step * F(t + step, x_temp3, j);
}
for (j = 0; j < dim; j++) {
x[j] += (k1[j] + 2 * k2[j] + 2 * k3[j] + k4[j]) / 6.0;
}
t += step;
for (j = 0; j < dim; j++) {
return x[j];
}
}
vector array
2.434 s | | 0.859 s
2.443 s | | 0.845 s
2.314 s | | 0.883 s
2.418 s | | 0.884 s
2.505 s | | 0.852 s
2.428 s | | 0.923 s
2.097 s | | 0.814 s
2.266 s | | 0.922 s
2.133 s | | 0.954 s
2.266 s | | 0.868 s
_______ _______
average = 2.330 s average = 0.880 s

Using vector function where the vector arithmetic is taken from Eigen3
#include <eigen3/Eigen/Dense>
using namespace Eigen;
of the same parts as discussed in the question could look like (inspired by function pointer with Eigen)
VectorXd Func(const double t, const VectorXd& x)
{ // equations for solving simple harmonic oscillator
Vector3d dxdt;
dxdt[0] = x[1];
dxdt[1] = gama * sin(OMEGA*t) - zeta * x[1] - alpha * x[0] - beta * pow(x[0], 3) - chi * x[2];
dxdt[2] = -kappa * x[1] - phi * x[2];
return dxdt;
}
MatrixXd RK4(VectorXd Func(double t, const VectorXd& y), const Ref<const VectorXd>& y0, double t, double h, int step_num)
{
MatrixXd y(y0.rows(), step_num );
VectorXd k1, k2, k3, k4;
y.col(0) = y0;
for (int i=1; i<step_num; i++){
k1 = Func(t, y.col(i-1));
k2 = Func(t+0.5*h, y.col(i-1)+0.5*h*k1);
k3 = Func(t+0.5*h, y.col(i-1)+0.5*h*k2);
k4 = Func(t+h, y.col(i-1)+h*k3);
y.col(i) = y.col(i-1) + (k1 + 2*k2 + 2*k3 + k4)*h/6;
t = t+h;
}
return y.transpose();
}
Passing a vector to a function to be filled apparently requires some higher template contemplations in Eigen.

if statement runtime error

I originally had 3 equations: Pu, Pm & Pd. It ran fine.
Once I introduced the if statement, with variations on the 3 equations, depending on the loop iteration, I receive a runtime error.
Any help would be appreciated.
Cheers in advance.
#include <cmath>
#include <iostream>
#include <vector>
#include <iomanip>
int Rounding(double x)
{
int Integer = (int)x;
double Decimal = x - Integer;
if (Decimal > 0.49)
{
return (Integer + 1);
}
else
{
return Integer;
}
}
int main()
{
double a = 0.1;
double sigma = 0.01;
int delta_t = 1;
double M = -a * delta_t;
double V = sigma * sigma * delta_t;
double delta_r = sqrt(3 * V);
int count;
double PuValue;
double PmValue;
double PdValue;
int j_max;
int j_min;
j_max = Rounding(-0.184 / M);
j_min = -j_max;
std::vector<std::vector<double>> Pu((20), std::vector<double>(20));
std::vector<std::vector<double>> Pm((20), std::vector<double>(20));
std::vector<std::vector<double>> Pd((20), std::vector<double>(20));
std::cout << std::setprecision(10);
for (int i = 0; i <= 2; i++)
{
count = 0;
for (int j = i; j >= -i; j--)
{
count = count + 1;
if (j = j_max) // Exhibit 1C
{
PuValue = 7.0/6.0 + (j * j * M * M + 3 * j * M)/2.0;
PmValue = -1.0/3.0 - j * j * M * M - 2 * j * M;
PdValue = 1.0/6.0 + (j * j * M * M + j * M)/2.0;
}
else if (j = j_min) // Exhibit 1B
{
PuValue = 1.0/6.0 + (j * j * M * M - j * M)/2.0;
PmValue = -1.0/3.0 - j * j * M * M + 2 * j * M;
PdValue = 7.0/6.0 + (j * j * M * M - 3 * j * M)/2.0;
}
else
{
PuValue = 1.0/6.0 + (j * j * M * M + j * M)/2.0;
PmValue = 2.0/3.0 - j * j * M * M;
PdValue = 1.0/6.0 + (j * j * M * M - j * M)/2.0;
}
Pu[count][i] = PuValue;
Pm[count][i] = PmValue;
Pd[count][i] = PdValue;
std::cout << Pu[count][i] << ", ";
}
std::cout << std::endl;
}
return 0;
}

You are assigning instead of checking for equal: j_max to j in your if statements.
if (j = j_max)
// ^
else if (j = j_min)
// ^
Change if (j = j_max) to if (j == j_max),
And else if (j = j_min) to else if (j == j_min).

Correct the following if conditional check and all other instances of an if check
if(j=j_max)
with
if (j == j_max)
you are checking for an equality not assigning.
Your code was going into an infinite loop.

Optimize log entropy calculation in sparse matrix

I have a 3007 x 1644 dimensional matrix of terms and documents. I am trying to assign weights to frequency of terms in each document so I'm using this log entropy formula http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix (See entropy formula in the last row).
I'm successfully doing this but my code is running for >7 minutes.
Here's the code:
int N = mat.cols();
for(int i=1;i<=mat.rows();i++){
double gfi = sum(mat(i,colon()))(1,1); //sum of occurrence of terms
double g =0;
if(gfi != 0){// to avoid divide by zero error
for(int j = 1;j<=N;j++){
double tfij = mat(i,j);
double pij = gfi==0?0.0:tfij/gfi;
pij = pij + 1; //avoid log0
double G = (pij * log(pij))/log(N);
g = g + G;
}
}
double gi = 1 - g;
for(int j=1;j<=N;j++){
double tfij = mat(i,j) + 1;//avoid log0
double aij = gi * log(tfij);
mat(i,j) = aij;
}
}
Anyone have ideas how I can optimize this to make it faster? Oh and mat is a RealSparseMatrix from amlpp matrix library.
UPDATE
Code runs on Linux mint with 4gb RAM and AMD Athlon II dual core
Running time before change: > 7mins
After #Kereks answer: 4.1sec

Here's a very naive rewrite that removes some redundancies:
int const N = mat.cols();
double const logN = log(N);
for (int i = 1; i <= mat.rows(); ++i)
{
double const gfi = sum(mat(i, colon()))(1, 1); // sum of occurrence of terms
double g = 0;
if (gfi != 0)
{
for (int j = 1; j <= N; ++j)
{
double const pij = mat(i, j) / gfi + 1;
g += pij * log(pij);
}
g /= logN;
}
for (int j = 1; j <= N; ++j)
{
mat(i,j) = (1 - g) * log(mat(i, j) + 1);
}
}
Also make sure that the matrix data structure is sane (e.g. a flat array accessed in strides; not a bunch of dynamically allocated rows).
Also, I think the first + 1 is a bit silly. You know that x -> x * log(x) is continuous at zero with limit zero, so you should write:
double const pij = mat(i, j) / gfi;
if (pij != 0) { g += pij + log(pij); }
In fact, you might even write the first inner for loop like this, avoiding a division when it isn't needed:
for (int j = 1; j <= N; ++j)
{
if (double pij = mat(i, j))
{
pij /= gfi;
g += pij * log(pij);
}
}

Laguerre interpolation algorithm, something's wrong with my implementation

This is a problem I have been struggling for a week, coming back just to give up after wasted hours...
I am supposed to find coefficents for the following Laguerre polynomial:
P0(x) = 1
P1(x) = 1 - x
Pn(x) = ((2n - 1 - x) / n) * P(n-1) - ((n - 1) / n) * P(n-2)
I believe there is an error in my implementation, because for some reason the coefficents I get seem way too big. This is the output this program generates:
a1 = -190.234
a2 = -295.833
a3 = 378.283
a4 = -939.537
a5 = 774.861
a6 = -400.612
Description of code (given below):
If you scroll the code down a little to the part where I declare array, you'll find given x's and y's.
The function polynomial just fills an array with values of said polynomial for certain x. It's a recursive function. I believe it works well, because I have checked the output values.
The gauss function finds coefficents by performing Gaussian elimination on output array. I think this is where the problems begin. I am wondering, if there's a mistake in this code or perhaps my method of veryfying results is bad? I am trying to verify them like that:
-190.234 * 1.5 ^ 5 - 295.833 * 1.5 ^ 4 ... - 400.612 = -3017,817625 =/= 2
Code:
#include "stdafx.h"
#include <conio.h>
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
double polynomial(int i, int j, double **tab)
{
double n = i;
double **array = tab;
double x = array[j][0];
if (i == 0) {
return 1;
} else if (i == 1) {
return 1 - x;
} else {
double minusone = polynomial(i - 1, j, array);
double minustwo = polynomial(i - 2, j, array);
double result = (((2.0 * n) - 1 - x) / n) * minusone - ((n - 1.0) / n) * minustwo;
return result;
}
}
int gauss(int n, double tab[6][7], double results[7])
{
double multiplier, divider;
for (int m = 0; m <= n; m++)
{
for (int i = m + 1; i <= n; i++)
{
multiplier = tab[i][m];
divider = tab[m][m];
if (divider == 0) {
return 1;
}
for (int j = m; j <= n; j++)
{
if (i == n) {
break;
}
tab[i][j] = (tab[m][j] * multiplier / divider) - tab[i][j];
}
for (int j = m; j <= n; j++) {
tab[i - 1][j] = tab[i - 1][j] / divider;
}
}
}
double s = 0;
results[n - 1] = tab[n - 1][n];
int y = 0;
for (int i = n-2; i >= 0; i--)
{
s = 0;
y++;
for (int x = 0; x < n; x++)
{
s = s + (tab[i][n - 1 - x] * results[n-(x + 1)]);
if (y == x + 1) {
break;
}
}
results[i] = tab[i][n] - s;
}
}
int _tmain(int argc, _TCHAR* argv[])
{
int num;
double **array;
array = new double*[5];
for (int i = 0; i <= 5; i++)
{
array[i] = new double[2];
}
//i 0 1 2 3 4 5
array[0][0] = 1.5; //xi 1.5 2 2.5 3.5 3.8 4.1
array[0][1] = 2; //yi 2 5 -1 0.5 3 7
array[1][0] = 2;
array[1][1] = 5;
array[2][0] = 2.5;
array[2][1] = -1;
array[3][0] = 3.5;
array[3][1] = 0.5;
array[4][0] = 3.8;
array[4][1] = 3;
array[5][0] = 4.1;
array[5][1] = 7;
double W[6][7]; //n + 1
for (int i = 0; i <= 5; i++)
{
for (int j = 0; j <= 5; j++)
{
W[i][j] = polynomial(j, i, array);
}
W[i][6] = array[i][1];
}
for (int i = 0; i <= 5; i++)
{
for (int j = 0; j <= 6; j++)
{
cout << W[i][j] << "\t";
}
cout << endl;
}
double results[6];
gauss(6, W, results);
for (int i = 0; i < 6; i++) {
cout << "a" << i + 1 << " = " << results[i] << endl;
}
_getch();
return 0;
}

I believe your interpretation of the recursive polynomial generation either needs revising or is a bit too clever for me.
given P[0][5] = {1,0,0,0,0,...}; P[1][5]={1,-1,0,0,0,...};
then P[2] is a*P[0] + convolution(P[1], { c, d });
where a = -((n - 1) / n)
c = (2n - 1)/n and d= - 1/n
This can be generalized: P[n] == a*P[n-2] + conv(P[n-1], { c,d });
In every step there is involved a polynomial multiplication with (c + d*x), which increases the degree by one (just by one...) and adding to P[n-1] multiplied with a scalar a.
Then most likely the interpolation factor x is in range [0..1].
(convolution means, that you should implement polynomial multiplication, which luckily is easy...)
[a,b,c,d]
* [e,f]
------------------
af,bf,cf,df +
ae,be,ce,de, 0 +
--------------------------
(= coefficients of the final polynomial)

The definition of P1(x) = x - 1 is not implemented as stated. You have 1 - x in the computation.
I did not look any further.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matlab VS. C++ in matrix calculation - c++

Related

Reorganizing nested loops for multithreading

Implementing modular Runge-kutta 4th order method for a n-dimension system

if statement runtime error

Optimize log entropy calculation in sparse matrix

Laguerre interpolation algorithm, something's wrong with my implementation

Categories

Resources