Eigen3 matrix multiplication performance - c++

Note: I've posted this also on Eigen forum here
I want to premultiply 3xN matrices by a 3x3 matrix, i.e., to transform 3D points, like
p_dest = T * p_source
after initializing the matrices:
Eigen::Matrix<double, 3, Eigen::Dynamic> points = Eigen::Matrix<double, 3, Eigen::Dynamic>::Random(3, NUMCOLS);
Eigen::Matrix<double, 3, Eigen::Dynamic> dest = Eigen::Matrix<double, 3, Eigen::Dynamic>(3, NUMCOLS);
int NT = 100;
I have evaluated this two versions
// eigen direct multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
}
and
// col multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
}
}
the NT repetition are done just to compute average time
I am surprised the the column by column multiplication is about 4/5 time faster than the direct multiplication
(and the direct multiplication is even slower if I do not use the .noalias(), but this is fine since it is doing a temporary copy)
I've tried to change NUMCOLS from 0 to 1000000 and the relation is linear.
I'm using Visual Studio 2013 and compiling in release
The next figure shows on X the number of columns of the matrix and in Y the avg time for a single operation, in blue the col by col multiplication, in red the matrix multiplication
Any suggestion why this happens?

Short answer
You're timing the lazy (and therefore lack of) evaluation in the col multiplication version, vs. the lazy (but evaluated) evaluation in the direct version.
Long answer
Instead of code snippets, let's look at a full MCVE. First, "you're" version:
void ColMult(Matrix3Xd& dest, Matrix3Xd& points)
{
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
}
}
void EigenDirect(Matrix3Xd& dest, Matrix3Xd& points)
{
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
}
int main(int argc, char *argv[])
{
srand(time(NULL));
int NUMCOLS = 100000 + rand();
Matrix3Xd points = Matrix3Xd::Random(3, NUMCOLS);
Matrix3Xd dest = Matrix3Xd(3, NUMCOLS);
Matrix3Xd dest2 = Matrix3Xd(3, NUMCOLS);
int NT = 200;
// eigen direct multiplication
auto beg1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < NT; i++)
{
EigenDirect(dest, points);
}
auto end1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end1-beg1;
// col multiplication
auto beg2 = std::chrono::high_resolution_clock::now();
for(int i = 0; i < NT; i++)
{
ColMult(dest2, points);
}
auto end2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds2 = end2-beg2;
std::cout << "Direct time: " << elapsed_seconds.count() << "\n";
std::cout << "Col time: " << elapsed_seconds2.count() << "\n";
std::cout << "Eigen speedup: " << elapsed_seconds2.count() / elapsed_seconds.count() << "\n\n";
return 0;
}
With this code (and SSE turned on), I get:
Direct time: 0.449301
Col time: 0.10107
Eigen speedup: 0.224949
Same 4-5 slowdown you complained of. Why?!?! Before we get to the answer, let's modify the code a bit so that the dest matrix is sent to an ostream. Add std::ostream outPut(0); to the beginning of main() and before ending the timers add outPut << dest << "\n\n"; and outPut << dest2 << "\n\n";. The std::ostream outPut(0); doesn't output anything (I'm pretty sure the badbit is set), but it does cause Eigens operator<< to be called, which forces the evaluation of the matrix.
NOTE: if we used outPut << dest(1,1) then dest would be evaluated only enough to output the single element in the col multiplication method.
We then get
Direct time: 0.447298
Col time: 0.681456
Eigen speedup: 1.52349
as a result as expected. Note that the Eigen direct method took the exact(ish) same time (meaning the evaluation took place even without the added ostream), whereas the col method all of the sudden took much longer.

Related

Matrix multiplication using three different methods gives different results, depending on amount of values

I would like to multiply two matrices A and B, and wanted to compare three different methods. One of them is simply iterating over the columns of B and multiplying them by the matrix A, the second one is using the function each_col() from armadillo, and applying a lambda, and the third one is simply the multiplication A * B. The resulting code is shown below:
#include <complex>
#include <iostream>
#include <chrono>
#include <armadillo>
constexpr int num_values = 2048;
constexpr int num_rows = 128;
constexpr int num_cols = num_values / num_rows;
constexpr int bench_rounds = 100;
void test_multiply_loop(const arma::mat &in_mat,
const arma::mat &init_mat,
arma::mat &out_mat) {
for(size_t i = 0; i < in_mat.n_cols; ++i) {
out_mat.col(i) = init_mat * in_mat.col(i);
}
}
void test_multiply_matrix(const arma::mat &in_mat,
const arma::mat &init_mat,
arma::mat &out_mat) {
out_mat = init_mat * in_mat;
}
void test_multiply_lambda(const arma::mat &in_mat,
const arma::mat &init_mat,
arma::mat &out_mat) {
out_mat = in_mat;
out_mat.each_col([init_mat](arma::colvec &a) {
a = init_mat * a;
});
}
int main()
{
std::cout << "Hello World" << "\n";
//Create matrix
arma::colvec test_vec = arma::linspace(1, num_values, num_values);
arma::mat init_mat = arma::reshape(test_vec, num_rows, num_cols);
arma::mat out_mat_loop = arma::zeros(num_rows, num_cols),
out_mat_lambda = arma::zeros(num_rows, num_cols),
out_mat_matrix = arma::zeros(num_rows, num_cols);
arma::mat test_mat = arma::eye(num_rows, num_rows);
for(size_t i = 0; i < num_rows; ++i)
for(size_t j = 0; j < num_rows; ++j)
test_mat(i, j) *= (i + 1);
auto t1 = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < bench_rounds; ++i)
test_multiply_loop(init_mat, test_mat, out_mat_loop);
auto t2 = std::chrono::high_resolution_clock::now();
auto t3 = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < bench_rounds; ++i)
test_multiply_lambda(init_mat, test_mat, out_mat_lambda);
auto t4 = std::chrono::high_resolution_clock::now();
auto t5 = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < bench_rounds; ++i)
test_multiply_matrix(init_mat, test_mat, out_mat_matrix);
auto t6 = std::chrono::high_resolution_clock::now();
std::cout << "Multiplication by loop:\t\t" << std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count() << '\n';
std::cout << "Multiplication by lambda:\t" << std::chrono::duration_cast<std::chrono::microseconds>( t4 - t3 ).count() << '\n';
std::cout << "Multiplication by internal:\t" << std::chrono::duration_cast<std::chrono::microseconds>( t6 - t5 ).count() << '\n';
std::cout << "Loop and matrix are equal:\t" << arma::approx_equal(out_mat_loop, out_mat_matrix, "reldiff", 0.1) << '\n';
std::cout << "Loop and lambda are equal:\t" << arma::approx_equal(out_mat_loop, out_mat_lambda, "reldiff", 0.1) << '\n';
std::cout << "Matrix and lambda are equal:\t" << arma::approx_equal(out_mat_matrix, out_mat_lambda, "reldiff", 0.1) << '\n';
return 0;
}
Now, for num_rows = 128 my output is
Multiplication by loop: 124525
Multiplication by lambda: 46690
Multiplication by internal: 1270
Loop and matrix are equal: 0
Loop and lambda are equal: 0
Matrix and lambda are equal: 0
but for num_rows = 64 my output is
Multiplication by loop: 32305
Multiplication by lambda: 6517
Multiplication by internal: 56344
Loop and matrix are equal: 1
Loop and lambda are equal: 1
Matrix and lambda are equal: 1
Why is the output so different when increasing the amount of columns? And why is the timing of the functions changing so much?
The three functions are indeed doing the same thing and the result should be the same, except for precision differences which should not matter since you compare the results with arma::approx_equal.
In my machine the output was correct for both sizes you mention and for other higher values that I have tried. I could not reproduce the problem.
For reference, I'm tried with armadillo 9.870.2 and I linked with openblas and lapack.
How did you install armadillo?
Armadillo uses blas and lapack for much of its functionality. For matrix multiplication it's using some blas implementation. There are several implementations for blas, such as openblas, mkl even cublas (for runing in the gpu), etc..
Armadillo can work without a blas implementation, where it would use its own (slower) implementation for matrix multiplication. I haven't tried it using its own implementation without linking with blas.
Another point that might be related is that depending on the blas implementation the matrix multiplication might use multiple threads, but usually only for large matrices, since using multiple threads for small matrices would hurt performance. That is, the code path used to perform the multiplication could be different depending on the matrix size (but of course it would be a bug if both code paths do not produce the same answer).

Most efficient way to clear an image in opencv2/3

I'm currently porting my old OpenCV C code to the C++ interface of OpenCV 2/3 and I'm not quite sure about some equivalents of old functions. Pretty early I ran into an issue with cvZero. The only possibility I found was to set the matrix content via Mat::setTo. Now, having to be able to manage multi-channel scalars and different data types, setTo iterates through all elements of the matrix and sets them one after another while cvZero basically did a memset. I am wondering what would be the recommended way for using the C++ interface, in case I just want to clear my image black.
Thanks!
yourMat = cv::Mat::zeros(yourMat.size(), yourMat.type()) does not seem to allocate new memory but only overwrites the existing Mat object (memory was previously allocated, otherwise .size is 0). Not sure whether memset is used internally, but this sample code gives 50% longer processing time for the version with .setTo compared to the version with cv::Mat::zeros - but I didn't evaluate the offset from the manipulation (which should be quite identical in both versions)!
int main(int argc, char* argv[])
{
cv::Mat input = cv::imread("C:/StackOverflow/Input/Lenna.png");
srand(time(NULL));
cv::Mat a = input;
cv::Mat b = input;
cv::imshow("original", a);
b = cv::Mat::zeros(a.size(), a.type());
std::vector<int> randX;
std::vector<int> randY;
std::vector<cv::Vec3b> randC;
int n = 500000;
randX.resize(n);
randY.resize(n);
randC.resize(n);
for (unsigned int i = 0; i < n; ++i)
{
randX[i] = rand() % input.cols;
randY[i] = rand() % input.rows;
randC[i] = cv::Vec3b(rand()%255, rand()%255, rand()%255);
}
clock_t start1 = clock();
for (unsigned int i = 0; i < randX.size(); ++i)
{
b.at<cv::Vec3b>(randY[i], randX[i]) = randC[i];
b = cv::Mat::zeros(b.size(), b.type());
}
clock_t end1 = clock();
clock_t start2 = clock();
for (unsigned int i = 0; i < randX.size(); ++i)
{
b.at<cv::Vec3b>(randY[i], randX[i]) = randC[i];
b.setTo( cv::Scalar(0, 0, 0));
}
clock_t end2 = clock();
std::cout << "time1 = " << ( (end1 - start1) / CLOCKS_PER_SEC ) << " seconds" << std::endl;
std::cout << "time2 = " << ((end2 - start2) / CLOCKS_PER_SEC) << " seconds" << std::endl;
cv::imshow("a", a);
cv::imshow("b", b);
cv::waitKey(0);
return 0;
}
gives me output:
time1 = 14 seconds
time2 = 21 seconds
on my machine (Release mode) (no IPP).
and a black image for both, a and b which indicates that no new memory was allocated, but the existing Mat memory was used.
int n = 250000; will produce output
time1 = 6 seconds
time2 = 10 seconds
This is no answer about whether or not memset is used internally or whether or not it is as fast as cvZero, but at least you know now how to set to zero faster than .setTo

C++ Advice on manipulating output Matrix data

I have the following code.
Essentially it is creating N random normal variables, and running through an equation M times for a simulation.
The output should be an NxM matrix of data, however the only way I could do the calculation has the output as MxN. ie each M run should be a column, not a row.
I have attempted in vain to follow some of the other suggestions that have been posted on previous similar topics.
Code:
#include <iostream>
#include <time.h>
#include <random>
int main()
{
double T = 1; // End time period for simulation
int N = 4; // Number of time steps
int M = 2; // Number of simulations
double x0 = 1.00; // Starting x value
double mu = 0.00; // mu(x,t) value
double sig = 1.00; // sigma(x,t) value
double dt = T/N;
double sqrt_dt = sqrt(dt);
double** SDE_X = new double*[M]; // SDE Matrix setup
// Random Number generation setup
double RAND_N;
srand ((unsigned int) time(NULL)); // Generator loop reset
std::default_random_engine generator (rand());
std::normal_distribution<double> distribution (0.0,1.0); // Mean = 0.0, Variance = 1.0 ie Normal
for (int i = 0; i < M; i++)
{
SDE_X[i] = new double[N];
for (int j=0; j < N; j++)
{
RAND_N = distribution(generator);
SDE_X[i][0] = x0;
SDE_X[i][j+1] = SDE_X[i][j] + mu * dt + sig * RAND_N * sqrt_dt; // The SDE we wish to plot the path for
std::cout << SDE_X[i][j] << " ";
}
std::cout << std::endl;
}
std::cout << std::endl;
std::cout << " The simulation is complete!!" << std::endl;
std::cout << std::endl;
system("pause");
return 0;
}
Well why can't you just create the transpose of your SDE_X matrix then? Isn't that what you want to get?
Keep in mind, that presentation has nothing to do with implementation. Whether to access columns or rows is your decision. So you want an implementation of it transposed. Then quick and dirty create your matrix first, and then create your number series. Change i and j, and N and M.
I said quick and dirty, because the program at all is bad:
why don't you just keep it simple and use a better data structure for your matrix? If you know the size: compile-time array or dynamic vectors at runtime? Maybe there are some nicer implementation for 2d array.
There is a bug I think: you create N doubles and access index 0 to N inclusive.
In every iteration you set index 0 to x0 what is also needless.
I would change your code a bit make more clear:
create your matrix at first
initialize the first value of the matrix
provide an algorithm function calculating a target cell taking the matrix and the parameters.
Go through each cell and invoke your function for that cell
Thank you all for your input. I was able to implement my code and have it displayed as needed.
I added a second for loop to rearrange the matrix rows and columns.
Please feel free to let me know if you think there is anyway I can improve it.
#include <iostream>
#include <time.h>
#include <random>
#include <vector>
int main()
{
double T = 1; // End time period for simulation
int N = 3; // Number of time steps
int M = 2; // Number of simulations
int X = 100; // Max number of matrix columns
int Y = 100; // Max number of matrix rows
double x0 = 1.00; // Starting x value
double mu = 0.00; // mu(x,t) value
double sig = 1.00; // sigma(x,t) value
double dt = T/N;
double sqrt_dt = sqrt(dt);
std::vector<std::vector<double>> SDE_X((M*N), std::vector<double>((M*N))); // SDE Matrix setup
// Random Number generation setup
double RAND_N;
srand ((unsigned int) time(NULL)); // Generator loop reset
std::default_random_engine generator (rand());
std::normal_distribution<double> distribution (0.0,1.0); // Mean = 0.0, Variance = 1.0 ie Normal
for (int i = 0; i <= M; i++)
{
SDE_X[i][0] = x0;
for (int j=0; j <= N; j++)
{
RAND_N = distribution(generator);
SDE_X[i][j+1] = SDE_X[i][j] + mu * dt + sig * RAND_N * sqrt_dt; // The SDE we wish to plot the path for
}
}
for (int j = 0; j <= N; j++)
{
for (int i = 0; i <=M; i++)
{
std::cout << SDE_X[i][j] << ", ";
}
std::cout << std::endl;
}
std::cout << std::endl;
std::cout << " The simulation is complete!!" << std::endl;
std::cout << std::endl;
system("pause");
return 0;
}

Why does MATLAB/Octave wipe the floor with C++ in Eigenvalue Problems?

I'm hoping that the answer to the question in the title is that I'm doing something stupid!
Here is the problem. I want to compute all the eigenvalues and eigenvectors of a real, symmetric matrix. I have implemented code in MATLAB (actually, I run it using Octave), and C++, using the GNU Scientific Library. I am providing my full code below for both implementations.
As far as I can understand, GSL comes with its own implementation of the BLAS API, (hereafter I refer to this as GSLCBLAS) and to use this library I compile using:
g++ -O3 -lgsl -lgslcblas
GSL suggests here to use an alternative BLAS library, such as the self-optimizing ATLAS library, for improved performance. I am running Ubuntu 12.04, and have installed the ATLAS packages from the Ubuntu repository. In this case, I compile using:
g++ -O3 -lgsl -lcblas -latlas -lm
For all three cases, I have performed experiments with randomly-generated matrices of sizes 100 to 1000 in steps of 100. For each size, I perform 10 eigendecompositions with different matrices, and average the time taken. The results are these:
The difference in performance is ridiculous. For a matrix of size 1000, Octave performs the decomposition in under a second; GSLCBLAS and ATLAS take around 25 seconds.
I suspect that I may be using the ATLAS library incorrectly. Any explanations are welcome; thanks in advance.
Some notes on the code:
In the C++ implementation, there is no need to make the matrix
symmetric, because the function only uses the lower triangular part
of it.
In Octave, the line triu(A) + triu(A, 1)' enforces the matrix to be symmetric.
If you wish to compile the C++ code your own Linux machine, you also need to add the flag -lrt, because of the clock_gettime function.
Unfortunately I don't think clock_gettime exits on other platforms. Consider changing it to gettimeofday.
Octave Code
K = 10;
fileID = fopen('octave_out.txt','w');
for N = 100:100:1000
AverageTime = 0.0;
for k = 1:K
A = randn(N, N);
A = triu(A) + triu(A, 1)';
tic;
eig(A);
AverageTime = AverageTime + toc/K;
end
disp([num2str(N), " ", num2str(AverageTime), "\n"]);
fprintf(fileID, '%d %f\n', N, AverageTime);
end
fclose(fileID);
C++ Code
#include <iostream>
#include <fstream>
#include <time.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_eigen.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
int main()
{
const int K = 10;
gsl_rng * RandomNumberGenerator = gsl_rng_alloc(gsl_rng_default);
gsl_rng_set(RandomNumberGenerator, 0);
std::ofstream OutputFile("atlas.txt", std::ios::trunc);
for (int N = 100; N <= 1000; N += 100)
{
gsl_matrix* A = gsl_matrix_alloc(N, N);
gsl_eigen_symmv_workspace* EigendecompositionWorkspace = gsl_eigen_symmv_alloc(N);
gsl_vector* Eigenvalues = gsl_vector_alloc(N);
gsl_matrix* Eigenvectors = gsl_matrix_alloc(N, N);
double AverageTime = 0.0;
for (int k = 0; k < K; k++)
{
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
gsl_matrix_set(A, i, j, gsl_ran_gaussian(RandomNumberGenerator, 1.0));
}
}
timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
gsl_eigen_symmv(A, Eigenvalues, Eigenvectors, EigendecompositionWorkspace);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
double TimeElapsed = (double) ((1e9*end.tv_sec + end.tv_nsec) - (1e9*start.tv_sec + start.tv_nsec))/1.0e9;
AverageTime += TimeElapsed/K;
std::cout << "N = " << N << ", k = " << k << ", Time = " << TimeElapsed << std::endl;
}
OutputFile << N << " " << AverageTime << std::endl;
gsl_matrix_free(A);
gsl_eigen_symmv_free(EigendecompositionWorkspace);
gsl_vector_free(Eigenvalues);
gsl_matrix_free(Eigenvectors);
}
return 0;
}
I disagree with the previous post. This is not a threading issue, this is an algorithm issue. The reason matlab, R, and octave wipe the floor with C++ libraries is because their C++ libraries use more complex, better algorithms. If you read the octave page you can find out what they do[1]:
Eigenvalues are computed in a several step process which begins with a Hessenberg decomposition, followed by a Schur decomposition, from which the eigenvalues are apparent. The eigenvectors, when desired, are computed by further manipulations of the Schur decomposition.
Solving eigenvalue/eigenvector problems is non-trivial. In fact its one of the few things "Numerical Recipes in C" recommends you don't implement yourself. (p461). GSL is often slow, which was my initial response. ALGLIB is also slow for its standard implementation (I'm getting about 12 seconds!):
#include <iostream>
#include <iomanip>
#include <ctime>
#include <linalg.h>
using std::cout;
using std::setw;
using std::endl;
const int VERBOSE = false;
int main(int argc, char** argv)
{
int size = 0;
if(argc != 2) {
cout << "Please provide a size of input" << endl;
return -1;
} else {
size = atoi(argv[1]);
cout << "Array Size: " << size << endl;
}
alglib::real_2d_array mat;
alglib::hqrndstate state;
alglib::hqrndrandomize(state);
mat.setlength(size, size);
for(int rr = 0 ; rr < mat.rows(); rr++) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
mat[rr][cc] = mat[cc][rr] = alglib::hqrndnormal(state);
}
}
if(VERBOSE) {
cout << "Matrix: " << endl;
for(int rr = 0 ; rr < mat.rows(); rr++) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
cout << setw(10) << mat[rr][cc];
}
cout << endl;
}
cout << endl;
}
alglib::real_1d_array d;
alglib::real_2d_array z;
auto t = clock();
alglib::smatrixevd(mat, mat.rows(), 1, 0, d, z);
t = clock() - t;
cout << (double)t/CLOCKS_PER_SEC << "s" << endl;
if(VERBOSE) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
cout << "lambda: " << d[cc] << endl;
cout << "V: ";
for(int rr = 0 ; rr < mat.rows(); rr++) {
cout << setw(10) << z[rr][cc];
}
cout << endl;
}
}
}
If you really need a fast library, probably need to do some real hunting.
[1] http://www.gnu.org/software/octave/doc/interpreter/Basic-Matrix-Functions.html
I have also encountered with the problem. The real cause is that the eig() in matlab doesn't calculate the eigenvectors, but the C version code above does. The different in time spent can be larger than one order of magnitude as shown in the figure below. So the comparison is not fair.
In Matlab, depending on the return value, the actual function called will be different. To force the calculation of eigenvectors, the [V,D] = eig(A) should be used (see codes below).
The actual time to compute eigenvalue problem depends heavily on the matrix properties and the desired results, such as
Real or complex
Hermitian/Symmetric or not
Dense or sparse
Eigenvalues only, Eigenvectors, Maximum eigenvalue only, etc
Serial or parallel
There are algorithms optimized for each of the above case. In the gsl, these algorithm are picked manually, so a wrong selection will decrease performance significantly. Some C++ wrapper class or some language such as matlab and mathematica will choose the optimized version through some methods.
Also, the Matlab and Mathematica have used parallelization. These are further broaden the gap you see by few times, depending on the machine. It is reasonable to say that the calculation of eigenvalues and eigenvectors of a general complex 1000x1000 are about a second and ten second, without parallelization.
Fig. Compare Matlab and C. The "+ vec" means the codes included the calculations of the eigenvectors. The CPU% is the very rough observation of CPU usage at N=1000 which is upper bounded by 800%, though they are supposed to fully use all 8 cores. The gap between Matlab and C are smaller than 8 times.
Fig. Compare different matrix type in Mathematica. Algorithms automatically picked by program.
Matlab (WITH the calculation of eigenvectors)
K = 10;
fileID = fopen('octave_out.txt','w');
for N = 100:100:1000
AverageTime = 0.0;
for k = 1:K
A = randn(N, N);
A = triu(A) + triu(A, 1)';
tic;
[V,D] = eig(A);
AverageTime = AverageTime + toc/K;
end
disp([num2str(N), ' ', num2str(AverageTime), '\n']);
fprintf(fileID, '%d %f\n', N, AverageTime);
end
fclose(fileID);
C++ (WITHOUT the calculation of eigenvectors)
#include <iostream>
#include <fstream>
#include <time.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_eigen.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
int main()
{
const int K = 10;
gsl_rng * RandomNumberGenerator = gsl_rng_alloc(gsl_rng_default);
gsl_rng_set(RandomNumberGenerator, 0);
std::ofstream OutputFile("atlas.txt", std::ios::trunc);
for (int N = 100; N <= 1000; N += 100)
{
gsl_matrix* A = gsl_matrix_alloc(N, N);
gsl_eigen_symm_workspace* EigendecompositionWorkspace = gsl_eigen_symm_alloc(N);
gsl_vector* Eigenvalues = gsl_vector_alloc(N);
double AverageTime = 0.0;
for (int k = 0; k < K; k++)
{
for (int i = 0; i < N; i++)
{
for (int j = i; j < N; j++)
{
double rn = gsl_ran_gaussian(RandomNumberGenerator, 1.0);
gsl_matrix_set(A, i, j, rn);
gsl_matrix_set(A, j, i, rn);
}
}
timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
gsl_eigen_symm(A, Eigenvalues, EigendecompositionWorkspace);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
double TimeElapsed = (double) ((1e9*end.tv_sec + end.tv_nsec) - (1e9*start.tv_sec + start.tv_nsec))/1.0e9;
AverageTime += TimeElapsed/K;
std::cout << "N = " << N << ", k = " << k << ", Time = " << TimeElapsed << std::endl;
}
OutputFile << N << " " << AverageTime << std::endl;
gsl_matrix_free(A);
gsl_eigen_symm_free(EigendecompositionWorkspace);
gsl_vector_free(Eigenvalues);
}
return 0;
}
Mathematica
(* Symmetric real matrix + eigenvectors *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
AbsoluteTiming[Eigensystem[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Symmetric real matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Asymmetric real matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Hermitian matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[] + I Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Random complex matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[] + I Random[], {i, NN}, {j, NN}];
AbsoluteTiming[Eigenvalues[M]][[1]]
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
In the C++ implementation, there is no need to make the matrix
symmetric, because the function only uses the lower triangular part of
it.
This may not be the case. In the reference, it is stated that:
int gsl_eigen_symmv(gsl_matrix *A,gsl_vector *eval, gsl_matrix *evec, gsl_eigen_symmv_workspace * w)
This function computes the eigenvalues and eigenvectors of the real symmetric matrix
A. Additional workspace of the appropriate size must be provided in w.
The diagonal and lower triangular part of A are destroyed during the
computation, but the strict upper triangular part is not referenced.
The eigenvalues are stored in the vector eval and are unordered. The
corresponding eigenvectors are stored in the columns of the matrix
evec. For example, the eigenvector in the first column corresponds to
the first eigenvalue. The eigenvectors are guaranteed to be mutually
orthogonal and normalised to unit magnitude.
It seems that you also need to apply a similar symmetrization operation in C++ in order to get at least correct results although you can get the same performance.
On the MATLAB side, eigen value decomposition may be faster due to its multi-threaded execution as stated in this reference:
Built-in Multithreading
Linear algebra and numerical functions such as fft, \ (mldivide), eig,
svd, and sort are multithreaded in MATLAB. Multithreaded computations
have been on by default in MATLAB since Release 2008a. These
functions automatically execute on multiple computational threads in a
single MATLAB session, allowing them to execute faster on
multicore-enabled machines. Additionally, many functions in Image
Processing Toolbox™ are multithreaded.
In order to test the performance of MATLAB for single core, you can disable multithreading by
File>Preferences>General>Multithreading
in R2007a or newer as stated here.

Red-Black Gauss Seidel and OpenMP

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).