How to speed up LU decomposition in Eigen C++? - c++

I am new to c++ and the Eigen library. I want to perform LU decomposition (partial pivoting) on a matrix of size 1815 X 1815, with complex entries. However, the performance of my code is bad, the LU decomposition is taking 77.2852 seconds, compared to MATLAB taking only 0.140946 seconds. Please find the attached code. Any advice on how I can improve the code? Please note that in the first part of the code, I am importing the matrix from a file with entries: a + bi, where a and b are complex numbers. The matrix file was generated from MATLAB. Thank you.
#include <iostream>
#include <Eigen/Dense>
#include <fstream>
#include <complex>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main(){
int mat_sz = 1815; // size of matrix
MatrixXcd c_mat(mat_sz,mat_sz); // initialize eigen matrix
double re, im;
char sign;
string entry;
ifstream myFile("A_mat"); // format of entries : a + bi. 'a' and 'b' are complex numbers
//Import and assign matrix to an Eigen matrix
for (int i = 0; i < mat_sz; i++){
for (int j = 0; j < mat_sz; j++){
myFile >> entry;
stringstream stream(entry);
stream >> re >> sign >> im;
c_mat(i,j) = {re, (sign == '-') ? -im : im}; // Assigning matrix entries
}
}
// LU Decomposition
auto start = high_resolution_clock::now();
c_mat.partialPivLu(); // Solving equation through partial LU decomposition
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
double million = 1000000;
cout << "Time taken by function: " << duration.count()/million << " seconds" << endl;
}

I'll summarize the comments into an answer.
When you feel that Eigen is running slow there are a list of things that should be verified.
Are optimizations turned on?
Eigen is a template heavy library that does a lot of compile time checks and that should be optimized out. If optimizations are not on, none of it gets inlined and many pointless function calls are made. Turning on even the lowest level of optimizations usually alleviates most of this (-O1 or higher in gcc/clang, /O1 or higher in MSVC). General notes on optimizations can be found here.
Am I utilizing all the hardware options?
A lot of code in Eigen can be vectorized if allowed. Make sure that this is enabled with flags turning on SSE/AVX/etc. if the target hardware supports it. Enable FMA if available as well. There's a placeholder doc here.
Enable multithreading
If your process/hardware allow, consider enabling OpenMP to allow Eigen to utilize multiple cores for some of the operations.
Use the right precision
In many applications, only the first few digits matter. If this is the case in your application, consider using single precision instead of double precision.
Link to a fine tuned library
In the end, Eigen spits out some finely built C++ code and relies on the compiler to handle most of the optimizations itself. In some cases, a more finely tuned library such as MKL may improve performance. Eigen can link to MKL to squeeze a bit more speed out of the hardware.

Related

Why it is appropriate to use `std::uniform_real_distribution`?

I'm trying to write Metropolis Monte Carlo simulation code.
Since the simulation will be very long, I'd like to think seriously about the performance for generating random numbers in [0, 1].
So I decided to check the performance of two methods by the following code:
#include <cfloat>
#include <chrono>
#include <iostream>
#include <random>
int main()
{
constexpr auto Ntry = 5000000;
std::mt19937 mt(123);
std::uniform_real_distribution<double> dist(0.0, std::nextafter(1.0, DBL_MAX));
double test1, test2;
// method 1
auto start1 = std::chrono::system_clock::now();
for (int i=0; i<Ntry; i++) {
test1 = dist(mt);
}
auto end1 = std::chrono::system_clock::now();
auto elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
std::cout << elapsed1 << std::endl;
// method 2
auto start2 = std::chrono::system_clock::now();
for (int i=0; i<Ntry; i++) {
test2 = 1.0*mt() / mt.max();
}
auto end2 = std::chrono::system_clock::now();
auto elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
std::cout << elapsed2 << std::endl;
}
Then the result is
295489 micro sec for method 1
79884 micro sec for method 2
I understand that there are many posts that recommend to use std::uniform_real_distribution.
But performance-wise, it is tempting to use the latter as this result shows.
Would you tell me what is the point of using std::uniform_real_distribution?
What is the disadvantage of using 1.0*mt() / mt.max()?
And in the current purpose, is it acceptable to use 1.0*mt() / mt.max() instead?
Edit:
I compiled this code with g++-11 test.cpp.
When I compile with -O3 flag, the result is qualitatively same (the method 1 is approx. 1.8 times slower).
I would like to discuss what is the advantage of the widely-used method.
I do concern the trend of performances, but specific performance comparison is out of my scope.
You use the standard random library because it is extremely difficult to do numerical calculations correctly and you don't want the burden of proving and maintaining your own random library.
Case in point, your random distribution is wrong. std::mt19937 produces 32-bit integers, yet you're expecting a double, which has a 53-bit significand (usually). There are values in the range [0, 1] that you will never obtain from 1.0*mt() / mt::max().
Your testing methodology is flawed. You don't use the result that you produce, so a smart optimiser may simply skip producing a result.
Would you tell me what is the point of using std::uniform_real_distribution?
The clue is in the name. It produces a uniform distribution.
Furthermore, it allows you to specify the minimum and maximum between which you want the distribution to lie.
What is the disadvantage of using 1.0*mt() / mt.max()?
You cannot specify a minimum and a maximum.
It produces a less uniform distribution.
It produces less randomness.
is it acceptable to use 1.0*mt() / mt.max() instead?
In some use cases, it could be acceptable. In some other cases, it isn't acceptable. In the rest, it won't matter.

C++ is way slower than MATLAB

I am trying to generate 5000 by 5000 random number matrix. Here is what I do with MATLAB:
for i = 1:100
rand(5000)
end
And here is what I do in C++:
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
using namespace std;
int main(){
int N = 5000;
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(time(NULL));
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
MATLAB takes around 38 seconds while C++ takes around 90 seconds.
In another question, people executed the same code and got same speeds for both C++ and MATLAB.
I am using visual C++ with the following optimizations
I would like to learn what I am missing here? Thank you for all the help.
EDIT: Here is the key thing though...
Why MATLAB is faster than C++ in creating random numbers?
In this question, people gave me answers where their C++ speeds are same as MATLAB. When I use the same code I get way worse speeds and I am trying to understand why.
Your test is flawed, as others have noted, and does not even address the statement made by the title. You are comparing an inbuilt Matlab function to C++, not Matlab code itself, which in fact executes 100x more slowly than C++. Matlab is just a wrapper around the BLAS/LAPACK libraries in C/Fortran so one would expect a Matlab script, and a competently written C++ to be approximately equivalent, and indeed they are: This code in Matlab 2007b
tic; A = rand(5000); toc
executes in 810ms on my machine and this
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
#define N 5000
int main()
{
srand(time(NULL));
clock_t start = clock();
int num_rows = N,
num_cols = N;
double * A = new double[N*N];
for (int i=0; i<N*N; ++i)
A[i] = rand();
std::cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << std::endl;
return 0;
}
executes in 830ms. A slight advantage for Matlab's in-house RNG over rand() is not too surprising. Note also the single indexing. This is how Matlab does it, internally. It then uses a clever indexing system (developed by others) to give you a matrix-like interface to the data.
In your C++ code, you are doing 5000 allocations of double[5000] on the heap. You would probably get much better speed if you did a single allocation of a double[25000000], and then do your own arithmetic to convert your 2 indices to a single one.
I believe MATLAB utilize multiple cpu cores on your machine. Have you try to write a multi-threaded version and measure the difference?
Also, the quality of (pseudo) random would also make slightly difference (but not that much).
In my experience,
First check that you execute your C++ code in release mode instead of in Debug mode. (Although I see in the picture you are in release mode)
Consider MPI parallelization.
Bear in mind that MATLAB is highly optimized and compiled with the Intel compiler which produces faster executables. You can also try more advanced compilers if you can afford them.
Last you can make a loop aggregation by using a function to generate combinations of i, j in a single loop. (In python this is a common practice given by the function product from the itertools library, see this)
I hope it helps.

Make Sparse Matrix Multiply Fast

The code is written using C++11. Each Process got tow Matrix Data(Sparse). The test data can be downloaded from enter link description here
Test data contains 2 file : a0 (Sparse Matrix 0) and a1 (Sparse Matrix 1). Each line in file is "i j v", means the sparse matrix Row i, Column j has the value v. i,j,v are all integers.
Use c++11 unordered_map as the sparse matrix's data structure.
unordered_map<int, unordered_map<int, double> > matrix1 ;
matrix1[i][j] = v ; //means at row i column j of matrix1 is value v;
The following code took about 2 minutes. The compile command is g++ -O2 -std=c++11 ./matmult.cpp.
g++ version is 4.8.1, Opensuse 13.1. My computer's info : Intel(R) Core(TM) i5-4200U CPU # 1.60GHz, 4G memory.
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>
using namespace std;
void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
ifstream input ;
input.open(fn);
int i, j ; double v;
while (input >> i >> j >> v) {
m[i][j] = v;
}
}
unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
//vector<vector<int> > keys(BLK_SIZE);
int main() {
load("./a0",m1);
load("./a1",m2);
for (auto r1 : m1) {
for (auto r2 : m2) {
double sim = 0.0 ;
for (auto c1 : r1.second) {
auto f = r2.second.find(c1.first);
if (f != r2.second.end()) {
sim += (f->second) * (c1.second) ;
}
}
}
}
return 0;
}
The code above is too slow. How can I make it run faster? I use multithread.
The new code is following, compile command is g++ -O2 -std=c++11 -pthread ./test.cpp. And it took about 1 minute. I want it to be faster.
How Can I make the task faster? Thank you!
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>
#define BLK_SIZE 8
using namespace std;
void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
ifstream input ;
input.open(fn);
int i, j ; double v;
while (input >> i >> j >> v) {
m[i][j] = v;
}
}
unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
vector<vector<int> > keys(BLK_SIZE);
void thread_sim(int blk_id) {
for (auto row1_id : keys[blk_id]) {
auto r1 = m1[row1_id];
for (auto r2p : m2) {
double sim = 0.0;
for (auto col1 : r1) {
auto f = r2p.second.find(col1.first);
if (f != r2p.second.end()) {
sim += (f->second) * col1.second ;
}
}
}
}
}
int main() {
load("./a0",m1);
load("./a1",m2);
int df = BLK_SIZE - (m1.size() % BLK_SIZE);
int blk_rows = (m1.size() + df) / (BLK_SIZE - 1);
int curr_thread_id = 0;
int index = 0;
for (auto k : m1) {
keys[curr_thread_id].push_back(k.first);
index++;
if (index==blk_rows) {
index = 0;
curr_thread_id++;
}
}
cout << "ok" << endl;
std::thread t[BLK_SIZE];
for (int i = 0 ; i < BLK_SIZE ; ++i){
t[i] = std::thread(thread_sim,i);
}
for (int i = 0; i< BLK_SIZE; ++i)
t[i].join();
return 0 ;
}
Most times when working with sparse matrices one uses more efficient representations than the nested maps you have. Typical choices are Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC). See https://en.wikipedia.org/wiki/Sparse_matrix for details.
You haven't specified the time you expect your example to run in or the platform you hope to run on. These are important design contraints in this example.
There are several areas that I can think of for improving the efficeny of this:-
Improve the way the data is stored
Improve the multithreading
Improve the algorithm
The first point is geared toward the way the system stores the sparse arrays and the interfaces to enable the data to be read. Nested unordered_maps are a good option when speed isn't important but there may be more specific data structures available that are geared toward this problem. At best you may find a library that provides a better way to store the data than nested maps, at worst you may have to come up with something yourself.
The second point refers to the way the multithreading is supported in the language. The original spec for the multithreading system were meant to be platform independant and might miss out handy features some systems might have. Decide what system you want to target and use the OSs threading system. You'll have more control over the way the threading works, possibly reduce the overhead but will lose out on the cross platform support.
The third point will take a bit of work. Is the way you're multiplying the matricies really the most efficent way given the nature of the data. I'm no expert on these things but it is something to consider but it will take a bit of effort.
Lastly, you can always be very specific about the platform you're running on and head into the world of assembly programming. Modern CPUs are complicated beasts. They can sometimes perform operations in parallel. For example, you may be able to do SIMD operations or do parallel integer and floating point operations. Doing this does require a deep understanding of what's going on and there are useful tools to help you out. Intel did have a tool called VTune (it may be something else now) that would analyse code and highlight potential bottlenecks. Ultimately, you'll be wanting to eliminate areas of the algorithm where the CPU is idle waiting for something to happen (like waiting for data from RAM) either by finding something else for the CPU to do or improving the algorithm (or both).
Ultimately, in order to improve the overall speed, you'll need to know what is slowing it down. This generally means knowing how to analyse your code and understand the results. Profilers are the general tool for this but there are platform specific tools available as well.
I know this isn't quite what you want but making code fast is really hard and very time consuming.

Pre-compute once cos() and sin() in tables

I'd like to improve performance of my Dynamic Linked Library (DLL).
For that I want to use lookup tables of cos() and sin() as I use a lot of them.
As I want maximum performance, I want to create a table from 0 to 2PI that contains the resulting cos and sin computations.
For a good result in term of precision, I think tables of 1 mb for each function is a good trade between size and precision.
I would like to know how to create and uses these tables without using an external file (as it is a DLL) : I want to keep everything within one file.
Also I don't want to compute the sin and cos function when the plugin starts : they have to be computed once and put in a standard vector.
But how do I do that in C++?
EDIT1: code from jons34yp is very good to create the vector files.
I did a small benchmark and found that if you need good precision and good speed you can do a 250000 units vector and linear interpolate between them you will have a 7.89E-11 max error (!) and it is the fastest between all the approximations I tried (and it is more than 12x faster than sin() (13,296 x faster exactly)
Easiest solution is to write a separate program that creates a .cc file with definition of your vector.
For example:
#include <iostream>
#include <cmath>
int main()
{
std::ofstream out("values.cc");
out << "#include \"static_values.h\"\n";
out << "#include <vector>\n";
out << "std::vector<float> pi_values = {\n";
out << std::precision(10);
// We only need to compute the range from 0 to PI/2, and use trigonometric
// transformations for values outside this range.
double range = 3.141529 / 2;
unsigned num_results = 250000;
for (unsigned i = 0; i < num_results; i++) {
double value = (range / num_results) * i;
double res = std::sin(value);
out << " " << res << ",\n";
}
out << "};\n"
out.close();
}
Note that this is unlikely to improve performance, since a table of this size probably won't fit in your L2 cache. This means a large percentage of trigonometric computations will need to access RAM; each such access costs roughly several hundreds of CPU cycles.
By the way, have you looked at approximate SSE SIMD trigonometric libraries. This looks like a good use case for them.
You can use precomputation instead of storing them already precomputed in the executable:
double precomputed_sin[65536];
struct table_filler {
table_filler() {
for (int i=0; i<65536; i++) {
precomputed_sin[i] = sin(i*2*3.141592654/65536);
}
}
} table_filler_instance;
This way the table is computed just once at program startup and it's still at a fixed memory address. After that tsin and tcos can be implemented inline as
inline double tsin(int x) { return precomputed_sin[x & 65535]; }
inline double tcos(int x) { return precomputed_sin[(x + 16384) & 65535]; }
The usual answer to this sort of question is to write a small
program which generates a C++ source file with the values in
a table, and compile it into your DLL. If you're thinking of
tables with 128000 entries (128000 doubles are 1MB), however,
you might run up against some internal limits in your compiler.
In that case, you might consider writing the values out to
a file as a memory dump, and mmaping this file when you load
the DLL. (Under windows, I think you could even put this second
file into a second stream of your DLL file, so you wouldn't have
to distribute a second file.)

Benchmarking math.h square root and Quake square root

Okay so I was board and wondered how fast math.h square root was in comparison to the one with the magic number in it (made famous by Quake but made by SGI).
But this has ended up in a world of hurt for me.
I first tried this on the Mac where the math.h would win hands down every time then on Windows where the magic number always won, but I think this is all down to my own noobness.
Compiling on the Mac with "g++ -o sq_root sq_root_test.cpp" when the program ran it takes about 15 seconds to complete. But compiling in VS2005 on release takes a split second. (in fact I had to compile in debug just to get it to show some numbers)
My poor man's benchmarking? is this really stupid? cos I get 0.01 for math.h and 0 for the Magic number. (it cant be that fast can it?)
I don't know if this matters but the Mac is Intel and the PC is AMD. Is the Mac using hardware for math.h sqroot?
I got the fast square root algorithm from http://en.wikipedia.org/wiki/Fast_inverse_square_root
//sq_root_test.cpp
#include <iostream>
#include <math.h>
#include <ctime>
float invSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
int main() {
std::clock_t start;// = std::clock();
std::clock_t end;
float rootMe;
int iterations = 999999999;
// ---
rootMe = 2.0f;
start = std::clock();
std::cout << "Math.h SqRoot: ";
for (int m = 0; m < iterations; m++) {
(float)(1.0/sqrt(rootMe));
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
// ---
std::cout << "Quake SqRoot: ";
rootMe = 2.0f;
start = std::clock();
for (int q = 0; q < iterations; q++) {
invSqrt(rootMe);
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
}
There are several problems with your benchmarks. First, your benchmark includes a potentially expensive cast from int to float. If you want to know what a square root costs, you should benchmark square roots, not datatype conversions.
Second, your entire benchmark can be (and is) optimized out by the compiler because it has no observable side effects. You don't use the returned value (or store it in a volatile memory location), so the compiler sees that it can skip the whole thing.
A clue here is that you had to disable optimizations. That means your benchmarking code is broken. Never ever disable optimizations when benchmarking. You want to know which version runs fastest, so you should test it under the conditions it'd actually be used under. If you were to use square roots in performance-sensitive code, you'd enable optimizations, so how it behaves without optimizations is completely irrelevant.
Also, you're not benchmarking the cost of computing a square root, but of the inverse square root.
If you want to know which way of computing the square root is fastest, you have to move the 1.0/... division down to the Quake version. (And since division is a pretty expensive operation, this might make a big difference in your results)
Finally, it might be worth pointing out that Carmacks little trick was designed to be fast on 12 year old computers. Once you fix your benchmark, you'll probably find that it's no longer an optimization, because today's CPU's are much faster at computing "real" square roots.