C++ is way slower than MATLAB

C++ is way slower than MATLAB - c++

I am trying to generate 5000 by 5000 random number matrix. Here is what I do with MATLAB:
for i = 1:100
rand(5000)
end
And here is what I do in C++:
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
using namespace std;
int main(){
int N = 5000;
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(time(NULL));
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
MATLAB takes around 38 seconds while C++ takes around 90 seconds.
In another question, people executed the same code and got same speeds for both C++ and MATLAB.
I am using visual C++ with the following optimizations
I would like to learn what I am missing here? Thank you for all the help.
EDIT: Here is the key thing though...
Why MATLAB is faster than C++ in creating random numbers?
In this question, people gave me answers where their C++ speeds are same as MATLAB. When I use the same code I get way worse speeds and I am trying to understand why.

Your test is flawed, as others have noted, and does not even address the statement made by the title. You are comparing an inbuilt Matlab function to C++, not Matlab code itself, which in fact executes 100x more slowly than C++. Matlab is just a wrapper around the BLAS/LAPACK libraries in C/Fortran so one would expect a Matlab script, and a competently written C++ to be approximately equivalent, and indeed they are: This code in Matlab 2007b
tic; A = rand(5000); toc
executes in 810ms on my machine and this
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
#define N 5000
int main()
{
srand(time(NULL));
clock_t start = clock();
int num_rows = N,
num_cols = N;
double * A = new double[N*N];
for (int i=0; i<N*N; ++i)
A[i] = rand();
std::cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << std::endl;
return 0;
}
executes in 830ms. A slight advantage for Matlab's in-house RNG over rand() is not too surprising. Note also the single indexing. This is how Matlab does it, internally. It then uses a clever indexing system (developed by others) to give you a matrix-like interface to the data.

In your C++ code, you are doing 5000 allocations of double[5000] on the heap. You would probably get much better speed if you did a single allocation of a double[25000000], and then do your own arithmetic to convert your 2 indices to a single one.

I believe MATLAB utilize multiple cpu cores on your machine. Have you try to write a multi-threaded version and measure the difference?
Also, the quality of (pseudo) random would also make slightly difference (but not that much).

In my experience,
First check that you execute your C++ code in release mode instead of in Debug mode. (Although I see in the picture you are in release mode)
Consider MPI parallelization.
Bear in mind that MATLAB is highly optimized and compiled with the Intel compiler which produces faster executables. You can also try more advanced compilers if you can afford them.
Last you can make a loop aggregation by using a function to generate combinations of i, j in a single loop. (In python this is a common practice given by the function product from the itertools library, see this)
I hope it helps.

Related

Multithreading a larger c++ program in CPU machine

Working (in terms of the run time) with a small program, for example, a c++ program can be easy even I have a few cores on my computer.
#include <bits/stdc++.h>
using namespace std;
int main()
{
vector<int> g1;
for (int i = 1; i <= 10; i++)
g1.push_back(i * 10);
for (int i = 1; i <= 10; i++){
std::cout << g1[i] <<endl;
}
return 0;
}
But, I'm going to work with a program that has a very big vector size [more than a milions]. There are a lot of other processes as well, which makes it harder to finish the program on my computer(MacBook) with small run time. Is there any way that I can do it parallelly (i mean with multiple threads)? This means I run the same program, but the time gets reduced because of the processing in multiple threads. I'm very new to parallel computing, so let me know if the question is not clear enough.
The memory of my computer(macbook): [8GB 1600 MHz DDR3]
Processor: 1.6 GHz Dual-Core Intel Core i5

If you are using the same resources (the vector g1) then unfortunately there will not be significant time saving.
Threads are good to run separately with separated resources asynchronously.
Here is another question that goes more into depth of accessing the STL vector with threads: C++ Access to vector from multiple threads

How to speed up LU decomposition in Eigen C++?

I am new to c++ and the Eigen library. I want to perform LU decomposition (partial pivoting) on a matrix of size 1815 X 1815, with complex entries. However, the performance of my code is bad, the LU decomposition is taking 77.2852 seconds, compared to MATLAB taking only 0.140946 seconds. Please find the attached code. Any advice on how I can improve the code? Please note that in the first part of the code, I am importing the matrix from a file with entries: a + bi, where a and b are complex numbers. The matrix file was generated from MATLAB. Thank you.
#include <iostream>
#include <Eigen/Dense>
#include <fstream>
#include <complex>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main(){
int mat_sz = 1815; // size of matrix
MatrixXcd c_mat(mat_sz,mat_sz); // initialize eigen matrix
double re, im;
char sign;
string entry;
ifstream myFile("A_mat"); // format of entries : a + bi. 'a' and 'b' are complex numbers
//Import and assign matrix to an Eigen matrix
for (int i = 0; i < mat_sz; i++){
for (int j = 0; j < mat_sz; j++){
myFile >> entry;
stringstream stream(entry);
stream >> re >> sign >> im;
c_mat(i,j) = {re, (sign == '-') ? -im : im}; // Assigning matrix entries
}
}
// LU Decomposition
auto start = high_resolution_clock::now();
c_mat.partialPivLu(); // Solving equation through partial LU decomposition
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
double million = 1000000;
cout << "Time taken by function: " << duration.count()/million << " seconds" << endl;
}

I'll summarize the comments into an answer.
When you feel that Eigen is running slow there are a list of things that should be verified.
Are optimizations turned on?
Eigen is a template heavy library that does a lot of compile time checks and that should be optimized out. If optimizations are not on, none of it gets inlined and many pointless function calls are made. Turning on even the lowest level of optimizations usually alleviates most of this (-O1 or higher in gcc/clang, /O1 or higher in MSVC). General notes on optimizations can be found here.
Am I utilizing all the hardware options?
A lot of code in Eigen can be vectorized if allowed. Make sure that this is enabled with flags turning on SSE/AVX/etc. if the target hardware supports it. Enable FMA if available as well. There's a placeholder doc here.
Enable multithreading
If your process/hardware allow, consider enabling OpenMP to allow Eigen to utilize multiple cores for some of the operations.
Use the right precision
In many applications, only the first few digits matter. If this is the case in your application, consider using single precision instead of double precision.
Link to a fine tuned library
In the end, Eigen spits out some finely built C++ code and relies on the compiler to handle most of the optimizations itself. In some cases, a more finely tuned library such as MKL may improve performance. Eigen can link to MKL to squeeze a bit more speed out of the hardware.

C++ code doesn't run while using fftw library

I am trying to implement a DFT using the fftw library (link to FFTW documentation).
All the libraries have been correctly linked, and the project builds just fine. However, the code doesn't run the moment any function from the fftw library is called.
#include <iostream>
#include <fftw3.h>
using namespace std;
int main() {
int vectorSize = 100;
cout << vectorSize << endl;
fftw_complex vec[vectorSize], vecOut[vectorSize];
for(int i = 0; i < vectorSize; i++) {
vec[i][0] = i;
vec[i][1] = 1;
}
// Call to function to create an FFT plan
fftw_plan plan = fftw_plan_dft_1d(vectorSize, vec, vecOut, FFTW_FORWARD, FFTW_ESTIMATE);
cout << "test" << endl;
return 0;
}
If I comment the line where the fftw_plan is instantiated, the code outputs 100 and "test" as expected. There are no issues in the build, as far as I can tell. I haven't really been able to find any post which describes a similar problem.
I am running this on eclipse, using MinGW and the 32 bit version of the pre-compiled binary available for windows (download link).
Any help would be really appreciated :)

Fftw requires input/output to be 16-byte aligned. When you declare the arrays on stack, this can't be guaranteed. So you need to call fftw_malloc or other function to allocate the arrays. Also, your code only creates the plan but doesn't execute it, thus no fft is carried out on the input data.

Finding maximum value in Python vs. C++

I am just curious as to why finding the maximum value in C++ is faster than in Python3. Here is a snippet of my code in both languages:
C++:
int main() {
int arr[] = {45, 67, 89};
int temp = 0;
for(int n = 0; n < 3; n++) {
if(arr[n] > temp)
temp = arr[n];
}
cout << "Biggest number: " << temp << endl;
}
Python:
def Main():
numbers = ['87', '67', '32', '43']
print(max(numbers))
if __name__ == "__main__":
Main()
As it is illustrated in the code, I am finding the maximum value in C++ via looping each element in an array as compared to using the max() method in Python.
I then ran the code on the terminal to find their execution times and found out that it takes approximately 0.006s(C++) and 0.032s(Python). Is there a way to further shorten Python's execution time?

Python is an interpreted language. Python has to read the text file with the python code, parse it, and only then begin executing it.
By the time the C++ code executes, the C++ compiler already did all the heavy lifting of compiling C++ into native machine language code that gets directly executed by the CPU.
It is possible to precompile Python code; this'll save some overhead, but the C++ code will still get the benefit of C++ compile-time optimization. With a small array size, an aggressive C++ compiler is likely to unroll the loop, and maybe even compute the maximum value at compile time, instead of at runtime; so all you end up executing is:
cout << "Biggest number: " << 89 << endl;
This is something that, theoretically, Python can also do; however that'll take even more CPU cycles to figure out, at runtime.

Assuming you're using a larger vector than the toy example above, I would give numpy a shot.
# set up a vector with 50000 random elements
a = np.random.randint(0,100000,50000)
max_val = np.max(a)
Very fast relative to looping.
My computer shows it about 12x faster to use np.max than the built in max() operation in python. C++ would be even faster as it's a compiled language. (Numpy wraps around low level packages that are optimized C code.)

C++ To Cuda Conversion/String Generation And Comparison

So I am in a basic High School coding class. We had to think up one
of our semester projects. I chose to
base mine on ideas and applications
that arn't used in traditional code.
This brought up the idea for use of
CUDA. One of the best ways I would
know to compare speed of traditional
methods versus unconventional is
string generation and comparison. One
could demonstrate the generation and
matching speed of traditional CPU
generation with timers and output. And
then you could show the increase(or
decrease) in speed and output of GPU
Processing.
I wrote this C++ code to generate random characters that are input into
a character array and then match that
array to a predetermined string.
However like most CPU programming it
is incredibly slow comparatively to
GPU programming. I've looked over CUDA
API and could not find something that
would possibly lead me in the right
direction for what I'm looking to do.
Below is the code I have written in C++, if anyone could point me in
the direction of such things as a
random number generator that I can
convert to chars using ASCII codes,
that would be excellent.
#include <iostream>
#include <string>
#include <cstdlib>
using namespace std;
int sLength = 0;
int count = 0;
int stop = 0;
int maxValue = 0;
string inString = "aB1#";
static const char alphanum[] =
"0123456789"
"!##$%^&*"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
int stringLength = sizeof(alphanum) - 1;
char genRandom()
{
return alphanum[rand() % stringLength];
}
int main()
{
cout << "Length of string to match?" << endl;
cin >> sLength;
string sMatch(sLength, ' ');
while(true)
{
for (int x = 0; x < sLength; x++)
{
sMatch[x] = genRandom();
//cout << sMatch[x];
count++;
if (count == 2147000000)
{
count == 0;
maxValue++;
}
}
if (sMatch == inString)
{
cout << "It took " << count + (maxValue*2147000000) << " randomly generated characters to match the strings." << endl;
cin >> stop;
}
//cout << endl;
}
}

If you want to implement a pseudorandom number generator using CUDA, have a look over here. If you want to generate chars from a predetermined set of characters, you can just put all possible chars into that array and create a random index (just as you are doing it right now).
But I think it might be more valuable comparison might be one that uses brute force. Therefore, you could adapt your program to try not random strings, but try one string after another in any meaningful order.
Then, on the other hand, you could implement the brute-force stuff on the GPU using CUDA. This can be tricky since you might want to stop all CUDA threads as soon as one of them finds a solution. I could imagine the brute force process using CUDA the following way: One thread tries aa as first two letters and brute-forces all following digits, the next thread tries ab as first two letters and brute-forces all following digits, the next thread tries ac as first two letters and brute-forces all following digits, and so on. All these threads run in parallel. Of course, you could vary the number of predetermined chars such that e.g. the first thread tries aaaa, the second aaab. Then, you could compare different input values.
Any way, if you have never dealt with CUDA, I recommend the vector addition sample, a very basic CUDA example, that serves very well for getting a basic understanding of what's going on with CUDA. Moreover, you should read the CUDA programming guide to make yourself familiar with CUDAs concept of a grid of thread-blocks containing a grid of threads. Once you understand this, I think it becomes clearer how CUDA organizes stuff. To be short, in CUDA, you should replace loops with a kernel, that is executed multiple times at once.

First off, I am not sure what your actual question is? Do you need a faster random number generator or one with a greater period? In that case I would recommend boost::random, the "Mersenne Twister" is generally considered state of the art. It is a little hard to get started, but boost is a great library so worth the effort.
I think the method you arer using should be fairly efficient. Be aware that it could take up to (#characters)^(length of string) draws to get to the target string (here 70^4 = 24010000). GPU should be at an advantage here since this process is a Monte Carlo simulation and trivially parallelizable.
Have you compiled the code with optimizations?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js