Why MATLAB is faster than C++ in creating random numbers? - c++

I have been using MATLAB for a while for my projects and I have almost never had an experience in C++.
I needed speed and I heard that C++ can be more efficient and faster than MATLAB. So I tried this:
I created a matrix of random numbers using rand(5000,5000) on MATLAB.
And in C++, I have initialized a 2D vector created 2 for loops each of them looping for 5000 times and each time. MATLAB was 4-5x faster, so I thought it is because matlab executes vectorized codes in parallel, then I written the C++ code using parallel_for. Here is the code:
#include "stdafx.h"
#include <iostream>
#include <vector>
#include <fstream>
#include <ppl.h>
using namespace std;
using namespace concurrency;
int main();
{
int a = 5000, b = 5000, j, k;
vector< vector<int> > vec(a, vector<imt>(b));
parallel_for(int(0), a, [&](int i) {
for (j = 0; j <b; j++)
{
vec[i][j] = rand();
}
});
}
So the code above is about 25% faster than MATLAB's rand(5000,5000) Yet C++ is using 100% of the CPU while MATLAB is using 30% of CPU.
So I forced MATLAB to use all of the CPU by running 3 instances of MATLAB using rand(5000,5000) and divided the time it takes by 3. It made MATLAB twice as fast as C++.
I wonder what am I missing? I know this is a tiny example but I need an answer to be sure to port my code to C++.
Current status:
When I write C++ code without parallel_for I get half of the MATLAB's speed with the same CPU usage. Yet people who gave answers say that they are almost the same. I do not understand what I am missing
here is a snapshot of the optimization menu

This is maybe no answer, but a litle hint.
The comparison might be a bit unfair due to the usage of vectors.
Here is a comparison I've written. Both take up roughly 100% of one of the four threads available. In both cases I create 5000x5000 random numbers and do this 100 times for timing
Matlab
function stackoverflow
tic
for i=1:100
A =rand(5000);
end
toc
Runtime: ~27.9 sec
C++
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
using namespace std;
int main(){
int N = 5000;
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(time(NULL));
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
Runtime: ~28.7 sec
So both examples run almost equally fast.

When you call rand(5000,5000) in Matlab, Matlab executes the command by calling Intel MKL library, which is a highly optimized library written in C/C++ with lots of hand-coded assembly.
MKL should be faster than any straightforward C++ implementation, but there is an overhead for Matlab to call external library. The net result is that, for random number generation in smaller sizes (less than 1K for instance), plain C/C++ implementation will be faster, but for larger sizes, Matlab will benefit from super optimized MKL.

After looking at #sonystarmap's answer, I added a few types of containers: double*, vector<double> and vector<vector<double> >. I also added tests where the "pointer-containers" are memset, since vector initialises all memory.
The C++ code was compiled with these optimization flag: -O3 -march=native
The results:
Matlab: Elapsed time is 28.457788 seconds.
C++:
T=23844.2ms
T=25161.5ms
T=25154ms
T=24197.3ms
T=24235.2ms
T=24166.1ms
I can essentially not find the large gain you mention.
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
#include <vector>
#include <cstring>
using namespace std;
int main(){
const int N = 5000;
{
vector<double> A(N*N);
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i*N+j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
vector<vector<double> > A(N);
for (int i=0;i<N;i++)
A[i] = vector<double>(N);
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double ** A = new double*[N];
for (int i=0;i<N;i++) {
A[i] = new double[N];
memset(A[i], 0, sizeof(double) * N);
}
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double * A = new double[N * N];
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i*N + j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
{
double * A = new double[N * N];
memset(A, 0, sizeof(double) * N * N);
srand(0);
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i*N + j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
}

#include <vector>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cstring>
int main() {
const int N = 5000;
std::vector<int> A(N*N);
srand(0);
clock_t start = clock();
for(int k = 0; k < 100; ++k){
for(int i = 0; i < N * N; ++i) {
A[i] = rand();
}
}
std::cout << (clock()-start)/(double)(CLOCKS_PER_SEC/1000) << "ms" << "\n";
return 0;
}
Went from 25-27 seconds on my workstation without any optimization flag on the compiler to 21 seconds with
-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native -ffast-math

Related

timer on counting the total time execution on a program for c++?

So I have this following simple program that I would like to have a timer imprinted in the program to count the exact time it takes for the program to execute on the number of n times.
for (int i=0; i<n; i++){
for (int j=1; j>=i; j++){
cout << "perfecto" << endl;
I was thinking of using the ctime library to help me out with the timer.
#include <iostream>
#include <ctime>
using namespace std;
time_t time_1;
time_t time_2;
time ( &time_1);
int main(){
int n=5;
for (int i=0; i<n; i++){
for (j=0; j<=i; j++){
cout << 'test';
}
}
}
time (&time_2 );
cout<<'time taken for the algorithm :'<<time_2 - time_1 << seconds <<endl;
Will this work somehow because when I ran this it shows me an error just like this.
Is there any other way to do it and is it possible to add a timer when starting the program?
Statements to be executed have to be inside function bodies.
#include <iostream>
#include <ctime>
using namespace std;
int main(){ // move here
time_t time_1;
time_t time_2;
time ( &time_1);
// move this above
//int main(){
int n=5;
for (int i=0; i<n; i++){
for (j=0; j<=i; j++){
cout << 'test';
}
}
// move this below
//}
time (&time_2 );
cout<<'time taken for the algorithm :'<<time_2 - time_1 << seconds <<endl;
} // move here
Also there are some more errors:
Undeclared j and seconds are used.
Multi-character character literals are used where I think string literals should be used.
Do you mean this?
#include <iostream>
#include <ctime>
using namespace std;
int main(){
time_t time_1;
time_t time_2;
time ( &time_1);
int n=5;
for (int i=0; i<n; i++){
for (int j=0; j<=i; j++){
cout << "test";
}
}
time (&time_2 );
cout<<"time taken for the algorithm :"<<time_2 - time_1 << "seconds" <<endl;
}

C++ Unhandled exception for large vector/array

I keep getting an unhandled exception in my code and it has me stumped.
I am sure it is in the way I have my variables declared.
Basically I am attempting to create 3 arrays, M rows, N columns of random variables.
If I set my N = 1,000 and M = 10,000, not a problem.
If I then change M = 100,000 I get an Unhandled exception memory allocation error.
Can someone please help me understand why this is happening.
Parts of the code was written on VS2010. I have now moved on to VS2013, so any additional advice on the usage of newer functions would also be appreciated.
cheers,
#include <cmath>
#include <iostream>
#include <random>
#include <vector>
#include <ctime>
#include <ratio>
#include <chrono>
int main()
{
using namespace std::chrono;
steady_clock::time_point Start_Time = steady_clock::now();
unsigned int N; // Number of time Steps in a simulation
unsigned long int M; // Number of simulations (paths)
N = 1000;
M = 10000;
// Random Number generation setup
double RANDOM;
srand((unsigned int)time(NULL)); // Generator loop reset
std::default_random_engine generator(rand()); // Seed with RAND()
std::normal_distribution<double> distribution(0.0, 1.0); // Mean = 0.0, Variance = 1.0 ie Normal
std::vector<std::vector<double>> RandomVar_A(M, std::vector<double>(N)); // dw
std::vector<std::vector<double>> RandomVar_B(M, std::vector<double>(N)); // uncorrelated dz
std::vector<std::vector<double>> RandomVar_C(M, std::vector<double>(N)); // dz
// Generate random variables for dw
for (unsigned long int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
RANDOM = distribution(generator);
RandomVar_A[i][j] = RANDOM;
}
}
// Generate random variables for uncorrelated dz
for (unsigned long int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
RANDOM = distribution(generator);
RandomVar_B[i][j] = RANDOM;
}
}
// Generate random variables for dz
for (unsigned long int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
RANDOM = distribution(generator);
RandomVar_C[i][j] = RANDOM;
}
}
steady_clock::time_point End_Time = steady_clock::now();
duration<double> time_span = duration_cast<duration<double>>(End_Time - Start_Time);
//Clear Matricies
RandomVar_A.clear();
RandomVar_B.clear();
RandomVar_C.clear();
std::cout << std::endl;
std::cout << "its done";
std::cout << std::endl << std::endl;
std::cout << "Time taken : " << time_span.count() << " Seconds" << std::endl << std::endl;
std::cout << "End Of Program" << std::endl << std::endl;
system("pause");
return 0;
}
// *************** END OF PROGRAM ***************
Three 100,000 x 1,000 arrays of doubles represents 300 million doubles. Assuming 8 byte doubles, that's around 2.3 GB of memory. Most likely your process is by default limited to 2 GB on Windows (even if you have much more RAM installed on the machine). However, there are ways to allow your process to access a larger address space: Memory Limits for Windows.
I'm experienced something similar then my 32-bit application allocates more than 2Gb memory.
Your vectors require about 2.1Gb memory, so it might be same problem.
Try to change platform of your application to x64. This may solve problem.

C++ - Calculate execution time of a function always = 0?

I use clock() in library to calculate excution time of a function, which is BubbleSort(..) function in my code below. But probleam is that the return execution time always = 0 (and it shows no unit, too).
This is my code:
#include <iostream>
#include <ctime>
using namespace std;
void BubbleSort(int arr[], int n)
{
for (int i = 1; i<n; i++)
for (int j = n-1; j >=i; j-- )
if (arr[j] < arr[j-1])
{
int temp = arr[j];
arr[j] = arr[j-1];
arr[j-1] = temp;
}
return;
}
int main()
{
int arr[] = {4,1,7,2,6, 17, 3, 2, 8,1};
int len = sizeof(arr)/sizeof(int);
cout << "Before Bubble Sort: \n";
for (int i=0;i<len;i++)
{
cout << arr[i] << " ";
}
clock_t start_s=clock(); // begin
BubbleSort(arr,len);
clock_t stop_s=clock(); // end
cout << "\nAfter Bubble Sort: \n";
for (int i=0;i<len;i++)
{
cout << arr[i] << " ";
}
// calculate then print out execution time - currently always returns 0 and I don't know why
cout << "\nExecution time: "<< (double)(stop_s - start_s)/CLOCKS_PER_SEC << endl;
//system("pause");
return 0;
}
I haven't known how to fix this problem yet .. So hope you guys can help me with this. Any comments would be very appreciated. Thanks so much in advanced !
As you have only a very small array, the execution time is probably much shorter than the resolution of clock(), so you either have to call the sort algorithm repeatedly or use another time source.
I modified your code as such and both start end stop have the value of 0. (ubuntu 13.10)
std::cout<<"start: "<<start_s<<std::endl;
BubbleSort(arr,len);
clock_t stop_s=clock(); // end
std::cout<<"stop: "<<stop_s<<std::endl;
you probably want something more like gettimeofday()
this http://www.daniweb.com/software-development/cpp/threads/120862/clock-always-returns-0 is an interesting discussion of the same thing. the poster concluded that clock()(on his machine) had a resolution of about 1/100 of a sec. and your code is probably ( almost certainly) running faster than that

Translating pseudocode into C++

I am having a hard time with translating this pseudocode into C++. The goal is to generate random numbers into A[] and sort them using insertion sort then get the execution time in milliseconds. Insertion sort would run for m=5 times. Each n value should be 100, 200, 300,....,1000. So for example if n=100 then that would run 5 times with 5 different sets of random numbers, then do the same thing for n=200, etc...
I have already written my insertion sort and that works so I did not include it. I am really just having trouble translating this pseudocode into something I can work with. I included my attempt and the pseudocode so you can compare.
Pseudocode:
main()
//generate elements using rand()
for i=1 to 5
for j=1 to 1000
A[i,j] = rand()
//insertion sort
for (i=1; i<=5; i=i+1)
for (n=100; n<=1000; n=n+100)
B[1..n] = A[i,n]
t1 = time()
insertionSort(B,n)
t2 = time()
t_insort[i,n] = t2-t1
//compute the avg time
for (n=100; n<=1000; n=n+100)
avgt_insort[n] = (t_insort[1,n]+t_insort[2,n]+t_insort[3,n]+...+t_insort[5,n]+)/5
//plot graph with avgt_insort
This is my attempt:
I am confused with t_insort and avgt_insort, I did not write them to C++. Do I make these into new arrays? Also take I am not sure if I am doing my time correctly either. I am sorta new at this running time thing so I have never actually wrote it into code yet.
#include <iostream>
#include <stdlib.h>
#include <time.h>
int main()
{
int A[100];
for(int i=1; i<=5; i++)
{
for(int j=1; j<=1000; j++)
{
A[i,j] = rand();
}
}
for(int i=0;i<=5; i++)
{
for(int n=100; n<=1000; n=n+100)
{
static int *B = new int[n];
B[n] = A[i,n];
cout << "\nLength\t: " << n << '\n';
long int t1 = clock();
insertionSort(B, n);
long int t2 = clock();
//t_insort
cout << "Insertion Sort\t: " << (t2 - t1) << " ms.\n";
}
}
for(int n=100; n<=1000; n=n+100)
{
//avt_insort[n]
}
return 0;
}
The pseudocode is relatively close to a C++ code with some syntactic changes. Note that this C++ code is a straightforward "translation". A better solution would be to use containers from C++ standard library.
int main()
{
int A[6][1001], B[1001]; //C++ starts indexing from 0
double t_insort[6][1000]; //should be of return type of time(), so maybe not double
int i,j,n;
for( i=1;i<=5;i++) //in C++ it is more common to start from 0 for(i=0;i<5;i++)
for(j=1;j<=1000;j++)
A[i][j] = rand(); //one has to include appropriate header file with rand()
//or to define his/her own function
for (i=1; i<=5; i++)
for (n=100; n<=1000; n=n+100)
{
B[n]=A[i][n];
t1 = time(); //one has firstly to declare t1 to be return type of time() function
insertionSort(B,n); //also this function has to be defined before
t2=time();
t_insort[i][n]=t2-t1; //this may be necessary to change depending on exact return type of time()
}
}
for (n=100; n<=1000; n=n+100)
for(i=1;i<=5;i++)
avgt_insort[n] += t_insort[i][n]
avgt_insort[n]/=5;
//plot graph with avgt_insort
A[i,j] is the same as A[j] (comma operator!), and wouldn't work.
You might want to declare a two dimensional array for A or even better an appropriate std::array:
int A[100][1000];
std::array<std::array<int,1000>, 100> A; // <- prefer this for c++
Also allocating B right away inside the for loop doesn't look right:
static int *B = new int[n];
and
B[n] = A[i,n];
won't work either as you intend (see above!).

Why is accumulate faster than a simple for cycle?

I was testing algorithms and run into this weird behavior, when std::accumulate is faster than a simple for cycle.
Looking at the generated assembler I'm not much wiser :-) It seems that the for cycle is optimized into MMX instructions, while accumulate expands into a loop.
This is the code. The behavior manifests with -O3 optimization level, gcc 4.7.1
#include <vector>
#include <chrono>
#include <iostream>
#include <random>
#include <algorithm>
using namespace std;
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (size_t i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
long long tmp = 0;
for (size_t i = 0; i < vsize; i++)
{
tmp += x[i];
}
cout << "dry run " << tmp << endl;
auto start = chrono::high_resolution_clock::now();
long long suma = accumulate(x.begin(),x.end(),0);
auto end = chrono::high_resolution_clock::now();
cout << "Accumulate runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;
start = chrono::high_resolution_clock::now();
suma = 0;
for (size_t i = 0; i < vsize; i++)
{
suma += x[i];
}
end = chrono::high_resolution_clock::now();
cout << "Manual sum runtime " << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << " - " << suma << endl;
return 0;
}
When you pass the 0 to accumulate, you are making it accumulate using an int instead of a long long.
If you code your manual loop like this, it will be equivalent:
int sumb = 0;
for (size_t i = 0; i < vsize; i++)
{
sumb += x[i];
}
suma = sumb;
or you can call accumulate like this:
long long suma = accumulate(x.begin(),x.end(),0LL);
I have some different results using Visual Studio 2012
// original code
Accumulate runtime 93600 ms
Manual sum runtime 140400 ms
Note that the original std::accumulate code isn't equivalent to the for loop because the third parameter to std::accumulate is an int 0 value. It performs the summation using an int and only at the end stores the result in a long long. Changing the third parameter to 0LL forces the algorithm to use a long long accumulator and results in the following times.
// change std::accumulate initial value -> 0LL
Accumulate runtime 265200 ms
Manual sum runtime 140400 ms
Since the final result fits in an int I changed suma and std::accumulate back to using only int values. After this change the MSVC 2012 compiler was able to auto-vectorize the for loop and resulted in the following times.
// change suma from long long to int
Accumulate runtime 93600 ms
Manual sum runtime 46800 ms
After fixing the accumulate issue others noted I tested with both Visual Studio 2008 & 2010 and accumulate was indeed faster than the manual loop.
Looking at the disassembly I saw some additional iterator checking being done in the manual loop so I switched to just a raw array to eliminate it.
Here's what I ended up testing with:
#include <Windows.h>
#include <iostream>
#include <numeric>
#include <stdlib.h>
int main()
{
const size_t vsize = 100*1000*1000;
int* x = new int[vsize];
for (size_t i = 0; i < vsize; i++) x[i] = rand() % 1000;
LARGE_INTEGER start,stop;
long long suma = 0, sumb = 0, timea = 0, timeb = 0;
QueryPerformanceCounter( &start );
suma = std::accumulate(x, x + vsize, 0LL);
QueryPerformanceCounter( &stop );
timea = stop.QuadPart - start.QuadPart;
QueryPerformanceCounter( &start );
for (size_t i = 0; i < vsize; ++i) sumb += x[i];
QueryPerformanceCounter( &stop );
timeb = stop.QuadPart - start.QuadPart;
std::cout << "Accumulate: " << timea << " - " << suma << std::endl;
std::cout << " Loop: " << timeb << " - " << sumb << std::endl;
delete [] x;
return 0;
}
Accumulate: 633942 - 49678806711
Loop: 292642 - 49678806711
Using this code, the manual loop easily beats accumulate. The big difference is the compiler unrolled the manual loop 4 times, otherwise the generated code is almost identical.