Execution time of a function in C++

Execution time of a function in C++ - c++

I want to use several functions that declare the same array but in different ways (statically, on the stack and on the heap) and to display the execution time of each functions. Finally I want to call those functions several times.
I think I've managed to do everything but for the execution time of the functions I'm constantly getting 0 and I don't know if it's supposed to be normal. If somebody could confirm it for me. Thanks
Here's my code
#include "stdafx.h"
#include <iostream>
#include <time.h>
#include <stdio.h>
#include <chrono>
#define size 100000
using namespace std;
void prem(){
auto start = std::chrono::high_resolution_clock::now();
static int array[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed timefor static: " << elapsed.count() << " s\n";
}
void first(){
auto start = std::chrono::high_resolution_clock::now();
int array[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time on the stack: " << elapsed.count() << " s\n";
}
void secon(){
auto start = std::chrono::high_resolution_clock::now();
int *array = new int[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time dynamic: " << elapsed.count() << " s\n";
delete[] array;
}
int main()
{
for (int i = 0; i <= 1000; i++){
prem();
first();
secon();
}
return 0;
}

prem() - the array is allocated outside of the function
first() - the array is allocated before your code gets to it
You are looping over all 3 functions in a single loop. Why? Didn't you mean to loop for 1000 times over each one separately, so that they (hopefully) don't affect each other? In practice that last statement is not true though.
My suggestions:
Loop over each function separately
Do the now() call for the entire 1000 loops: make the now() calls before you enter the loop and after you exit it, then get the difference and divide it by the number of iterations(1000)
Dynamic allocation can be (trivially) reduced to just grabbing a block of memory in the vast available address space (I assume you are running on 64-bit platform) and unless you actually use that memory the OS doesn't even need to make sure it is in RAM. That would certainly skew your results significantly
Write a "driver" function that gets function pointer to "test"
Possible implementation of that driver() function:
void driver( void(*_f)(), int _iter, std::string _name){
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < _iter; ++i){
*_f();
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time " << _name << ": " << elapsed.count() / _iter << " s" << std::endl;
}
That way your main() looks like that:
void main(){
const int iterations = 1000;
driver(prem, iterations, "static allocation");
driver(first, iterations, "stack allocation");
driver(secon, iterations, "dynamic allocation");
}

Do not do such synthetic tests because the compiler will optimize out everything that is not used.
As another answer suggests, you need to measure the time for entire 1000 loops. And even though, I do not think you will get reasonable results.
Let's make not 1000 iterations, but 1000000. And let's add another case, where we just do two subsequent calls to chrono::high_resolution_clock::now() as a baseline:
#include <iostream>
#include <time.h>
#include <stdio.h>
#include <chrono>
#include <string>
#include <functional>
#define size 100000
using namespace std;
void prem() {
static int array[size];
}
void first() {
int array[size];
}
void second() {
int *array = new int[size];
delete[] array;
}
void PrintTime(std::chrono::duration<double> elapsed, int count, std::string msg)
{
std::cout << msg << elapsed.count() / count << " s\n";
}
int main()
{
int iterations = 1000000;
{
auto start = std::chrono::high_resolution_clock::now();
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time for nothing: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
prem();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed timefor static: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
first();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time on the stack: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
second();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time dynamic: ");
}
return 0;
}
With all optimisations on, I get this result:
Elapsed time for nothing: 3.11e-13 s
Elapsed timefor static: 3.11e-13 s
Elapsed time on the stack: 3.11e-13 s
Elapsed time dynamic: 1.88703e-07 s
That basically means, that compiler actually optimized out prem() and first(). Even not calls, but entire loops, because they do not have side effects.

Related

How to tell the compiler to optimize array access?

I have the following fragment of code. It contains 3 sections where I measure memory access runtime. First is plain iteration over the array. The second is almost the same with the exception that the array address received from the function call. The third is the same as the second but manually optimized.
#include <map>
#include <cstdlib>
#include <chrono>
#include <iostream>
std::map<void*, void*> cache;
constexpr int elems = 1000000;
double x[elems] = {};
template <typename T>
T& find_in_cache(T& var) {
void* key = &var;
void* value = nullptr;
if (cache.count(key)) {
value = cache[key];
} else {
value = malloc(sizeof(T));
cache[key] = value;
}
return *(T*)value;
}
int main() {
std::chrono::duration<double> elapsed_seconds1, elapsed_seconds2, elapsed_seconds3;
for (int k = 0; k < 100; k++) { // account for cache effects
// first section
auto start = std::chrono::steady_clock::now();
for (int i = 1; i < elems; i++) {
x[i] = (x[i-1] + 1.0) * 1.001;
}
auto end = std::chrono::steady_clock::now();
elapsed_seconds1 = end-start;
// second section
start = std::chrono::steady_clock::now();
for (int i = 1; i < elems; i++) {
find_in_cache(x)[i] = (find_in_cache(x)[i-1] + 1.0) * 1.001;
}
end = std::chrono::steady_clock::now();
elapsed_seconds2 = end-start;
// third section
start = std::chrono::steady_clock::now();
double* y = find_in_cache(x);
for (int i = 1; i < elems; i++) {
y[i] = (y[i-1] + 1.0) * 1.001;
}
end = std::chrono::steady_clock::now();
elapsed_seconds3 = end-start;
}
std::cout << "elapsed time 1: " << elapsed_seconds1.count() << "s\n";
std::cout << "elapsed time 2: " << elapsed_seconds2.count() << "s\n";
std::cout << "elapsed time 3: " << elapsed_seconds3.count() << "s\n";
return x[elems - 1]; // prevent optimizing away
}
The timings of these sections are following:
elapsed time 1: 0.0018678s
elapsed time 2: 0.00423903s
elapsed time 3: 0.00189678s
Is it possible to change the interface of find_in_cache() without changing the body of the second iteration section to make its performance the same as section 3?

template <typename T>
[[gnu::const]]
T& find_in_cache(T& var) { ... }
lets clang optimize the code the way you want, but gcc fails to handle the call as a loop invariant, even with gnu::noinline to make sure the attribute is not lost (maybe worth a bug report?).
How safe such code is may depend on the rest of your code. It is a lie since the function can use memory, but it may be ok if that memory is private enough to the function. Preventing inlining of find_in_cache may help reduce the risks.
You can also convince gcc to optimize with
template <typename T>
[[gnu::const,gnu::noinline]]
T& find_in_cache(T& var) noexcept { ... }
which would cause your program to terminate if there isn't enough memory to add an element in the cache.

Performance of C++ containers during thread execution

I am working on a project that uses multiple threads to parallelize tasks. During development, I noticed that performance of operations on a std::list container, e.g. pop_front() or push_back(), are significantly slower while executed in the thread, compared to single-thread execution. See the code snipped below:
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
#include <thread>
#include <list>
using namespace std;
void SingleThreadedList()
{
auto t1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2; i++)
{
list<char> l;
for(int j=0; j<10000; j++)
{
l.push_back('c');
l.pop_front();
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << "duration single thread: " << duration << endl;
}
void MultiThreadedList()
{
auto t1 = std::chrono::high_resolution_clock::now();
auto lambda_fkt = []() {
list<char> l;
for(int i=0; i<10000; i++)
{
l.push_back('c');
l.pop_front();
}
};
vector<thread*> thread_array;
for (int i=0; i<2; ++i)
{
thread *th = new thread(lambda_fkt);
thread_array.push_back(th);
}
for(auto t : thread_array)
{
t->join();
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << "duration multi thread: " << duration << endl;
for(auto t : thread_array)
{
delete t;
}
}
int main() {
SingleThreadedList();
MultiThreadedList();
}
the code produces the following output:
duration single thread: 4589
duration multi thread: 245483
Single Thread variant takes 4 ms, but as soon as the threads are created, execution takes more than 200ms! I cannot imagine that standard library containers show such a performance difference, depending on the execution context. Therefore I wonder if maybe someone can explain to me, what happens in this code and how I can avoid the performance decrease within thread? Thank you!
PS: When I remove the list operations from this sample and add e.g. some simple math, the code quickly shows the expected behavior: Multi-thread variant gets faster, if the computation is split among multiple threads, thus using multiple cores to get the result.

Does armadillo library slow down the execution of a matrix operations?

I've converted a MATLAB code to C++ to speed it up, using the Armadillo library to handle matrix operations in C++, but surprisingly it is 10 times slower than the MATLAB code!
So I test the Armadillo library to see if it's the cause. The below code is a simple test code that initializes two matrices, adds them together and saves the result to a new matrix. One section of code uses the Armadillo library and other one doesn't. The section using Armadillo is too slow (notice the elapsed times).
Does it really slow down the execution (though it is supposed to speed it up) or am I missing some thing?
#include<iostream>
#include<math.h>
#include<chrono>
#include<armadillo>
using namespace std;
using namespace arma;
int main()
{
auto start = std::chrono::high_resolution_clock::now();
double a[100][100];
double b[100][100];
double c[100][100];
for (int i = 0; i < 100; i++)
{
for (int j = 0; j < 100; j++)
{
a[i][j] = 1;
b[i][j] = 1;
c[i][j] = a[i][j] + b[i][j];
}
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
auto start1 = std::chrono::high_resolution_clock::now();
mat a1=ones(100,100);
mat b1=ones(100,100);
mat c1(100,100);
c1 = a1 + b1;
auto finish1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed1 = finish1 - start1;
std::cout << "Elapsed time: " << elapsed1.count() << " s\n";
return 0;
}
Here is the answer I get:
Elapsed time: 5.1729e-05 s
Elapsed time: 0.00025536 s
As you see, Armadillo is significantly slower! Is it better not to use the Armadillo library?

First of all make sure that the blas and lapack library are enabled, there are instructions at Armadillo doc.
The second thing is that it might be a more extensive memory allocation in Armadillo. If you restructure your code to do the memory initialisation first as
#include<iostream>
#include<math.h>
#include<chrono>
#include<armadillo>
using namespace std;
using namespace arma;
int main()
{
double a[100][100];
double b[100][100];
double c[100][100];
mat a1=ones(100,100);
mat b1=ones(100,100);
mat c1(100,100);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; i++)
{
for (int j = 0; j < 100; j++)
{
a[i][j] = 1;
b[i][j] = 1;
c[i][j] = a[i][j] + b[i][j];
}
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
auto start1 = std::chrono::high_resolution_clock::now();
c1 = a1 + b1;
auto finish1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed1 = finish1 - start1;
std::cout << "Elapsed time: " << elapsed1.count() << " s\n";
return 0;
}
With this I got the result:
Elapsed time: 0.000647521 s
Elapsed time: 0.000353198 s
I compiled it with (in Ubuntu 17.10):
g++ prog.cpp -larmadillo

I believe the problem comes from you not using armadillo at all. You uniquely used it to create variables that are a bit more complex than normal 2D arrays of C++, but really nothing more. What Armadillo can do for you is give you very fast matrix operations, as in c1=a1+b1;, without loops.
But if you just instead write it as an elemetwise operation, you are just not using armadillo. Its the same as using MATLAB for matrix multiplication but writing the matrix multiplication yourself. You are not using MATLAB's libraries then!

Timing a for loop with clock

Hi so I am trying to do a program that sums 20 consecutive numbers and calculates the time that it took to do so... the problem is that when I run the program the time is always 0... any ideas?
this is what I have so far... thanks!
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
int finish = 20;
int start = 1;
int result = 0;
double msecs;
clock_t init, end;
init = clock();
for (int i = start; i <= finish; i++)
{
result += i;
}
end = clock();
cout << ((float)(end - init)) *1000 / (CLOCKS_PER_SEC);
system ("PAUSE");
return 0;
}

No matter what technique you use for timing they all have some precision. This simply executes so fast that your timer isn't registering any time as having passed.
Aside #1: Use high_resolution_clock - maybe that will register something non-zero, probably not.
Aside #2: Don't name your variable null, in C++ that implies 0 or a null pointer

You can try this...but you might need version C++11.
This can get down to 0.000001 seconds.
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
//using namespace std;
int main()
{
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int finish = 20;
int start = 1;
for (int i = start; i <= finish; i++)
{
result += i;
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
cout << time_span.count() << " seconds" << endl;
end = clock();
system ("PAUSE");
return 0;
}

Timing the Thrash

I feel confident that my nested for conditions will thrash memory, but I would like to know long it takes. I'm assuming time.h can help but I don't know what methods to use and how to display. Can someone help?
I have updated my code with the suggestions made and I believe it worked. I got a slow output of 4 (thrashTime). Is this in seconds? Also, perhaps my method could be refactored. I set a time before and after the for conditions.
// Updated
#include <iostream>
#include <time.h>
using namespace std;
int array[1 << 14][1 << 14];
int main() {
time_t beforeThrash = 0;
time_t afterThrash = 0;
time_t thrashTime;
int i, j;
beforeThrash = time(NULL);
for (i = 0; i<16384; i++)
for (j = 0; j<16384; j++)
array[i][j] = i*j;
afterThrash = time(NULL);
thrashTime = afterThrash - beforeThrash;
cout << thrashTime << endl;
system("pause");
return 0;
}

You can just follow the instruction in time and clock as Joe Z mentioned.
A quick demo for printing current time:
#include <ctime>
time_t start = time(0);
const char* tstart = ctime(&start);
// std::cout << tstart; will give you local time Fri Dec 06 11:53:46 2013
For time difference:
#include <ctime>
clock_t t = clock();
do_something();
t = clock() - t;
// std::cout << (float)t / CLOCKS_PER_SEC; will give you elapsed time in seconds
You can simply replace do_something(); with the operations to be measured.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Execution time of a function in C++ - c++

Related

How to tell the compiler to optimize array access?

Performance of C++ containers during thread execution

Does armadillo library slow down the execution of a matrix operations?

Timing a for loop with clock

Timing the Thrash

Categories

Resources