Threads slowing eachother down - c++

I have some expensive computation I want to divide and distribute over a set of threads.
I dumbed down my code to a minimal example where this is still happening.
In short:
I have N tasks that I want to divide into "Threads" threads.
Each task is the following simple function of running a bunch of simple mathematical operations.
(In practice I verify asymmetric signatures here, but I excluded that for the sake of simplification)
while (i++ < 100000)
{
for (int y = 0; y < 1000; y++)
{
sqrt(y);
}
}
Running the above code with 1 thread results in 0.36 seconds per operation (outermost for loop), and thus in around 36 seconds overall execution time.
Thus, parallelization seemed like an obvious way to speed it up. However, with two threads the operation time rises to 0.72 seconds completely destroying any speed up.
Adding more threads results usually in an increasingly worse performance.
I got a Intel(R) Core(TM) i7-8750H CPU # 2.20GHz with 6 physical cores.
So I'd expect a performance boost at least using when going from 1 to 2 threads. But in fact each operation slows down when increasing the number of threads.
Am I doing something wrong?
Full code:
using namespace std;
const size_t N = 100;
const size_t Threads = 1;
atomic_int counter(0);
struct ThreadData
{
int index;
int count;
ThreadData(const int index, const int count): index(index), count(count){};
};
void *executeSlave(void *threadarg)
{
struct ThreadData *my_data;
my_data = static_cast<ThreadData *>(threadarg);
for( int x = my_data->index; x < my_data->index + my_data->count; x++ )
{
cout << "Thread: " << my_data->index << ": " << x << endl;
clock_t start, end;
start = clock();
int i = 0;
while (i++ < 100000)
{
for (int y = 0; y < 1000; y++)
{
sqrt(y);
}
}
counter.fetch_add(1);
end = clock();
cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)<< endl;
}
pthread_exit(NULL);
}
int main()
{
clock_t start, end;
start = clock();
pthread_t threads[Threads];
vector<ThreadData> td;
td.reserve(Threads);
int each = N / Threads;
cout << each << endl;
for (int x = 0; x < Threads; x++) {
cout << "main() : creating thread, " << x << endl;
td[x] = ThreadData(x * each, each);
int rc = pthread_create(&threads[x], NULL, executeSlave, (void *) &td[x]);
if (rc) {
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
while (counter < N) {
std::this_thread::sleep_for(10ms);
}
end = clock();
cout << "Final:" << endl;
cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)
<< endl;
}

clock() returns approximate CPU time for the entire process.
The outermost loop does a fixed amount of work per iteration
int i = 0;
while (i++ < 100000)
{
for (int y = 0; y < 1000; y++)
{
sqrt(y);
}
}
Therefore, process CPU time reported around this loop will be proportional to the number of running threads (it still takes the same amount of time per thread, times N threads).
Use std::chrono::steady_clock to measure wall clock time instead. Note also that I/O such as std::cout takes a lot of wall clock time and is unstable. So the measured total elapsed time will be skewed due to the I/O inside.
Some additional remarks:
The return value of sqrt() is never used; the compiler may eliminate the call entirely. It would be prudent to use the value in some way to be sure.
void* executeSlave() isn't returning a void* pointer value (UB). It should probably be declared simply void if it returns nothing.
td.reserve(Threads) reserves memory but does not allocate objects. td[x] then accesses nonexistent objects (UB). Use td.emplace_back(x * each, each) instead of td[x] = ....
Not technically an issue, but it is recommended to use the standard C++ std::thread instead of pthread, for better portability.
With the following I'm seeing correct speedup proportional to the # of threads:
#include <string>
#include <iostream>
#include <vector>
#include <atomic>
#include <cmath>
#include <thread>
using namespace std;
using namespace std::chrono_literals;
const size_t N = 12;
const size_t Threads = 2;
std::atomic<int> counter(0);
std::atomic<int> xx{ 0 };
void executeSlave(int index, int count, int n)
{
double sum = 0;
for (int x = index; x < index + count; x++)
{
cout << "Thread: " << index << ": " << x << endl;
auto start = std::chrono::steady_clock::now();
for (int i=0; i < 100000; i++)
{
for (int y = 0; y < n; y++)
{
sum += sqrt(y);
}
}
counter++;
auto end = std::chrono::steady_clock::now();
cout << 1e-6 * (end - start) / 1us << " s" << endl;
}
xx += (int)sum; // prevent optimization
}
int main()
{
std::thread threads[Threads];
int each = N / Threads;
cout << each << endl;
auto start = std::chrono::steady_clock::now();
for (int x = 0; x < Threads; x++) {
cout << "main() : creating thread, " << x << endl;
threads[x] = std::thread(executeSlave, x * each, each, 100);
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::steady_clock::now();
cout << "Final:" << endl;
cout << 1e-6 * (end - start) / 1us << " s" << endl;
}

Related

The function called by std::async is not executed immediately?

#include <iostream>
#include <future>
auto gClock = clock();
char threadPool(char c) {
std::cout << "enter thread :" << c << " cost time:" << clock() - gClock << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(2));
for (int i = 0; i < 10; i++)
std::cout << c;
std::cout << std::endl;
return c;
}
void fnTestAsync(){
auto begin = clock();
std::future<char> futures[10];
for (int i = 0; i < 10; ++i){
futures[i] = std::async(std::launch::async,threadPool, 'a' + i);
}
for (int i = 0; i < 10; ++i){
std::cout << futures[i].get() << " back ,cost time: " << clock() - begin << std::endl;
}
std::cout << "fnTestAsync: " << clock() - begin << std::endl;
}
int main(){
std::thread testAsync(fnTestAsync);
testAsync.detach();
std::this_thread::sleep_for(std::chrono::seconds(10));
return 0;
}
run result
I'm trying to get these 10 threads to execute together and all return immediately after a two second delay, but I output the time spent and find that it takes about 2900ms, much larger than the 2000ms I expected.
What is the cause of this increase?
How should he fix it?

C++ thread request

I'm new to C++. In my application, there is a method getOnlineStatus():
int getOnlineStatus(int num);
This method is from third party DLL, it can't be modified.
I call this method to check number status, like this:
int num = 123456;
for (int i = 0; i < 10000000; i++) {
num = num + 1;
int nRet = getOnlineStatus(num);
if (nRet > 0) {
cout << num << "status online" << endl;
}
else if (nRet == 0) {
cout << num << "status offline" << endl;
}
else {
cout << num << "check fail" << endl;
}
}
But every time, it will take 2 seconds to return the nRet. So, if I check lots of number, it will take a long time.
Also, I tried to use async, but it's not working, it still takes 2 seconds to return a result one by one.
int num = 123456;
for (int i = 0; i < 10000000; i++) {
num = num + 1;
future<int> fuRes = std::async(std::launch::async, getOnlineStatus, num);
int result = fuRes.get();
if (result > 0) {
cout << num << "status online" << endl;
}
else if (result == 0) {
cout << num << "status offline" << endl;
}
else {
cout << num << "check fail" << endl;
}
}
Is there any way to open multiple threads to make it show results faster?
This largely depends on your third party DLL - does it even support requests from multiple threads? And if it does - do those requests use shared resources? Like the same internet connection / socket?
If you simplify your question and assume that the getOnlineStatus() sleeps for 2 seconds - then yes, you can greatly benefit from issuing multiple requests on different threads and wait in parallel.
Here is how you can simply setup reasonable number of threads to share the workload:
#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
int status[10'000]{};
int getOnlineStatus(int n) {
std::this_thread::sleep_for(std::chrono::seconds(1));
return rand();
}
void getStatus(int low, int high) {
for (int i = low; i < high; i++) {
status[i] = getOnlineStatus(i);
}
}
int main()
{
srand(0);
const int count = std::thread::hardware_concurrency();
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0, low = 0, high = 10; i < count; ++i, low += 10, high += 10)
threads.emplace_back(std::thread(getStatus, low, high));
for (auto& thread : threads)
thread.join();
auto stop = std::chrono::high_resolution_clock::now();
std::cout << count << " threads: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() << " ms" << std::endl;
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10 * count; ++i)
status[i] = getOnlineStatus(i);
stop = std::chrono::high_resolution_clock::now();
std::cout << "single thread: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() << " ms" << std::endl;
}
I get this result:
12 threads: 10075 ms
single thread: 120720 ms
NOTE: if those worker threads really do nothing, you can run many more of those, reducing total time significantly.

pthread execution time worse than sequential

I was learning to use pthread with hopes it will help some of the slowest pieces of my code
go a bit faster. I tried to (as a warm-up example) to write a Montecarlo integrator using
threads. I wrote a code that compares three approaches:
Single thread pthread evaluation of the integral with NEVALS integrand evaluations.
Multiple thread evaluation of the integral NTHREADS times each with NEVALS
integrand evaluations.
Multiple threads commited to different cores in my CPU, again totalling NEVALS*NTHREADS
integrand evaluations.
Upon running the fastest per integrand evaluations is the single core, between 2 and 3 times faster than the others. The other two seem to be somewhat equivalent except for the fact that
the CPU usage is very different, the second one spreads the threads across all the (8) cores
in my CPU, while the third (unsurprisingly) concentrates the job in NTHREADS and leaves the rest
unoccupied.
Here is the source:
#include <iostream>
#define __USE_GNU
#include <sched.h>
#include <pthread.h>
#include <thread>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
using namespace std;
double aleatorio(double a, double b){
double r = double(rand())/RAND_MAX;
return a + r * (b - a);
}
double funct(double* a){
return pow(a[0],6);
}
void EstimateBounds(int ndim, double (*f)(double*), double* bounds){
double x[ndim];
for(int i=1;i<=1000;i++){
for(int j=0;j<ndim;j++) x[j] = aleatorio(0,1);
if ( f(x) > bounds[1]) bounds[1] = f(x);
if ( f(x) < bounds[0]) bounds[0] = f(x);
}
}
void Integrate(double (*f)(double*), int ndim, double* integral, int verbose, int seed){
int nbatch = 5000000;
const int maxeval = 25*nbatch;
double x[ndim];
srand(seed);
/// Algorithm to estimate the maxima and minima ///
for(int j=0;j<ndim;j++) x[j] = 0.5;
double bounds[2] = {f(x),f(x)};
EstimateBounds(ndim,f,bounds);
/// Integral initialization ///
int niter = int(maxeval/nbatch);
for(int k=1;k<=niter;k++)
{
double loc_min = bounds[0];
double loc_max = bounds[1];
int count = 0;
for (int i=1; i<=nbatch; i++)
{
for(int j=0;j<ndim;j++) x[j] = aleatorio(0,1);
double y = aleatorio(bounds[0],bounds[1]);
if ( f(x) > loc_max ) loc_max = f(x);
if ( f(x) < loc_min ) loc_min = f(x);
if ( f(x) > y && y > 0 ) count++;
if ( f(x) < y && y < 0 ) count--;
}
double delta = (bounds[1]-bounds[0])*double(count)/nbatch;
integral[0] += delta;
integral[1] += pow(delta,2);
bounds[0] = loc_min;
bounds[1] = loc_max;
if(verbose>0){
cout << "Iteration["<<k<<"]: " << k*nbatch;
cout << " integrand evaluations so far" <<endl;
if(verbose>1){
cout << "The bounds for this iteration were = ["<<bounds[0]<<","<<bounds[1]<<"]"<<endl;}
cout << "Integral = ";
cout << integral[0]/k << " +- ";
cout << sqrt((integral[1]/k - pow(integral[0]/k,2)))/(k) << endl;
cout << endl;
}
}
integral[0] /= niter;
integral[1] = sqrt((integral[1]/niter - pow(integral[0],2)))/niter;
}
struct IntegratorArguments{
double (*Integrand)(double*);
int NumberOfVariables;
double* Integral;
int VerboseLevel;
int Seed;
};
void LayeredIntegrate(IntegratorArguments IA){
Integrate(IA.Integrand,IA.NumberOfVariables,IA.Integral,IA.VerboseLevel,IA.Seed);
}
void ThreadIntegrate(void * IntArgs){
IntegratorArguments *IA = (IntegratorArguments*)IntArgs;
LayeredIntegrate(*IA);
pthread_exit(NULL);
}
#define NTHREADS 5
int main(void)
{
cout.precision(16);
bool execute_single_core = true;
bool execute_multi_core = true;
bool execute_multi_core_2 = true;
///////////////////////////////////////////////////////////////////////////
///
/// Single Thread Execution
///
///////////////////////////////////////////////////////////////////////////
if(execute_single_core){
pthread_t thr0;
double integral_value0[2] = {0,0};
IntegratorArguments IntArg0;
IntArg0.Integrand = funct;
IntArg0.NumberOfVariables = 2;
IntArg0.VerboseLevel = 0;
IntArg0.Seed = 1;
IntArg0.Integral = integral_value0;
int t = time(NULL);
cout << "Now Attempting to create thread "<<0<<endl;
int rc0 = 0;
rc0 = pthread_create(&thr0, NULL, ThreadIntegrate,&IntArg0);
if (rc0) {
cout << "Error:unable to create thread," << rc0 << endl;
exit(-1);
}
else cout << "Thread "<<0<<" has been succesfuly created" << endl;
pthread_join(thr0,NULL);
cout << "Thread 0 has finished, it took " << time(NULL)-t <<" secs to finish" << endl;
cout << "Integral Value = "<< integral_value0[0] << "+/-" << integral_value0[1] <<endl;
}
////////////////////////////////////////////////////////////////////////////////
///
/// Multiple Threads Creation
///
///////////////////////////////////////////////////////////////////////////////
if(execute_multi_core){
pthread_t threads[NTHREADS];
double integral_value[NTHREADS][2];
IntegratorArguments IntArgs[NTHREADS];
int rc[NTHREADS];
for(int i=0;i<NTHREADS;i++){
integral_value[i][0]=0;
integral_value[i][1]=0;
IntArgs[i].Integrand = funct;
IntArgs[i].NumberOfVariables = 2;
IntArgs[i].VerboseLevel = 0;
IntArgs[i].Seed = i;
IntArgs[i].Integral = integral_value[i];
}
int t = time(NULL);
for(int i=0;i<NTHREADS;i++){
cout << "Now Attempting to create thread "<<i<<endl;
rc[i] = pthread_create(&threads[i], NULL, ThreadIntegrate,&IntArgs[i]);
if (rc[i]) {
cout << "Error:unable to create thread," << rc[i] << endl;
exit(-1);
}
else cout << "Thread "<<i<<" has been succesfuly created" << endl;
}
/// Thread Waiting Phase ///
for(int i=0;i<NTHREADS;i++) pthread_join(threads[i],NULL);
cout << "All threads have now finished" <<endl;
cout << "This took " << time(NULL)-t << " secs to finish" <<endl;
cout << "Or " << (time(NULL)-t)/NTHREADS << " secs per core" <<endl;
for(int i = 0; i < NTHREADS; i++ ) {
cout << "Thread " << i << " has as the value for the integral" << endl;
cout << "Integral = ";
cout << integral_value[i][0] << " +- ";
cout << integral_value[i][1] << endl;
}
}
////////////////////////////////////////////////////////////////////////
///
/// Multiple Cores Execution
///
///////////////////////////////////////////////////////////////////////
if(execute_multi_core_2){
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
pthread_t threads[NTHREADS];
double integral_value[NTHREADS][2];
IntegratorArguments IntArgs[NTHREADS];
int rc[NTHREADS];
for(int i=0;i<NTHREADS;i++){
integral_value[i][0]=0;
integral_value[i][1]=0;
IntArgs[i].Integrand = funct;
IntArgs[i].NumberOfVariables = 2;
IntArgs[i].VerboseLevel = 0;
IntArgs[i].Seed = i;
IntArgs[i].Integral = integral_value[i];
}
int t = time(NULL);
for(int i=0;i<NTHREADS;i++){
cout << "Now Attempting to create thread "<<i<<endl;
rc[i] = pthread_create(&threads[i], NULL, ThreadIntegrate,&IntArgs[i]);
if (rc[i]) {
cout << "Error:unable to create thread," << rc[i] << endl;
exit(-1);
}
else cout << "Thread "<<i<<" has been succesfuly created" << endl;
CPU_SET(i, &cpuset);
}
cout << "Now attempting to commit different threads to different cores" << endl;
for(int i=0;i<NTHREADS;i++){
const int set_result = pthread_setaffinity_np(threads[i], sizeof(cpu_set_t), &cpuset);
if(set_result) cout << "Error: Thread "<<i<<" could not be commited to a new core"<<endl;
else cout << "Thread reassignment succesful" << endl;
}
/// Thread Waiting Phase ///
for(int i=0;i<NTHREADS;i++) pthread_join(threads[i],NULL);
cout << "All threads have now finished" <<endl;
cout << "This took " << time(NULL)-t << " secs to finish" <<endl;
cout << "Or " << (time(NULL)-t)/NTHREADS << " secs per core" <<endl;
for(int i = 0; i < NTHREADS; i++ ) {
cout << "Thread " << i << " has as the value for the integral" << endl;
cout << "Integral = ";
cout << integral_value[i][0] << " +- ";
cout << integral_value[i][1] << endl;
}
}
pthread_exit(NULL);
}
I compile with
g++ -std=c++11 -w -fpermissive -O3 SOURCE.cpp -lpthread
It seems to me that my threads are actually being excecuted sequentially, because
the time seems to grow with NTHREADS, and it actully takes roughly NTHREADS times longer
than a single thread.
Does anyone have an idea of where the bottleneck is?
You are using rand(), which is a global random number generator. First of all it is not thread-safe, so using it in multiple threads, potentially in parallel, causes undefined behavior.
Even if we set that aside, rand() is using one global instance, shared by all threads. If one thread wants to call it, the processor core needs to check whether the other cores modified its state and needs to refetch that state from the main memory or other caches each time it is used. This is why you observe the drop in performance.
Use the <random> facilities for pseudo-random number generators instead. They offer much better quality random number generators, random number distributions, and the ability to create multiple independent random number generator instances. Make these thread_local, so the threads do not interfere with one another:
double aleatorio(double a, double b){
thread_local std::mt19937 rng{/*seed*/};
return std::uniform_real_distribution<double>{a, b}(rng);
}
Please note though that this is not using proper seeding for std::mt19937, see this question for details and that uniform_real_distribution<double>{a, b} will return a uniformly distributed number between a inclusive and b exclusive. Your original code gave a number between a and b inclusive (potential rounding errors aside). I assume that neither is particularly relevant to you.
Also note my unrelated comments under your question for other things you should improve.

Using OpenMP in Visual Studio 2012 -> the code gets even slower than without

I'm trying to use OpenMP in my project in Visual Studio 2012 for acceleration of some for-loops, but I'm observing some strange behaviour. OpenMP does not improve the performance, it even leads to slower runtimes...
I've created the following simple demo with OpenMP:
#include "stdafx.h"
#include <iostream>
#include <cstdlib>
#include <time.h>
#include <omp.h>
int _tmain(int argc, _TCHAR* argv[])
{
// get the number of processors in this system
int iCPU = omp_get_num_procs();
std::cout << "Number of processors: " << iCPU << std::endl;
int n_threads = omp_get_num_threads();
std::cout << "Number of threads (before): " << n_threads << std::endl;
//#pragma omp parallel
// set the number of threads
omp_set_num_threads(/*iCPU*/5);
#pragma omp parallel
n_threads = omp_get_num_threads();
std::cout << "Number of threads (after): " << n_threads << std::endl;
const int length = 10000000;
unsigned short *pSrc = new unsigned short[length];
unsigned short *pDst = new unsigned short[length];
memset(pSrc, '\0', length*sizeof(unsigned short));
for (int i = 0; i < length; ++i)
*(pSrc + i) = (unsigned short)(((double)rand() / (double)RAND_MAX)*255);
int n = 20;
double time_acc = 0.0;
for (int ind = 0; ind < n; ++ind)
{
memset(pDst, '\0', length*sizeof(unsigned short));
double stime = omp_get_wtime();
int i;
#pragma omp parallel for
for (i = 0; i < length; ++i)
{
//std::cout << "thread: " << omp_get_thread_num() << std::endl;
*(pDst + i) = *(pSrc + i) + *(pSrc + i);
}
double etime = omp_get_wtime();
double elapsed = etime - stime;
std::cout << "elapsed = " << elapsed*1e3 << " [ms]"<< std::endl;
time_acc += elapsed*1e3;
}
std::cout << "Runtime (averaged) = " << time_acc/n << " [ms]"<< std::endl;
delete[] pSrc;
delete[] pDst;
std::cout << "Press any button to exit" << std::endl;
getchar();
return 0;
}
OpenMP support in Visual Studio settings is activated. Running this code multiple times without and with OpenMP I observe the following behaviour:
Sometimes runtimes are more or less equal;
But very often usage of OpenMP slows down the computation...
Does anybody have any idea what is going on here? I have 8 cores CPU i7-3720QM. Setting the number of threads seems to work fine and 'omp_get_thread_num()' prints the actual thread number...
Any advice is kindly appreciated!
Best,
Alexey

What's wrong with this threads pool / multi-core simulation

Hy everyone. I've been trying to make a kind of threads pool, meant to simulate a multi-core processor, where I have a number of threads running all the time( the core ), that I later dispatch to process a ( fixed for now) function. The idea to have the threads running at all time is that I don't have the thread creation/destruction overhead.
There are three problems with what I'm doing now.
First, the results are all wrong.
Second, the function measuring the time is reporting 0 ms
Third, the program calls abort at exit.
Here's the code I'm using:
auto fakeExpensiveOperation = [](int i) -> float
{
Sleep(10);
return sqrt(i);
};
// Performance test
int size = 4000;
float* out = new float[size];
#define THREAD_RUNNING 0
#define THREAD_FINISHED (1 << 0)
#define THREAD_EXIT (1 << 1)
#define THREAD_STALL (1 << 2)
const int coreCount = 8;
thread cores[coreCount];
atomic<unsigned int> msgArray[coreCount]; for (auto& msg : msgArray) msg.store(THREAD_STALL);
auto kernel = [out, &fakeExpensiveOperation](int idx){ out[idx] = fakeExpensiveOperation(idx); };
for (int i = 0; i < coreCount; i++)
{
cores[i] = thread([&msgArray, i, kernel]()
{
while (true)
{
unsigned int msg = msgArray[i].load();
if((msg & THREAD_STALL) == THREAD_STALL)
continue;
if ((msg & THREAD_EXIT) == THREAD_EXIT)
break;
if ((msg & THREAD_RUNNING) == THREAD_RUNNING)
{
int idx = (msg >> 3) + i;
// Do the function
kernel(idx);
msgArray[i].store(THREAD_FINISHED);
}
}
});
}
auto t2 = time_call([&]()
{
for (int i = 0; i < size; i += coreCount)
{
for (int n = 0; n < coreCount; n++)
{
if((msgArray[n].load() & THREAD_RUNNING) == THREAD_RUNNING) continue; // The core is still working
unsigned int msg = THREAD_RUNNING;
msg |= (i << 3);
msgArray[n].store(msg);
}
}
});
for (int n = 0; n < coreCount; n++) msgArray[n].store(THREAD_EXIT);
cout << "sqrt 0 : " << out[0] << endl;
cout << "sqrt 1 : " << out[1] << endl;
cout << "sqrt 2 : " << out[2] << endl;
cout << "sqrt 4 : " << out[4] << endl;
cout << "sqrt 16 : " << out[16] << endl;
cout << "Parallel : " << t2 << endl;
system("pause");
delete[] out;
return 0;
I'm really out of ideas. Can anyone point out what's wrong here?
EDIT: I did the changes i mentioned, and still get wrong values. I changed the values of the flags, and detached the threads after creating them.
I could be wrong, (I'm not greatly familiar with the C++11 threading stuff) but it looks like you running a thread cores[i] = thread([&msgArray, i, kernel]() then waiting until that thread is done. Then creating the next thread. Essentially making it single threaded.
I also like using this for timing with C++11
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
// Do Stuff
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds_d1 = end-start;
std::time_t end_time_d1 = std::chrono::system_clock::to_time_t(end);
std::cout << "elapsed time: " << elapsed_seconds_d1.count() << "s\n";