Wait in C++0x multithreading - c++

I'm playing around new c++ standard. I write a test to observe behavior of scheduling algorithms and see what's happening with threads. Considering context switch time, I expected real waiting time for a specific thread to be a bit more than value specified by std::this_thread::sleep_for() function. But surprisingly it's sometimes even less than sleep time! I can't figure out why this happens, or what I'm doing wrong...
#include <iostream>
#include <thread>
#include <random>
#include <vector>
#include <functional>
#include <math.h>
#include <unistd.h>
#include <sys/time.h>
void heavy_job()
{
// here we're doing some kind of time-consuming job..
int j=0;
while(j<1000)
{
int* a=new int[100];
for(int i=0; i<100; ++i)
a[i] = i;
delete[] a;
for(double x=0;x<10000;x+=0.1)
sqrt(x);
++j;
}
std::cout << "heavy job finished" << std::endl;
}
void light_job(const std::vector<int>& wait)
{
struct timeval start, end;
long utime, seconds, useconds;
std::cout << std::showpos;
for(std::vector<int>::const_iterator i = wait.begin();
i!=wait.end();++i)
{
gettimeofday(&start, NULL);
std::this_thread::sleep_for(std::chrono::microseconds(*i));
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
utime = ((seconds) * 1000 + useconds/1000.0);
double delay = *i - utime*1000;
std::cout << "delay: " << delay/1000.0 << std::endl;
}
}
int main()
{
std::vector<int> wait_times;
std::uniform_int_distribution<unsigned int> unif;
std::random_device rd;
std::mt19937 engine(rd());
std::function<unsigned int()> rnd = std::bind(unif, engine);
for(int i=0;i<1000;++i)
wait_times.push_back(rnd()%100000+1); // random sleep time between 1 and 1 million µs
std::thread heavy(heavy_job);
std::thread light(light_job,wait_times);
light.join();
heavy.join();
return 0;
}
Output on my Intel Core-i5 machine:
.....
delay: +0.713
delay: +0.509
delay: -0.008 // !
delay: -0.043 // !!
delay: +0.409
delay: +0.202
delay: +0.077
delay: -0.027 // ?
delay: +0.108
delay: +0.71
delay: +0.498
delay: +0.239
delay: +0.838
delay: -0.017 // also !
delay: +0.157

Your timing code is causing integral truncation.
utime = ((seconds) * 1000 + useconds/1000.0);
double delay = *i - utime*1000;
Suppose your wait time was 888888 microseconds and you sleep for exactly that amount. seconds will be 0 and useconds will be 888888. After dividing by 1000.0, you get 888.888. Then you add 0*1000, still yielding 888.888. That then gets assigned to a long, leaving you with 888, and an apparent delay of 888.888 - 888 = 0.888.
You should update utime to actually store microseconds so that you don't get the truncation, and also because the name implies that the unit is in microseconds, just like useconds. Something like:
long utime = seconds * 1000000 + useconds;
You've also got your delay calculation backwards. Ignoring the effects of the truncation, it should be:
double delay = utime*1000 - *i;
std::cout << "delay: " << delay/1000.0 << std::endl;
The way you've got it, all the positive delays you're outputting are actually the result of the truncation, and the negative ones represent actual delays.

Related

measuring elapsed seconds using chrono (stop watch) C++

I am currently trying to create a way to display the elapsed seconds (not the difference between cycles). My code is following:
#include <iostream>
#include <vector>
#include <chrono>
#include <Windows.h>
typedef std::chrono::high_resolution_clock::time_point TIME;
#define TIMENOW() std::chrono::high_resolution_clock::now()
#define TIMECAST(x) std::chrono::duration_cast<std::chrono::duration<double>>(x).count()
int main()
{
std::chrono::duration<double> ms;
double t = 0;
while (1)
{
TIME begin = TIMENOW();
int c = 0;
for (int i = 0; i < 10000000; i++)
{
c += i*100000;
}
TIME end = TIMENOW();
ms= std::chrono::duration_cast<std::chrono::duration<double>>(end - begin);
t =t+ ms.count();
std::cout << t << std::endl;
}
I expected adding the delta time over and over again to roughly give me the elapsed time in seconds, however I noticed that only if I do i < big number it sort of is fairly accurate. If its only 10,000 or so, t seems to accumulate slower and gradually faster. Maybe I am missing something but isnt the difference my delta time(the elapsed time between this and last cycle) and if I keep adding the delta times up, it should spit out seconds? Any help is appreciated.

How to measure wallclock time in C++ instead of cpu time?

I would like to measure wallclock time taken by my algorithm in C++. Many articles point to this code.
clock_t begin_time, end_time;
begin_time = clock();
Algorithm();
end_time = clock();
cout << ((double)(end_time - begin_time)/CLOCKS_PER_SEC) << endl;
But this measures only cpu time taken by my algorithm.
Some other article pointed out this code.
double getUnixTime(void)
{
struct timespec tv;
if(clock_gettime(CLOCK_REALTIME, &tv) != 0) return 0;
return (tv.tv_sec + (tv.tv_nsec / 1000000000.0));
}
double begin_time, end_time;
begin_time = getUnixTime();
Algorithm();
end_time = getUnixTime();
cout << (double) (end_time - begin_time) << endl;
I thought it would print wallclock time taken by my algorithm. But surprisingly, the time printed by this code is much lower than cpu time printed by previous code. So, I am confused. Please provide code for printing wallclock time.
Those times are probably down in the noise. To get a reasonable time measurement, try executing your algorithm many times in a loop:
const int loops = 1000000;
double begin_time, end_time;
begin_time = getUnixTime();
for (int i = 0; i < loops; ++i)
Algorithm();
end_time = getUnixTime();
cout << (double) (end_time - begin_time) / loops << endl;
I'm getting approximately the same times in a single threaded program:
#include <time.h>
#include <stdio.h>
__attribute((noinline)) void nop(void){}
void loop(unsigned long Cnt) { for(unsigned long i=0; i<Cnt;i++) nop(); }
int main()
{
clock_t t0,t1;
struct timespec ts0,ts1;
t0=clock();
clock_gettime(CLOCK_REALTIME,&ts0);
loop(1000000000);
t1=clock();
clock_gettime(CLOCK_REALTIME,&ts1);
printf("clock-diff: %lu\n", (unsigned long)((t1 - t0)/CLOCKS_PER_SEC));
printf("clock_gettime-diff: %lu\n", (unsigned long)((ts1.tv_sec - ts0.tv_sec)));
}
//prints 2 and 3 or 2 and 2 on my system
But clocks manpage only describes it as returning an approximation. There's no indication that approximation is comparable to what clock_gettime returns.
Where I get drastically different results is where I throw in multiple threads:
#include <time.h>
#include <stdio.h>
#include <pthread.h>
__attribute((noinline)) void nop(void){}
void loop(unsigned long Cnt) {
for(unsigned long i=0; i<Cnt;i++) nop();
}
void *busy(void *A){ (void)A; for(;;) nop(); }
int main()
{
pthread_t ptids[4];
for(size_t i=0; i<sizeof(ptids)/sizeof(ptids[0]); i++)
pthread_create(&ptids[i], 0, busy, 0);
clock_t t0,t1;
struct timespec ts0,ts1;
t0=clock();
clock_gettime(CLOCK_REALTIME,&ts0);
loop(1000000000);
t1=clock();
clock_gettime(CLOCK_REALTIME,&ts1);
printf("clock-diff: %lu\n", (unsigned long)((t1 - t0)/CLOCKS_PER_SEC));
printf("clock_gettime-diff: %lu\n", (unsigned long)((ts1.tv_sec - ts0.tv_sec)));
}
//prints 18 and 4 on my 4-core linux system
That's because both musl and glibc on Linux use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) to implement clock() and the CLOCK_PROCESS_CPUTIME_ID nonstandard clock is described in the clock_gettime manpage as returning time for all process threads together.

Converting seconds (clocks) to decimal format

I'm making a stopwatch, and I need to output the seconds out like so: "9.743 seconds".
I have the start time, the end time, and the difference measured out in clocks, and was planning on achieving the decimal by dividing the difference by 1000. However, no matter what I try, it will always output as a whole number. It's probably something small I'm overlooking, but I haven't a clue what.
Here's my code:
#include "Stopwatch.h"
#include <iostream>
#include <iomanip>
using namespace std;
Stopwatch::Stopwatch(){
clock_t startTime = 0;
clock_t endTime = 0;
clock_t elapsedTime = 0;
long miliseconds = 0;
}
void Stopwatch::Start(){
startTime = clock();
}
void Stopwatch::Stop(){
endTime = clock();
}
void Stopwatch::DisplayTimerInfo(){
long formattedSeconds;
setprecision(4);
seconds = (endTime - startTime) / CLOCKS_PER_SEC;
miliseconds = (endTime - startTime) / (CLOCKS_PER_SEC / 1000);
formattedSeconds = miliseconds / 1000;
cout << formattedSeconds << endl;
system("pause");
}
Like I said, the output is integer. Say it timed 5892 clocks: the output would be "5".
Division between integers is still an integer. Cast one of your division parameters to a real type (double or float) and assign to another variable that is a real type.
double elapsedSeconds = (endTime - startTime) / (double)(CLOCKS_PER_SEC);
cout << elapsedSeconds << endl;
formattedSeconds =(double) miliseconds / 1000;
it will give you real number output

Why is my C++ code so much slower than R?

I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?

Check how long it takes for conhost to print

I have an array of booleans each representing a number. I am printing each one that is true with a for loop: for(unsigned long long l = 0; l<numt; l++) if(primes[l]) cout << l << endl; numt is the size of the array and is equal to over 1000000. The console window takes 30 seconds to print out all the value, but a timer I put in my program says 37ms. How do I wait for all the values to finish printing on the screen in my program so I can include that in my time.
Try this:
#include <windows.h>
...
int main() {
//init code
double startTime = GetTickCount();
//your loop
double timeNeededinSec = (GetTickCount() - startTime) / 1000.0;
}
Just in defense of ctime, cause it gives same result as with GetTickCount:
#include <ctime>
int main()
{
...
clock_t start = clock();
...
clock_t end = clock();
double timeNeededinSec = static_cast<double>(end - start) / CLOCKS_PER_SEC;
...
}
Update:
And the one with time() but in this case we can lost some precision( ~1 sec) because result in seconds.
#include <ctime>
int main()
{
time_t start;
time_t end;
...
time(&start);
...
time(&end);
int timeNeededinSec = static_cast<int>(end-start);
}
Combining both of them in simple example will show you the difference in result. In my tests I saw difference only in value after dot.