Negligible chronometric measurement in C++? - c++

Consider the following code:
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
int main()
{
int iter = 1000000;
int loops = 10;
while (loops)
{
int a=0, b=0, c=0, f = 0, m = 0, q = 0;
auto begin = chrono::high_resolution_clock::now();
auto end = chrono::high_resolution_clock::now();
auto deltaT = end - begin;
auto accumT = end - begin;
accumT = accumT - accumT;
auto controlT = accumT;
srand(chrono::duration_cast<chrono::nanoseconds>(begin.time_since_epoch()).count());
for (int i = 0; i < iter; i++) {
begin = chrono::high_resolution_clock::now();
// No arithmetic operation
end = chrono::high_resolution_clock::now();
deltaT = end - begin;
accumT += deltaT;
}
controlT = accumT; // Control duration
accumT = accumT - accumT; // Reset to zero
for (int i = 0; i < iter; i++) {
auto n1 = rand() % 100;
auto n2 = rand() % 100;
begin = chrono::high_resolution_clock::now();
c += i*2*n1*n2; // Some arbitrary arithmetic operation
end = chrono::high_resolution_clock::now();
deltaT = end - begin;
accumT += deltaT;
}
// Print the difference in time between loop with no arithmetic operation and loop with
cout << " c = " << c << "\t\t" << " | ";
cout << "difference between the 1st and 2nd loop: "
<< chrono::duration_cast<chrono::nanoseconds>(accumT - controlT).count()
<< endl;
loops--;
}
return 0;
}
It tries to isolate the time measurement of an operation. The first loop is a control to establish a baseline and the second loop has an arbitrary arithmetic operation.
Then it outputs to the console. Here's sample output:
c = 2116663282 | difference between 1st and 2nd loop: -8620916
c = 112424882 | difference between 1st and 2nd loop: -1197927
c = -1569775878 | difference between 1st and 2nd loop: -5226990
c = 1670984684 | difference between 1st and 2nd loop: 4394706
c = -1608171014 | difference between 1st and 2nd loop: 676683
c = -1684897180 | difference between 1st and 2nd loop: 2868093
c = 112418158 | difference between 1st and 2nd loop: 5846887
c = 2019014070 | difference between 1st and 2nd loop: -951609
c = 656490372 | difference between 1st and 2nd loop: 997815
c = 263579698 | difference between 1st and 2nd loop: 2371088
Here's the very interesting part: sometime the loop with the arithmetic operation finishes faster than the loop with no arithmetic operation (negative difference). Which means that the operation to record the current time is slower than the arithmetic operation, and thus not negligible.
Is there a way around this?
PS: Yes, I understand you can wrap the whole loop between begin and end.
Setup machine: Core i7 architecture, Windows 10 64 bit, and Visual Studio 2015

Your problem is that you measure time and not the number of instructions processed. Time can be influenced by a lot of things that are not really what you would expect, or wish to measure.
Instead, you should measure the number of clock cycles. There exists a library for this which can be found on Agner Fog's website. He has a lot of useful information about optimizations:
http://www.agner.org/optimize/#manuals
Even using clock cycles, you can still experience peculiarities in the results. This could happen if the processor uses out-of-order execution which enables the processor to optimize the order of execution of the operations.
If you have compiled your code with debugging symbols, the compiler may have injected additional code, which may impact the result. When performing tests like this, you should always compile without debugging information.

You should use a steady clock, std::steady_clock.
The std::system_clock/std::high_resolution_clock is getting corrected by the OS.

Related

Calculating Running Time of Binary Search

The following Binary Search program is returning a running time of 0 milliseconds using GetTickCount() no matter how big the search item is set in the given list of values.
Is there any other way to get the running time for comparison?
Here's the code :
#include <iostream>
#include <windows.h>
using namespace std;
int main(int argc, char **argv)
{
long int i = 1, max = 10000000;
long int *data = new long int[max];
long int initial = 1;
long int final = max, mid, loc = -5;
for(i = 1; i<=max; i++)
{
data[i] = i;
}
int range = final - initial + 1;
long int search_item = 8800000;
cout<<"Search Item :- "<<search_item<<"\n";
cout<<"-------------------Binary Search-------------------\n";
long int start = GetTickCount();
cout<<"Start Time : "<<start<<"\n";
while(initial<=final)
{
mid=(initial+final)/2;
if(data[mid]==search_item)
{
loc=mid;
break;
}
if(search_item<data[mid])
final=mid-1;
if(search_item>data[mid])
initial=mid+1;
}
long int end = GetTickCount();
cout<<"End Time : "<<end<<"\n";
cout << "time: " << double(end - start)<<" milliseconds \n";
if(loc==-5)
cout<<" Required number not found "<<endl;
else
cout<<" Required number is found at index "<<loc<<endl;
return 0;
}
Your code looks like this:
int main()
{
// Some code...
while (some_condition)
{
// Some more code...
// Print timing result
return 0;
}
}
That's why your code prints zero time, you only do one iteration of the loop then you exit the program.
Try to use the clock_t object from the time.h header:
clock_t START, END;
START = clock();
**YOUR CODE GOES HERE**
END = clock();
float clocks = END - START;
cout <<"running time : **" << clocks/CLOCKS_PER_SEC << "** seconds" << endl;
CLOCKS_PER_SEC is a defined var to convert from clock ticks to seconds.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms724408(v=vs.85).aspx
This article says that result of GetTickCount will wrap to zero if you system runs for 49.7 days.
You can find here: Easily measure elapsed time how to measure time in C++.
You can use time.h header
and do something like this in your code :
clock_t Start, Stop;
double sec;
Start = clock();
//call your BS function
Stop = clock();
Sec = ((double) (Stop - Start) / CLOCKS_PER_SEC);
and print the sec!
I hope this helps you!
The complexity of binary search is log2(N), it's about 23 for N = 10000000.
I think its not enough to mesure in realtime scale and even clock.
In this case you should use unsigned long long __rdtsc(), that returns number of processor ticks from last reset. Put this before and after your binary search and place cout << start; after obtaining end time. Overwise time of output would be included.
There is also memory corruption around data array. Index in C runs from 0 to size - 1, so thereis no data[max] element.
And delete [] data; before calling return.

Why is my C++ code so much slower than R?

I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?

Random Number Generator - Histogram Construction (Poisson Distribution and Counting Variables)

This Problem Has Now Been Resolved - Revised Code is Shown Below
I have a problem here which I'm sure will only require a small amount of tweaking the code but I do not seem to have been able to correct the program.
So, basically what I want to do is write a C++ program to construct a histogram with nbin = 20 (number of bins), for the number of counts of a Geiger counter in 10000 intervals of a time interval dt (delta t) = 1s; assuming an average count rate of 5 s^(-1). In order to determine the number of counts in some time interval deltat I use a while statement of the form shown below:
while((t-=tau*log(zscale*double(iran=IM*iran+IC)))<deltat)count++;
As a bit of background to this problem I should mention that the total number of counts is given by n*mu, which is proportional to the total counting time T = n*deltat. Obviously, in this problem n has been chosen to be 10000 and deltat is 1s; giving T = 10000s.
The issue I am having is that the output of my code (which will be shown below) simply gives 10000 "hits" for the element 0 (corresponding to 0 counts in the time deltat) and then, of course, 0 "hits" for every other element of the hist[] array subsequently. Whereas, the output which I am expecting is a Poisson Distribution with the peak "hits" at 5 counts (per second).
Thank you in advance for any help you can offer, and I apologise for my poor explanation of the problem at hand! My code is shown below:
#include <iostream> // Pre-processor directives to include
#include <ctime> //... input/output, time,
#include <fstream> //... file streaming and
#include <cmath> //... mathematical function headers
using namespace std;
int main(void) {
const unsigned IM = 1664525; // Integer constants for
const unsigned IC = 1013904223; //... the RNG algorithm
const double zscale = 1.0/0xFFFFFFFF; // Scaling factor for random double between 0 and 1
const double lambda = 5; // Count rate = 5s^-1
const double tau = 1/lambda; // Average time tau is inverse of count rate
const int deltat = 1; // Time intervals of 1s
const int nbin = 20; // Number of bins in histogram
const int nsteps = 1E4;
clock_t start, end;
int count(0);
double t = 0; // Time variable declaration
unsigned iran = time(0); // Seeds the random-number generator from the system time
int hist[nbin]; // Declare array of size nbin for histogram
// Create output stream and open output file
ofstream rout;
rout.open("geigercounterdata.txt");
// Initialise the hist[] array, each element is given the value of zero
for ( int i = 0 ; i < nbin ; i++ )
hist[i] = 0;
start = clock();
// Construction of histogram using RNG process
for ( int i = 1 ; i <= nsteps ; i++ ) {
t = 0;
count = 0;
while((t -= tau*log(zscale*double(iran=IM*iran+IC))) < deltat)
count++; // Increase count variable by 1
hist[count]++; // Increase element "count" of hist array by 1
}
// Print histogram to console window and save to output file
for ( int i = 0 ; i < nbin ; i++ ) {
cout << i << "\t" << hist[i] << endl;
rout << i << "\t" << hist[i] << endl;
}
end = clock();
cout << "\nTime taken for process completion = "
<< (end - start)/double(CLOCKS_PER_SEC)
<< " seconds.\n";
rout.close();
return 1;
} // End of main() routine
I do not entirely follow you on the mathematics of your while loop; however the problem is indeed in the condition of the while loop. I broke your while loop down as follows:
count--;
do
{
iran=IM * iran + IC; //Time generated pseudo-random
double mulTmp = zscale*iran; //Pseudo-random double 0 to 1
double logTmp = log(mulTmp); //Always negative (see graph of ln(x))
t -= tau*logTmp; //Always more than 10^4 as we substract negative
count++;
} while(t < deltat);
From the code it is apparent that you will always end up with count = 0 when t > 1 and run-time error when t < 1 as you will be corrupting your heap.
Unfortunately, I do not entirely follow you on mathematics behind your calculation and I don't understand why Poisson distribution shall to be expected. With the issue mentioned above, you should either go ahead and solve your problem (and then share your answer for the community) or provide me with more mathematical background and references and I will edit my answer with corrected code. If you decide for the earlier, keep in mind that Poisson distribution's domain is [0, infinity[ so you will need to check whether the vale of count is smaller than 20 (or your nbin for that matter).

Not Finding Times of Prime Generation / Limited Generation

This program is a c++ program that finds primes using the sieve of eratosthenes to calculate primes. It is then supposed to store the time it takes to do this, and reperform the calculation 100 times, storing the times each time. There are two things that I need help with in this program:
Firstly, I can only test numbers up to 480million I would like to get higher than that.
Secondly, when i time the program it only gets the first timing and then prints zeros as the time. This is not correct and I don't know what the problem with the clock is. -Thanks for the help
Here is my code.
#include <iostream>
#include <ctime>
using namespace std;
int main ()
{
int long MAX_NUM = 1000000;
int long MAX_NUM_ARRAY = MAX_NUM+1;
int long sieve_prime = 2;
int time_store = 0;
while (time_store<=100)
{
int long sieve_prime_constant = 0;
int *Num_Array = new int[MAX_NUM_ARRAY];
std::fill_n(Num_Array, MAX_NUM_ARRAY, 3);
Num_Array [0] = 1;
Num_Array [1] = 1;
clock_t time1,time2;
time1 = clock();
while (sieve_prime_constant <= MAX_NUM_ARRAY)
{
if (Num_Array [sieve_prime_constant] == 1)
{
sieve_prime_constant++;
}
else
{
Num_Array [sieve_prime_constant] = 0;
sieve_prime=sieve_prime_constant;
while (sieve_prime<=MAX_NUM_ARRAY - sieve_prime_constant)
{
sieve_prime = sieve_prime + sieve_prime_constant;
Num_Array [sieve_prime] = 1;
}
if (sieve_prime_constant <= MAX_NUM_ARRAY)
{
sieve_prime_constant++;
sieve_prime = sieve_prime_constant;
}
}
}
time2 = clock();
delete[] Num_Array;
cout << "It took " << (float(time2 - time1)/(CLOCKS_PER_SEC)) << " seconds to execute this loop." << endl;
cout << "This loop has already been executed " << time_store << " times." << endl;
float Time_Array[100];
Time_Array[time_store] = (float(time2 - time1)/(CLOCKS_PER_SEC));
time_store++;
}
return 0;
}
I think the problem is that you don't reset the starting prime:
int long sieve_prime = 2;
Currently that is outside your loop. On second thoughts... That's not the problem. Has this code been edited to incorporate the suggestions in Mats Petersson's answer? I just corrected the bad indentation.
Anyway, for the other part of your question, I suggest you use char instead of int for Num_Array. There is no use using int to store a boolean. By using char you should be able to store about 4 times as many values in the same amount of memory (assuming your int is 32-bit, which it probably is).
That means you could handle numbers up to almost 2 billion. Since you are using signed long as your type instead of unsigned long, that is approaching the numeric limits for your calculation anyway.
If you want to use even less memory, you could use std::bitset, but be aware that performance could be significantly impaired.
By the way, you should declare your timing array at the top of main:
float Time_Array[100];
Putting it inside the loop just before it is used is a bit whack.
Oh, and just in case you're interested, here is my own implementation of the sieve which, personally, I find easier to read than yours....
std::vector<char> isPrime( N, 1 );
for( int i = 2; i < N; i++ )
{
if( !isPrime[i] ) continue;
for( int x = i*2; x < N; x+=i ) isPrime[x] = 0;
}
This section of code is supposed to go inside your loop:
int *Num_Array = new int[MAX_NUM_ARRAY];
std::fill_n(Num_Array, MAX_NUM_ARRAY, 3);
Num_Array [0] = 1;
Num_Array [1] = 1;
Edit: and this one needs be in the loop too:
int long sieve_prime_constant = 0;
When I run this on my machine, it takes 0.2s per loop. If I add two zeros to the MAX_NUM_ARRAY, it takes 4.6 seconds per iteration (up to the 20th loop, I got bored waiting longer than 1.5 minute)
Agree with the earlier comments. If you really want to juice things up you don't store an array of all possible values (as int, or char), but only keep the primes. Then you test each subsequent number for divisibility through all primes found so far. Now you are only limited by the number of primes you can store. Of course, that's not really the algorithm you wanted to implement any more... but since it would be using integer division, it's quite fast. Something like this:
int myPrimes[MAX_PRIME];
int pCount, ii, jj;
ii = 3;
myPrimes[0]=2;
for(pCount=1; pCount<MAX_PRIME; pCount++) {
for(jj = 1; jj<pCount; jj++) {
if (ii%myPrimes[jj]==0) {
// not a prime
ii+=2; // never test even numbers...
jj = 1; // start loop again
}
}
myPrimes[pCount]=ii;
}
Not really what you were asking for, but maybe it is useful.

C++ performance, for versus while

In general (or from your experience), is there difference in performance between for and while loops?
What if they are doubly/triply nested?
Is vectorization (SSE) affected by loop variant in g++ or Intel compilers?
Thank you
Here is a nice paper on the subject.
Any intelligent compiler won't really show a difference between them. A for loop is really just syntactic sugar for a certain form of while loop, anyways.
VS2015, Intel Xeon CPU
long long n = 1000000000;
int *v = new int[n];
int *v1 = new int[2*n];
start = clock();
for (long long i = 0, j=0; i < n; i++, j+=2)
v[i] = v1[j];
end = clock();
std::cout << "for1 - CPU time = " << (double)(end - start) / CLOCKS_PER_SEC << std::endl;
p = v; pe = p + n; p1 = v1;
start = clock();
while (p < pe)
{
*p++ = *p1;
p1 += 2;
}
end = clock();
std::cout << "while3 - CPU time = " << (double)(end - start) / CLOCKS_PER_SEC << std::endl;
for1 - CPU time = 4.055
while3 - CPU time = 1.271
This is something easily ascertained by looking at disassembly. For most loops, they will be the same assuming you do the same work.
int i = 0;
while (i < 10)
++i;
is the same as
for (int i = 0; i < 10; ++i)
;
As for nesting, it really depends on how you configure it but same setups should yield same code.
Should be zero difference, but do check as I've seen really crappy, older versions of GCC create different code ARM/Thumb code between the two. One optimized away a compare after a subtract to set the zero flag where as the other did not. Was very lame.
Nesting again should make no difference. Not sure on SSE/Vectorization stuff, but again I'd expect there to be no difference.
it should be negligible. an optimizing compiler should make the distinction nonexistent.