Timing operator+, operator-, operator*, and operator/ - c++

I have a simple math vector class with operators overloaded. I would like to get some timing results for my operators. I can easily time my +=, -=, *=, and /= by timing the following code:
Vector sum;
for(size_t i = 0; i<iter; ++i)
sum += RandVector();
cout << sum << endl;
Then I can subtract the time it takes to generate iter random vectors. In my tests, Vector is 3 dimensional, iter = 10,000,000.
I tried to do a similar thing with +,-,*,/:
Vector sum;
for(size_t i = 0; i<iter; ++i)
sum = sum + RandVector();
cout << sum << endl;
Then subtract the time it takes to generate iter random vectors and perform iter assignments, however this gives a "negative" time, leading me to believe either the compiler is optimizing the operation somehow, or something strange is going on.
I am using gcc-4.7.2 using -O3 on a Fedora Linux machine.
Here is my timing code:
clock_t start, k = clock();
do start = clock();
while(start == k);
F()();
clock_t end = clock();
double time = double(end-start)/double(CLOCKS_PER_SEC);
cout << time - time_iter_rand_v - time_iter_ass;
Here F is a function object which performs the code above. time_iter_rand_v is the time it takes to create iter random vectors and time_iter_ass is the time it took for iter assignment operations.
My question is then how to get accurate timing of just the operator+ function not any assignments or random vector generation?

You really can't get an accurate timing for something like that when optimization is on. The reason is due to the fact that the compiler has the ability to move code around.
If you make the time storage variables volatile, the position of them relative to each other are not subject to optimisation due to moving. However, the code around them are, unless they are assigning or calling functions that take volatile variables (this includes a volatile member function which makes *this volatile).
Optimisation can do a lot of odd things to the code if you are expecting linear execution.

Just create a vector of RandVector()s and iterate through them. It will solve the problem of measuring the time of generation.
As for assignment I think that it comes down to how compiler optimizes it.

One basic benchmarking method is to use gettimeofday :
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <cstring>
//------------------- Handle time in milliseconds ----------------------//
/*
* Return 1 if the difference is negative, otherwise 0.
*/
int timeval_subtract(struct timeval *result, struct timeval *t2, struct timeval *t1)
{
long int diff = (t2->tv_usec + 1000000 * t2->tv_sec) - (t1->tv_usec + 1000000 * t1->tv_sec);
result->tv_sec = diff / 1000000;
result->tv_usec = diff % 1000000;
return (diff<0);
}
void timeval_print(struct timeval *tv)
{
char buffer[30];
time_t curtime;
printf("%ld.%06ld", tv->tv_sec, tv->tv_usec);
curtime = tv->tv_sec;
strftime(buffer, 30, "%m-%d-%Y %T", localtime(&curtime));
printf(" = %s.%06ld\n", buffer, tv->tv_usec);
}
// usage :
/*
struct timeval tvBegin, tvEnd, tvDiff;
// begin
gettimeofday(&tvBegin, NULL);
// lengthy operation
int i,j;
for(i=0;i<999999L;++i) {
j=sqrt(i);
}
//end
gettimeofday(&tvEnd, NULL);
// diff
timeval_subtract(&tvDiff, &tvEnd, &tvBegin);
printf("%ld.%06ld\n", tvDiff.tv_sec, tvDiff.tv_usec);
*/

Related

Why does the runtime of high_resolution_clock increase with the greater frequency I call it?

In the following code, I repeatedly call std::chrono::high_resolution_clock::now twice, and measure the time it took between these two calls. I would expect this time to be very small, since there is no other code is run between these two calls. However, I observe strange behavior.
For small N, the max element is within a few nanoseconds, as expected. However, the more I increase N, I get very large outliers, and have gotten up to a few milliseconds. Why does this happen?
In other words, why does the max element of v increase as I increase N in the following code?
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
int main()
{
using ns = std::chrono::nanoseconds;
uint64_t N = 10000000;
std::vector<uint64_t> v(N, 0);
for (uint64_t i = 0; i < N; i++) {
auto start = std::chrono::high_resolution_clock::now();
v[i] = std::chrono::duration_cast<ns>(std::chrono::high_resolution_clock::now() - start).count();
}
std::cout << "max: " << *std::max_element(v.begin(), v.end()) << std::endl;
return 0;
}
The longer you run your loop, the more likely it is that your OS will decide that your thread has consumed enough resources for the moment and suspend it. And the longer you run your loop, the more likely it is that this suspension will happen between those calls.
Since you're only looking at the "max" time, this only has to happen once to cause the max time to spike into the millisecond range.

Time measurement not working in C++

I'm trying to make run time measurements of simple algorithms like linear sort. The problem is that no matter what I do, the time measurement won't work as intended. I get the same search time no matter what problem size I use. Both me and other people who've tried to help me are equally confused.
I have a linear sort function that looks like this:
// Search the N first elements of 'data'.
int linearSearch(vector<int> &data, int number, const int N) {
if (N < 1 || N > data.size()) return 0;
for (int i=0;i<N;i++) {
if (data[i] == number) return 1;
}
return 0;
}
I've tried to take time measurement with both time_t and chrono from C++11 without any luck, except more decimals. This is how it looks like right now when i'm searching.
vector<int> listOfNumbers = large list of numbers;
for (int i = 15000; i <= 5000000; i += 50000) {
const clock_t start = clock();
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
linearSearch(listOfNumbers, randNum, i);
}
cout << float(clock() - start) / CLOCKS_PER_SEC << endl;
}
The result?
0.126, 0.125, 0.125, 0.124, 0.124, ... (same values?)
I have tried the code with both VC++, g++ and on different computers.
First I thought it was my implementation of the search algorithms that was at fault. But a linear sort like the one above can't become any simpler, it's clearly O(N). How can the time be the same even when the problem size is increased by so much? I'm at loss what to do.
Edit 1:
Someone else might have an explanation why this is the case. But it actually worked in release mode after changing:
if (data[i] == number)
To:
if (data.at(i) == number)
I have no idea why this is the case, but linear search could be time measured correctly after that change.
The reason for the about-constant execution times is that the compiler is able to optimize away parts of the code.
Specifically looking at this part of the code:
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
linearSearch(listOfNumbers, randNum, i);
}
When compiling with g++5.2 and optimization level -O3, the compiler can optimize away the call to linearSearch() completely. This is because the result of the code is the same with or without that function being called.
The return value of linearSearch is not used anywhere, and the function does not seem to have side-effects. So the compiler can remove it.
You can cross-check and modify the inner loop as follows. The execution times shouldn't change:
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
// linearSearch(listOfNumbers, randNum, i);
}
What remains in the loop is the call to rand(), and this is what you seem to be measuring. When changing the data[i] == number to data.at(i) == number, the call to linearSearch is not side-effects-free as at(i) may throw an out-of-range exception. So the compiler does not completely optimize the linearSearch code away. However, with g++5.2, it will still inline it and not make a function call.
clock() is measuring CPU time, maybe you want time(NULL)? check this issue
The start should be before the for loop. In your case the start is different for each iteration, it is constant between the { ... }.
const clock_t start = clock();
for (int i = 15000; i <= 5000000; i += 50000){
...
}

Benchmarking a pure C++ function

How do I prevent GCC/Clang from inlining and optimizing out multiple invocations of a pure function?
I am trying to benchmark code of this form
int __attribute__ ((noinline)) my_loop(int const* array, int len) {
// Use array to compute result.
}
My benchmark code looks something like this:
int main() {
const int number = 2048;
// My own aligned_malloc implementation.
int* input = (int*)aligned_malloc(sizeof(int) * number, 32);
// Fill the array with some random numbers.
make_random(input, number);
const int num_runs = 10000000;
for (int i = 0; i < num_runs; i++) {
const int result = my_loop(input, number); // Call pure function.
}
// Since the program exits I don't free input.
}
As expected Clang seems to be able to turn this into a no-op at O2 (perhaps even at O1).
A few things I tried to actually benchmark my implementation are:
Accumulate the intermediate results in an integer and print the results at the end:
const int num_runs = 10000000;
uint64_t total = 0;
for (int i = 0; i < num_runs; i++) {
total += my_loop(input, number); // Call pure function.
}
printf("Total is %llu\n", total);
Sadly this doesn't seem to work. Clang at least is smart enough to realize that this is a pure function and transforms the benchmark to something like this:
int result = my_loop();
uint64_t total = num_runs * result;
printf("Total is %llu\n", total);
Set an atomic variable using release semantics at the end of every loop iteration:
const int num_runs = 10000000;
std::atomic<uint64_t> result_atomic(0);
for (int i = 0; i < num_runs; i++) {
int result = my_loop(input, number); // Call pure function.
// Tried std::memory_order_release too.
result_atomic.store(result, std::memory_order_seq_cst);
}
printf("Result is %llu\n", result_atomic.load());
My hope was that since atomics introduce a happens-before relationship, Clang would be forced to execute my code. But sadly it still did the optimization above and sets the value of the atomic to num_runs * result in one shot instead of running num_runs iterations of the function.
Set a volatile int at the end of every loop along with summing the total.
const int num_runs = 10000000;
uint64_t total = 0;
volatile int trigger = 0;
for (int i = 0; i < num_runs; i++) {
total += my_loop(input, number); // Call pure function.
trigger = 1;
}
// If I take this printf out, Clang optimizes the code away again.
printf("Total is %llu\n", total);
This seems to do the trick and my benchmarks seem to work. This is not ideal for a number of reasons.
Per my understanding of the C++11 memory model volatile set operations do not establish a happens before relationship so I can't be sure that some compiler will not decide to do the same num_runs * result_of_1_run optimization .
Also this method seems undesirable since now I have an overhead (however tiny) of setting a volatile int on every run of my loop.
Is there a canonical way of preventing Clang/GCC from optimizing this result away. Maybe with a pragma or something? Bonus points if this ideal method works across compilers.
You can insert instruction directly into the assembly. I sometimes uses a macro for splitting up the assembly, e.g. separating loads from calculations and branching.
#define GCC_SPLIT_BLOCK(str) __asm__( "//\n\t// " str "\n\t//\n" );
Then in the source you insert
GCC_SPLIT_BLOCK("Keep this please")
before and after your functions

How to calculate GFLOPs for a funtion in c++ program?

I have a c++ code which calculates factorial of int data type, addition of float data type and execution time of each function as follows:
long Sample_C:: factorial(int n)
{
int counter;
long fact = 1;
for (int counter = 1; counter <= n; counter++)
{
fact = fact * counter;
}
Sleep(100);
return fact;
}
float Sample_C::add(float a, float b)
{
return a+b;
}
int main(){
Sample_C object;
clock_t start = clock();
object.factorial(6);
clock_t end = clock();
double time =(double)(end - start);// finding execution time of factorial()
cout<< time;
clock_t starts = clock();
object.add(1.1,5.5);
clock_t ends = clock();
double total_time = (double)(ends -starts);// finding execution time of add()
cout<< total_time;
return 0;
}
Now , i want to have the mesure GFLOPs for "add " function. So, kindly suggest how will i caculate it. As, i am completly new to GFLOPs so kindly tell me wether we can have GFLOPs calculated for functions having only foat data types? and also GFLOPs value vary with different functions?
If I was interested in estimating the execution time of the addition operation I might start with the following program. However, I would still only trust the number this program produced to within a factor of 10 to 100 at best (i.e. I don't really trust the output of this program).
#include <iostream>
#include <ctime>
int main (int argc, char** argv)
{
// Declare these as volatile so the compiler (hopefully) doesn't
// optimise them away.
volatile float a = 1.0;
volatile float b = 2.0;
volatile float c;
// Preform the calculation multiple times to account for a clock()
// implementation that doesn't have a sufficient timing resolution to
// measure the execution time of a single addition.
const int iter = 1000;
// Estimate the execution time of adding a and b and storing the
// result in the variable c.
// Depending on the compiler we might need to count this as 2 additions
// if we count the loop variable.
clock_t start = clock();
for (unsigned i = 0; i < iter; ++i)
{
c = a + b;
}
clock_t end = clock();
// Write the time for the user
std::cout << (end - start) / ((double) CLOCKS_PER_SEC * iter)
<< " seconds" << std::endl;
return 0;
}
If you knew how your particular architecture was executing this code you could then try and estimate FLOPS from the execution time but the estimate for FLOPS (on this type of operation) probably wouldn't be very accurate.
An improvement to this program might be to replace the for loop with a macro implementation or ensure your compiler expands for loops inline. Otherwise you may also be including the addition operation for the loop index in your measurement.
I think it is likely that the error wouldn't scale linearly with problem size. For example if the operation you were trying to time took 1e9 to 1e15 times longer you might be able to get a decent estimate for GFLOPS. But, unless you know exactly what your compiler and architecture are doing with your code I wouldn't feel confident trying to estimate GFLOPS in a high level language like C++, perhaps assembly might work better (just a hunch).
I'm not saying it can't be done, but for accurate estimates there are a lot of things you might need to consider.

Vector vs Deque Insertion Time

I read this nice experiment comparing, in particular, the performance of calling insert() on both a vector and a deque container. The result from that particular experiment (Experiment 4) was that deque is vastly superior for this operation.
I implemented my own test using a short sorting function I wrote, which I should note uses the [] operator along with other member functions, and found vastly different results. For example, for inserting 100,000 elements, vector took 24.88 seconds, while deque took 374.35 seconds.
How can I explain this? I imagine it has something to do with my sorting function, but would like the details!
I'm using g++ 4.6 with no optimizations.
Here's the program:
#include <iostream>
#include <vector>
#include <deque>
#include <cstdlib>
#include <ctime>
using namespace std;
size_t InsertionIndex(vector<double>& vec, double toInsert) {
for (size_t i = 0; i < vec.size(); ++i)
if (toInsert < vec[i])
return i;
return vec.size(); // return last index+1 if toInsert is largest yet
}
size_t InsertionIndex(deque<double>& deq, double toInsert) {
for (size_t i = 0; i < deq.size(); ++i)
if (toInsert < deq[i])
return i;
return deq.size(); // return last index+1 if toInsert is largest yet
}
int main() {
vector<double> vec;
deque<double> deq;
size_t N = 100000;
clock_t tic = clock();
for(int i = 0; i < N; ++i) {
double val = rand();
vec.insert(vec.begin() + InsertionIndex(vec, val), val);
// deq.insert(deq.begin() + InsertionIndex(deq, val), val);
}
float total = (float)(clock() - tic) / CLOCKS_PER_SEC;
cout << total << endl;
}
The special case where deque can be much faster than vector is when you're inserting at the front of the container. In your case you're inserting at random locations, which will actually give the advantage to vector.
Also unless you're using an optimized build, it's quite possible that there are bounds checks in the library implementation. Those checks can add significantly to the time. To do a proper benchmark comparison you must run with all normal optimizations turned on and debug turned off.
Your code is performing an insertion sort, which is O(n^2). Iterating over a deque is slower than iterating over a vector.
I suspect the reason you are not seeing the same result as the posted link is because the run-time of your program is dominated by the loop in InsertionIndex not the call to deque::insert (or vector::insert.