How to calculate GFLOPs for a funtion in c++ program? - c++

I have a c++ code which calculates factorial of int data type, addition of float data type and execution time of each function as follows:
long Sample_C:: factorial(int n)
{
int counter;
long fact = 1;
for (int counter = 1; counter <= n; counter++)
{
fact = fact * counter;
}
Sleep(100);
return fact;
}
float Sample_C::add(float a, float b)
{
return a+b;
}
int main(){
Sample_C object;
clock_t start = clock();
object.factorial(6);
clock_t end = clock();
double time =(double)(end - start);// finding execution time of factorial()
cout<< time;
clock_t starts = clock();
object.add(1.1,5.5);
clock_t ends = clock();
double total_time = (double)(ends -starts);// finding execution time of add()
cout<< total_time;
return 0;
}
Now , i want to have the mesure GFLOPs for "add " function. So, kindly suggest how will i caculate it. As, i am completly new to GFLOPs so kindly tell me wether we can have GFLOPs calculated for functions having only foat data types? and also GFLOPs value vary with different functions?

If I was interested in estimating the execution time of the addition operation I might start with the following program. However, I would still only trust the number this program produced to within a factor of 10 to 100 at best (i.e. I don't really trust the output of this program).
#include <iostream>
#include <ctime>
int main (int argc, char** argv)
{
// Declare these as volatile so the compiler (hopefully) doesn't
// optimise them away.
volatile float a = 1.0;
volatile float b = 2.0;
volatile float c;
// Preform the calculation multiple times to account for a clock()
// implementation that doesn't have a sufficient timing resolution to
// measure the execution time of a single addition.
const int iter = 1000;
// Estimate the execution time of adding a and b and storing the
// result in the variable c.
// Depending on the compiler we might need to count this as 2 additions
// if we count the loop variable.
clock_t start = clock();
for (unsigned i = 0; i < iter; ++i)
{
c = a + b;
}
clock_t end = clock();
// Write the time for the user
std::cout << (end - start) / ((double) CLOCKS_PER_SEC * iter)
<< " seconds" << std::endl;
return 0;
}
If you knew how your particular architecture was executing this code you could then try and estimate FLOPS from the execution time but the estimate for FLOPS (on this type of operation) probably wouldn't be very accurate.
An improvement to this program might be to replace the for loop with a macro implementation or ensure your compiler expands for loops inline. Otherwise you may also be including the addition operation for the loop index in your measurement.
I think it is likely that the error wouldn't scale linearly with problem size. For example if the operation you were trying to time took 1e9 to 1e15 times longer you might be able to get a decent estimate for GFLOPS. But, unless you know exactly what your compiler and architecture are doing with your code I wouldn't feel confident trying to estimate GFLOPS in a high level language like C++, perhaps assembly might work better (just a hunch).
I'm not saying it can't be done, but for accurate estimates there are a lot of things you might need to consider.

Related

What is causing the threads to execute slower than the serial case?

I have a simple function which computes the sum of "n" numbers.
I am attempting to use threads to implement the sum in parallel. The code is as follows,
void Add(double &sum, const int startIndex, const int endIndex)
{
sum = 0.0;
for (int i = startIndex; i < endIndex; i++)
{
sum = sum + 0.1;
}
}
int main()
{
int n = 100'000'000;
double sum1;
double sum2;
std::thread t1(Add, std::ref(sum1), 0, n / 2);
std::thread t2(Add, std::ref(sum2), n / 2, n);
t1.join();
t2.join();
std::cout << "sum: " << sum1 + sum2 << std::endl;
// double serialSum;
// Add(serialSum, 0, n);
// std::cout << "sum: " << serialSum << std::endl;
return 0;
}
However, the code runs much slower than the serial version. If I modify the function such that it does not take in the sum variable, then I obtain the desired speed-up (nearly 2x).
I read several resources online but all seem to suggest that variables must not be accessed by multiple threads. I do not understand why that would be the case for this example.
Could someone please clarify my mistake?.
The problem here is hardware.
You probably know that CPUs have caches to speed up operations. These caches are many times faster then memory but they work in units called cachelines. Probably 64 byte on your system. Your 2 doubles are each 8 byte large and near certainly will end up being in the same 64 byte region on the stack. And each core in a cpu generally have their own L1 cache while larger caches may be shared between cores.
Now when one thread accesses sum1 the core will load the relevant cacheline into the cache. When the second thread accesses sum2 the other core attempts to load the same cacheline into it's own cache. And the x86 architecture is so nice trying to help you it will ask the first cache to hand over the cacheline so both threads always see the same data.
So while you have 2 separate variables they are in the same cache line and on every access that cacheline bounces from one core to the other and back. Which is a rather slow operation. This is called false sharing.
So you need to put some separation between sum1 and sum2 to make this work fast. See std::hardware_destructive_interference_size for what distance you need to achieve.
Another, and probably way simpler, way is to modify the worker function to use local variables:
void Add(double &sum, const int startIndex, const int endIndex)
{
double t = 0.0;
for (int i = startIndex; i < endIndex; i++)
{
t = t + 0.1;
}
sum = t;
}
You still have false sharing and the two threads will fight over access to sum1 and sum2. But now it only happens once and becomes irrelevant.

Why is serial execution taking less time than parallel? [duplicate]

This question already has answers here:
OpenMP time and clock() give two different results
(3 answers)
Closed 3 years ago.
I have to add two vectors and compare serial performance against parallel performance.
However, my parallel code seems to take longer to execute than the serial code.
Could you please suggest changes to make the parallel code faster?
#include <iostream>
#include <time.h>
#include "omp.h"
#define ull unsigned long long
using namespace std;
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
ull i;
#pragma omp parallel for shared (A,B,C,N) private(i) schedule(static)
for (i = 0; i < N; ++i)
{
C[i] = A[i] + B[i];
}
}
int main(){
ull n = 100000000;
double* A = new double[n];
double* B = new double[n];
double* C = new double[n];
double time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
//PARALLEL
clock_t begin = clock();
parallelAddition(n, &A[0], &B[0], &C[0]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in parallel : "<<time_spent<<endl;
//SERIAL
time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
begin = clock();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
These are results:
time elapsed in parallel : 0.824808
time elapsed in serial : 0.351246
I've read on another thread that there are factors like spawning of threads, allocation of resources. But I don't know what to do to get the expected result.
EDIT:
Thanks! #zulan and #Daniel Langr 's answers actually helped!
I used omp_get_wtime() instead of clock().
It happens to be that clock() measures cumulative time of all threads as against omp_get_wtime() which can be used to measure the time elasped from an arbitrary point to some other arbitrary point
This answer too answers this query pretty well: https://stackoverflow.com/a/10874371/4305675
Here's the fixed code:
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
....
}
int main(){
....
//PARALLEL
double begin = omp_get_wtime();
parallelAddition(n, &A[0], &B[0], &C[0]);
double end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in parallel : "<<time_spent<<endl;
....
//SERIAL
begin = omp_get_wtime();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
RESULT AFTER CHANGES:
time elapsed in parallel : 0.204763
time elapsed in serial : 0.351711
There are multiple factors that influence your measurements:
Use omp_get_wtime() as #zulan suggested, otherwise, you may actually calculate combined CPU time, instead of wall time.
Threading has some overhead and typically does not pay off for short calculations. You may want to use higher n.
"Touch" data in C array before running parallelAddition. Otherwise, the memory pages are actually allocated from OS inside parallelAddition. Easy fix since C++11: double* C = new double[n]{};.
I tried your program for n being 1G and the last change reduced runtime of parallelAddition from 1.54 to 0.94 [s] for 2 threads. Serial version took 1.83 [s], therefore, the speedup with 2 threads was 1.95, which was pretty close to ideal.
Other considerations:
Generally, if you profile something, make sure that the program has some observable effect. Otherwise, a compiler may optimize a lot of code away. Your array addition has no observable effect.
Add some form of restrict keyword to the C parameter. Without it, a compiler might not be able to apply vectorization.
If you are on a multi-socket system, take care about affinity of threads and NUMA effects. On my dual-socket system, runtime of a parallel version for 2 threads took 0.94 [s] (as mentioned above) when restricting threads to a single NUMA node (numactl -N 0 -m 0). Without numactl, it took 1.35 [s], thus 1.44 times more.

Time measurement not working in C++

I'm trying to make run time measurements of simple algorithms like linear sort. The problem is that no matter what I do, the time measurement won't work as intended. I get the same search time no matter what problem size I use. Both me and other people who've tried to help me are equally confused.
I have a linear sort function that looks like this:
// Search the N first elements of 'data'.
int linearSearch(vector<int> &data, int number, const int N) {
if (N < 1 || N > data.size()) return 0;
for (int i=0;i<N;i++) {
if (data[i] == number) return 1;
}
return 0;
}
I've tried to take time measurement with both time_t and chrono from C++11 without any luck, except more decimals. This is how it looks like right now when i'm searching.
vector<int> listOfNumbers = large list of numbers;
for (int i = 15000; i <= 5000000; i += 50000) {
const clock_t start = clock();
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
linearSearch(listOfNumbers, randNum, i);
}
cout << float(clock() - start) / CLOCKS_PER_SEC << endl;
}
The result?
0.126, 0.125, 0.125, 0.124, 0.124, ... (same values?)
I have tried the code with both VC++, g++ and on different computers.
First I thought it was my implementation of the search algorithms that was at fault. But a linear sort like the one above can't become any simpler, it's clearly O(N). How can the time be the same even when the problem size is increased by so much? I'm at loss what to do.
Edit 1:
Someone else might have an explanation why this is the case. But it actually worked in release mode after changing:
if (data[i] == number)
To:
if (data.at(i) == number)
I have no idea why this is the case, but linear search could be time measured correctly after that change.
The reason for the about-constant execution times is that the compiler is able to optimize away parts of the code.
Specifically looking at this part of the code:
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
linearSearch(listOfNumbers, randNum, i);
}
When compiling with g++5.2 and optimization level -O3, the compiler can optimize away the call to linearSearch() completely. This is because the result of the code is the same with or without that function being called.
The return value of linearSearch is not used anywhere, and the function does not seem to have side-effects. So the compiler can remove it.
You can cross-check and modify the inner loop as follows. The execution times shouldn't change:
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
// linearSearch(listOfNumbers, randNum, i);
}
What remains in the loop is the call to rand(), and this is what you seem to be measuring. When changing the data[i] == number to data.at(i) == number, the call to linearSearch is not side-effects-free as at(i) may throw an out-of-range exception. So the compiler does not completely optimize the linearSearch code away. However, with g++5.2, it will still inline it and not make a function call.
clock() is measuring CPU time, maybe you want time(NULL)? check this issue
The start should be before the for loop. In your case the start is different for each iteration, it is constant between the { ... }.
const clock_t start = clock();
for (int i = 15000; i <= 5000000; i += 50000){
...
}

Timing operator+, operator-, operator*, and operator/

I have a simple math vector class with operators overloaded. I would like to get some timing results for my operators. I can easily time my +=, -=, *=, and /= by timing the following code:
Vector sum;
for(size_t i = 0; i<iter; ++i)
sum += RandVector();
cout << sum << endl;
Then I can subtract the time it takes to generate iter random vectors. In my tests, Vector is 3 dimensional, iter = 10,000,000.
I tried to do a similar thing with +,-,*,/:
Vector sum;
for(size_t i = 0; i<iter; ++i)
sum = sum + RandVector();
cout << sum << endl;
Then subtract the time it takes to generate iter random vectors and perform iter assignments, however this gives a "negative" time, leading me to believe either the compiler is optimizing the operation somehow, or something strange is going on.
I am using gcc-4.7.2 using -O3 on a Fedora Linux machine.
Here is my timing code:
clock_t start, k = clock();
do start = clock();
while(start == k);
F()();
clock_t end = clock();
double time = double(end-start)/double(CLOCKS_PER_SEC);
cout << time - time_iter_rand_v - time_iter_ass;
Here F is a function object which performs the code above. time_iter_rand_v is the time it takes to create iter random vectors and time_iter_ass is the time it took for iter assignment operations.
My question is then how to get accurate timing of just the operator+ function not any assignments or random vector generation?
You really can't get an accurate timing for something like that when optimization is on. The reason is due to the fact that the compiler has the ability to move code around.
If you make the time storage variables volatile, the position of them relative to each other are not subject to optimisation due to moving. However, the code around them are, unless they are assigning or calling functions that take volatile variables (this includes a volatile member function which makes *this volatile).
Optimisation can do a lot of odd things to the code if you are expecting linear execution.
Just create a vector of RandVector()s and iterate through them. It will solve the problem of measuring the time of generation.
As for assignment I think that it comes down to how compiler optimizes it.
One basic benchmarking method is to use gettimeofday :
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <cstring>
//------------------- Handle time in milliseconds ----------------------//
/*
* Return 1 if the difference is negative, otherwise 0.
*/
int timeval_subtract(struct timeval *result, struct timeval *t2, struct timeval *t1)
{
long int diff = (t2->tv_usec + 1000000 * t2->tv_sec) - (t1->tv_usec + 1000000 * t1->tv_sec);
result->tv_sec = diff / 1000000;
result->tv_usec = diff % 1000000;
return (diff<0);
}
void timeval_print(struct timeval *tv)
{
char buffer[30];
time_t curtime;
printf("%ld.%06ld", tv->tv_sec, tv->tv_usec);
curtime = tv->tv_sec;
strftime(buffer, 30, "%m-%d-%Y %T", localtime(&curtime));
printf(" = %s.%06ld\n", buffer, tv->tv_usec);
}
// usage :
/*
struct timeval tvBegin, tvEnd, tvDiff;
// begin
gettimeofday(&tvBegin, NULL);
// lengthy operation
int i,j;
for(i=0;i<999999L;++i) {
j=sqrt(i);
}
//end
gettimeofday(&tvEnd, NULL);
// diff
timeval_subtract(&tvDiff, &tvEnd, &tvBegin);
printf("%ld.%06ld\n", tvDiff.tv_sec, tvDiff.tv_usec);
*/

Puzzling behaviour of async

This might be some weird Linux quirk, but I'm observing very strange behavior.
The following code should compare a synchronized version of summing numbers with an async version. The thing is that I'm seeing a performance increase (it's not caching, it happens even when I split the code into two separate programs), while still observing the program as single-threaded (only one core is used).
strace does show some thread activity, but monitoring tools like top clones still show only one used core.
Second problem I'm observing is that if I increase the spawn ratio, the memory usage just explodes. What is the memory overhead of a thread? With 5000 threads I get ~10GB memory usage.
#include <iostream>
#include <random>
#include <chrono>
#include <future>
using namespace std;
long long sum2(const vector<int>& v, size_t from, size_t to)
{
const size_t boundary = 5*1000*1000;
if (to-from <= boundary)
{
long long rsum = 0;
for (;from < to; from++)
{
rsum += v[from];
}
return rsum;
}
else
{
size_t mid = from + (to-from)/2;
auto s2 = async(launch::async,sum2,cref(v),mid,to);
long long rsum = sum2(v,from,mid);
rsum += s2.get();
return rsum;
}
}
long long sum2(const vector<int>& v)
{
return sum2(v,0,v.size());
}
long long sum(const vector<int>& v)
{
long long rsum = 0;
for (auto i : v)
{
rsum += i;
}
return rsum;
}
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (auto i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
auto start = chrono::high_resolution_clock::now();
long long suma = sum(x);
auto end = chrono::high_resolution_clock::now();
cout << "Sum is " << suma << endl;
cout << "Duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
start = chrono::high_resolution_clock::now();
suma = sum2(x);
end = chrono::high_resolution_clock::now();
cout << "Async sum is " << suma << endl;
cout << "Async duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
return 0;
}
Maybe you observe one core being used because the overlap between threads doing work simultaneously is too short to be noticeable. Summing 5mln values from a continuous area of memory should be very fast on modern hardware, so by the time parent finishes summing, child may have barely started and parent may be spending most or all of the time waiting for the result from the child. Have you tried to increase work unit to see if the overlap becomes noticeable?
Regarding increased performance: even if there is 0 overlap between threads because of too small work unit, multithreaded version can still benefit from additional L1 cache memory. For such a test, memory will likely be a bottleneck and sequential version will use only one L1 cache while multithreaded version will use as many as there are cores.
Have you checked the times that are being printed? On my machine, the serial time is under 1s at -O2, whilst the parallel sum time is several times faster. It's therefore entirely possible that the CPU usage is not enough for long enough for things like "top" to register, since they typically only refresh once per second.
If you increase the number of threads by reducing the count-per-thread, then you effectively increase the overhead of the thread management. If you have 5000 threads active, then your task will take 5000* min-thread-stack-size in additional memory. On my machine that's 20Gb!
Why don't you try increasing the size of the source container? If you make the parallel section take long enough, you'll see the corresponding parallel CPU usage. However, be prepared: summing integers is fast, and the time taken generating the random numbers can take an order of magnitude or two longer than the time to add the numbers together.