Sequential is faster than Multi threaded - OpenMp - C++ - c++

I am using C++ & OpenMP to parallelize an algorithm to find the convex hull.
But I am not able to get the expected speedup. In fact, the sequential algorithm is faster.
The input & output set of points are stored in arrays.
Could you please look into the code and let me know the corrections?
Point *points = new Point[inp_size]; // contains the input
int th_id;
omp_set_num_threads(nthreads);
clock_t t1,t2;
t1=clock();
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
///////////// …. Only Function called ….///////////////////////////////////
findParallelUCHWOUP(points,th_id+1, nthreads, inp_size);
}
t2=clock();
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
std::cout << "Time Elapsed in seconds:" << seconds << '\n';
///////////////////////////////////////////////////////////////
int findParallelUCHWOUP(Point iv[],int id, int thread_num, int inp_size){
int numElems = inp_size/thread_num;
int first = (id-1) * numElems;;
int last;
if(id == thread_num){
last = inp_size-1;
}
else{
last = id*numElems-1;
}
output[first]=iv[first];
std::stack<int> s;
s.push(first);
int i=first+1;
while(i<last){
if ( crossProduct(iv, i, first, last) > 0){
s.push(i);
i++;
break;
}else{
i++;
}
}
if(i==last){
s.push(last);
return 0;
}
for(;i<=last;i++){
if ( crossProduct(iv, i, first, last) >= 0){
while ( s.size()>1 && crossProduct(iv, s.top(), second(s), i) <= 0){
s.pop();
}
s.push(i);
}
}
int count=s.size();
sizes[id-1] = count;
while(!s.empty()){
output[first+count-1]=iv[s.top()];
s.pop();
count--;
}
return 0;
}
///////////tested on these machines/////
Sequential Time:0.016466
Using two threads:0.022979
Using four threads:0.035213
Using 8 threads: 0.03315
Machine used: Mac Book Pro
Processor: 2.5 GHz Intel Core i5(at least 4 logical cores)
Memory: 4GB 1600 MHz
Compiler: Mac OSX Compiler

The problem is the way you count time. Actually, you could write something like:
diff / (float) (CLOCKS_PER_SEC * nthreads)
And this is only an approximation (and not always true).
CLOCKS_PER_SEC stands for sum of clocks of all cores...
You'd better use OpenMP special functions...

Related

How to benchmark CUDA programs?

I was trying to benchmark my first CUDA application that adds two arrays first using the CPU and then using the GPU.
Here is the program.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include<iostream>
#include<chrono>
using namespace std;
using namespace std::chrono;
// add two arrays
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] += x[i];
}
}
__global__ void addParallel(int n, float *x, float *y) {
int i = threadIdx.x;
if (i < n)
y[i] += x[i];
}
void printElapseTime(std::chrono::microseconds elapsed_time) {
cout << "completed in " << elapsed_time.count() << " microseconds" << endl;
}
int main() {
// generate two arrays of million float values each
cout << "Generating two lists of a million float values ... ";
int n = 1 << 28;
float *x, *y;
cudaMallocManaged(&x, sizeof(float)*n);
cudaMallocManaged(&y, sizeof(float)*n);
// begin benchmark array generation
auto begin = high_resolution_clock::now();
for (int i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// end benchmark array generation
auto end = high_resolution_clock::now();
auto elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition cpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using CPU ... ";
add(n, x, y);
// end benchmark addition cpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition gpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using GPU ... ";
addParallel << <1, 1024 >> > (n, x, y);
cudaDeviceSynchronize();
// end benchmark addition gpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
cudaFree(x);
cudaFree(y);
return 0;
}
Surprisingly though, the program is generating the following output.
Generating two lists of a million float values ... completed in 13343211 microseconds
Adding both arrays using CPU ... completed in 543994 microseconds
Adding both arrays using GPU ... completed in 3030147 microseconds
I wonder where exactly I am going wrong. Why is the GPU computation taking 6 times longer than the one that is running on the CPU.
For your reference, I'm running Windows 10 on Intel i7 8750H and Nvidia GTX 1060.
Note that your unified memory array contains 268 million floats, meaning you're transferring about 1 GB of data to the device when you invoke your kernel. Use a GPU profiler (nvprof, nvvp, or nsight) and you should see a HtoD transfer taking the bulk of your computation time.

C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads

I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).
int fib(int n)
{
if(n == 0 || n == 1)
return n;
if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
return fib(n-1)+fib(n-2);
int res, a, b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
res = a+b;
return res;
}
int main(){
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib(25) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time*1000 << std::endl;
return 0;
}
Have you tried it with a large number?
In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.
Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.
There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.
In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.
EDIT: cutoff variable needs to be increased before entering OMP zone.
the following code is for test purposes for the OP to test
#define CUTOFF 5
int fib_s(int n)
{
if (n == 0 || n == 1)
return n;
int res, a, b;
a = fib_s(n - 1);
b = fib_s(n - 2);
res = a + b;
return res;
}
int fib_m(int n,int co)
{
if (co >= CUTOFF) return fib_s(n);
if (n == 0 || n == 1)
return n;
int res, a, b;
co++;
#pragma omp task shared(a)
a = fib_m(n - 1,co);
#pragma omp task shared(b)
b = fib_m(n - 2,co);
#pragma omp taskwait
res = a + b;
return res;
}
int main()
{
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib_m(25,1) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time * 1000 << std::endl;
return 0;
}
RESULT:
With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.
co=1 14.5s
co=2 9.5s
co=3 6.4s
co=10 7.5s
co=15 7.0s
co=20 8.5s
co=21 >18.0s
co=22 >40.0s
I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).
So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time).
Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x
#include <stdio.h>
#include <omp.h>
#include <math.h>
int fib(int n, int rec_height)
{
int x = 1, y = 1;
if (n < 2)
return n;
int tCount = 0;
if (rec_height > 0) //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP)
{
rec_height -= 1;
#pragma omp task shared(x)
x = fib(n - 1, rec_height);
#pragma omp task shared(y)
y = fib(n - 2, rec_height);
#pragma omp taskwait
}
else{
x = fib(n - 1, rec_height);
y = fib(n - 2, rec_height);
}
return x+y;
}
int main()
{
int tot_thread = 16;
int recDepth = (int)log2f(tot_thread);
if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1;
printf("\nrecDepth: %d\n",recDepth);
omp_set_max_active_levels(recDepth);
omp_set_nested(recDepth-1);
int n,fibonacci;
double starttime;
printf("\nPlease insert n, to calculate fib(n): %d\n",n);
scanf("%d",&n);
omp_set_num_threads(tot_thread);
starttime=omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
fibonacci=fib(n, recDepth);
}
}
printf("\n\nfib(%d)=%d \n",n,fibonacci);
printf("calculation took %lf sec\n",omp_get_wtime()-starttime);
return 0;
}

optimize c++ query to calculate Nmin

I have run into a problem where i am trying to optimize my query which is created to calculate Nmin values for the increasing values of N and error approximation.
I am not from programming background and have just started to take it up.
This is the calculation which is inefficient as it calculates Nmin even after finding Nmin.
Now to reduce the time i did below changes reduce function call with no improvement:
#include<iostream>
#include<cmath>
#include<time.h>
#include<iomanip>
using namespace std;
double f(int);
int main(void)
{
double err;
double pi = 4.0*atan(1.0);
cout<<fixed<<setprecision(7);
clock_t start = clock();
for (int n=1;;n++)
{
if((f(n)-pi)>= 1e-6)
{
cout<<"n_min is "<< n <<"\t"<<f(n)-pi<<endl;
}
else
{
break;
}
}
clock_t stop = clock();
//double elapsed = (double)(stop - start) * 1000.0 / CLOCKS_PER_SEC; //this one in ms
cout << "time: " << (stop-start)/double(CLOCKS_PER_SEC)*1000 << endl; //this one in s
return 0;
}
double f(int n)
{
double sum=0;
for (int i=1;i<=n;i++)
{
sum += 1/(1+pow((i-0.5)/n,2));
}
return (4.0/n)*sum;
}
Is there any way to reduce the time and make the second query efficient?
Any help would be greatly appreciated.
I do not see any immediate way of optimizing the algorithm itself. You could however reduce the time significantly by not writing to the standard output for every iteration. Also, do not calculate f(n) more than once per iteration.
for (int n=1;;n++)
{
double val = f(n);
double diff = val-pi;
if(diff < 1e-6)
{
cout<<"n_min is "<< n <<"\t"<<diff<<endl;
break;
}
}
Note however that this will yield a higher n_min (increased by 1 compared to the result of your version) since we changed the condition to diff < 1e-6.

Are loops really faster than recursion?

According to my professor loops are faster and more deficient than using recursion yet I came up with this c++ code that calculates the Fibonacci series using both recursion and loops and the results prove that they are very similar. So I maxed the possible input to see if there was a difference in performance and for some reason recursion clocked in better than using a loop. Anyone know why? Thanks in advanced.
Here's the code:
#include "stdafx.h"
#include "iostream"
#include <time.h>
using namespace std;
double F[200000000];
//double F[5];
/*int Fib(int num)
{
if (num == 0)
{
return 0;
}
if (num == 1)
{
return 1;
}
return Fib(num - 1) + Fib(num - 2);
}*/
double FiboNR(int n) // array of size n
{
for (int i = 2; i <= n; i++)
{
F[i] = F[i - 1] + F[i - 2];
}
return (F[n]);
}
double FibMod(int i,int n) // array of size n
{
if (i==n)
{
return F[i];
}
F[i] = F[i - 1] + F[i - 2];
return (F[n]);
}
int _tmain(int argc, _TCHAR* argv[])
{
/*cout << "----------------Recursion--------------"<<endl;
for (int i = 0; i < 36; i=i+5)
{
clock_t tStart = clock();
cout << Fib(i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}*/
cout << "----------------Linear--------------"<<endl;
for (int i = 0; i < 200000000; i = i + 20000000)
//for (int i = 0; i < 50; i = i + 5)
{
clock_t tStart = clock();
F[0] = 0; F[1] = 1;
cout << FiboNR(i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}
cout << "----------------Recursion Modified--------------" << endl;
for (int i = 0; i < 200000000; i = i + 20000000)
//for (int i = 0; i < 50; i = i + 5)
{
clock_t tStart = clock();
F[0] = 0; F[1] = 1;
cout << FibMod(0,i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}
std::cin.ignore();
return 0;
}
You you go by the conventional programming approach loops are faster. But there is category of languages called functional programming languages which does not contain loops. I am a big fan of functional programming and I am an avid Haskell user. Haskell is a type of functional programming language. In this instead of loops you use recursions. To implement fast recursion there is something known as tail recursion. Basically to prevent having a lot of extra info to the system stack, you write the function such a way that all the computations are stored as function parameters so that nothing needs to be stored on the stack other that the function call pointer. So once the final recursive call has been called, instead of unwinding the stack the program just needs to go to the first function call stack entry. Functional programming language compilers have an inbuilt design to deal with this. Now even non functional programming languages are implementing tail recursion.
For example consider finding the recursive solution for finding the factorial of a positive number. The basic implementation in C would be
int fact(int n)
{
if(n == 1 || n== 0)
return 1
return n*fact(n-1);
}
In the above approach, each time the stack is called n is stored in the stack so that it can be multiplied with the result of fact(n-1). This basically happens during stack unwinding. Now check out the following implementation.
int fact(int n,int result)
{
if(n == 1 || n== 0)
return result
return fact(n-1,n*result);
}
In this approach we are passing the computation result in the variable result. So in the end we directly get the answer in the variable result. The only thing you have to do is that in the initial call pass a value of 1 for the result in this case. The stack can be unwound directly to its first entry. Of course I am not sure that C or C++ allows tail recursion detection, but functional programming languages do.
Your "recursion modified" version doesn't have recursion at all.
In fact, the only thing enabling a non-recursive version that fills in exactly one new entry of the array is the for-loop in your main function -- so it is actually a solution using iteration also (props to immibis and BlastFurnace for noticing that).
But your version doesn't even do that correctly. Rather since it is always called with i == 0, it illegally reads F[-1] and F[-2]. You are lucky (?)1 the program didn't crash.
The reason you are getting correct results is that the entire F array is prefilled by the correct version.
Your attempt to calculate Fib(2000....) isn't successful anyway, since you overflow a double. Did you even try running that code?
Here's a version that works correctly (to the precision of double, anyway) and doesn't use a global array (it really is iteration vs recursion and not iteration vs memoization).
#include <cstdio>
#include <ctime>
#include <utility>
double FiboIterative(int n)
{
double a = 0.0, b = 1.0;
if (n <= 0) return a;
for (int i = 2; i <= n; i++)
{
b += a;
a = b - a;
}
return b;
}
std::pair<double,double> FiboRecursive(int n)
{
if (n <= 0) return {};
if (n == 1) return {0, 1};
auto rec = FiboRecursive(n-1);
return {rec.second, rec.first + rec.second};
}
int main(void)
{
const int repetitions = 1000000;
const int n = 100;
volatile double result;
std::puts("----------------Iterative--------------");
std::clock_t tStart = std::clock();
for( int i = 0; i < repetitions; ++i )
result = FiboIterative(n);
std::printf("[%d] = %f\n", n, result);
std::printf("Time taken: %.2f us\n", (std::clock() - tStart) / 1.0 / CLOCKS_PER_SEC);
std::puts("----------------Recursive--------------");
tStart = std::clock();
for( int i = 0; i < repetitions; ++i )
result = FiboRecursive(n).second;
std::printf("[%d] = %f\n", n, result);
std::printf("Time taken: %.2f us\n", (std::clock() - tStart) / 1.0 / CLOCKS_PER_SEC);
return 0;
}
--
1Arguably anything that hides a bug is actually unlucky.
I don't think this is not a good question. But maybe the answer why is somehow interesting.
At first let me say that generally the statement is probably true. But ...
Questions about performance of c++ programs are very localized. It's never possible to give a good general answer. Every example should be profiled an analyzed separately. It involves lots of technicalities. c++ compilers are allowed to modify program practically as they wish as long as they don't produce visible side effects (whatever precisely that means). So as long as your computation gives the same result is fine. This technically allows to transform one version of your program into an equivalent even from recursive version into the loop based and vice versa. So it depends on compiler optimizations and compiler effort.
Also, to compare one version to another you would need to prove that the versions you compare are actually equivalent.
It might also happen that somehow a recursive implementation of algorithm is faster than a loop based one if it's easier to optimize for the compiler. Usually iterative versions are more complex, and generally the simpler the code is, the easier it is for the compiler to optimize because it can make assumptions about invariants, etc.

OpenMP C++ Not able to get linear speedup with number of processors [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
Please see the below results and let me know where I can optimise my code further to get a better speedup.
Result
Machine used: Mac Book Pro Processor: 2.5 GHz Intel Core i5(at least 4 logical cores)
Memory: 4GB 1600 MHz
Compiler: Mac OSX Compiler
Sequential Time:0.016466
Using two threads:0.0120111
Using four threads:0.0109911(Speed Up ~ 1.5)
Using 8 threads: 0.0111289
II Machine:
OS: Linux
Hardware: Intel(R) Core™ i5-3550 CPU # 3.30GHz × 4
Memory: 7.7 GiB
Compiler: G++ Version 4.6
Sequential Time:0.0128901
Using two threads:0.00838804
Using four threads:0.00612688(Speed up = 2)
Using 8 threads: 0.0101049
Please let me know what's the overhead in my code that is not giving a linear speedup. There is nothing much in the code. I am calling the function "findParallelUCHWOUP" in the main function like this:
#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);
Code:
class Point {
double i, j;
public:
Point() {
i = 0;
j = 0;
}
Point(double x, double y) {
i = x;
j = y;
}
double x() const {
return i;
}
double y() const {
return j;
}
void setValue(double x, double y) {
i = x;
j = y;
}
};
typedef std::vector<Point> Vector;
int second(std::stack<int> &s);
double crossProduct(Point v[], int a, int b, int c);
bool myfunction(Point a, Point b) {
return ((a.x() < b.x()) || (a.x() == b.x() && a.y() < b.y()));
}
class CTPoint {
int i, j;
public:
CTPoint() {
i = 0;
j = 0;
}
CTPoint(int x, int y) {
i = x;
j = y;
}
double getI() const {
return i;
}
double getJ() const {
return j;
}
};
const int nthreads = 4;
const int inp_size = 1000000;
Point output[inp_size];
int numElems = inp_size / nthreads;
int sizes[nthreads];
CTPoint ct[nthreads][nthreads];
//function that is called from different threads
int findParallelUCHWOUP(Point* iv, int id, int thread_num, int inp_size, int first, int last) {
output[first] = iv[first];
std::stack<int> s;
s.push(first);
int i = first + 1;
while (i < last) {
if (crossProduct(iv, i, first, last) > 0) {
s.push(i);
i++;
break;
} else {
i++;
}
}
if (i == last) {
s.push(last);
return 0;
}
for (; i <= last; i++) {
if (crossProduct(iv, i, first, last) >= 0) {
while (s.size() > 1 && crossProduct(iv, s.top(), second(s), i) <= 0) {
s.pop();
}
s.push(i);
}
}
int count = s.size();
sizes[id - 1] = count;
while (!s.empty()) {
output[first + count - 1] = iv[s.top()];
s.pop();
count--;
}
return 0;
}
double crossProduct(Point* v, int a, int b, int c) {
return (v[c].x() - v[b].x()) * (v[a].y() - v[b].y())
- (v[a].x() - v[b].x()) * (v[c].y() - v[b].y());
}
int second(std::stack<int> &s) {
int temp = s.top();
s.pop();
int sec = s.top();
s.push(temp);
return sec;
}
//reads points from a file and divides the array of points to different threads
int main(int argc, char *argv[]) {
// read points from a file and assign them to the input array.
Point *points = new Point[inp_size];
unsigned i = 0;
while (i < Points.size()) {
points[i] = Points[i];
i++;
}
numElems = inp_size / nthreads;
int first[nthreads];
int last[nthreads];
for(int i=1;i<=nthreads;i++){
first[i-1] = (i - 1) * numElems;
if (i == nthreads) {
last[i-1] = inp_size - 1;
} else {
last[i-1] = i * numElems - 1;
}
}
/* Parallel Code starts here*/
int th_id;
omp_set_num_threads(nthreads);
double start = omp_get_wtime();
#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);
/* Parallel Code ends here*/
double end = omp_get_wtime();
double diff = end - start;
std::cout << "Time Elapsed in seconds:" << diff << '\n';
return 0;
}
Threading in general and in your particular case, OpenMP do introduce a certain amount of overhead that does essentially prevent you from getting "real" linear speedup. You have to account for that.
Second, the runtime of your test is extremely short (I assume the times measure are seconds?). At that level you're also running into issues with the precision of timing the functions as a very small amount in overhead has a large impact on the measure result.
Last, you're also dealing with memory access here and if both the chunks you are processing and the stack you're creating don't fit into the processor cache, you also have to account for the overhead of fetching data from memory. The latter gets worse if you have multiple threads reading and possibly writing to the same area of memory. That will result in invalidated cache lines, which means that your cores will be waiting for data to be fetched into the cache and/or written to main memory.
I would massively increase the size of your data so you can runtimes in the seconds, for starters, then measure again. The longer running your test code is the better because the startup and general overhead of the threading will play less of a role if you do more processing.
Once you established a better baseline, you'll probably need a good profiler that gives you deeper insight into threading to see where the hotspots are in your code. It's not unusual that you might have to roll custom data structures for your parallelized part to improve the performance.