C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads

C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads - c++

I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).
int fib(int n)
{
if(n == 0 || n == 1)
return n;
if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
return fib(n-1)+fib(n-2);
int res, a, b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
res = a+b;
return res;
}
int main(){
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib(25) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time*1000 << std::endl;
return 0;
}

Have you tried it with a large number?
In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.
Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.
There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.
In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.
EDIT: cutoff variable needs to be increased before entering OMP zone.
the following code is for test purposes for the OP to test
#define CUTOFF 5
int fib_s(int n)
{
if (n == 0 || n == 1)
return n;
int res, a, b;
a = fib_s(n - 1);
b = fib_s(n - 2);
res = a + b;
return res;
}
int fib_m(int n,int co)
{
if (co >= CUTOFF) return fib_s(n);
if (n == 0 || n == 1)
return n;
int res, a, b;
co++;
#pragma omp task shared(a)
a = fib_m(n - 1,co);
#pragma omp task shared(b)
b = fib_m(n - 2,co);
#pragma omp taskwait
res = a + b;
return res;
}
int main()
{
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib_m(25,1) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time * 1000 << std::endl;
return 0;
}
RESULT:
With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.
co=1 14.5s
co=2 9.5s
co=3 6.4s
co=10 7.5s
co=15 7.0s
co=20 8.5s
co=21 >18.0s
co=22 >40.0s

I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).
So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time).
Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x
#include <stdio.h>
#include <omp.h>
#include <math.h>
int fib(int n, int rec_height)
{
int x = 1, y = 1;
if (n < 2)
return n;
int tCount = 0;
if (rec_height > 0) //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP)
{
rec_height -= 1;
#pragma omp task shared(x)
x = fib(n - 1, rec_height);
#pragma omp task shared(y)
y = fib(n - 2, rec_height);
#pragma omp taskwait
}
else{
x = fib(n - 1, rec_height);
y = fib(n - 2, rec_height);
}
return x+y;
}
int main()
{
int tot_thread = 16;
int recDepth = (int)log2f(tot_thread);
if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1;
printf("\nrecDepth: %d\n",recDepth);
omp_set_max_active_levels(recDepth);
omp_set_nested(recDepth-1);
int n,fibonacci;
double starttime;
printf("\nPlease insert n, to calculate fib(n): %d\n",n);
scanf("%d",&n);
omp_set_num_threads(tot_thread);
starttime=omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
fibonacci=fib(n, recDepth);
}
}
printf("\n\nfib(%d)=%d \n",n,fibonacci);
printf("calculation took %lf sec\n",omp_get_wtime()-starttime);
return 0;
}

Related

OpenMP paralellized C++ code exhibits higher execution times on more threads?

I'm trying to learn paralellization of C++ using openmp, and I'm trying to use the following example. But for some reason when I increase the number of threads the code runs slower. Im compiling it using the -fopenmp flag. It would be nice if I could get your expert opinion.
#include <omp.h>
#include <iostream>
static long num_steps =100000000;
#define NUM_THREADS 4
double step;
int main(){
int i,nthreads;
double pi, sum[NUM_THREADS]; // should be shared : hence promoted scalar sum into an array
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
//if(id==0) nthreads = nthrds; // This is done because the number of threads can be different
// ie the environment can give you a different number of threads
// than requested
for(i=id, sum[id] = 0.0; i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
double t2 = omp_get_wtime();
std::cout << "Time : " ;
double ms_double = t2 - t1;
std::cout << ms_double << "ms\n";
for(i=0,pi=0.0; i < nthreads; i++){
pi += sum[i]*step;
}
}

Minor complaints aside, your big problem is the loop update i=i+nthrds. This means that each cache line will be accessed by all 4 of your threads. (Btw, use the OMP_NUM_THREADS environment variable to set the number of threads. Do not hardcode.) This is called false sharing and it's really bad for performance: you want each cacheline to be exclusively in one core.

The main advantage of OpenMP is that you do not have to do reduction manually. You just have to add an extra line to the serial code. So, your code should be something like this (which is free from false-sharing):
double sum=0;
#pragma omp parallel for reduction(+:sum)
for(unsigned long i=0; i<num_steps; ++i){
const double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
double pi = sum*step;
Note that your code had an uninitialized variable (pi) and your code did not handle the properly if you got less threads than requested.

What #Victor Ejkhout called "minor complaints" might not be so minor. It is only normal that using a new API (omp) for the first time can be confusing. And that reflects on the coding style of the application code as well, more often than not. But especially in such cases, special attention should be paid to readability.
The code below is the "prettied-up" version of your attempt. And next to the omp parallel integration it also has the single threaded and a multi threaded (using std::thread) version so you can compare them to each other.
#include <omp.h>
#include <iostream>
#include <thread>
constexpr int MAX_PARALLEL_THREADS = 4; // long is wrong - is it an isize_t or a int32_t or an int64_t???
// the function we want to integrate
double f(double x) {
return 4.0 / (1.0 + x * x);
}
// performs the summation of function values on the interval [left,right[
double sum_interval(double left, double right, double step) {
double sum = 0.0;
for (double x = left; x < right; x += step) {
sum += f(x);
}
return sum;
}
double integrate_single_threaded(double left, double right, double step) {
return sum_interval(left, right, step) / (right - left);
}
double integrate_multi_threaded(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
std::thread threads[MAX_PARALLEL_THREADS];
for (int i= 0; i < MAX_PARALLEL_THREADS;i++) {
threads[i] = std::thread( [&sums,left,right,step,i] () {
double ileft = left + (right - left) / MAX_PARALLEL_THREADS * i;
double iright = left + (right - left) / MAX_PARALLEL_THREADS * (i + 1);
sums[i] = sum_interval(ileft,iright,step);
});
}
double total_sum = 0.0;
for (int i = 0; i < MAX_PARALLEL_THREADS; i++) {
threads[i].join();
total_sum += sums[i];
}
return total_sum / (right - left);
}
double integrate_parallel(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
int thread_count = 0;
omp_set_num_threads(MAX_PARALLEL_THREADS);
#pragma omp parallel
{
thread_count = omp_get_num_threads(); // 0 is impossible, there is always 1 thread minimum...
int interval_index = omp_get_thread_num();
double ileft = left + (right - left) / thread_count * interval_index;
double iright = left + (right - left) / thread_count * (interval_index + 1);
sums[interval_index] = sum_interval(ileft,iright,step);
}
double total_sum = 0.0;
for (int i = 0; i < thread_count; i++) {
total_sum += sums[i];
}
return total_sum / (right - left);
}
int main (int argc, const char* argv[]) {
double left = -1.0;
double right = 1.0;
double step = 1.0E-9;
// run single threaded calculation
std::cout << "single" << std::endl;
double tstart = omp_get_wtime();
double i_single = integrate_single_threaded(left, right, step);
double tend = omp_get_wtime();
double st_time = tend - tstart;
// run multi threaded calculation
std::cout << "multi" << std::endl;
tstart = omp_get_wtime();
double i_multi = integrate_multi_threaded(left, right, step);
tend = omp_get_wtime();
double mt_time = tend - tstart;
// run omp calculation
std::cout << "omp" << std::endl;
tstart = omp_get_wtime();
double i_omp = integrate_parallel(left, right, step);
tend = omp_get_wtime();
double omp_time = tend - tstart;
std::cout
<< "i_single: " << i_single
<< " st_time: " << st_time << std::endl
<< "i_multi: " << i_multi
<< " mt_time: " << mt_time << std::endl
<< "i_omp: " << i_omp
<< " omp_time: " << omp_time << std::endl;
return 0;
}
When I compile this on my Debian with g++ --std=c++17 -Wall -O3 -lpthread -fopenmp -o para para.cpp -pthread, I get the following results:
single
multi
omp
i_single: 3.14159e+09 st_time: 2.37662
i_multi: 3.14159e+09 mt_time: 0.635195
i_omp: 3.14159e+09 omp_time: 0.660593
So, at least my conclusion is, that it is not worth the effort to learn openMP, given that the (more general use) std::thread version looks just as nice and performs at least equally well.
I am not really trusting the computed integral result in either case, though. But I did not really focus on that. They all produce the same value. That is the important part.

Why doesn't my OpenMP program scale with number of threads?

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:
#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;
int main() {
int* arr = new int[SIZE];
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(N_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, 16) reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += arr[i];
}
}
double t2 = omp_get_wtime();
cout << "n_threads " << n_threads << endl;
cout << "time " << (t2 - t1)*1000 << endl;
cout << sum << endl;
}
The run time (in milliseconds) for different values of N_THREADS is as follows:
n_threads 1
time 3.6718
n_threads 2
time 2.5308
n_threads 3
time 3.4383
n_threads 4
time 3.7427
n_threads 5
time 2.4621
I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?

Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.
UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g++ -fopenmp -O3 -mavx2)
Here are the results, it still scales badly with number of threads:
n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000
n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000
n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000
n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000
n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000

5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.
Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.
The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?

I have reimplemented the test as a Google Benchmark with different values:
#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>
constexpr int SCALE{32};
constexpr int ARRAY_SIZE{1000000};
constexpr int CHUNK_SIZE{16};
void original_benchmark(benchmark::State& state)
{
const int num_threads{state.range(0)};
const int array_size{state.range(1)};
const int chunk_size{state.range(2)};
auto arr = std::make_unique<int[]>(array_size);
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(num_threads);
// double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, chunk_size)
for (int i = 0; i < array_size; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, chunk_size) reduction(+:sum)
for (int i = 0; i < array_size; i++) {
sum += arr[i];
}
}
// double t2 = omp_get_wtime();
// cout << "n_threads " << n_threads << endl;
// cout << "time " << (t2 - t1)*1000 << endl;
// cout << sum << endl;
state.counters["n_threads"] = n_threads;
}
static void BM_original_benchmark(benchmark::State& state) {
for (auto _ : state) {
original_benchmark(state);
}
}
BENCHMARK(BM_original_benchmark)
->Args({1, ARRAY_SIZE, CHUNK_SIZE})
->Args({1, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({1, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({2, ARRAY_SIZE, CHUNK_SIZE})
->Args({2, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({2, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({4, ARRAY_SIZE, CHUNK_SIZE})
->Args({4, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({4, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({8, ARRAY_SIZE, CHUNK_SIZE})
->Args({8, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({8, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({16, ARRAY_SIZE, CHUNK_SIZE})
->Args({16, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({16, ARRAY_SIZE, SCALE * CHUNK_SIZE});
BENCHMARK_MAIN();
I only have access to Compiler Explorer at the moment which will not execute the complete suite of benchmarks. However, it looks like increasing the chunk size will improve the performance. Obviously, benchmark and optimize for your own system.

Thread-safe parallel RNG slower than sequential rand()

I use this version of calculation of Pi with thread-safe function
rand_r
But it appears that it is slower (and answer is wrong) when running this program in parallel comparing to sequential program with use of
rand()
which is not thread-safe. It seems that this way of using is also not thread-safe. But I do not understand why, because I have read many questions about thread-safe PRNGs and learned that rand_r should be safe enough.
#include <iostream>
#include <random>
#include <ctime>
#include "omp.h"
#include <stdlib.h>
using namespace std;
unsigned seed;
int main()
{
double start = time(0);
int i, n, N;
double x, y;
N = 1<<30;
n = 0;
double pi;
#pragma omp threadprivate(seed)
#pragma omp parallel private(x, y) reduction(+:n)
{
for (i = 0; i < N; i++) {
seed = 25234 + 17 * omp_get_thread_num();
x = rand_r(&seed) / (double) RAND_MAX;
y = rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
pi = 4. * n / (double) (N);
cout << pi << endl;
double stop = time(0);
cout << (stop - start) << endl;
return 0;
}
P.S. By the way, what are the magic numbers in
seed = 25234 + 17 * omp_get_thread_num();
? I stole them from some answer.
EDIT: The comment by Gilles helped me. The resolution was:
1. To switch lines of for loop and seed initialization.
2. To add #pragma omp for
Modified code reads
#pragma omp parallel private(x, y, seed)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for reduction(+:n)
for (int i = 0; i < N; i++) {
x = (double) rand_r(&seed) / (double) RAND_MAX;
y = (double) rand_r(&seed) / (double) RAND_MAX;
if (x*x + y*y <= 1)
n++;
}
}
The problem is resolved.

Apparently there are more instructions in rand_r() compared to rand(). Below is copied from one implementation. So it's reasonable that rand_r() takes more time to complete one round than rand().
int
rand_r(unsigned int *ctx)
{
u_long val = (u_long) *ctx;
int r = do_rand(&val);
*ctx = (unsigned int) val;
return (r);
}
static u_long next = 1;
int
rand()
{
return (do_rand(&next));
}
And since rand() is not thread safe, the output could be incorrect if you use rand() in parallel. The worse part is that you would still get a result and don't know if it's correct in small scale test.

openmp broken by visual studio c++ optimization

I've been using OpenMP with Visual Studio 2010 for quite some time by now, but today I've encountered yet another baffling quirk of VS. After cutting off all the possible suspects, I was left with the program below.
It simply counts in a cycle and sometimes makes some calculation and churns out counters.
#include "stdafx.h"
#include "omp.h"
#include <string>
#include <iostream>
#include <time.h>
int _tmain(int argc, _TCHAR* argv[])
{
int count = 0;
double a = 1;
double b = 2;
double c = 3, mean_tau = 1, r_w = 1, weights = 1, r0 = 1, tau = 1, sq_tau = 1,
r_sw = 1;
#pragma omp parallel num_threads(3) shared(count)
{
int tid = omp_get_thread_num();
int pers_count = 0;
std::string someline;
for (int i = 0; i < 100000; i++)
{
pers_count++;
#pragma omp critical
{
count++;
if ((count%10000 == 0))
{
sq_tau = (r_sw / weights) * pow( 1/ r0 * tau, 2);
std::cout << count << " " << pers_count << std::endl;
}
}
}
}
std::getchar();
return 0;
}
Now, if I compile it with optimisation disabled (/Od), it works just as it should, spitting out their shared counter alongside with its private counter (which is roughly three times smaller), something along the lines of
10000 3890
20000 6523
...
300000 100000
If I turn on the optimisation (I tried all options, but for clarity's sake let's say /O2), however, for some reason the shared count seems to become private, as I start getting something like
10000 10000
10000 10000
10000 10000
...
60000 60000
50000 50000
...
100000 100000
And now that I encountered this quirk, somehow everything that was working before is rebuilt into incorrect version even if I don't change a thing. What could be the cause of this and what can I do? Thanks.

I don't know why the shared count is behaving this way. I can provide a workaround (assuming you only use atomic operations on the shared variable):
#pragma omp critical
{
#pragma omp atomic
count++;
if ((count%10000 == 0))
{
sq_tau = (r_sw / weights) * pow( 1/ r0 * tau, 2);
std::cout << count << " " << pers_count << std::endl;
}
}

Sequential is faster than Multi threaded - OpenMp - C++

I am using C++ & OpenMP to parallelize an algorithm to find the convex hull.
But I am not able to get the expected speedup. In fact, the sequential algorithm is faster.
The input & output set of points are stored in arrays.
Could you please look into the code and let me know the corrections?
Point *points = new Point[inp_size]; // contains the input
int th_id;
omp_set_num_threads(nthreads);
clock_t t1,t2;
t1=clock();
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
///////////// …. Only Function called ….///////////////////////////////////
findParallelUCHWOUP(points,th_id+1, nthreads, inp_size);
}
t2=clock();
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
std::cout << "Time Elapsed in seconds:" << seconds << '\n';
///////////////////////////////////////////////////////////////
int findParallelUCHWOUP(Point iv[],int id, int thread_num, int inp_size){
int numElems = inp_size/thread_num;
int first = (id-1) * numElems;;
int last;
if(id == thread_num){
last = inp_size-1;
}
else{
last = id*numElems-1;
}
output[first]=iv[first];
std::stack<int> s;
s.push(first);
int i=first+1;
while(i<last){
if ( crossProduct(iv, i, first, last) > 0){
s.push(i);
i++;
break;
}else{
i++;
}
}
if(i==last){
s.push(last);
return 0;
}
for(;i<=last;i++){
if ( crossProduct(iv, i, first, last) >= 0){
while ( s.size()>1 && crossProduct(iv, s.top(), second(s), i) <= 0){
s.pop();
}
s.push(i);
}
}
int count=s.size();
sizes[id-1] = count;
while(!s.empty()){
output[first+count-1]=iv[s.top()];
s.pop();
count--;
}
return 0;
}
///////////tested on these machines/////
Sequential Time:0.016466
Using two threads:0.022979
Using four threads:0.035213
Using 8 threads: 0.03315
Machine used: Mac Book Pro
Processor: 2.5 GHz Intel Core i5(at least 4 logical cores)
Memory: 4GB 1600 MHz
Compiler: Mac OSX Compiler

The problem is the way you count time. Actually, you could write something like:
diff / (float) (CLOCKS_PER_SEC * nthreads)
And this is only an approximation (and not always true).
CLOCKS_PER_SEC stands for sum of clocks of all cores...
You'd better use OpenMP special functions...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads - c++

Related

OpenMP paralellized C++ code exhibits higher execution times on more threads?

Why doesn't my OpenMP program scale with number of threads?

Thread-safe parallel RNG slower than sequential rand()

openmp broken by visual studio c++ optimization

Sequential is faster than Multi threaded - OpenMp - C++

Categories

Resources