OpenMP. Parallelization of the loop in N threads - c++

I am trying to parallelize a cycle of 50 million iterations with several threads - first by 1, then by 4, 8 and 16. Below is the code for implementing this functionality.
#include <iostream>
#include <omp.h>
using namespace std;
void someFoo();
int main() {
someFoo();
}
void someFoo() {
long sum = 0;
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
omp_set_num_threads(numOfThreads[j]);
start = omp_get_wtime();
#pragma omp parallel for
for(int i = 0; i<50000000; i++) {
sum += i * 10;
}
#pragma omp end parallel
end = omp_get_wtime();
cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
}
}
It is expected that in 4 threads the program will run faster than in 1, in 8 threads faster than in 4 threads, and in 16 threads faster than in 8 threads, but in practice this is not the case - everything is executed at a chaotic speed and there is almost no difference. Also, it is not visible in the task manager that the program is parallelized. I have a computer with 8 logical processors and 4 cores.
Please tell me where I made a mistake and how to properly parallelize the loop in N threads.

There is a race condition in your code because sum is read/written from multiple threads at the same time. This should cause wrong results. You can fix this using a reduction with the directive #pragma omp parallel for reduction(+:sum). Note that OpenMP does not check if your loop can be parallelized, it is your responsibility.
Additionally, the parallel computation might be slower than the the sequential one since a clever compiler can see that sum = 50000000*(50000000-1)/2*10 = 12499999750000000 (AFAIK, Clang does that). As a result, the benchmark is certainly flawed. Note that this is certainly bigger than what the type long can contain so there is certainly an overflow in your code.
Moreover, AFAIK, there is no such thing as the directive #pragma omp end parallel.
Finally, note that you can control the number of thread using the OMP_NUM_THREADS environment variable which is generally more convenient than setting it in your application (Hardwiring a given number of thread in the application code is generally not a good idea, even for benchmarks).

Please tell me where I made a mistake and how to properly parallelize
the loop in N threads.
First you need to fix some of the compiler issues on your code example. Like removing pragmas like #pragma omp end parallel, declaring the variables correctly and so on. Second you need to fix the race condition during the update of the variable sum. That variable is shared among threads an updated concurrently. The easiest way would be to use the reduction clause of OpenMP, you code would look like the following:
#include <stdio.h>
#include <omp.h>
int main() {
someFoo();
}
void someFoo() {
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
omp_set_num_threads(numOfThreads[j]);
double start = omp_get_wtime();
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i = 0; i<50000000; i++) {
sum += i * 10;
}
double end = omp_get_wtime();
printf("Result: '%d' : '%f'\n", sum, (end - start));
}
}
With that you should get some speedups when running with multi-cores.
NOTE: To solve the overflow mentioned first by #Jérôme Richard, I change the 'sum' variable from long to a double.

Related

OpenMP doesn't give right result and different time

I am new to OpenMP and now I'm studying the usage of atomic. I had a different result and time each run. Sometimes about a minute sometimes about 19 seconds.
Below is my code:
#include <iostream>
#include<iomanip>
#include<cmath>
#include<omp.h>
#include"KiTimer.h"
int main()
{
using namespace std;
const int NUM_REPEAT = 100000000;
KiTimer timer;
timer.MakeTimer(0, "ADD");
timer.Start();
double sum = 0., x = 0.;
#pragma omp parallel
{
#pragma omp single
cout << "Thread num:" << omp_get_num_threads() << endl;
#pragma omp for private(x)
for (int i = 0; i < NUM_REPEAT; i++) {
x = sqrt(i);
#pragma omp atomic
sum += x;
}
}
cout << setprecision(20) << "total:" << sum << endl;
timer.Stop();
timer.Print();
return 0;
}
Here are the results from three different test runs:
First Result:
Second Result:
Third Result:
The correct way of doing sum with OMP is:
#pragma omp for reduction(+:sum)
instead of
#pragma omp for private(x)
...
#pragma omp atomic
The atomic clause as far as I remember adds a big overhead when executed too often (as it is the case).
Also, the scope of x can be greatly reduced, simplifying the code: no need to use private.
About the different result, that is normal since you are adding floating point numbers in different orders in different executions. It is more of a problem when operating big numbers with small numbers since they need to be normalized when operating and will loose precision. In the case of doubles, the precision is 15 digits, so if you add 1000000000000000 + 1, the result will still be 1000000000000000, even if you do it many times.

Why does gcc's implementation of openMP fail to parallelise a recursive function inside another recursive function

I am trying to parallelise these recursive functions with openMP tasks,
when I compile with gcc it runs only on 1 thread. When i compile it with clang it runs on multiple threads
The second function calls the first one which doesn't generate new tasks to stop wasting time.
gcc does work when there is only one function that calls itself.
Why is this?
Am I doing something wrong in the code?
Then why does it work with clang?
I am using gcc 9.3 on windows with Msys2.
The code was compiled with -O3 -fopenmp
//the program compiled by gcc only runs on one thread
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void excalibur(int current_node, int current_depth) {
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
if(current_depth > 0){
int new_depth = current_depth - 1;
#pragma omp simd
for(int i = current_node;i <= vec[current_node];i++){
++steps;
excalibur(i + 1,new_depth);
}
}
}
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
while this works with gcc
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
mario(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
Your program doesn't run in parallel because there is simply nothing to run in parallel. Upon first entry in mario, current_node is 9 and vec is all 8s, so this loop in the first and only task never executes:
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
Hence, no recursive creation of new tasks. How and what runs in parallel when you compile it with Clang is well beyond me, since when I compile it with Clang 9, the executable behaves exactly the same as the one produced by GCC.
The second code runs in parallel because of the recursive call in the loop after the task region. But it also isn't a correct OpenMP program - the specification forbids nesting task regions inside a simd construct (see under Restrictions here):
The only OpenMP constructs that can be encountered during execution of a simd region are the atomic construct, the loop construct, the simd construct and the ordered construct with the simd clause.
None of the two compilers catches that problem when the nesting is in the dynamic and not in the lexical scope of the simd construct though.
Edit: I actually looked it a bit closer into it and I may have a suspicion about what might have caused your confusion. I guess you determine if your program works in parallel or not by looking at the CPU utilisation while it runs. This often leads to confusion. The Intel OpenMP runtime that Clang uses has a very aggressive waiting policy. When the parallel region in the main() function spawns a team of four threads, one of them goes executing mario() and the other three hit the implicit barrier at the end of the region. There they spin, waiting for new tasks to be eventually assigned to them. They never get one, but keep on spinning anyway, and that's what you see in the CPU utilisation. If you want to replicate the same with GCC, set OMP_WAIT_POLICY to ACTIVE and you'll see the CPU usage soar while the program runs. Still, if you profile the program's execution, you'll see that CPU time is spent inside your code in one thread only.

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?
Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

No speedup with OpenMP

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup.
Unfortunately I noticed that I could not get the desired speedup.
So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.
This is the toy example i wrote:
#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
int main () {
int number_of_threads = 1;
int n = 600;
int m = 50;
int N = n/number_of_threads;
int time_limit = 600;
double total_clock = omp_get_wtime();
int time_flag = 0;
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++){
CD[c] = C[z]*D[x];
C[z] = CD[c] + D[x];
}
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
#pragma omp critical
std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
}
}
I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).
However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.
Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?
Results
Running the code above I got the following results.
1 thread: 23 iterations.
20 threads: 397-401 iterations per thread (instead of 420-460).
Your measurement methodology is wrong. Especially for small number of iterations.
1 thread: 3 iterations.
3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.
2 threads: 5 iterations per thread (instead of 6).
4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.
20 threads: 40-44 iterations per thread (instead of 60).
40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.
As you can see your results actually do not contradict linear speedup.
It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.
Some reasons (non exhaustive) why you don't see linear speedup are:
Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
Synchronization between threads (e.g. critical sections). Not the case in your toy example.
Load imbalance between threads. Not the case in your toy example.
Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.
From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.
More general advise would be too broad for this format and require more specific information about the non-toy example.
You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
}
Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.
A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or
#pragma omp parallel
{
#pragma omp for
{
}
}
this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster.
be careful
#pragma omp parallel
{
for
{
}
}
will make the entire for loop for each thread, which will not speed up your program.
You should try
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
#pragma omp for
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++)
CD[c] = C[z]*D[x];
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
if(thread_id == 0)
iteration_number = iteration_number_local;
}
std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

openmp latency for inside for

I have a piece of code that i want to parallelize and the openmp program is much slower than the serial version, so what is wrong with my implementation?. This is the code of the program
#include <iostream>
#include <gsl/gsl_math.h>
#include "Chain.h"
using namespace std;
int main(){
int const N=1000;
int timeSteps=100;
double delta=0.0001;
double qq[N];
Chain ch(N);
ch.initCond();
for (int t=0; t<timeSteps; t++){
ch.changeQ(delta*t);
ch.calMag_i();
ch.calForce001();
}
ch.printSomething();
}
The Chain.h is
class Chain{
public:
int N;
double *q;
double *mx;
double *my;
double *force;
Chain(int const Np);
void initCond();
void changeQ(double delta);
void calMag_i();
void calForce001();
};
And the Chain.cpp is
Chain::Chain(int const Np){
this->N = Np;
this->q = new double[Np];
this->mx = new double[Np];
this->my = new double[Np];
this->force = new double[Np];
}
void Chain::initCond(){
for (int i=0; i<N; i++){
q[i] = 0.0;
force[i] = 0.0;
}
}
void Chain::changeQ(double delta){
int i=0;
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++){
q[i] = q[i] + delta*i + 1.0*i/N;
}
}
}
void Chain::calMag_i(){
int i =0;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++){
mx[i] = cos(q[i]);
my[i] = sin(q[i]);
}
}
}
void Chain::calForce001(){
int i;
int j;
double fij =0.0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for private(j, fij)
for (i=0; i<N; i++){
force[i] = 0.0;
for (j=0; j<i; j++){
fij = my[i]*mx[j] - mx[i]*my[j];
#pragma omp critical
{
force[i] += fij;
force[j] += -fij;
}
}
}
}
double time = omp_get_wtime() - start_time;
cout <<"time = " << time <<endl;
}
So the methods changeQ() and calMag_i() are in fact faster than the serial code, but my problem is the calForce001(). The execution time are:
with openMP 3.939s
without openMP 0.217s
Now, clearly i'm doing something wrong or the code can't be parallelize. Please any help with be usefull.
Thanks in advance.
Carlos
Edit:
In order to clarify the question i add the functions omp_get_wtime() to calculate the execution time for the function calForce001() and the times for one execution are
with omp :0.0376656
without omp: 0.00196766
So with omp method is 20 times slower.
Otherwise, i'm also calculate the time for the calMag_i() method
with omp: 3.3845e-05
without omp: 9.9516e-05
for this method omp is 3 times faster.
I hope this confirm that the latency problem is in the calForce001() method.
There are three reasons why you don't benefit from any speedup.
you have #pragma omp parallel all over your code. What this pragma does, is start the "team of threads". At the end of the block, this team is disbanded. This is quite costly. Removing those and using #pragma omp parallel for instead of #pragma omp for will start the team upon first encounter and put it to sleep after each block. This made the application 4x faster for me.
you use #pragma omp critical. On most platforms, this will force the use of a mutex - which is heavily contended because all threads want to write to that variable at the same time. So, don't use a critical section here. You could use atomic updates, but in this case, that won't make much of a difference - see third item. Just removing the critical section improved the speed by another 3x.
Parallelism only makes sense when you have an actual workload. All of your code is too small to benefit from parallelism. There's simply too little workload to win back the time lost on starting/waking/destroying the threads. If your workload would be ten times this, some of the parallel for statements would make sense. But especially Chain::calForce001() will never be worth it if you have to do atomic updates.
With respect to programming style: you're programming in C++. Please use local scope variables wherever you can - in e.g. Chain::calForce001(), use a local double fij inside the inner loop. That saves you from having to write private clauses. Compilers are smart enough to optimize that. Correct scoping allows for better optimizations.