OpenMP: parallel for(i;...) and i value - c++

I have a following parallel snippet:
#include <omp.h>
#include "stdio.h"
int main()
{
omp_set_num_threads(4);
int i;
#pragma omp parallel private(i)
{
#pragma omp for
for(i = 0;i < 10; i++) {
printf("A %d: %d\n", omp_get_thread_num(),i);
}
#pragma omp critical
printf("i %d: %d\n", omp_get_thread_num(), i );
}
}
I thought that after the loop, each thread will have i equal to i last value in the thread's loop. My desired output would be:
A 0: 0
A 0: 1
A 0: 2
A 3: 9
A 2: 6
A 2: 7
A 2: 8
A 1: 3
A 1: 4
A 1: 5
i 0: 3
i 3: 10
i 2: 9
i 1: 6
whereas what I get is:
A 0: 0
A 0: 1
A 0: 2
A 3: 9
A 2: 6
A 2: 7
A 2: 8
A 1: 3
A 1: 4
A 1: 5
i 0: -1217085452
i 3: -1217085452
i 2: -1217085452
i 1: -1217085452
How to make i to hold last iteration's value? lastprivate(i) makes i = 10 for all threads, and that is not what I want.

It turns out you can't. OpenMP alters program semantics.
Parallel for loops are rewritten by the compiler according to well-defined set of rules.
This also implies you cannot break from, return from such a loop. You can also not directly manipulate the loop variable. The loop condition can not call random functions or do any conditional expression, in short: a omp parallel for loop is not a for loop
#include <omp.h>
#include "stdio.h"
int main()
{
omp_set_num_threads(4);
#pragma omp parallel
{
int i;
#pragma omp for
for(i = 0;i < 10; i++) {
printf("A %d: %d\n", omp_get_thread_num(),i);
}
#pragma omp critical
printf("i %d: %d\n", omp_get_thread_num(), i );
}
}

Thanks to sehe`s post, I figure out the following dirty trick that solves the problem
int i, last_i;
#pragma omp parallel private(i)
{
#pragma omp for
for(i = 0;i < 10; i++) {
printf("A %d: %d\n", omp_get_thread_num(),i);
last_i = i;
}
#pragma omp critical
printf("i %d: %d\n", omp_get_thread_num(), last_i );
}
}

Related

OpenMP parallelise std::next_permutation

I have written the function scores_single, which calculates all the permuatations of a board and scores it.
I tried to follow this SO answer to parallelise the function using OpenMP, and came up with scores_parallel. The problem is that the parallel version iterates over the same permutations in multiple loops.
The code follows:
#include <iostream>
#include <omp.h>
#include <vector>
// Single threaded version
int scores_single(const int tokens, const int board_length) {
std::vector<int> scores;
// Generate boards
std::vector<char> board(board_length);
std::fill(board.begin(), board.end() - tokens, '-');
std::fill(board.end() - tokens, board.end(), 'T');
do {
// printf("Board: %s\n", std::string(board.data(), board.size()).c_str());
// Score board
auto value = 3;
scores.push_back(value);
} while (std::next_permutation(board.begin(), board.end()));
return scores.size();
}
// OpenMP parallel version
int scores_parallel(const int tokens, const int board_length) {
std::vector<std::vector<int>*> score_lists(board_length);
// Generate boards
std::vector<char> board(board_length);
std::fill(board.begin(), board.end() - tokens, '-');
std::fill(board.end() - tokens, board.end(), 'T');
printf("Starting\n");
#pragma omp parallel default(none) shared(board, board_length, score_lists)
{
#pragma omp single nowait
for (int i = 0; i < board_length; ++i) {
#pragma omp task untied
{
auto *scores = new std::vector<int>;
// Make a copy of the board for this thread
auto board_thread = board;
// Subset for this thread, see: https://stackoverflow.com/questions/30865231/parallel-code-for-next-permutation
std::rotate(board_thread.begin(), board_thread.begin() + i, board_thread.begin() + i + 1);
do {
printf("[%02d] board: %s\n", i, std::string(board_thread.data(), board_thread.size()).c_str());
// Score board
auto value = 3;
scores->push_back(value);
} while (std::next_permutation(board_thread.begin() + 1, board_thread.end()));
score_lists[i] = scores;
printf("[%02d] Finished on thread %d with %lu values\n", i, omp_get_thread_num(), scores->size());
}
}
}
std::vector<int> scores;
int k = 0;
for (auto & list : score_lists) {
for (int j : *list) {
scores.push_back(j);
}
}
printf("Finished, size: %lu\n", scores.size());
return scores.size();
}
int main() {
int p = scores_parallel(2, 4);
int s = scores_single(2, 4);
std::cout << p << " != " << s << std::endl;
return 0;
}
Output:
Starting
[01] board: --TT
[03] board: T--T
[03] board: T-T-
[02] board: T--T
[03] board: TT--
[01] board: -T-T
[02] board: T-T-
[00] board: --TT
[03] Finished on thread 10 with 3 values
[00] board: -T-T
[00] board: -TT-
[02] board: TT--
[00] Finished on thread 11 with 3 values
[01] board: -TT-
[02] Finished on thread 12 with 3 values
[01] Finished on thread 4 with 3 values
Finished, size: 12
12 != 6
I think I understand the SO answer I am copying, but I am not sure what I have done wrong.
6 is the expected answer, as 4C2 = 6.
Figured it out, I was calculating the same permutation multiple times. I fixed it with the if statement below.
#include <iostream>
#include <omp.h>
#include <vector>
// OpenMP parallel version
int scores_parallel(const int tokens, const int board_length) {
std::vector<std::vector<int>*> score_lists(board_length);
// Generate boards
std::vector<char> board(board_length);
std::fill(board.begin(), board.end() - tokens, '-');
std::fill(board.end() - tokens, board.end(), 'T');
printf("Starting\n");
#pragma omp parallel default(none) shared(board, board_length, score_lists)
{
#pragma omp single nowait
for (int i = 0; i < board_length; ++i) {
#pragma omp task untied
{
auto *scores = new std::vector<int>;
// No need to process this branch if it will be identical to a prior branch.
if (board[i] != board[i + 1]) {
// Make a copy of the board for this thread
auto board_thread = board;
printf("[%02d] evaluating: %s\n", i, std::string(board_thread.data(), board_thread.size()).c_str());
// Subset for this thread, see: https://stackoverflow.com/questions/30865231/parallel-code-for-next-permutation
std::rotate(board_thread.begin(), board_thread.begin() + i, board_thread.begin() + i + 1);
do {
printf("[%02d] board: %s\n", i, std::string(board_thread.data(), board_thread.size()).c_str());
// Score board
auto value = 3;
scores->push_back(value);
} while (std::next_permutation(board_thread.begin() + 1, board_thread.end()));
}
score_lists[i] = scores;
printf("[%02d] Finished on thread %d with %lu values\n", i, omp_get_thread_num(), scores->size());
}
}
}
std::vector<int> scores;
int k = 0;
for (auto & list : score_lists) {
for (int j : *list) {
scores.push_back(j);
}
}
printf("Finished, size: %lu\n", scores.size());
return scores.size();
}
int main() {
int p = scores_parallel(2, 4);
int s = scores_single(2, 4);
std::cout << p << " == " << s << std::endl;
return p != s;
}

Avoid calling omp_get_thread_num() in parallel for loop with simd

What is the performance cost of call omp_get_thread_num(), compared to look up the value of a variable?
How to avoid calling omp_get_thread_num() for many times in a simd openmp loop?
I can use #pragma omp parallel, but will that make a simd loop?
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp for simd
for (int i = 0; i < a_size; ++i) {
a[i] = omp_get_thread_num();
}
}
I wouldn't be too worried about the cost of the call, but for code clarity you can do:
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp parallel
{
const auto threadId = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < a_size; ++i) {
a[i] = threadId;
}
}
}
As long as you use #pragma omp for (and don't put an extra `parallel in there! otherwise each of your n threads will spawn n more threads... that's bad) it will ensure that inside your parallel region that for loop is split up amongst the n threads. Make sure omp compiler flag is turned on.

How does pragma and omp make a difference in these two codes producing same output?

Initially value of ab is 10, then after some delay created by for loop ab is set to 55 and then its printed in this code..
#include <iostream>
using namespace std;
int main()
{
long j, i;
int ab=10 ;
for(i=0; i<1000000000; i++) ;
ab=55;
cout << "\n----------------\n";
for(j=0; j<100; j++)
cout << endl << ab;
return 0;
}
The purpose of this code is also the same but what was expected from this code is the value of ab becomes 55 after some delay and before that the 2nd pragma block should print 10 and then 55 (multithreading) , but the second pragma block prints only after the delay created by the first for loop and then prints only 55.
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
#pragma omp barrier
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
So you want to "observe race conditions" by changing the value of a variable in a first region and printing the value from the second region.
There are a couple of things that prevent you achieving this.
The first (and explicitly stated) is the #pragma omp barrier. This OpenMP statement requests the runtime that threads running the #pragma omp parallel must wait until all threads in the team arrive. This first barrier forces the two threads to be at the barrier, thus at that point ab will have value 55.
The #pragma omp single (and here stated implicitly) contains an implicit `` waitclause, so the team of threads running theparallel region` will wait until this region has finished. Again, this means that ab will have value 55 after the first region has finished.
In order to try to achieve (and note the "try" because that will depend from run to run, depending on several factors [OS thread scheduling, OpenMP thread scheduling, HW resources available...]). You can give a try to this alternative version from yours:
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single nowait
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
BTW, rather than iterating for a long trip-count in your loops, you could use calls such as sleep/usleep.

openMP exercise omp_bug2.c

this is an exercise from the OpenMP website:
https://computing.llnl.gov/tutorials/openMP/exercise.html
#include "stdafx.h"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int _tmain(int argc, _TCHAR* argv[])
{
int nthreads, i, tid;
float total;
/*** Spawn parallel region ***/
#pragma omp parallel private(i, tid) // i changed this line
{
/* Obtain thread number */
tid = omp_get_thread_num();
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d is starting...\n",tid);
#pragma omp barrier
/* do some work */
total = 0.0;
#pragma omp for schedule(dynamic,10)
for (i=0; i<1000000; i++)
total = total + i*1.0;
printf ("Thread %d is done! Total= %e\n",tid,total);
}
}
the output for this is
Number of threads = 4
Thread 0 is starting...
Thread 3 is starting...
Thread 2 is starting...
Thread 1 is starting...
Thread 0 is done! Total= 0.000000e+000
Thread 3 is done! Total= 0.000000e+000
Thread 2 is done! Total= 0.000000e+000
Thread 1 is done! Total= 0.000000e+000
which means we have a problem with the variable "total"
this is the help on the site
Here is my Solution: do you think this is the correct way to do it?
#include "stdafx.h"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int _tmain(int argc, _TCHAR* argv[])
{
int nthreads, i, tid;
float total;
/*** Spawn parallel region ***/
#pragma omp parallel private(total,tid)
{
/* Obtain thread number */
tid = omp_get_thread_num();
total= 0.0;
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d is starting...\n",tid);
#pragma omp parallel for schedule(static,10)\
private(i)\
reduction(+:total)
for (i=0; i<1000000; i++)
total = total + i*1.0;
printf ("Thread %d is done! Total= %e\n",tid,total);
} /*** End of parallel region ***/
}
Here is my new output:
Number of threads = 4
Thread 0 is starting...
Thread 1 is starting...
Thread 0 is done! Total= 4.999404e+011
Thread 2 is starting...
Thread 1 is done! Total= 4.999404e+011
Thread 2 is done! Total= 4.999404e+011
Thread 3 is starting...
Thread 3 is done! Total= 4.999404e+011
Yes you certainly want total to be a thread-private variable. One thing you presumably would do in a real example is to reduce the thread-private totals to a single global total at the end (and only let one thread print the result then). One way to do that is a simple
#pragma omp atomic
global_total += total
at the end (there are better ways though using reductions).
PS: Loop counters for omp for are by default private, so you actually don't have to explicitly specify that.

Couldn't get acceleration OpenMP

I am writing simple parallel program in C++ using OpenMP.
I am working on Windows 7 and on Microsoft Visual Studio 2010 Ultimate.
I changed the Language property of the project to "Yes/OpenMP" to support OpenMP
Here I provide the code:
#include <iostream>
#include <omp.h>
using namespace std;
double sum;
int i;
int n = 800000000;
int main(int argc, char *argv[])
{
omp_set_dynamic(0);
omp_set_num_threads(4);
sum = 0;
#pragma omp for reduction(+:sum)
for (i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
return EXIT_SUCCESS;
}
But, I couldn't get any acceleration by changing the x in omp_set_num_threads(x);
It doesn't matter if I use OpenMp or not, the calculating time is the same, about 7 seconds.
Does Someone know what is the problem?
Your pragma statement is missing the parallel specifier:
#include <iostream>
#include <omp.h>
using namespace std;
double sum;
int i;
int n = 800000000;
int main(int argc, char *argv[])
{
omp_set_dynamic(0);
omp_set_num_threads(4);
sum = 0;
#pragma omp parallel for reduction(+:sum) // add "parallel"
for (i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
return EXIT_SUCCESS;
}
Sequential:
sum=3.6e+009
2.30071
Parallel:
sum=3.6e+009
0.618365
Here's a version that some speedup with Hyperthreading. I had to increase the # of iterations by 10x and bump the datatypes to long long:
double sum;
long long i;
long long n = 8000000000;
int main(int argc, char *argv[])
{
omp_set_dynamic(0);
omp_set_num_threads(8);
double start = omp_get_wtime();
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
double end = omp_get_wtime();
cout << end - start << endl;
system("pause");
return EXIT_SUCCESS;
}
Threads: 1
sum=3.6e+014
13.0541
Threads: 2
sum=3.6e+010
6.62345
Threads: 4
sum=3.6e+010
3.85687
Threads: 8
sum=3.6e+010
3.285
Apart from the error pointed out by Mystical, you seemed to assume that openMP can justs to magic. It can at best use all cores on your machine. If you have 2 cores, it may reduce the execution time by two if you call omp_set_num_threads(np) with np>=2, but for np much larger than the number of cores, the code will be inefficient due to parallelization overheads.
The example from Mystical was apparently run on at least 4 cores with np=4.