I use openMP for parallel my C++ program. My parallel code have very simple form
#pragma omp parallel for shared(a, b, c) private(i, result)
for (i = 0; i < N; i++){
result= F(a,b,c,i)//do some calculation
cout<<i<<" "<<result<<endl;
}
If two threads try to write into file simultaneously, the data is mixed up.
How I can solve this problem?
OpenMP provides pragmas to help with synchronisation. #pragma omp critical allows only one thread to be executing the attached statement at any time (a mutual exclusion critical region). The #pragma omp ordered pragma ensures loop iteration threads enter the region in order.
// g++ -std=c++11 -Wall -Wextra -pedantic -fopenmp critical.cpp
#include <iostream>
int main()
{
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
std::cout << "unsynchronized(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
#pragma omp critical
std::cout << "critical(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for ordered
for (int i = 0; i < 20; ++i)
#pragma omp ordered
std::cout << "ordered(" << i << ") ";
std::cout << std::endl;
return 0;
}
Example output (different each time in general):
unsynchronized(unsynchronized(unsynchronized(05) unsynchronized() 6unsynchronized() 1unsynchronized(7) ) unsynchronized(unsynchronized(28) ) unsynchronized(unsynchronized(93) ) unsynchronized(4) 10) unsynchronized(11) unsynchronized(12) unsynchronized(15) unsynchronized(16unsynchronized() 13unsynchronized() 17) unsynchronized(unsynchronized(18) 14unsynchronized() 19)
critical(5) critical(0) critical(6) critical(15) critical(1) critical(10) critical(7) critical(16) critical(2) critical(8) critical(17) critical(3) critical(9) critical(18) critical(11) critical(4) critical(19) critical(12) critical(13) critical(14)
ordered(0) ordered(1) ordered(2) ordered(3) ordered(4) ordered(5) ordered(6) ordered(7) ordered(8) ordered(9) ordered(10) ordered(11) ordered(12) ordered(13) ordered(14) ordered(15) ordered(16) ordered(17) ordered(18) ordered(19)
Problem is: you have a single resource all threads try to access. Those single resources must be protected against concurrent access (thread safe resources do this, too, just transparently for you; by the way: here is a nice answer about thread safety of std::cout). You could now protect this single resource e. g. with a std::mutex. Problem then is, that the threads will have to wait for the mutex until the other thread gives it back again. So you only will profit from parallelisation if F is a very complex function.
Further drawback: as threads work parallel, even with a mutex to protect std::in, the results can be printed out in arbitrary order, depending on which thread happens to operate earlier.
If I may assume that you want the results of F(... i) for smaller i before the results of greater i, you either should drop parallelisation entirely or do it differently:
Provide an array of size N and let each thread store its results there (array[i] = f(i);). Then iterate over the array in a separate non-parallel loop. Again, doing so is only worth the effort if F is a complex function (and for large N).
Additionally: Be aware that threads must be created, too, which causes some overhead somewhere (creating thread infrastructure and stack, registering thread at OS, ... – unless if you can reuse some threads already created in a thread pool earlier...). Consider this, too, when deciding if you want to parallelise or not. Sometimes, non-parallel calculations can be faster...
Related
openMp used to work on my project on 6 threads and now, (I have no ideas why), the program is single threaded.
My code is pretty simple, I only use openMp in one cpp file, i declared
#include <omp.h>
then the function to parallelize is :
#pragma omp parallel for collapse(2) num_threads(IntervalMapEstimator::m_num_thread)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
//use for debug
omp_set_num_threads (5);
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
if(split_points) {
extract_relevant_points_from_angle_lists(relevant_points, pointcloud_ff_polar_angle_lists, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
} else {
extract_relevant_points_multithread_with_localvector(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
}
omp_get_num_threads return 1 thread
omp_get_max_threads return 5
IntervalMapEstimator::m_num_thread is set at 6
Any lead would be greatly appreciated.
EDIT 1:
I modified my code but the problem remains, the program is still running in mono thread.
omp_get_num_threads return 1
omp_get_max_threads return 8
Is there a way to know how many threads are available at running time ?
#pragma omp parallel for collapse(2)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
extract_relevant_points(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
I just saw my computer is beginning to run low in memory, could that be a part of the problem ?
According to https://msdn.microsoft.com/en-us/library/bx15e8hb.aspx:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You request 6 threads, the implementation can only provide 5, so it is free to do what it wants.
I'm also pretty sure you are not supposed to change the number of threads while inside a parallel region, so your omp_set_num_threads will do nothing at best and blow up in your face at worst.
I founded the answer thanks to another post: Why does the compiler ignore OpenMP pragmas?
In the end it was a simple error of library that i didn't add to the compiler, and I didn't noticed it because i was compiling with cmake so i don't have to type the line directly. Also i use catkin_make to compile so i don't have warning but only error in the console.
So basically, to use openMp you have to add -fopenmp as arguments to your compiler, and if you don't' ... well the lines are just ignored by the compiler.
Please consider the following simple code for summing up values in a parallel for loop:
int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
for (int i = 0; i < 4; i++)
{
nTotalSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
When I run this on a two-core machine, the output I get is
0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5
This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.
Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
int nLocalSum = 0;
for (int i = 0; i < 4; i++)
{
nLocalSum += i;
}
nTotalSum += nLocalSum;
...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...
Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.
You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.
Your manual version would work if you split the parallel / for into two directives:
int nTotalSum = 0;
#pragma omp parallel
{
// Declare the local variable it here!
// Then it's private implicitly and properly initialized
int localSum = 0;
#pragma omp for
for (int i = 0; i < 4; i++) {
localSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
// Do not forget the atomic, or it would be a race condition!
// Alternative would be a critical, but that's less efficient
#pragma omp atomic
nTotalSum += localSum;
}
I think it's likely that your OpenMP implementation does the reduction just like that.
Each OMP thread has its own copy of nTotalSum. At the end of the OMP section these are combined back into the original nTotalSum. The output you're seeing comes from running loop iterations (0,1) in one thread, and (2,3) in another thread. If you output nTotalSum at the end of your loop, you should see the expected result of 6.
In you nLocalSum example, move the declaration of nLocalSum to before the #pragma omp line. The for loop must be on the line immediately following the pragma.
from my parallel programming in openmp book:
reduction clause can be trickier to understand, has both private and shared storage behavior. The reduction attribute is used on objects that are the target of an arithmetic reduction. This can be important in many applications...reduction allows it to be implemented by the compiler efficiently... this is such a common operation that openmp has the reduction data scope clause just to handle them...most common example is final summation of temporary local variables at the end of the parallel construct.
correction to your second example:
total_sum = 0; /* do all variable initialization prior to omp pragma */
#pragma omp parallel for \
private(i) \
reduction(+:total_sum)
for (int i = 0; i < 4; i++)
{
total_sum += i; /* you used nLocalSum here */
}
#pragma omp end parallel for
/* at this point in the code,
all threads will have done your `for` loop where total_sum is local to each thread,
openmp will then '+" together the values in `total_sum` coming from each thread because we used reduction,
do not do an explicit nTotalSum += nLocalSum after the omp for loop, it's not needed the reduction clause takes care of this
*/
In your first example, I'm not sure of your use of #pragma omp parallel for num_threads(nMaxThreads) reduction(+:nTotalSum) of what num_threads(nMaxThreads) is doing. But i suspect the weird output might be caused by print buffering.
In any case, the reduction clause is very useful and very efficient if used properly. It would be more obvious in a more complicated, real-world example.
Your posted example is so simple that it doesn't show off the usefulness of the reduction clause, and strictly speaking for your example since all threads are doing a summation the most efficient way to do it would just make total_sum a shared variable in the parallel section and have all threads pump in to it. At the end the answer would still be correct. would work if using critical directive.
I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.
I'm new to OpenMP and from what I have read about OpenMP 2.0, which comes standard with Microsoft Visual Studio 2010, global variables are considered troublesome and error prone when used in parallel programming. I have also been adopting this feeling since I have found very little on how to deal with global variables and static global variables efficiently, or at all for that matter.
I have this snippet of code which runs but because of the local variable created in the parallel block I don't get the answer I'm looking for. I get 8 different print outs (because that how many threads I have on my PC) instead of 1 answer. I know that it's because of the local variables "list" created in the parallel block but this code will not run if I move the "list" variable and make it a global variable. Actually the code does run but it never gives me an answer back. This is the sample code that I would like to modify to use a global "list" variable :
#pragma omp parallel
{
vector<int> list;
#pragma omp for
for(int i = 0; i < 50000; i++)
{
list.push_back(i);
}
cout << list.size() << endl;
}
Output:
6250
6250
6250
6250
6250
6250
6250
6250
They add up to 50000 but I did not get the one answer with 50000, instead it's divided up.
Solution:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
cout << i << endl;
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
According to the MSDN Documentation the parallel clause
Defines a parallel region, which is code that will be executed by
multiple threads in parallel.
And since the list variable is declared inside this section every thread will have its own list.
On the other hand, the for pragma
Causes the work done in a for loop inside a parallel region to be
divided among threads.
So the 50000 iterations will be split among threads but each thread will have its own list.
I think what you are trying to do can be achieved by:
Taking the list definition outside the "parallel" section.
Protect the list.push_back statement with a critical section.
Try this:
vector<int> list;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 50000; i++)
{
#pragma omp critical
{
list.push_back(i);
}
}
}
cout << list.size() << endl;
I don't think you should get any speedup from OpenMP in this case because there will be contention for the critical section. A faster solution for this (if you don't care about the order of elements) would be for every thread to have its own list, and get those lists merged after the loop finishes. The implementation using std::list instead of std::vector would look cleaner in this case (because you wouldn't have to copy arrays).
Some apps are memory bound and not compute bound. Bottom line: check if you actually get a speedup from OpenMP.
Why you need the first pragma here? (#pragma omp parallel). I think that's the issue.
I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
#include <vector>
#include <algorithm>
using namespace std;
int main ()
{
int n=400000, m=1000;
double x=0,y=0;
double s=0;
vector< double > shifts(n,0);
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}
I compile it with
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
real 0m18.417s
user 0m18.357s
sys 0m0.004s
when I do use "-fopenmp",
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
real 0m6.853s
user 0m52.007s
sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold
increase of performance? Am I coding the loop correctly?
You should make use of the OpenMP reduction clause for x and y:
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.
Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total
See - superlinear speed-up :)
let's try to understand how parallelize simple for loop using OpenMP
#pragma omp parallel
#pragma omp for
for(i = 1; i < 13; i++)
{
c[i] = a[i] + b[i];
}
assume that we have 3 available threads, this is what will happen
firstly
Threads are assigned an independent set of iterations
and finally
Threads must wait at the end of work-sharing construct
Because this question is highly viewed I decided to add a bit a OpenMP background to help those visiting it
The #pragma omp parallel creates a parallel region with a team of threads, where each thread executes the entire block of code that the parallel region encloses.
From the OpenMP 5.1 one can read a more formal description :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region (..). The
thread that encountered the parallel construct becomes the primary
thread of the new team, with a thread number of zero for the duration
of the new parallel region. All threads in the new team, including the
primary thread, execute the region. Once the team is created, the
number of threads in the team remains constant for the duration of
that parallel region.
The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.
From the OpenMP 5.1 you can read a more formal description :
The worksharing-loop construct specifies that the iterations of one or
more associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks. The iterations are
distributed across threads that already exist in the team that is
executing the parallel region to which the worksharing-loop region
binds.
Moreover,
The parallel loop construct is a shortcut for specifying a parallel
construct containing a loop construct with one or more associated
loops and no other statements.
Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for. In your case, this would mean that:
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
A team of threads will be created, and to each of those threads will be assigned chunks of the iterations of the outermost loop.
To make it more illustrative, with 4 threads the #pragma omp parallel for with a chunk_size=1 and a static schedule would result in something like:
Code-wise the loop would be transformed to something logically similar to:
for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{
c[i]=a[i]+b[i];
}
where omp_get_thread_num()
The omp_get_thread_num routine returns the thread number, within the
current team, of the calling thread.
and omp_get_num_threads()
Returns the number of threads in the current team. In a sequential
section of the program omp_get_num_threads returns 1.
or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.
Armed with this knowledge, and looking at your code, one can see that you have a race-condition on the updates of the variables 'x' and 'y'. Those variables are shared among threads and update inside the parallel region, namely:
x += rand_g1;
y += rand_g2;
To solve this race-condition you can use OpenMP' reduction clause:
Specifies that one or more variables that are private to each thread
are the subject of a reduction operation at the end of the parallel
region.
Informally, the reduction clause, will create for each thread a private copy of the variables 'x' and 'y', and at the end of the parallel region perform the summation among all those 'x' and 'y' variables into the original 'x' and 'y' variables from the initial thread.
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
What you can achieve at most(!) is a linear speedup.
Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.
Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.