Reduction(op:var) has the same effect as shared(var) - c++

I've tried this code snippet for reduction(op:var) proof of concept, it worked fine and gave a result = 656700
int i, n, chunk;
float a[100], b[100], result;
/* Some initializations */
n = 100; chunk = 10; result = 0.0;
for (i=0; i < n; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
//Fork has only for loop
#pragma omp parallel for default(shared) private(i) schedule(static,chunk) reduction(+:result)
for (i=0; i < n; i++)
result = result + (a[i] * b[i]);
printf("Final result= %f\n",result);
When i tried the same code but without reduction(+:result) it gave me the same result 656700 !
I think this makes very sense as reduction rely on a shared variable, in another words, shared clause would be sufficient for such operation.
I am confused!

Reduction uses a shared variable visible to you, but private copies of the variable internally. When you forget the reduction clause more threads may try to update the value of the reduction variable at the same time. That is a race condition. The result may likely be wrong and it will also will slow, because of the competition for the same resource.
With reduction, every thread has a private copy of the variable and works with it. When the reduction region finishes, the private copies are reduced using the reduction operator to the final shared variable.

shared clause would be sufficient for such operation.
Nope.
When you remove reduction(+:result), program cause data race on result variable and the result is unstable.
This means you may get wrong result, or correct result occasionally.

Related

How to run all threads in sequence as static with out using opemMP for?

I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.

OpenMP: Is array reduction always needed for updating an array in parallel?

I am quite new to OpenMP. I have the following simple loop that I want to run in parallel with OpenMP:
double rij[3];
double r;
#ifdef _OPENMP
#pragma omp parallel for private(rij,r)
#endif
for (int i=0; i<n; ++i)
{
for (int j=0; j<n; ++j)
{
if (i != j)
{
distance(X,rij,r,i,j);
V[i] += ke * Q[j] / r;
for (int k=0; k<3; ++k)
{
F[3*i+k] += ke * Q[j] * rij[k] / pow(r,3);
}
}
}
}
From what I understood, variables are shared by default which is why I only declared private(rij,r). But according to these questions (first second third), I should do array reduction in this case.
It's clear to me that if many threads need to sum to the same variable, this has to be done with #pragma omp parallel for reduction(+:A[:n]) for summing to array A of size n. This is what I do in another part of my code, and it works as expected.
However, in this case workers never have to sum to the same variable: every worker performs the sum on its index i. Is is correct to do as I do in this case i.e. not doing any array reduction and not using any critical section ?
If my implementation is correct, I believe it would avoid the overhead of the critical section while being simpler code. Feel free to give your advice on how this could be better optimized.
Thank you
You don't need a reduction. It is a feature to avoid copying the same code all over again because they are re-occurring problems (Try to think off, how you would implement a sum-reduction without OpenMP).
What you do right now is working on parallel data (V[i]) which should not overlap at any iteration (as you state in the question), because you divide by i itself. Furthermore write to F[...] shouldn't overlap either, because it only depends on iand k

Thread safety while looping with OpenMP

I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.

OpenMP parallel code has not the same output as the serial code

I had to change and extend my algorithm for some signal analysis (using the polyfilterbank technique) and couldn't use my old OpenMP code, but in the new code the results are not as expected (the results in the beginning positions in the array are somehow incorrect in comparison with a serial run [serial code shows the expected result]).
So in the first loop tFFTin I have some FFT data, which I'm multiplicating with a window function.
The goal is that a thread runs the inner loops for each polyphase factor. To avoid locks I use the reduction pragma (no complex reduction is defined by standard, so I use my one where each thread's omp_priv variable gets initialized with the omp_orig [so with tFFTin]). The reason I'm using the ordered pragma is that the results should be added to the output vector in an ordered way.
typedef std::complex<float> TComplexType;
typedef std::vector<TComplexType> TFFTContainer;
#pragma omp declare reduction(complexMul:TFFTContainer:\
transform(omp_in.begin(), omp_in.end(),\
omp_out.begin(), omp_out.begin(),\
std::multiplies<TComplexType>()))\
initializer (omp_priv(omp_orig))
void ConcreteResynthesis::ApplyPolyphase(TFFTContainer& tFFTin, TFFTContainer& tFFTout, TWindowContainer& tWindow, *someparams*) {;
#pragma omp parallel for shared(tWindow) firstprivate(sFFTParams) reduction(complexMul: tFFTin) ordered if(iFFTRawDataLen>cMinParallelSize)
for (int p = 0; p < uPolyphase; ++p) {
int iPolyphaseOffset = p * uFFTLength;
for (int i = 0; i < uFFTLength; ++i) {
tFFTin[i] *= tWindow[iPolyphaseOffset + i]; ///< get FFT input data from raw data
}
#pragma omp ordered
{
//using the overlap and add method
for (int i = 0; i < sFFTParams.uFFTLength; ++i) {
pDataPool->GetFullSignalData(workSignal)[mSignalPos + iPolyphaseOffset + i] += tFFTin[i];
}
}
}
mSignalPos = mSignalPos + mStep;
}
Is there a race condition or something, which makes wrong outputs at the beginning? Or do I have some logic error?
Another issue is, I don't really like my solution with using the ordered pragma, is there a better approach( i tried to use for this also the reduction-model, but the compiler doesn't allow me to use a pointer type for that)?
I think your problem is that you have implemented a very cool custom reduction for tFFTin. But this reduction is applied at the end of the parallel region.
Which is after you use the data in tFFTin. Another thing is what H. Iliev mentions that the second iteration of the outer loop relies on data which is computed in the previous iteration - a classic dependency.
I think you should try parallelizing the inner loops.

c++ multithreading shared resources

I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
mtx_.lock();
d.sents[lang][sidx].prob = sen_prob;
mtx_.unlock();
}
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
{
int j = i + size - 1;
for(int k = i; k < j; k++)
{
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
{
int aidx = r; int arg = s.words[aidx].pos;
mtx_.lock();
for(int kids = ONE; kids <= MAX; kids++)
{
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
}
mtx_.unlock();
}
}
From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
Edit:
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.
It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.
I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.