Fortran array values change if printed [duplicate] - fortran

I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.

private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.

You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.

Related

Question regarding race conditions while multithreading

So I'm reading through a book an in this chapter that goes over multithreading and concurrency they gave me a question that does not really make sense to me.
I'm suppose to create 3 functions with param x that simply calculates x * x; one using mutex, one using atomic types, and one using neither. And create 3 global variables holding the values.
The first two functions will prevent race conditions but the third might not.
After that I create N threads and then loop through and tell each thread to calculate it's x function (3 separate loops, one for each function. So I'm creating N threads 3 times)
Now the book tells me that using function 1 & 2 I should always get the correct answer but using function 3 I won't always get the right answer. However, I am always getting the right answer for all of them. I assume this is because I am just calculating x * x which is all it does.
As an example, when N=3, the correct value is 0 * 0 + 1 * 1 + 2 * 2 = 5.
this is the atomic function:
void squareAtomic(atomic<int> x)
{
accumAtomic += x * x;
}
And this is how I call the function
thread threadsAtomic[N]
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsAtomic[i] = thread(squareAtomic, i);
}
for (int i = 0; i < N; i++)
{
threadsAtomic[i].join();
}
This is the function that should sometimes create race conditions:
void squareNormal(int x)
{
accumNormal += x * x;
}
Heres how I call that:
thread threadsNormal[N];
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsNormal[i] = thread(squareNormal, i);
}
for (int i = 0; i < N; i++)
{
threadsNormal[i].join();
}
This is all my own code so I might not be doing this question correctly, and in that case I apologize.
One problem with race conditions (and with undefined behavior in general) is that their presence doesn't guarantee that your program will behave incorrectly. Rather, undefined behavior only voids the guarantee that your program will behave according to rules of the C++ language spec. That can make undefined behavior very difficult to detect via empirical testing. (Every multithreading-programmer's worst nightmare is the bug that was never seen once during the program's intensive three-month testing period, and only appears in the form of a mysterious crash during the big on-stage demo in front of a live audience)
In this case your racy program's race condition comes in the form of multiple threads reading and writing accumNormal simultaneously; in particular, you might get an incorrect result if thread A reads the value of accumNormal, and then thread B writes a new value to accumNormal, and then thread A writes a new value to accumNormal, overwriting thread B's value.
If you want to be able to demonstrate to yourself that race conditions really can cause incorrect results, you'd want to write a program where multiple threads hammer on the same shared variable for a long time. For example, you might have half the threads increment the variable 1 million times, while the other half decrement the variable 1 million times, and then check afterwards (i.e. after joining all the threads) to see if the final value is zero (which is what you would expect it to be), and if not, run the test again, and let that test run all night if necessary. (and even that might not be enough to detect incorrect behavior, e.g. if you are running on hardware where increments and decrements are implemented in such a way that they "just happen to work" for this use case)

Will fetch_add with relaxed memory order return unique values?

Imagine N threads running following simple code:
int res = num.fetch_add(1, std::memory_order_relaxed);
where num is:
std::atomic<int> num = 0;
Is it completelly safe to assume, that res for each thread running the code will be different or it is possible that it will be the same for some threads?
Yes. All threads will agree on the order in which the various threads modified the variable num; the kth thread to execute that line of code will definitely obtain the value k. The use of std::memory_order_relaxed, however, implies that accesses to num don't synchronize with each other; thus, for example, one thread may modify some other atomic variable x before it modifies num, and another thread may see the modification to num made by the former thread but subsequently see the old value of x.

This code prints the value of x around 5000 but not 10000, why is that?

This code that I have written creates 2 threads and a for loop that iterates 10000 times but the value of x at the end comes out near 5000 instead of 10000, why is that happening?
#include<unistd.h>
#include<stdio.h>
#include<sys/time.h>
#include "omp.h"
using namespace std;
int x=0;
int main(){
omp_set_num_threads(2);
#pragma omp parallel for
for(int i= 0;i<10000;i++){
x+=1;
}
printf("x is: %d\n",x);
}
x is not an atomic type and is read and written in different threads. (Thinking that int is an atomic type is a common misconception.)
The behaviour of your program is therefore undefined.
Using std::atomic<int> x; is the fix.
The reason is, that when multiple threads access the same variable, race conditions can occur.
The operation x+=1 can be understand as: x = x + 1. So you first read the value of x and then write x + 1 to x. When you have two threads running and operating on the same value of x, following happens: Thread A reads the value of x which is 0. Thread B reads the value of x which is still 0. Then thread A writes 0+1 to x. And then Thread B writes 0+1 to x. And now you have missed one increment and x is just 1 instead of 2. A fix for this problem might be to use an atomic_int.
Modifying one (shared) value by multiple threads is a race condition and leads to wrong results. If multiple threads work with one value, all of them must only read the value.
The idiomatic solution is to use a OpenMP reduction as follows
#pragma omp parallel for reduction(+:x)
for(int i= 0;i<10000;i++){
x+=1;
}
Internally, each thread has it's own x and they are added together after the loop.
Using atomics is an alternative, but will perform significantly worse. Atomic operations are more costly in itself and also very bad for caches.
If you use atomics, you should use OpenMP atomics which are applied to the operation, not the variable. I.e.
#pragma omp parallel for
for (int i= 0;i<10000;i++){
#pragma omp atomic
x+=1;
}
You should not, as other answers suggest, use C++11 atomics. Using them is explicitly unspecified behavior in OpenMP. See this question for details.

OpenMP odd behaviour with SIMD linear and parallel for linear directives

I am learning how to use OpenMP with C++ using GNU C compiler 6.2.1 and I tested the following code:
#include <stdio.h>
#include <omp.h>
#include <iostream>
int b=10;
int main()
{
int array[8];
std::cout << "Test with #pragma omp simd linear:\n";
#pragma omp simd linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
std::cout << "Test with #pragma omp parallel for linear:\n";
#pragma omp parallel for linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
}
In both cases I expected a list of numbers going from 10 to 17, however, this was not the case. The #pragma omp simd linear(b) is outright ignored, printing only 10 for each value in array. For #pragma omp parallel for linear(b) the program outputs 10,10,12,12,14,14,16,16.
I compile the file using g++ -fopenmp -Wall main.cpp -o main.o. How can I fix this?
EDIT: Reading the specification more carefully I found that the linear clausule overwrites the initial value with the last value obtained (i.e. if we start with b=10 after the first cycle we have b=17).
However, the program runs correctly if I add schedule(dynamic) to the parallel for cycles. Why would I have to specify that parameter in order to have a correct execution?
The OpenMP specification says:
The linear clause declares one or more list items to be private and to
have a linear relationship with respect to the iteration space of a
loop associated with the construct on which the clause appears.
This is an information only to the compiler to indicate the linear behavior of a variable in a loop, but in your code b is not increased at all. That is the reason you always get 10 in the first loop. So, the strange results obtained is not the compiler's fault. To correct it you have to use
array[n]=b++;
On the other hand, for #pragma omp parallel for linear(b) loop, OpenMP calculates the starting b value for each thread (based on the linear relationship), but this value is still not increased in a given thread. So, depending on the number of threads used you will see different number of "steps".
In the case of schedule(dynamic) clause, the chunk_size is 1, so each loop cycle runs in a different thread. In this case the initial b value is always calculated by OpenMP, so you get correct values only.

How many times may a loop be executed in a multithreaded C++ program?

This is an interview question.
class X
{
int i = 0 ;
public:
Class *foo()
{
for ( ; i < 1000 ; ++i )
{
// some code but do not change value of i
}
}
}
int main()
{
X myX ;
Thread t1 = create_thread( myX.foo() ) ;
Thread t2 = create_thread( myX.foo() ) ;
Start t1 ...
Start t2 ...
join(t1)
joint(t2)
}
Q1: if the code run on 1-cpu processor, how many times can the for-loop run in worst case?
Q2: what if the code run on 2-cpu processor, how many times can the for-loop run in worst case?
My ideas:
The loop may run infinite times, because a thread can run it many times before the other thread updates the value of i.
Or, when t1 is suspended, t2 runs 1000 times and then we have 1000 x 1000 times ?
Is this correct?
create_thread( myX.foo() ) calls create_thread with the return value of myX.foo(). myX.foo() is run on the main thread, so myX.i will eventually have a value of 1000 (which is the value which it has after two calls to myX.foo()).
If the code was actually meant to run myX.foo() twice on a two different threads concurrently, then the code would have undefined behaviour (due to the race condition in the access to myX.i). So yes, the loop could run an infinite number of times (or zero times, or the program could decide to get up and eat a bagel).
This is a bad interview question if the code is transcribed accurately.
class X
{
int i = 0;
This notation is not valid C++. G++ says:
3:13: error: ISO C++ forbids initialization of member ‘i’ [-fpermissive]
3:13: error: making ‘i’ static [-fpermissive]
3:13: error: ISO C++ forbids in-class initialization of non-const static member ‘i’
We'll ignore this, assuming that the code was written as something more like:
class X
{
int i;
public:
X() : i(0) { }
The original code continues:
public:
Class *foo()
{
for ( ; i < 1000 ; ++i )
{
// some code but do not change value of i
}
return 0; // Added to remove undefined behaviour
}
}
It is not clear what a Class * is - the type Class is unspecified in the example.
int main()
{
X myX;
Thread t1 = create_thread( myX.foo() );
Since foo() is called here and its return value is passed to create_thread(), the loop will be executed 1000 times here - it matters not whether it is a multi-core system. After the loops are done, the return value is passed to create_thread().
Since we don't have a specification for create_thread(), it is not possible to predict what it will do with the Class * that is returned from myX.foo(), any more than it is possible to tell how myX.foo() actually generates an appropriate Class * or what a Class object is capable of doing. The chances are that the null pointer will cause problems - however, for the sake of the question, we'll assume that the Class * is valid and a new thread is created and placed on hold waiting for the 'start' operation to let it run.
Thread t2 = create_thread( myX.foo() );
Here we have to make some assumptions. We may assume that the Class * returned by myX.foo() does not give access to the member variable i that is in myX. Therefore, even if thread t1 is running before t2 is created, there is no interference from t1 in the value of myX, and when the main thread executes this statement, the loop will execute 0 more times. The result from myX.foo() will be used to create thread t2, which cannot interfere with i in myX any more either. We'll discuss variations on these assumptions below.
Start t1 ...
Start t2 ...
The threads are allowed to run; they do whatever is implied by the Class * returned from myX.foo(). But the threads can neither reference nor (therefore) modify myX; they have not been given access to it unless the Class * somehow provides that access.
join(t1)
joint(t2)
The threads complete...
}
So, the body of the loop executes 1000 times before t1 is created, and is executed an additional 0 times before t2 is created. And it does not matter whether it is a single-core or multi-core machine.
Indeed, even if you assume that the Class * gives the thread access to the i and you assume that t1 starts running immediately (possibly before create_thread() returns to the main thread), as long as it does not modify i, the behaviour is guaranteed to be '1000 and 0 times'.
Clearly, if t1 starts running when create_thread() is called and modifies the i in myX, then the behaviour is indeterminate. However, while the threads are in suspended animation until the 'start' operations, there is no indeterminacy and '1000 and 0 times' remains the correct answer.
Alternative Scenario
If the create_thread() calls have been misremembered and the code was:
Thread t1 = create_thread(myX.foo);
Thread t2 = create_thread(myX.foo);
where a pointer to member function is being passed to create_thread(), then the answer is quite different. Now the function is not executed until the threads are started, and the answer is indeterminate whether there is one CPU or are several CPUs on the machine. It comes down to thread scheduling issues and also depends on how the code is optimized. Almost any answer between 1000 and 2000 is plausible.
Under sufficiently weird circumstances, the answer might even be larger. For example, suppose t2 executed and read i as 0, then got suspended to let t1 run; t1 processes iterations 0..900, and then writes back i, and transfers control to t2, which increments its internal copy of i to 1 and writes this back, then gets suspended, and t1 runs again and reads i and runs from 1 to 900 again, and then lets t2 have another go...etc. Under this implausible scenario (implausible the code for t1 and t2 to execute is probably the same - though it all hinges on what the Class * really is), there could be a lot of iterations.
It is no matter on which type of system this code will run. Switching between threads goes in the same way.
Worst case for all cases is:
1000 + 1000 * 999 * 998 * 997 * ... * 2 * 1 times. (incorrect!!! correct one is in update)
When first thread tries to increase a variable (it already read a value, but not written yet), second thread can make all the loop from the start value of i, but when second thread finishing it's last loop, first thread increases a value of i, and second thread starts it's long job again :)
* Updated (A little more details)
Sorry, real formula is:
1000 + 1000 + 999 + 998 + ... + 2 + 1 times,
or 500999
Each iteration of the loop looks like this:
Check condition.
Make a work.
Read value from i
Increase read value
Write value to i
Nobody said that step 2 has constant time. So I suppose that it has a varying and suitable for my worst case time.
Here is this worst case:
Iteration 1:
[1st thread] Makes steps 1-4 of the first loop iteration (very long work time)
[2nd thread] Makes all the loop (1000 times), but doesn't check a condition last time
Iteration 2:
[1] Maskes step 5, so now i == 1, and makes steps 1-4 of next loop iteration
[2] Makes all the loop from current i (999 times)
Iteration 3: the same as before, but i == 2
...
Iteration 1000: the same as before , but i == 999
In the end we will have 1000 Iterations and each iteration will have 1 execution of loop code from first thread and (1000 - Iteration number) executions from second thread.
Worst case 2000 times assuming main lives till t1 and t2
it can't be infinite. because if even single thread is running, it will increment i value.