OpenMP integer copied after tasks finish

OpenMP integer copied after tasks finish - c++

I do not know if this is documented anywhere, if so I would love a reference to it, however I have found some unexpected behaviour when using OpenMP. I have a simple program below to illustrate the issue. Here in point form I will tell what I expect the program to do:
I want to have 2 threads
They both share an integer
The first thread increments the integer
The second thread reads the integer
Ater incrementing once, an external process must tell the first thread to continue incrementing (via a mutex lock)
The second thread is in charge of unlocking this mutex
As you will see, the counter which is shared between the threads is not altered properly for the second thread. However, if I turn the counter into an integer refernce instead, I get the expected result. Here is a simple code example:
#include <mutex>
#include <thread>
#include <chrono>
#include <iostream>
#include <omp.h>
using namespace std;
using std::this_thread::sleep_for;
using std::chrono::milliseconds;
const int sleep_amount = 2000;
int main() {
int counter = 0; // if I comment this and uncomment the 2 lines below, I get the expected results
/* int c = 0; */
/* int &counter = c; */
omp_lock_t mut;
omp_init_lock(&mut);
int counter_1, counter_2;
#pragma omp parallel
#pragma omp single
{
#pragma omp task default(shared)
// The first task just increments the counter 3 times
{
while (counter < 3) {
omp_set_lock(&mut);
counter += 1;
cout << "increasing: " << counter << endl;
}
}
#pragma omp task default(shared)
{
sleep_for(milliseconds(sleep_amount));
// While sleeping, counter is increased to 1 in the first task
counter_1 = counter;
cout << "counter_1: " << counter << endl;
omp_unset_lock(&mut);
sleep_for(milliseconds(sleep_amount));
// While sleeping, counter is increased to 2 in the first task
counter_2 = counter;
cout << "counter_2: " << counter << endl;
omp_unset_lock(&mut);
// Release one last time to increment the counter to 3
}
}
omp_destroy_lock(&mut);
cout << "expected: 1, actual: " << counter_1 << endl;
cout << "expected: 2, actual: " << counter_2 << endl;
cout << "expected: 3, actual: " << counter << endl;
}
Here is my output:
increasing: 1
counter_1: 0
increasing: 2
counter_2: 0
increasing: 3
expected: 1, actual: 0
expected: 2, actual: 0
expected: 3, actual: 3
gcc version: 9.4.0
Additional discoveries:
If I use OpenMP 'sections' instead of 'tasks', I get the expected result as well. The problem seems to be with 'tasks' specifically
If I use posix semaphores, this problem also persists.

This is not permitted to unlock a mutex from another thread. Doing it causes an undefined behavior. The general solution is to use semaphores in this case. Wait conditions can also help (regarding the real-world use cases). To quote the OpenMP documentation (note that this constraint is shared by nearly all mutex implementation including pthreads):
A program that accesses a lock that is not in the locked state or that is not owned by the task that contains the call through either routine is non-conforming.
A program that accesses a lock that is not in the uninitialized state through either routine is non-conforming.
Moreover, the two tasks can be executed on the same thread or different threads. You should not assume anything about their scheduling unless you tell OpenMP to do so with dependencies. Here, it is completely compliant for a runtime to execute the tasks serially. You need to use OpenMP sections so multiple threads execute different sections. Besides, it is generally considered as a bad practice to use locks in tasks as the runtime scheduler is not aware of them.
Finally, you do not need a lock in this case: an atomic operation is sufficient. Fortunately, OpenMP supports atomic operations (as well as C++).
Additional notes
Note that locks guarantee the consistency of memory accesses in multiple threads thanks to memory barriers. Indeed, an unlock operation on a mutex cause a release memory barrier that make writes visible from others threads. A lock from another thread do an acquire memory barrier that force reads to be done after the lock. When lock/unlocks are not used correctly, the way memory accesses are done is not safe anymore and this cause some variable not to be updated from other threads for example. More generally, this also tends to create race conditions. Thus, put it shortly, don't do that.

Related

Do spinlocks guarantee context switching when compared to mutexes

Consider the following code snippet
int index = 0;
av::utils::Lock lock(av::utils::Lock::EStrategy::eMutex); // Uses a mutex or a spin lock based on specified strategy.
void fun()
{
for (int i = 0; i < 100; ++i)
{
lock.aquire();
++index;
std::cout << "thread " << std::this_thread::get_id() << " index = " << index << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(500));
lock.release();
}
}
int main()
{
std::thread t1(fun);
std::thread t2(fun);
t1.join();
t2.join();
}
The output that I get with a mutex used for synchronization is first thread 1 gets executed completely followed by thread 2.
While using a spinlock(implemented using std::atomic_flag), I get the order of execution between the threads which is interleaved (one iteration of thread 1 followed by another iteration of thread 2). The latter case happens irrespective of the delay I add in execution of the iteration.
I understand that a mutex only guarantees mutual exclusion and not the order of execution. The question I have is if I want to have an execution order such that two threads are executed in an interleaved manner, is using spinlocks a recommended strategy or not?

The output that I get with a mutex ... is first thread 1 [runs through the whole loop] followed by thread 2.
That's because of how your loop uses the lock: The very last thing the loop body does is, it unlocks the lock. The very next thing it does at the start of the next iteration is, it locks the lock again.
The other thread can be blocked, effectively sleeping, waiting for the mutex. When your thread 1 releases the lock, the OS scheduler may still be running its algorithms, trying to figure out how to respond to that, when thread 1 comes 'round and locks the lock again.
It's like a race to lock the mutex, and thread 1 is on the starting line when the gun goes off, while thread 2 is sitting on the bench, tying its shoes.
While using a spinlock...the order of execution between the threads which is interleaved
That's because the "blocked" thread isn't really blocked. It's still actively running on a different processor while it waits. It has a much better chance at winning the lock when the first thread releases it.

Deadlock using std::mutex to protect cout in multiple threads

Using cout in multiple threads might result in interleaved output.
So I tried to protect cout with a mutex.
The following code starts 10 background threads with std::async. When a thread starts, it prints "Started thread ...".
The main thread iterates over the futures of the background threads in the order in which they were created and prints out "Done thread ..." when the corresponding thread finished.
The output is synchronized correctly, but after some threads have started and some have finished (see output below), a deadlock occurres. All background threads left and the main thread are waiting for the mutex.
What is the reason for the deadlock?
When the print function is left or one iteration of the for loop ends, the lock_guard should unlock the mutex, so that one of the waiting threads would be able to proceed.
Why are all the threads left starving?
Code
#include <future>
#include <iostream>
#include <vector>
using namespace std;
std::mutex mtx; // mutex for critical section
int print_start(int i) {
lock_guard<mutex> g(mtx);
cout << "Started thread" << i << "(" << this_thread::get_id() << ") " << endl;
return i;
}
int main() {
vector<future<int>> futures;
for (int i = 0; i < 10; ++i) {
futures.push_back(async(print_start, i));
}
//retrieve and print the value stored in the future
for (auto &f : futures) {
lock_guard<mutex> g(mtx);
cout << "Done thread" << f.get() << "(" << this_thread::get_id() << ")" << endl;
}
cin.get();
return 0;
}
Output
Started thread0(352)
Started thread1(14944)
Started thread2(6404)
Started thread3(16884)
Done thread0(16024)
Done thread1(16024)
Done thread2(16024)
Done thread3(16024)

Your problem lies in the use of future::get:
Returns the value stored in the shared state (or throws its exception)
when the shared state is ready.
If the shared state is not yet ready (i.e., the provider has not yet
set its value or exception), the function blocks the calling thread
and waits until it is ready.
http://www.cplusplus.com/reference/future/future/get/
So if the thread behind the future didn't get to run yet, the function blocks until that thread finishes. However, you take ownership of the mutex before calling future::get, so whichever thread you're waiting for will not be able to attain the mutex for itself.
This should fix your deadlock problem:
int value = f.get();
lock_guard<mutex> g(mtx);
cout << "Done thread" << value << "(" << this_thread::get_id() << ")" << endl;

You lock the mutex and then wait for one of the futures, which in turn requires a lock on the mutex itself. Simple rule: Don't wait with locked mutexes.
BTW: Locking output streams is not very effective, because it can easily be circumvented by code you don't even control. Rather than using those globals, give a stream to code that needs to output something (dependency injection) and then collect the data from that stream in a threadsafe way. Or use a logging library, because that's probably what you wanted to do anyway.

It is good that the reason was spotted from the source. However, quite often the error, as it happens, may be not so easy to locate. And the reason may differ as well. Fortunately, in case of deadlock you can use debugger to investigate it.
I compiled and ran your example, then after attaching to it with gdb (gcc 4.9.2/Linux), there is a backtrace (noisy implementation details skipped):
#0 __lll_lock_wait ()
...
#5 0x0000000000403140 in std::lock_guard<std::mutex>::lock_guard (
this=0x7ffe74903320, __m=...) at /usr/include/c++/4.9/mutex:377
#6 0x0000000000402147 in print_start (i=0) at so_deadlock.cc:9
...
#23 0x0000000000409e69 in ....::_M_complete_async() (this=0xdd4020)
at /usr/include/c++/4.9/future:1498
#24 0x0000000000402af2 in std::__future_base::_State_baseV2::wait (
this=0xdd4020) at /usr/include/c++/4.9/future:321
#25 0x0000000000404713 in std::__basic_future<int>::_M_get_result (
this=0xdd47e0) at /usr/include/c++/4.9/future:621
#26 0x0000000000403c48 in std::future<int>::get (this=0xdd47e0)
at /usr/include/c++/4.9/future:700
#27 0x000000000040229b in main () at so_deadlock.cc:24
This is just what is explained in the other answers - the code in locked section (so_deadlock.cc:24) calls future::get(), which in turn (by forcing the result) trying to acquire the lock again.
It might be not that simple in other cases, there are usually several threads, but it's all there.

omp doens't launch thread

openMp used to work on my project on 6 threads and now, (I have no ideas why), the program is single threaded.
My code is pretty simple, I only use openMp in one cpp file, i declared
#include <omp.h>
then the function to parallelize is :
#pragma omp parallel for collapse(2) num_threads(IntervalMapEstimator::m_num_thread)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
//use for debug
omp_set_num_threads (5);
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
if(split_points) {
extract_relevant_points_from_angle_lists(relevant_points, pointcloud_ff_polar_angle_lists, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
} else {
extract_relevant_points_multithread_with_localvector(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
}
omp_get_num_threads return 1 thread
omp_get_max_threads return 5
IntervalMapEstimator::m_num_thread is set at 6
Any lead would be greatly appreciated.
EDIT 1:
I modified my code but the problem remains, the program is still running in mono thread.
omp_get_num_threads return 1
omp_get_max_threads return 8
Is there a way to know how many threads are available at running time ?
#pragma omp parallel for collapse(2)
for (int cell_index_x = m_min_cell_index_sensor_rot_x; cell_index_x <= m_max_cell_index_sensor_rot_x; cell_index_x++)
{
for (int cell_index_y = m_min_cell_index_sensor_rot_y; cell_index_y <= m_max_cell_index_sensor_rot_y; cell_index_y++)
{
std::cout << "omp_get_num_threads = " << omp_get_num_threads ()<< std::endl;
std::cout << "omp_get_max_threads = " << omp_get_max_threads ()<< std::endl;
extract_relevant_points(relevant_points, pointcloud, cell_min_angle_sensor_rot, cell_max_angle_sensor_rot);
}
}
I just saw my computer is beginning to run low in memory, could that be a part of the problem ?

According to https://msdn.microsoft.com/en-us/library/bx15e8hb.aspx:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You request 6 threads, the implementation can only provide 5, so it is free to do what it wants.
I'm also pretty sure you are not supposed to change the number of threads while inside a parallel region, so your omp_set_num_threads will do nothing at best and blow up in your face at worst.

I founded the answer thanks to another post: Why does the compiler ignore OpenMP pragmas?
In the end it was a simple error of library that i didn't add to the compiler, and I didn't noticed it because i was compiling with cmake so i don't have to type the line directly. Also i use catkin_make to compile so i don't have warning but only error in the console.
So basically, to use openMp you have to add -fopenmp as arguments to your compiler, and if you don't' ... well the lines are just ignored by the compiler.

How to write to file from different threads, OpenMP, C++

I use openMP for parallel my C++ program. My parallel code have very simple form
#pragma omp parallel for shared(a, b, c) private(i, result)
for (i = 0; i < N; i++){
result= F(a,b,c,i)//do some calculation
cout<<i<<" "<<result<<endl;
}
If two threads try to write into file simultaneously, the data is mixed up.
How I can solve this problem?

OpenMP provides pragmas to help with synchronisation. #pragma omp critical allows only one thread to be executing the attached statement at any time (a mutual exclusion critical region). The #pragma omp ordered pragma ensures loop iteration threads enter the region in order.
// g++ -std=c++11 -Wall -Wextra -pedantic -fopenmp critical.cpp
#include <iostream>
int main()
{
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
std::cout << "unsynchronized(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for
for (int i = 0; i < 20; ++i)
#pragma omp critical
std::cout << "critical(" << i << ") ";
std::cout << std::endl;
#pragma omp parallel for ordered
for (int i = 0; i < 20; ++i)
#pragma omp ordered
std::cout << "ordered(" << i << ") ";
std::cout << std::endl;
return 0;
}
Example output (different each time in general):
unsynchronized(unsynchronized(unsynchronized(05) unsynchronized() 6unsynchronized() 1unsynchronized(7) ) unsynchronized(unsynchronized(28) ) unsynchronized(unsynchronized(93) ) unsynchronized(4) 10) unsynchronized(11) unsynchronized(12) unsynchronized(15) unsynchronized(16unsynchronized() 13unsynchronized() 17) unsynchronized(unsynchronized(18) 14unsynchronized() 19)
critical(5) critical(0) critical(6) critical(15) critical(1) critical(10) critical(7) critical(16) critical(2) critical(8) critical(17) critical(3) critical(9) critical(18) critical(11) critical(4) critical(19) critical(12) critical(13) critical(14)
ordered(0) ordered(1) ordered(2) ordered(3) ordered(4) ordered(5) ordered(6) ordered(7) ordered(8) ordered(9) ordered(10) ordered(11) ordered(12) ordered(13) ordered(14) ordered(15) ordered(16) ordered(17) ordered(18) ordered(19)

Problem is: you have a single resource all threads try to access. Those single resources must be protected against concurrent access (thread safe resources do this, too, just transparently for you; by the way: here is a nice answer about thread safety of std::cout). You could now protect this single resource e. g. with a std::mutex. Problem then is, that the threads will have to wait for the mutex until the other thread gives it back again. So you only will profit from parallelisation if F is a very complex function.
Further drawback: as threads work parallel, even with a mutex to protect std::in, the results can be printed out in arbitrary order, depending on which thread happens to operate earlier.
If I may assume that you want the results of F(... i) for smaller i before the results of greater i, you either should drop parallelisation entirely or do it differently:
Provide an array of size N and let each thread store its results there (array[i] = f(i);). Then iterate over the array in a separate non-parallel loop. Again, doing so is only worth the effort if F is a complex function (and for large N).
Additionally: Be aware that threads must be created, too, which causes some overhead somewhere (creating thread infrastructure and stack, registering thread at OS, ... – unless if you can reuse some threads already created in a thread pool earlier...). Consider this, too, when deciding if you want to parallelise or not. Sometimes, non-parallel calculations can be faster...

boost::threads execution ordering

i have a problem with the order of execution of the threads created consecutively.
here is the code.
#include <iostream>
#include <Windows.h>
#include <boost/thread.hpp>
using namespace std;
boost::mutex mutexA;
boost::mutex mutexB;
boost::mutex mutexC;
boost::mutex mutexD;
void SomeWork(char letter, int index)
{
boost::mutex::scoped_lock lock;
switch(letter)
{
case 'A' : lock = boost::mutex::scoped_lock(mutexA); break;
case 'B' : lock = boost::mutex::scoped_lock(mutexB); break;
case 'C' : lock = boost::mutex::scoped_lock(mutexC); break;
case 'D' : lock = boost::mutex::scoped_lock(mutexD); break;
}
cout << letter <<index << " started" << endl;
Sleep(800);
cout << letter<<index << " finished" << endl;
}
int main(int argc , char * argv[])
{
for(int i = 0; i < 16; i++)
{
char x = rand() % 4 + 65;
boost::thread tha = boost::thread(SomeWork,x,i);
Sleep(10);
}
Sleep(6000);
system("PAUSE");
return 0;
}
each time a letter (from A to D) and a genereaion id (i) is passed to the method SomeWork as a thread. i do not care about the execution order between letters but for a prticular letter ,say A, Ax has to start before Ay, if x < y.
a random part of a random output of the code is :
B0 started
D1 started
C2 started
A3 started
B0 finished
B12 started
D1 finished
D15 started
C2 finished
C6 started
A3 finished
A9 started
B12 finished
B11 started --> B11 started after B12 finished.
D15 finished
D13 started
C6 finished
C7 started
A9 finished
how can avoid such conditions?
thanks.
i solved the problem using condition variables. but i changed the problem a bit. the solution is to keep track of the index of the for loop. so each thread knows when it does not work. but as far as this code is concerned, there are two other things that i would like to ask about.
first, on my computer, when i set the for-loop index to 350 i had an access violation. 310 was the number of loops, which was ok. so i realized that there is a maximum number of threads to be generated. how can i determine this number?
second, in visual studio 2008, the release version of the code showed a really strange behaviour. without using condition variables (lines 1 to 3 were commented out), the threads were ordered. how could that happen?
here is the code:
#include <iostream>
#include <Windows.h>
#include <boost/thread.hpp>
using namespace std;
boost::mutex mutexA;
boost::mutex mutexB;
boost::mutex mutexC;
boost::mutex mutexD;
class cl
{
public:
boost::condition_variable con;
boost::mutex mutex_cl;
char Letter;
int num;
cl(char letter) : Letter(letter) , num(0)
{
}
void doWork( int index, int tracknum)
{
boost::unique_lock<boost::mutex> lock(mutex_cl);
while(num != tracknum) // line 1
con.wait(lock); // line 2
Sleep(10);
num = index;
cout << Letter<<index << endl;
con.notify_all(); // line 3
}
};
int main(int argc , char * argv[])
{
cl A('A');
cl B('B');
cl C('C');
cl D('D');
for(int i = 0; i < 100; i++)
{
boost::thread(&cl::doWork,&A,i+1,i);
boost::thread(&cl::doWork,&B,i+1,i);
boost::thread(&cl::doWork,&C,i+1,i);
boost::thread(&cl::doWork,&D,i+1,i);
}
cout << "************************************************************************" << endl;
Sleep(6000);
system("PAUSE");
return 0;
}

If you have two different threads waiting for the lock, it's entirely non-deterministic which one will acquire it once the lock is released by the previous holder. I believe this is what you are experiencing. Assume B10 is holding the lock, and in the mean time threads are spawned for B11 and B12. B10 releases the lock - it's down to a coin toss as to whether B11 or B12 acquires it next, irrespective of which thread was created first, or even which thread started waiting first.
Perhaps you should implement work queues for each letter, such that you spawn exactly 4 threads, each of which consume work units? This is the only way to easily guarantee ordering in this way. A simple mutex is not going to guarantee ordering if multiple threads are waiting for the lock.

Even though B11 is started before B12 it is not guaranteed to be given a CPU time slice to execute SomeWork() prior to B12. This decision is up to the OS and its scheduler.
Mutex's are typically used to synchronize access to data between threads and a concern has been raised with the sequence of thread execution (i.e. data access).
If the threads for group 'A' are executing the same code on the same data then just use one thread. This will eliminate context switching between threads in the group and yield the same result. If the data is changing consider a producer/consumer pattern. Paul Bridger give's an easy to understand producer/consumer example here.

Your threads have dependencies that must be satisfied before they start execution. In your example, B12 depends on B0 and B11. Somehow you have to track that dependency knowledge. Threads with unfinished dependencies must be made to wait.
I would look into condition variables. Each time a thread finishes SomeWork() it would use the condition variable's notify_all() method. Then all of the waiting threads must check if they still have dependencies. If so, go back and wait. Otherwise, go ahead and call SomeWork().
You need some way for each thread to determine if it has unfinished dependencies. This will probably be some globally available entity. You should only modify it when you have the mutex (in SomeWork()). Reading by multiple threads should be safe for simple data structures.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js