Using cout in multiple threads might result in interleaved output.
So I tried to protect cout with a mutex.
The following code starts 10 background threads with std::async. When a thread starts, it prints "Started thread ...".
The main thread iterates over the futures of the background threads in the order in which they were created and prints out "Done thread ..." when the corresponding thread finished.
The output is synchronized correctly, but after some threads have started and some have finished (see output below), a deadlock occurres. All background threads left and the main thread are waiting for the mutex.
What is the reason for the deadlock?
When the print function is left or one iteration of the for loop ends, the lock_guard should unlock the mutex, so that one of the waiting threads would be able to proceed.
Why are all the threads left starving?
Code
#include <future>
#include <iostream>
#include <vector>
using namespace std;
std::mutex mtx; // mutex for critical section
int print_start(int i) {
lock_guard<mutex> g(mtx);
cout << "Started thread" << i << "(" << this_thread::get_id() << ") " << endl;
return i;
}
int main() {
vector<future<int>> futures;
for (int i = 0; i < 10; ++i) {
futures.push_back(async(print_start, i));
}
//retrieve and print the value stored in the future
for (auto &f : futures) {
lock_guard<mutex> g(mtx);
cout << "Done thread" << f.get() << "(" << this_thread::get_id() << ")" << endl;
}
cin.get();
return 0;
}
Output
Started thread0(352)
Started thread1(14944)
Started thread2(6404)
Started thread3(16884)
Done thread0(16024)
Done thread1(16024)
Done thread2(16024)
Done thread3(16024)
Your problem lies in the use of future::get:
Returns the value stored in the shared state (or throws its exception)
when the shared state is ready.
If the shared state is not yet ready (i.e., the provider has not yet
set its value or exception), the function blocks the calling thread
and waits until it is ready.
http://www.cplusplus.com/reference/future/future/get/
So if the thread behind the future didn't get to run yet, the function blocks until that thread finishes. However, you take ownership of the mutex before calling future::get, so whichever thread you're waiting for will not be able to attain the mutex for itself.
This should fix your deadlock problem:
int value = f.get();
lock_guard<mutex> g(mtx);
cout << "Done thread" << value << "(" << this_thread::get_id() << ")" << endl;
You lock the mutex and then wait for one of the futures, which in turn requires a lock on the mutex itself. Simple rule: Don't wait with locked mutexes.
BTW: Locking output streams is not very effective, because it can easily be circumvented by code you don't even control. Rather than using those globals, give a stream to code that needs to output something (dependency injection) and then collect the data from that stream in a threadsafe way. Or use a logging library, because that's probably what you wanted to do anyway.
It is good that the reason was spotted from the source. However, quite often the error, as it happens, may be not so easy to locate. And the reason may differ as well. Fortunately, in case of deadlock you can use debugger to investigate it.
I compiled and ran your example, then after attaching to it with gdb (gcc 4.9.2/Linux), there is a backtrace (noisy implementation details skipped):
#0 __lll_lock_wait ()
...
#5 0x0000000000403140 in std::lock_guard<std::mutex>::lock_guard (
this=0x7ffe74903320, __m=...) at /usr/include/c++/4.9/mutex:377
#6 0x0000000000402147 in print_start (i=0) at so_deadlock.cc:9
...
#23 0x0000000000409e69 in ....::_M_complete_async() (this=0xdd4020)
at /usr/include/c++/4.9/future:1498
#24 0x0000000000402af2 in std::__future_base::_State_baseV2::wait (
this=0xdd4020) at /usr/include/c++/4.9/future:321
#25 0x0000000000404713 in std::__basic_future<int>::_M_get_result (
this=0xdd47e0) at /usr/include/c++/4.9/future:621
#26 0x0000000000403c48 in std::future<int>::get (this=0xdd47e0)
at /usr/include/c++/4.9/future:700
#27 0x000000000040229b in main () at so_deadlock.cc:24
This is just what is explained in the other answers - the code in locked section (so_deadlock.cc:24) calls future::get(), which in turn (by forcing the result) trying to acquire the lock again.
It might be not that simple in other cases, there are usually several threads, but it's all there.
Related
I do not know if this is documented anywhere, if so I would love a reference to it, however I have found some unexpected behaviour when using OpenMP. I have a simple program below to illustrate the issue. Here in point form I will tell what I expect the program to do:
I want to have 2 threads
They both share an integer
The first thread increments the integer
The second thread reads the integer
Ater incrementing once, an external process must tell the first thread to continue incrementing (via a mutex lock)
The second thread is in charge of unlocking this mutex
As you will see, the counter which is shared between the threads is not altered properly for the second thread. However, if I turn the counter into an integer refernce instead, I get the expected result. Here is a simple code example:
#include <mutex>
#include <thread>
#include <chrono>
#include <iostream>
#include <omp.h>
using namespace std;
using std::this_thread::sleep_for;
using std::chrono::milliseconds;
const int sleep_amount = 2000;
int main() {
int counter = 0; // if I comment this and uncomment the 2 lines below, I get the expected results
/* int c = 0; */
/* int &counter = c; */
omp_lock_t mut;
omp_init_lock(&mut);
int counter_1, counter_2;
#pragma omp parallel
#pragma omp single
{
#pragma omp task default(shared)
// The first task just increments the counter 3 times
{
while (counter < 3) {
omp_set_lock(&mut);
counter += 1;
cout << "increasing: " << counter << endl;
}
}
#pragma omp task default(shared)
{
sleep_for(milliseconds(sleep_amount));
// While sleeping, counter is increased to 1 in the first task
counter_1 = counter;
cout << "counter_1: " << counter << endl;
omp_unset_lock(&mut);
sleep_for(milliseconds(sleep_amount));
// While sleeping, counter is increased to 2 in the first task
counter_2 = counter;
cout << "counter_2: " << counter << endl;
omp_unset_lock(&mut);
// Release one last time to increment the counter to 3
}
}
omp_destroy_lock(&mut);
cout << "expected: 1, actual: " << counter_1 << endl;
cout << "expected: 2, actual: " << counter_2 << endl;
cout << "expected: 3, actual: " << counter << endl;
}
Here is my output:
increasing: 1
counter_1: 0
increasing: 2
counter_2: 0
increasing: 3
expected: 1, actual: 0
expected: 2, actual: 0
expected: 3, actual: 3
gcc version: 9.4.0
Additional discoveries:
If I use OpenMP 'sections' instead of 'tasks', I get the expected result as well. The problem seems to be with 'tasks' specifically
If I use posix semaphores, this problem also persists.
This is not permitted to unlock a mutex from another thread. Doing it causes an undefined behavior. The general solution is to use semaphores in this case. Wait conditions can also help (regarding the real-world use cases). To quote the OpenMP documentation (note that this constraint is shared by nearly all mutex implementation including pthreads):
A program that accesses a lock that is not in the locked state or that is not owned by the task that contains the call through either routine is non-conforming.
A program that accesses a lock that is not in the uninitialized state through either routine is non-conforming.
Moreover, the two tasks can be executed on the same thread or different threads. You should not assume anything about their scheduling unless you tell OpenMP to do so with dependencies. Here, it is completely compliant for a runtime to execute the tasks serially. You need to use OpenMP sections so multiple threads execute different sections. Besides, it is generally considered as a bad practice to use locks in tasks as the runtime scheduler is not aware of them.
Finally, you do not need a lock in this case: an atomic operation is sufficient. Fortunately, OpenMP supports atomic operations (as well as C++).
Additional notes
Note that locks guarantee the consistency of memory accesses in multiple threads thanks to memory barriers. Indeed, an unlock operation on a mutex cause a release memory barrier that make writes visible from others threads. A lock from another thread do an acquire memory barrier that force reads to be done after the lock. When lock/unlocks are not used correctly, the way memory accesses are done is not safe anymore and this cause some variable not to be updated from other threads for example. More generally, this also tends to create race conditions. Thus, put it shortly, don't do that.
Consider the following code snippet
int index = 0;
av::utils::Lock lock(av::utils::Lock::EStrategy::eMutex); // Uses a mutex or a spin lock based on specified strategy.
void fun()
{
for (int i = 0; i < 100; ++i)
{
lock.aquire();
++index;
std::cout << "thread " << std::this_thread::get_id() << " index = " << index << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(500));
lock.release();
}
}
int main()
{
std::thread t1(fun);
std::thread t2(fun);
t1.join();
t2.join();
}
The output that I get with a mutex used for synchronization is first thread 1 gets executed completely followed by thread 2.
While using a spinlock(implemented using std::atomic_flag), I get the order of execution between the threads which is interleaved (one iteration of thread 1 followed by another iteration of thread 2). The latter case happens irrespective of the delay I add in execution of the iteration.
I understand that a mutex only guarantees mutual exclusion and not the order of execution. The question I have is if I want to have an execution order such that two threads are executed in an interleaved manner, is using spinlocks a recommended strategy or not?
The output that I get with a mutex ... is first thread 1 [runs through the whole loop] followed by thread 2.
That's because of how your loop uses the lock: The very last thing the loop body does is, it unlocks the lock. The very next thing it does at the start of the next iteration is, it locks the lock again.
The other thread can be blocked, effectively sleeping, waiting for the mutex. When your thread 1 releases the lock, the OS scheduler may still be running its algorithms, trying to figure out how to respond to that, when thread 1 comes 'round and locks the lock again.
It's like a race to lock the mutex, and thread 1 is on the starting line when the gun goes off, while thread 2 is sitting on the bench, tying its shoes.
While using a spinlock...the order of execution between the threads which is interleaved
That's because the "blocked" thread isn't really blocked. It's still actively running on a different processor while it waits. It has a much better chance at winning the lock when the first thread releases it.
I want to create over 500 threads in c++ on beaglebone black
but the program has errors.
could you explain why the errors is occured and how I fix the errors
in thread func. : call_from_thread(int tid)
void call_from_thread(int tid)
{
cout << "thread running : " << tid << std::endl;
}
in main func.
int main() {
thread t[500];
for(int i=0; i<500; i++) {
t[i] = thread(call_from_thread, i);
usleep(100000);
}
std::cout << "main fun start" << endl;
return 0;
}
I expects
...
...
thread running : 495
thread running : 496
thread running : 497
thread running : 498
thread running : 499
main fun start
but
...
...
thread running : 374
thread running : 375
thread running : 376
thread running : 377
thread running : 378
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
Aborted
could you help me?
The beaglebone black appears to have a maximum of 512MB of DRAM.
The minimum stack size of a thread according to pthread_create() is 2MB.
i.e. 2^29 / 2^21 = 2^8 = 256. So what you're probably seeing around thread 374 is the allocator cannot free memory fast enough to meet the demand which
is handled by throwing an exception.
If you really want to see this explode, try moving that sleep call inside your thread function. :)
You could try preallocating the stack to 1MB or less (pthreads), but that has it's
own set of problems.
The questions to really ask yourself is:
Is my application io bound or compute bound?
What's my memory budget to run this application? If you spend your entire physical memory
on thread stacks, you'll have nothing left for the shared program heap.
Do I really need this much parallelism to do the job? The A8 is a single core machine BTW.
Could I solve the problem using a thread pool? Or not use threads at all?
Finally, you can't set the stack size in std::thread api, but you can in
boost::thread.
Or just write a thin wrapper around pthreads (assuming Linux).
Whenever you use threads, there are three parts.
Start the threads
Do the work
Release the thread
You're starting the threads and doing the work, but you're not releasing them.
Releasing threads. There are two options for releasing a thread.
You can join the thread (which basically waits for it to finish)
You can detach the thread, and let it execute independently.
In this particular case, you don't want the program to finish until all threads are done executing, so you should join them.
#include <iostream>
#include <thread>
#include <vector>
#include <string>
auto call_from_thread = [](int i) {
// I create the entire message before printing it, so that there's no interleaving of messages between threads
std::string message = "Calling from thread " + std::to_string(i) + '\n';
// Because I only call print once, everything gets printed together
std::cout << message;
};
using std::thread;
int main() {
thread t[500];
for(int i=0; i<500; i++) {
// Here, I don't have to start the thread with any delay
t[i] = thread(call_from_thread, i);
}
std::cout << "main fun start\n";
// I join each thread (which waits for them to finish before closing the program)
for(auto& item : t) {
item.join();
}
return 0;
}
I am having a hard time understanding why following code blocks:
{
std::async(std::launch::async, [] { std::this_thread::sleep_for(5s);
// this line will not execute until above task finishes?
}
I suspect that std::async returns std::future as temporary which in destructor joins on the task thread. Is it possible?
Full code is below:
int main() {
using namespace std::literals;
{
auto fut1 = std::async(std::launch::async, [] { std::this_thread::sleep_for(5s); std::cout << "work done 1!\n"; });
// here fut1 in its destructor will force a join on a thread associated with above task.
}
std::cout << "Work done - implicit join on fut1 associated thread just ended\n\n";
std::cout << "Test 2 start" << std::endl;
{
std::async(std::launch::async, [] { std::this_thread::sleep_for(5s); std::cout << "work done 2!" << std::endl; });
// no future so it should not join - but - it does join somehow.
}
std::cout << "This shold show before work done 2!?" << std::endl;
}
Yes, std::future returned by async has the special property of waiting for the task to be completed in the destructor.
This is because loose threads are bad news, and the only token you have to wait for that thread is in the destructor of the future.
To fix this, store the resulting futures until either you need the result to be done, or in extreme cases the end of the program.
Writing your own thread pool system is also a good idea; I find C++ threading primitives to be sufficient to write a threading system, but use in the raw is not something I'd encourage outside of tiny programs.
For example I want each thread to not start running until the previous one has completed, is there a flag, something like thread.isRunning()?
#include <iostream>
#include <vector>
#include <thread>
using namespace std;
void hello() {
cout << "thread id: " << this_thread::get_id() << endl;
}
int main() {
vector<thread> threads;
for (int i = 0; i < 5; ++i)
threads.push_back(thread(hello));
for (thread& thr : threads)
thr.join();
cin.get();
return 0;
}
I know that the threads are meant to run concurrently, but what if I want to control the order?
There is no thread.isRunning(). You need some synchronization primitive to do it.
Consider std::condition_variable for example.
One approachable way is to use std::async. With the current definition of std::async is that the associated state of an operation launched by std::async can cause the returned std::future's destructor to block until the operation is complete. This can limit composability and result in code that appears to run in parallel but in reality runs sequentially.
{
std::async(std::launch::async, []{ hello(); });
std::async(std::launch::async, []{ hello(); }); // does not run until hello() completes
}
If we need the second thread start to run after the first one is completed, is a thread really needed?
For solution I think try to set a global flag, the set the value in the first thread, and when start the second thread, check the flag first should work.
You can't simply control the order like saying "First, thread 1, then thread 2,..." you will need to make use of synchronization (i.e. std::mutex and condition-variables std::condition_variable_any).
You can create events so as to block one thread until a certain event happend.
See cppreference for an overview of threading-mechanisms in C++-11.
You will need to use semaphore or lock.
If you initialize semaphore to value 0:
Call wait after thread.start() and call signal/ release in the end of thread execution function (e.g. run funcition in java, OnExit function etc...)
So the main thread will keep waiting until the thread in loop has completed its execution.
Task-based parallelism can achieve this, but C++ does not currently offer task model as part of it's threading libraries. If you have TBB or PPL you can use their task-based facilities.
I think you can achieve this by using std::mutex and std::condition_variable from C++11. To be able to run threads sequentially array of booleans in used, when thread is done doing some work it writes true in specific index of the array.
For example:
mutex mtx;
condition_variable cv;
int ids[10] = { false };
void shared_method(int id) {
unique_lock<mutex> lock(mtx);
if (id != 0) {
while (!ids[id - 1]) {
cv.wait(lock);
}
}
int delay = rand() % 4;
cout << "Thread " << id << " will finish in " << delay << " seconds." << endl;
this_thread::sleep_for(chrono::seconds(delay));
ids[id] = true;
cv.notify_all();
}
void test_condition_variable() {
thread threads[10];
for (int i = 0; i < 10; ++i) {
threads[i] = thread(shared_method, i);
}
for (thread &t : threads) {
t.join();
}
}
Output:
Thread 0 will finish in 3 seconds.
Thread 1 will finish in 1 seconds.
Thread 2 will finish in 1 seconds.
Thread 3 will finish in 2 seconds.
Thread 4 will finish in 2 seconds.
Thread 5 will finish in 0 seconds.
Thread 6 will finish in 0 seconds.
Thread 7 will finish in 2 seconds.
Thread 8 will finish in 3 seconds.
Thread 9 will finish in 1 seconds.