Is there a race condition in the `latch` sample in N3600?

Is there a race condition in the `latch` sample in N3600? - c++

Proposed for inclusion in C++14 (aka C++1y) are some new thread synchronization primitives: latches and barriers. The proposal is
N3600: C++ Latches and Barriers
N3666: C++ Latches and Barriers, revised
It sounds like a good idea and the samples make it look very programmer-friendly. Unfortunately, I think the sample code invokes undefined behavior. The proposal says of latch::~latch():
Destroys the latch. If the latch is destroyed while other threads are in wait(), or are invoking count_down(), the behaviour is undefined.
Note that it says "in wait()" and not "blocked in wait()", as the description of count_down() uses.
Then the following sample is provided:
An example of the second use case is shown below. We need to load data and then process it using a number of threads. Loading the data is I/O bound, whereas starting threads and creating data structures is CPU bound. By running these in parallel, throughput can be increased.
void DoWork()
{
latch start_latch(1);
vector<thread*> workers;
for (int i = 0; i < NTHREADS; ++i) {
workers.push_back(new thread([&] {
// Initialize data structures. This is CPU bound.
...
start_latch.wait();
// perform work
...
}));
}
// Load input data. This is I/O bound.
...
// Threads can now start processing
start_latch.count_down();
}
Isn't there a race condition between the threads waking and returning from wait(), and destruction of the latch when it leaves scope? Besides that, all the thread objects are leaked. If the scheduler doesn't run all worker threads before count_down returns and the start_latch object leaves scope, then I think undefined behavior will result. Presumably the fix is to iterate the vector and join() and delete all the worker threads after count_down but before returning.
Is there a problem with the sample code?
Do you agree that a proposal should show a complete correct example, even if the task is extremely simple, in order for reviewers to see what the use experience will be like?
Note: It appears possible that one or more of the worker threads haven't yet begun to wait, and will therefore call wait() on a destroyed latch.
Update: There's now a new version of the proposal, but the representative example is unchanged.

Thanks for pointing this out. Yes, I think that the sample code (which, in its defense, was intended to be concise) is broken. It should probably wait for the threads to finish.
Any implementation that allows threads to be blocked in wait() is almost certainly going to involves some kind of condition variable, and destroying the latch while a thread has not yet exited wait() is potentially undefined.
I don't know if there's time to update the paper, but I can make sure that the next version is fixed.
Alasdair

Related

C++ condition variables vs new threads for vectorization

I have a block of code that goes through a loop. A section of the code operates on a vector of data and I would like to vectorize this operation. The idea is to split the elaboration of the array on multiple threads that will work on subsections of the array. I have to decide between two possibilities. The first one is to create the threads each time this section is encountered an rejoin them at the end with the main thread:
for(....)
{
//serial stuff
//crate threads
for(i = 0; i < num_threads; ++i)
{
threads_vect.push_back(std::thread(f, sub_array[i]));
}
//join them
for(auto& t : threads_vect)
{
t.join();
}
//serial stuff
}
This is similar at what it is done with OpenMP, but since the problem is simple I'd like to use std::threads instead of OpenMP (unless there are good reasons against this).
The second option is to create the threads beforehand to avoid the overhead of creation and destruction, and use condition variables for synchronization (I omitted a lot of stuff for the synchronization. It is just the general idea):
std::condition_variable cv_threads;
std::condition_variable cv_main;
//create threads, the will be to sleep on cv_threads
for(....)
{
//serial stuff
//wake up threads
cv_threads.notify_all();
//sleep until the last thread finishes, that will notify.
main_thread_lock.lock();
cv_main.wait(main_lock);
//serial stuff
}
To allow for parallelism the threads will have to unlock the thread_lock as soon as they wake up at the beginning of the computation, then acquire it again at to go to sleep and synchronize between them to notify the main thread.
My question is which of this solutions is usually preferred in a context like this, and if the avoided overhead of thread creation and destruction is usually worth the added complexity (or worth at all given that the added synchronization also adds time).
Obviously this also depends on how long the computation is for each thread, but this could vary a lot since the length of the data vector could also be very short (to about two element per thread, that would led to a computation time of about 15 milliseconds).

The biggest disadvantage of creating new threads is that thread creation and shutdown is usually quite expensive. Just think of all the things an operating system has to do to get a thread off the ground, compared to what it takes to notify a condition variable.
Note that synchronization is always required, also with thread creation. The C++11 std::thread for instances introduces a synchronizes-with relationship with the creating thread upon construction. So you can safely assume that thread creation will always be significantly more expensive than condition variable signalling, regardless of your implementation.
A framework like OpenMP will typically attempt to amortize these costs in some way. For instance, an OpenMP implementation is not required to shut down the worker threads after every loop and many implementations will not do this.

Safely Destroying a Thread Pool

Consider the following implementation of a trivial thread pool written in C++14.
threadpool.h
threadpool.cpp
Observe that each thread is sleeping until it's been notified to awaken -- or some spurious wake up call -- and the following predicate evaluates to true:
std::unique_lock<mutex> lock(this->instance_mutex_);
this->cond_handle_task_.wait(lock, [this] {
return (this->destroy_ || !this->tasks_.empty());
});
Furthermore, observe that a ThreadPool object uses the data member destroy_ to determine if its being destroyed -- the destructor has been called. Toggling this data member to true will notify each worker thread that it's time to finish its current task and any of the other queued tasks then synchronize with the thread that's destroying this object; in addition to prohibiting the enqueue member function.
For your convenience, the implementation of the destructor is below:
ThreadPool::~ThreadPool() {
{
std::lock_guard<mutex> lock(this->instance_mutex_); // this line.
this->destroy_ = true;
}
this->cond_handle_task_.notify_all();
for (auto &worker : this->workers_) {
worker.join();
}
}
Q: I do not understand why it's necessary to lock the object's mutex while toggling destroy_ to true in the destructor. Furthermore, is it only necessary for setting its value or is it also necessary for accessing its value?
BQ: Can this thread pool implementation be improved or optimized while maintaining it's original purpose; a thread pool that can pool N amount of threads and distribute tasks to them to be executed concurrently?
This thread pool implementation is forked from Jakob Progsch's C++11 thread pool repository with a thorough code step through to understand the purpose behind its implementation and some subjective style changes.
I am introducing myself to concurrent programming and there is still much to learn -- I am a novice concurrent programmer as it stands right now. If my questions are not worded correctly then please make the appropriate correction(s) in your provided answer. Moreover, if the answer can be geared towards a client who is being introduced to concurrent programming for the first time then that would be best -- for myself and any other novices as well.

If the owning thread of the ThreadPool object is the only thread that atomically writes to the destroy_ variable, and the worker threads only atomically read from the destroy_ variable, then no, a mutex is not needed to protect the destroy_ variable in the ThreadPool destructor. Typically a mutex is necessary when an atomic set of operations must take place that can't be accomplished through a single atomic instruction on a platform, (i.e., operations beyond an atomic swap, etc.). That being said, the author of the thread pool may be trying to force some type of acquire semantics on the destroy_ variable without restoring to atomic operations (i.e. a memory fence operation), and/or the setting of the flag itself is not considered an atomic operation (platform dependent)... Some other options include declaring the variable as volatile to prevent it from being cached, etc. You can see this thread for more info.
Without some sort of synchronization operation in place, the worst case scenario could end up with a worker that won't complete due to the destroy_ variable being cached on a thread. On platforms with weaker memory ordering models, that's always a possibility if you allowed a benign memory race condition to exist ...

C++ defines a data race as multiple threads potentially accessing an object simultaneously with at least one of those accesses being a write. Programs with data races have undefined behavior. If you were to write to destroy in your destructor without holding the mutex, your program would have undefined behavior and we cannot predict what would happen.
If you were to read destroy elsewhere without holding the mutex, that read could potentially happen while the destructor is writing to it which is also a data race.

Is it necessary to use a std::atomic to signal that a thread has finished execution?

I would like to check if a std::thread has finished execution. Searching stackoverflow I found the following question which addresses this issue. The accepted answer proposes having the worker thread set a variable right before exiting and having the main thread check this variable. Here is a minimal working example of such a solution:
#include <unistd.h>
#include <thread>
void work( bool* signal_finished ) {
sleep( 5 );
*signal_finished = true;
}
int main()
{
bool thread_finished = false;
std::thread worker(work, &thread_finished);
while ( !thread_finished ) {
// do some own work until the thread has finished ...
}
worker.join();
}
Someone who commented on the accepted answer claims that one cannot use a simple bool variable as a signal, the code was broken without a memory barrier and using std::atomic<bool> would be correct. My initial guess is that this is wrong and a simple bool is sufficient, but I want to make sure I'm not missing something. Does the above code need a std::atomic<bool> in order to be correct?
Let's assume the main thread and the worker are running on different CPUs in different sockets. What I think would happen is, that the main thread reads thread_finished from its CPU's cache. When the worker updates it, the cache coherency protocol takes care of writing the workers change to global memory and invalidating the main thread's CPU's cache so it has to read the updated value from global memory. Isn't the whole point of cache coherence to make code like the above just work?

Someone who commented on the accepted answer claims that one cannot use a simple bool variable as a signal, the code was broken without a memory barrier and using std::atomic would be correct.
The commenter is right: a simple bool is insufficient, because non-atomic writes from the thread that sets thread_finished to true can be re-ordered.
Consider a thread that sets a static variable x to some very important number, and then signals its exit, like this:
x = 42;
thread_finished = true;
When your main thread sees thread_finished set to true, it assumes that the worker thread has finished. However, when your main thread examines x, it may find it set to a wrong number, because the two writes above have been re-ordered.
Of course this is only a simplified example to illustrate the general problem. Using std::atomic for your thread_finished variable adds a memory barrier, making sure that all writes before it are done. This fixes the potential problem of out-of-order writes.
Another issue is that reads to non-volatile variables can be optimized out, so the main thread would never notice the change in the thread_finished flag.
Important note: making your thread_finished volatile is not going to fix the problem; in fact, volatile should not be used in conjunction with threading - it is intended for working with memory-mapped hardware.

Using a raw bool is not sufficient.
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. § 1.10 p21
Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location. § 1.10 p4
Your program contains a data race where the worker thread writes to the bool and the main thread reads from it, but there is no formal happens-before relation between the operations.
There are a number of different ways to avoid the data race, including using std::atomic<bool> with appropriate memory orderings, using a memory barrier, or replacing the bool with a condition variable.

It's not ok. Optimizer can optimize
while ( !thread_finished ) {
// do some own work until the thread has finished ...
}
to:
if(!thread_finished)
while (1) {
// do some own work until the thread has finished ...
}
assuming it can prove, that "some own work" doesn't change thread_finished.

Cache coherency algorithms are not present everywhere, nor are they perfect. The issue surrounding thread_finished is that one thread tries to write a value to it while another thread tries to read it. This is a data race, and if the accesses are not sequenced, it results in undefined behavior.

Cheapest way to wake up multiple waiting threads without blocking

I use boost::thread to manage threads. In my program i have pool of threads (workers) that are activated sometimes to do some job simultaneously.
Now i use boost::condition_variable: and all threads are waiting inside boost::condition_variable::wait() call on their own conditional_variableS objects.
Can i AVOID using mutexes in classic scheme, when i work with conditional_variables? I want to wake up threads, but don't need to pass some data to them, so don't need a mutex to be locked/unlocked during awakening process, why should i spend CPU on this (but yes, i should remember about spurious wakeups)?
The boost::condition_variable::wait() call trying to REACQUIRE the locking object when CV received the notification. But i don't need this exact facility.
What is cheapest way to awake several threads from another thread?

If you don't reacquire the locking object, how can the threads know that they are done waiting? What will tell them that? Returning from the block tells them nothing because the blocking object is stateless. It doesn't have an "unlocked" or "not blocking" state for it to return in.
You have to pass some data to them, otherwise how will they know that before they had to wait and now they don't? A condition variable is completely stateless, so any state that you need must be maintained and passed by you.
One common pattern is to use a mutex, condition variable, and a state integer. To block, do this:
Acquire the mutex.
Copy the value of the state integer.
Block on the condition variable, releasing the mutex.
If the state integer is the same as it was when you coped it, go to step 3.
Release the mutex.
To unblock all threads, do this:
Acquire the mutex.
Increment the state integer.
Broadcast the condition variable.
Release the mutex.
Notice how step 4 of the locking algorithm tests whether the thread is done waiting? Notice how this code tracks whether or not there has been an unblock since the thread decided to block? You have to do that because condition variables don't do it themselves. (And that's why you need to reacquire the locking object.)
If you try to remove the state integer, your code will behave unpredictably. Sometimes you will block too long due to missed wakeups and sometimes you won't block long enough due to spurious wakeups. Only a state integer (or similar predicate) protected by the mutex tells the threads when to wait and when to stop waiting.
Also, I haven't seen how your code uses this, but it almost always folds into logic you're already using. Why did the threads block anyway? Is it because there's no work for them to do? And when they wakeup, are they going to figure out what to do? Well, finding out that there's no work for them to do and finding out what work they do need to do will require some lock since it's shared state, right? So there almost always is already a lock you're holding when you decide to block and need to reacquire when you're done waiting.

For controlling threads doing parallel jobs, there is a nice primitive called a barrier.
A barrier is initialized with some positive integer value N representing how many threads it holds. A barrier has only a single operation: wait. When N threads call wait, the barrier releases all of them. Additionally, one of the threads is given a special return value indicating that it is the "serial thread"; that thread will be the one to do some special job, like integrating the results of the computation from the other threads.
The limitation is that a given barrier has to know the exact number of threads. It's really suitable for parallel processing type situations.
POSIX added barriers in 2003. A web search indicates that Boost has them, too.
http://www.boost.org/doc/libs/1_33_1/doc/html/barrier.html

Generally speaking, you can't.
Assuming the algorithm looks something like this:
ConditionVariable cv;
void WorkerThread()
{
for (;;)
{
cv.wait();
DoWork();
}
}
void MainThread()
{
for (;;)
{
ScheduleWork();
cv.notify_all();
}
}
NOTE: I intentionally omitted any reference to mutexes in this pseudo-code. For the purposes of this example, we'll suppose ConditionVariable does not require a mutex.
The first time through MainTnread(), work is queued and then it notifies WorkerThread() that it should execute its work. At this point two things can happen:
WorkerThread() completes DoWork() before MainThread() can complete ScheduleWork().
MainThread() completes ScheduleWork() before WorkerThread() can complete DoWork().
In case #1, WorkerThread() comes back around to sleep on the CV, and is awoken by the next cv.notify() and all is well.
In case #2, MainThread() comes back around and notifies... nobody and continues on. Meanwhile WorkerThread() eventually comes back around in its loop and waits on the CV but it is now one or more iterations behind MainThread().
This is known as a "lost wakeup". It is similar to the notorious "spurious wakeup" in that the two threads now have different ideas about how many notify()s have taken place. If you are expecting the two threads to maintain synchrony (and usually you are), you need some sort of shared synchronization primitive to control it. This is where the mutex comes in. It helps avoid lost wakeups which, arguably, are a more serious problem than the spurious variety. Either way, the effects can be serious.
UPDATE: For further rationale behind this design, see this comment by one of the original POSIX authors: https://groups.google.com/d/msg/comp.programming.threads/cpJxTPu3acc/Hw3sbptsY4sJ
Spurious wakeups are two things:
Write your program carefully, and make sure it works even if you
missed something.
Support efficient SMP implementations
There may be rare cases where an "absolutely, paranoiacally correct"
implementation of condition wakeup, given simultaneous wait and
signal/broadcast on different processors, would require additional
synchronization that would slow down ALL condition variable operations
while providing no benefit in 99.99999% of all calls. Is it worth the
overhead? No way!
But, really, that's an excuse because we wanted to force people to
write safe code. (Yes, that's the truth.)

boost::condition_variable::notify_*(lock) does NOT require that the caller hold the lock on the mutex. THis is a nice improvement over the Java model in that it decouples the notification of threads with the holding of the lock.
Strictly speaking, this means the following pointless code SHOULD DO what you are asking:
lock_guard lock(mutex);
// Do something
cv.wait(lock);
// Do something else
unique_lock otherLock(mutex);
//do something
otherLock.unlock();
cv.notify_one();
I do not believe you need to call otherLock.lock() first.

C++ Thread question - setting a value to indicate the thread has finished

Is the following safe?
I am new to threading and I want to delegate a time consuming process to a separate thread in my C++ program.
Using the boost libraries I have written code something like this:
thrd = new boost::thread(boost::bind(&myclass::mymethod, this, &finished_flag);
Where finished_flag is a boolean member of my class. When the thread is finished it sets the value and the main loop of my program checks for a change in that value.
I assume that this is okay because I only ever start one thread, and that thread is the only thing that changes the value (except for when it is initialised before I start the thread)
So is this okay, or am I missing something, and need to use locks and mutexes, etc

You never mentioned the type of finished_flag...
If it's a straight bool, then it might work, but it's certainly bad practice, for several reasons. First, some compilers will cache the reads of the finished_flag variable, since the compiler doesn't always pick up the fact that it's being written to by another thread. You can get around this by declaring the bool volatile, but that's taking us in the wrong direction. Even if reads and writes are happening as you'd expect, there's nothing to stop the OS scheduler from interleaving the two threads half way through a read / write. That might not be such a problem here where you have one read and one write op in separate threads, but it's a good idea to start as you mean to carry on.
If, on the other hand it's a thread-safe type, like a CEvent in MFC (or equivilent in boost) then you should be fine. This is the best approach: use thread-safe synchronization objects for inter-thread communication, even for simple flags.

Instead of using a member variable to signal that the thread is done, why not use a condition? You are already are using the boost libraries, and condition is part of the thread library.
Check it out. It allows the worker thread to 'signal' that is has finished, and the main thread can check during execution if the condition has been signaled and then do whatever it needs to do with the completed work. There are examples in the link.
As a general case I would neve make the assumption that a resource will only be modified by the thread. You might know what it is for, however someone else might not - causing no ends of grief as the main thread thinks that the work is done and tries to access data that is not correct! It might even delete it while the worker thread is still using it, and causing the app to crash. Using a condition will help this.
Looking at the thread documentation, you could also call thread.timed_join in the main thread. timed_join will wait for a specified amount for the thread to 'join' (join means that the thread has finsihed)

I don't mean to be presumptive, but it seems like the purpose of your finished_flag variable is to pause the main thread (at some point) until the thread thrd has completed.
The easiest way to do this is to use boost::thread::join
// launch the thread...
thrd = new boost::thread(boost::bind(&myclass::mymethod, this, &finished_flag);
// ... do other things maybe ...
// wait for the thread to complete
thrd.join();

If you really want to get into the details of communication between threads via shared memory, even declaring a variable volatile won't be enough, even if the compiler does use appropriate access semantics to ensure that it won't get a stale version of data after checking the flag. The CPU can issue reads and writes out of order as long (x86 usually doesn't, but PPC definitely does) and there is nothing in C++9x that allows the compiler to generate code to order memory accesses appropriately.
Herb Sutter's Effective Concurrency series has an extremely in depth look at how the C++ world intersects the multicore/multiprocessor world.

Having the thread set a flag (or signal an event) before it exits is a race condition. The thread has not necessarily returned to the OS yet, and may still be executing.
For example, consider a program that loads a dynamic library (pseudocode):
lib = loadLibrary("someLibrary");
fun = getFunction("someFunction");
fun();
unloadLibrary(lib);
And let's suppose that this library uses your thread:
void someFunction() {
volatile bool finished_flag = false;
thrd = new boost::thread(boost::bind(&myclass::mymethod, this, &finished_flag);
while(!finished_flag) { // ignore the polling loop, it's besides the point
sleep();
}
delete thrd;
}
void myclass::mymethod() {
// do stuff
finished_flag = true;
}
When myclass::mymethod() sets finished_flag to true, myclass::mymethod() hasn't returned yet. At the very least, it still has to execute a "return" instruction of some sort (if not much more: destructors, exception handler management, etc.). If the thread executing myclass::mymethod() gets pre-empted before that point, someFunction() will return to the calling program, and the calling program will unload the library. When the thread executing myclass::mymethod() gets scheduled to run again, the address containing the "return" instruction is no longer valid, and the program crashes.
The solution would be for someFunction() to call thrd->join() before returning. This would ensure that the thread has returned to the OS and is no longer executing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js