Are there any implicit memory barriers in C++ - c++

In the following code, is using atomics necessary to guarantee race-free semantics on all platforms or does the use the use of a promise.set_value/future.wait imply some kind of implicit memory barrier that would allow me to rely on the flag write having become visible to the outer thread?
std::atomic_bool flag{false}; // <- does this need to be atomic?
runInThreadPoolBlocking([&]() {
// do something
flag.store(true);
});
if (flag.load()) // do something
// simplified runInThreadPoolBlocking implementation
template <typename Callable>
void runInThreadPoolBlocking(Callable func)
{
std::promise<void> prom;
auto fut = prom.get_future();
enqueueToThreadPool([&]() {
func();
prom.set_value();
});
fut.get();
}
In general, are there any "implicit" memory barriers guaranteed by the standard for things like thread.join() or futures?

thread.join() and promise.set_value()/future.wait() guarantee to imply memory barriers.
Using atomic_bool is necessary if you don't want the compiler to reorder boolean check or assignment with other code. But in that particular case you can use not atomic bool. That flag will be guaranteed to be true in the moment of check if you don't use it in any other place, as the assignment and check are on the opposite sides of synchronisation point (fut.get()) (forcing the compiler to load real flag value) and the function runInThreadPoolBlocking() is guaranteed to finish only after the lambda is executed.
Quoting from cplusplus.com for future::get(), for example:
Data races
The future object is modified. The shared state is accessed as an
atomic operation (causing no data races).
The same is for promise::set_value(). Besides other stuff
... atomic operation (causing no data races) ...
means no one of the conflicting evaluations happens before another (strict memory ordering).
So do all of std:: multithreading synchronization primitives and tools, where you expect some operations to occur only before or after the synchronization point (like std::mutex::lock() or unlock(), thread::join(), etc.).
Note that any operations on the thread object itself are not synchronized with thread::join() (unlike the operations within the thread it represents).

std::atomic_bool flag{false}; // <- does this need to be atomic?
Yes.
The call:
prom.get_future()
returns a std::future<void> object.
For the future, the reference says the following:
The class template std::future provides a mechanism to access the
result of asynchronous operations:
An asynchronous operation (created via std::async, std::packaged_task, or std::promise) can provide a std::future object
to the creator of that asynchronous operation.
The creator of the asynchronous operation can then use a variety of methods to query, wait for, or extract a value from the
std::future. These methods may block if the asynchronous operation has
not yet provided a value.
When the asynchronous operation is ready to send a result to the creator, it can do so by modifying shared state (e.g.
std::promise::set_value) that is linked to the creator's std::future.
Note that std::future references shared state that is not shared with any other asynchronous return objects (as opposed to
std::shared_future).
You don't store a 'return' value here so the point is kind of mute, and since there is no other guarentees (and the whole idea is that threads may run in parallel anyway ! ) you need to keep the bool atomic if it's shared !

Related

Is c++ std::future::wait() always cache-safe? [duplicate]

If I do the following:
std::promise<void> p;
int a = 1;
std::thread t([&] {
a = 2;
p.set_value();
});
p.get_future().wait();
// Is the value of `a` guaranteed to be 2 here?
cppreference has this to say about set_value(), but I am not sure what it means:
Calls to this function do not introduce data races with calls to get_future (but they need not synchronize with each other).
Do set_value() and wait() provide an acquire/release synchronization (or some other form)?
From my reading I believe a is guarnteed to be 2 at the end. Notice the information about the promise itself (emphasis mine):
The promise is the "push" end of the promise-future communication channel: the operation that stores a value in the shared state synchronizes-with (as defined in std::memory_order) the successful return from any function that is waiting on the shared state (such as std::future::get). Concurrent access to the same shared state may conflict otherwise: for example multiple callers of std::shared_future::get must either all be read-only or provide external synchronization.
Of course I encourage you to read what it means that something synchronizes-with. For this situation it means that set_value is seen as inter-thread happening-before and therefore from what I get writing a is a visible side effect. You can find more here.
What does mean your qoute about get_future? It means simply that you can safely call get_future and set_value from different threads and it won't break anything. But also it does not necessarily introduce any memory fences by itself. The only synchronization points that are sure and safe are set_value from std::promise and get from std::future.

Do std::promise::set_value() and std::future::wait() provide a memory fence?

If I do the following:
std::promise<void> p;
int a = 1;
std::thread t([&] {
a = 2;
p.set_value();
});
p.get_future().wait();
// Is the value of `a` guaranteed to be 2 here?
cppreference has this to say about set_value(), but I am not sure what it means:
Calls to this function do not introduce data races with calls to get_future (but they need not synchronize with each other).
Do set_value() and wait() provide an acquire/release synchronization (or some other form)?
From my reading I believe a is guarnteed to be 2 at the end. Notice the information about the promise itself (emphasis mine):
The promise is the "push" end of the promise-future communication channel: the operation that stores a value in the shared state synchronizes-with (as defined in std::memory_order) the successful return from any function that is waiting on the shared state (such as std::future::get). Concurrent access to the same shared state may conflict otherwise: for example multiple callers of std::shared_future::get must either all be read-only or provide external synchronization.
Of course I encourage you to read what it means that something synchronizes-with. For this situation it means that set_value is seen as inter-thread happening-before and therefore from what I get writing a is a visible side effect. You can find more here.
What does mean your qoute about get_future? It means simply that you can safely call get_future and set_value from different threads and it won't break anything. But also it does not necessarily introduce any memory fences by itself. The only synchronization points that are sure and safe are set_value from std::promise and get from std::future.

How to synchronize threads/CPUs without mutexes if sequence of access is known to be safe?

Consider the following:
// these services are running on different threads that are started a long time ago
std::vector<io_service&> io_services;
struct A {
std::unique_ptr<Object> u;
} a;
io_services[0].post([&io_services, &a] {
std::unique_ptr<Object> o{new Object};
a.u = std::move(o);
io_services[1].post([&a] {
// as far as I know changes to `u` isn't guaranteed to be seen in this thread
a.u->...;
});
});
The actual code passes a struct to a bunch of different boost::asio::io_service objects and each field of struct is filled by a different service object (the struct is never accessed from different io_service objects/threads at the same time, it is passed between the services by reference until the process is done).
As far as I know I always need some kind of explicit synchronization/memory flushing when I pass anything between threads even if there is no read/write race (as in simultaneous access). What is the way of correctly doing it in this case?
Note that Object does not belong to me and it is not trivially copy-able or movable. I could use a std::atomic<Object*> (if I am not wrong) but I would rather use the smart pointer. Is there a way to do that?
Edit:
It seems like std::atomic_thread_fence is the tool for the job but I cannot really wrap the 'memory model' concepts to safely code.
My understanding is that the following lines are needed for this code to work correctly. Is it really the case?
// these services are running on different threads that are started a long time ago
std::vector<io_service&> io_services;
struct A {
std::unique_ptr<Object> u;
} a;
io_services[0].post([&io_services, &a] {
std::unique_ptr<Object> o{new Object};
a.u = std::move(o);
std::atomic_thread_fence(std::memory_order_release);
io_services[1].post([&a] {
std::atomic_thread_fence(std::memory_order_acquire);
a.u->...;
});
});
Synchronisation is only needed when there would be a data race without it. A data race is defined as unsequenced access by different threads.
You have no such unsequenced access. The t.join() guarantees that all statements that follow are sequenced strictly after all statements that run as part of t. So no synchronisation is required.
ELABORATION: (To explain why thread::join has the above claimed properties) First, description of thread::join from standard [thread.thread.member]:
void join();
Requires: joinable() is true.
Effects: Blocks until
the thread represented by *this has completed.
Synchronization: The
completion of the thread represented by *this synchronizes with (1.10)
the corresponding successful join() return.
a). The above shows that join() provides synchronisation (specifically: the completion of the thread represented by *this synchronises with the outer thread calling join()). Next [intro.multithread]:
An evaluation A inter-thread happens before an evaluation B if
(13.1) — A synchronizes with B, or ...
Which shows that, because of a), we have that the completion of t inter-thread happens before the return of the join() call.
Finally, [intro.multithread]:
Two actions are potentially concurrent if
(23.1) — they are performed
by different threads, or
(23.2) — they are unsequenced, and at least
one is performed by a signal handler.
The execution of a program
contains a data race if it contains two potentially concurrent
conflicting actions, at least one of which is not atomic, and neither
happens before the other ...
Above the required conditions for a data race are described. The situation with t.join() does not meet these conditions because, as shown, the completion of t does in fact happen-before the return of join().
So there is no data race, and all data accesses are guaranteed well-defined behaviour.
(I'd like to remark that you appear to have changed your question in some significant way since #Smeeheey answered it; essentially, he answered your originally-worded question but cannot get credit for it since you asked two different questions. This is poor form – in the future, please just post a new question so the original answerer can get credit as due.)
If multiple threads read/write a variable, even if you know said variable is accessed in a defined sequence, you must still inform the compiler of that. The correct way to do this necessarily involves synchronization, atomics, or something documented to perform one of the prior itself (such as std::thread::join). Presuming the synchronization route is both obvious in implementation and undesirable..:
Addressing this with atomics may simply consist of std::atomic_thread_fence; however, an acquire fence in C++ cannot synchronize-with a release fence alone – an actual atomic object must be modified. Consequently, if you want to use fences alone you'll need to specify std::memory_order_seq_cst; that done, your code will work as shown otherwise.
If you want to stick with release/acquire semantics, fortunately even the simplest atomic will do – std::atomic_flag:
std::vector<io_service&> io_services;
struct A {
std::unique_ptr<Object> u;
} a;
std::atomic_flag a_initialized = ATOMIC_FLAG_INIT;
io_services[0].post([&io_services, &a, &a_initialized] {
std::unique_ptr<Object> o{new Object};
a_initialized.clear(std::memory_order_release); // initiates release sequence (RS)
a.u = std::move(o);
a_initialized.test_and_set(std::memory_order_relaxed); // continues RS
io_services[1].post([&a, &a_initialized] {
while (!a_initialized.test_and_set(std::memory_order_acquire)) ; // completes RS
a.u->...;
});
});
For information on release sequences, see here.

Safely Destroying a Thread Pool

Consider the following implementation of a trivial thread pool written in C++14.
threadpool.h
threadpool.cpp
Observe that each thread is sleeping until it's been notified to awaken -- or some spurious wake up call -- and the following predicate evaluates to true:
std::unique_lock<mutex> lock(this->instance_mutex_);
this->cond_handle_task_.wait(lock, [this] {
return (this->destroy_ || !this->tasks_.empty());
});
Furthermore, observe that a ThreadPool object uses the data member destroy_ to determine if its being destroyed -- the destructor has been called. Toggling this data member to true will notify each worker thread that it's time to finish its current task and any of the other queued tasks then synchronize with the thread that's destroying this object; in addition to prohibiting the enqueue member function.
For your convenience, the implementation of the destructor is below:
ThreadPool::~ThreadPool() {
{
std::lock_guard<mutex> lock(this->instance_mutex_); // this line.
this->destroy_ = true;
}
this->cond_handle_task_.notify_all();
for (auto &worker : this->workers_) {
worker.join();
}
}
Q: I do not understand why it's necessary to lock the object's mutex while toggling destroy_ to true in the destructor. Furthermore, is it only necessary for setting its value or is it also necessary for accessing its value?
BQ: Can this thread pool implementation be improved or optimized while maintaining it's original purpose; a thread pool that can pool N amount of threads and distribute tasks to them to be executed concurrently?
This thread pool implementation is forked from Jakob Progsch's C++11 thread pool repository with a thorough code step through to understand the purpose behind its implementation and some subjective style changes.
I am introducing myself to concurrent programming and there is still much to learn -- I am a novice concurrent programmer as it stands right now. If my questions are not worded correctly then please make the appropriate correction(s) in your provided answer. Moreover, if the answer can be geared towards a client who is being introduced to concurrent programming for the first time then that would be best -- for myself and any other novices as well.
If the owning thread of the ThreadPool object is the only thread that atomically writes to the destroy_ variable, and the worker threads only atomically read from the destroy_ variable, then no, a mutex is not needed to protect the destroy_ variable in the ThreadPool destructor. Typically a mutex is necessary when an atomic set of operations must take place that can't be accomplished through a single atomic instruction on a platform, (i.e., operations beyond an atomic swap, etc.). That being said, the author of the thread pool may be trying to force some type of acquire semantics on the destroy_ variable without restoring to atomic operations (i.e. a memory fence operation), and/or the setting of the flag itself is not considered an atomic operation (platform dependent)... Some other options include declaring the variable as volatile to prevent it from being cached, etc. You can see this thread for more info.
Without some sort of synchronization operation in place, the worst case scenario could end up with a worker that won't complete due to the destroy_ variable being cached on a thread. On platforms with weaker memory ordering models, that's always a possibility if you allowed a benign memory race condition to exist ...
C++ defines a data race as multiple threads potentially accessing an object simultaneously with at least one of those accesses being a write. Programs with data races have undefined behavior. If you were to write to destroy in your destructor without holding the mutex, your program would have undefined behavior and we cannot predict what would happen.
If you were to read destroy elsewhere without holding the mutex, that read could potentially happen while the destructor is writing to it which is also a data race.

Avoiding data race of boolean variables with pthreads

in my code I have the following structure:
Parent thread
somedatatype thread1_continue, thread2_continue; // Does bool guarantee no data race?
Thread 1:
while (thread1_continue) {
// Do some work
}
Thread 2:
while (thread2_continue) {
// Do some work
}
So I wonder which data type should be thread1_continue or thread2_continue to avoid data race. And also if there is any data type or technique in pthread to solve this problem.
There is no built-in basic type that guarantees thread safety, no matter how small. Even if you are working with bool or unsigned char, neither reading nor writing is guaranteed to be atomic. In other words: there is a chance that if more threads are independantly working with the same memory, one thread can overwrite this memory only partially while the other reads the trash value ~ in that case the behavior is undefined.
You could use mutex to wrap the critical section with lock and unlock calls to ensure the mutual exclusion - there will be only 1 thread that will be able to execute that code. For more sophisticated synchronization there are semaphores, condition variables or even patterns / idioms describing how the synchronization can be handled using these (light switch, turniket, etc.). Just study more about these, some simple examples can be found here :)
Note that there might be some more complex types / wrappers available that wrap the way the object is being accessed - such as std::atomic template in C++11, which does nothing but internally handles the synchronization for you so that you don't need to do that explicitly. With std::atomic there is a guarantee that: "if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined".
For booleans (and others), be sure to avoid
thread 1 loop
{
do actions1;
myFlag = true;
do more1;
}
thread 2 loop
{
do actions2;
if (myFlag)
{
myFlag = false;
do flagged actions;
}
do more2;
}
This nearly always works until myBool is set by thread1 while thread2 is in between checking and resetting myBool. There are CPU-dependent primitives to handle test-and-set, but the normal solution is lock when accessing shared resources, even booleans.