C++20 coroutines and happens-before relation - c++

Do C++20 coroutines provide some guarantees about synchronizes-with relation?
I would hope for something more or less like ``suspension synchronizes-with resumption" but that's probably too much. So far I was only able to find this statement on cppreference.com
Note that because the coroutine is fully suspended before entering awaiter.await_suspend(), that function is free to transfer the coroutine handle across threads, with no additional synchronization. For example, it can put it inside a callback, scheduled to run on a threadpool when async I/O operation completes.[...]
I also found this in [coroutine.handle.resumption]
Resuming a coroutine via resume, operator(), or destroy on an execution agent other than the one on which it was suspended has implementation-defined behavior unless each execution agent either is an instance of std​::​thread or std​::​jthread, or is the thread that executes main.
[ Note: A coroutine that is resumed on a different execution agent should avoid relying on consistent thread identity throughout, such as holding a mutex object across a suspend point.
— end note
]
[ Note: A concurrent resumption of the coroutine may result in a data race.
— end note
]
I also realize that the linked documentation of handle.resume() does not say that it happens-after some other operation. But still, the quoted statements seem to indicate that there are some synchronization guarantees, when
we pass the handle between different std::threads. So what exactly is guaranteed? [Some links to relevant fragments of the standard for further reading will be deeply appreciated.]

Related

C++20: How is the returning from atomic::wait() guaranteed by the standard?

This is a language-lawyer question.
First of all, does the a.wait() in the following code always get to return?
std::atomic_int a{ 0 };
void f()
{
a.store(1, std::memory_order_relaxed);
a.notify_one();
}
int main()
{
std::thread thread(f);
a.wait(0, std::memory_order_relaxed);//always return?
thread.join();
}
I believe the standard's intention is that a.wait() always get to return. (Otherwise atomic::wait/notify would be useless, isn't it?) But I think the current standard text cannot guarantee this.
The relevant part of the standard is in §31.6 [atomics.wait] paragraph 4:
A call to an atomic waiting operation on an atomic object M is eligible to be unblocked by a call to an atomic notifying operation on M if there exist side effects X and Y on M such that:
(4.1) — the atomic waiting operation has blocked after observing the result of X,
(4.2) — X precedes Y in the modification order of M, and
(4.3) — Y happens before the call to the atomic notifying operation.
and §31.8.2 [atomics.types.operations] paragraph 29~33:
void wait(T old, memory_order order = memory_order::seq_cst) const volatile noexcept;
void wait(T old, memory_order order = memory_order::seq_cst) const noexcept;
Effects: Repeatedly performs the following steps, in order:
(30.1) — Evaluates load(order) and compares its value representation for equality against that of old.
(30.2) — If they compare unequal, returns.
(30.3) — Blocks until it is unblocked by an atomic notifying operation or is unblocked spuriously.
void notify_one() volatile noexcept;
void notify_one() noexcept;
Effects: Unblocks the execution of at least one atomic waiting operation that is eligible to be unblocked (31.6) by this call, if any such atomic waiting operations exist.
With the above wording, I see two problems:
If the wait() thread saw the value in step (30.1), compared it equal to old in step (30.2), and got scheduled out; then in another thread notify_one() stepped in and saw no blocking thread, doing nothing; the subsequent blocking in step (30.3) would never be unblocked. Here isn't it necessary for the standard to say "wait() function atomically performs the evaluation-compare-block operation", similar to what is said about condition_variable::wait()?
There's no synchronization between notify_*() and unblocking of wait(). If in step (30.3), the thread was unblocked by an atomic notifying operation, it would repeat step (30.1) to evaluate load(order). Here there is nothing preventing it from getting the old value. (Or is there?) Then it would block again. Now no one would wake it.
Is the above concern just nit-picking, or defect of the standard?
#1 is pretty much addressed by C++20 thread possibly waiting on std::atomic forever. The wait() operation is clearly eligible to be unblocked by the notify(), and is the only such operation, so the notify() must unblock it. The eligible "wait operation" is the entire call, not only step 30.3.
If an implementation performs steps 30.1-3 in a non-atomic fashion, such that the notify can happen "between" steps 1 and 3, then it has to somehow ensure that step 3 unblocks anyway.
#2 is stickier. At this point I think you are right: the standard doesn't guarantee that the second load gets the value 1; and if it doesn't, then it will presumably block again and never be woken up.
The use of the relaxed memory ordering makes it pretty clear in this example. If we wanted to prove that the second load must see 1, the only way I can see is to invoke write-read coherence (intro.races p18) which requires that we prove the store happens before the load, in the sense of intro.races p10. This in turn requires that somewhere along the way, we have some operation in one thread that synchronizes with some operation in the other (you can't get inter-thread happens before without a synchronizes with, unless there are consume operations which is not the case here). Usually you get synchronizes with from a pairing of an acquire load with a release store (atomics.order p2), and here we have no such thing; nor, as far as I can tell, anything else that would synchronize. So we don't have a proof.
In fact, I think the problem persists even if we upgrade to seq_cst operations. We could then have both loads coherence-ordered before the store, and the total order S of atomics.order p4 would go "first load, second load, store". I don't see that contradicting anything. We would still have to show a synchronizes with to rule this out, and again we can't. There might appear to be a better chance than in the relaxed case, since seq_cst loads and stores are acquire and release respectively. But the only way to use this would be if one of the loads were to take its value from the store, i.e. if one of the loads were to return 1, and we are assuming that is not the case. So again this undesired behavior seems consistent with all the rules.
It does make you wonder if the Standard authors meant to require the notification to synchronize with the unblocking. That would fix the problem, and I would guess real-life implementations already include the necessary barriers.
But indeed, I am not seeing this specified anywhere.
The only possible way out that I can see is that "eligible to be unblocked" applies to the entire wait operation, not just to a single iteration of it. But it seems clear that the intent was that if you are unblocked by a notify and the value has not changed, then you block again until a second notify occurs (or spurious wakeup).
It's starting to look to me like a defect.
1. wait atomicity: As "M is eligible to be unblocked by a call to an atomic notifying operation" in the situation you described (after the 30.2 logical step but before 30.3), the implementation has to comply.
If there is "a waiting operation that is eligible to be unblocked" then notify_one has to unblock at least one wait call - not it's internal block - , regardless of whether it is before step (30.3) or not.
So the implementation must ensure that the notification is being delivered and the (30.1)...(30.3) steps are repeated in the described case.
2. store/notify order: First clarify the terms the standard uses:
"the value of an atomic object M" is used to refer the underlying T object.
"the value pointed to by this" is used to refer the underlying T object too. It is misleading I think, as the internal representation of the std::atomic<T> object is not required to be identical with a T object, so it refers something like - just an example -: this->m_underlying_data (because this can have more members, as sizeof(T) != sizeof(std::atomic<T>) can be true, cannot refer simply *this)
"an atomic object M" term is used to refer the whole std::atomic<T> typed object, the real *this. As explained in this thread.
It is guaranteed by the standard that even with the relaxed order the sequence of the modifications on the same atomic object has to be consistent among different threads:
C++20 standard 6.9.2.1 (19):
[Note: The four preceding coherence requirements effectively disallow
compiler reordering of atomic operations to a single object, even if
both operations are relaxed loads. This effectively makes the cache
coherence guarantee provided by most hardware available to C++ atomic
operations. — end note]
As the standard doesn't say how atomic should be implemented, notify could modify the atomic object. This is why it is not a const function and I think the following applies:
C++20 standard 6.9.2.1 (15):
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
These are why this statement is wrong:
Here there is nothing preventing it from getting the old value.
As the store and notify_one operate on the same atomic object (and modify it), their order has to be preserved. So it is granted that there will be a notify_one after the value is 1.
This question was asked on std-discussions, so I'll post here my answer in that discussion.
If the wait() thread saw the value in step (30.1), compared it equal to old in step (30.2), and got scheduled out; then in another thread notify_one() stepped in and saw no blocking thread, doing nothing; the subsequent blocking in step (30.3) would never be unblocked. Here isn't it necessary for the standard to say "wait() function atomically performs the evaluation-compare-block operation", similar to what is said about condition_variable::wait()?
The operation is called an "atomic waiting operation". I agree, this can be read as a "waiting operation on an atomic", but practically speaking, it is clear that the steps described in the operation description need to be performed atomically (or have the effect of being atomic). You could argue that the standard wording could be better, but I don't see that as a logical error in the standard.
There's no synchronization between notify_*() and unblocking of wait(). If in step (30.3), the thread was unblocked by an atomic notifying operation, it would repeat step (30.1) to evaluate load(order). Here there is nothing preventing it from getting the old value. (Or is there?) Then it would block again. Now no one would wake it.
There is no synchronization between notify_* and wait because such synchronization would be redundant.
The synchronizes-with relation is used to determine the order of events happening in different threads, including events on different objects. It is used in definition of inter-thread happens before and, by induction, of happens before. In short, "a release operation on an atomic in one thread synchronizes with an acquire operation on the atomic in another thread" means that effects on objects (including other than the atomic) that were made prior to the release operation will be observable by operations that are performed following the acquire operation.
This release-acquire memory ordering semantics is redundant and irrelevant for the notify_* and wait operations because you already can achieve the synchronizes-with relation by performing a store(release) in the notifying thread and wait(acquire) in the waiting thread. There is no reasonable use case for notify_* without a prior store or read-modify-write operation. In fact, as you quoted yourself, an atomic waiting operation is eligible to be unblocked by an atomic notifying operation only if there was a side effect on the atomic that was not first observed by the waiting operation.
What gives the guarantee that the load in the wait observes the effect of the store in the notifying thread is sequential execution guarantee in the notifying thread. The store call is sequenced before the notify_one call because those are two full expressions. Consequently, in the waiting thread, store happens before notify_one (following from the definitions of happens before and inter-thread happens before). This means that the waiting thread will observe the effects of store and notify_one in that order, never in reverse.
So, to recap, there are two scenarios possible:
The notifying thread stores true to the atomic first, and the waiting thread observes that store on entry into wait. The waiting thread does not block, notify_one call is ignored.
The waiting thread does not observe the store and blocks. The notifying thread issues store and notify_one, in that order. At some point, the waiting thread observes the notify_one (due to the eligible to be unblocked relation between notify_one and wait). At this point, it must also observe the effect of store (due to the happens before relation between store and notify_one) and return.
If the waiting thread observed notify_one but not store, it would violate the happens before relation between those operations, and therefore it is not possible.
Update 2022-07-28:
As was pointed out in the comments by Broothy, the happens before relation between store and notify_one may be guaranteed only in the notifying thread but not in the waiting thread. Indeed, the standard is unclear in this regard, as it required the happens before relation, but it doesn't specify in which thread this relation has to be observed. My answer above assumed that the relation has to be maintained in every thread, but this may not be the correct interpretation of the standard. So, in the end I agree the standard needs to be clarified in this regard. Specifically, it needs to require that store inter-thread happens before notify_one.

Is there a way to check if std::future state is ready in a guaranteed wait-free manner?

I know that I can check the state of the std::future the following way:
my_future.wait_for(std::chrono::seconds(0)) == std::future_status::ready
But according to cppreference.com std::future::wait_for may block in some cases:
This function may block for longer than timeout_duration due to scheduling or resource contention delays.
Is it still the case when timeout_duration is 0 ? If so, is there another way to query the state in a guaranteed wait-free manner ?
The quote from cppreference is simply there to remind you that the OS scheduler is a factor here and that other tasks requiring platform resources could be using the CPU-time your thread needs in order to return from wait_for() -- regardless of the specified timeout duration being zero or not. That's all. You cannot technically be guaranteed to get more than that on a non-realtime platform. As such, the C++ Standard says nothing about this, but you can see other interesting stuff there -- see the paragraph for wait_for() under [futures.unique_future¶21]:
Effects: None if the shared state contains a deferred function
([futures.async]), otherwise blocks until the shared state is ready or
until the relative timeout ([thread.req.timing]) specified by
rel_­time has expired.
No such mention of the additional delay here, but it does say that you are blocked, and it remains implementation dependent whether wait_for() is yield()ing the thread1 first thing upon such blocking or immediately returning if the timeout duration is zero. In addition, it might also be necessary for an implementation to synchronize access to the future's status in a locking manner, which would have to be applied prior to checking if a potential immediate return is to take place. Hence, you don't even have the guarantee for lock-freedom here, let alone wait-freedom.
Note that the same applies for calling wait_until with a time in the past.
Is it still the case when timeout_duration is 0 ? If so, is there
another way to query the state in a guaranteed wait-free manner ?
So yes, implementation of wait_free() notwithstanding, this is still the case. As such, this is the closest to wait-free you're going to get for checking the state.
1 In simple terms, this means "releasing" the CPU and putting your thread at the back of the scheduler's queue, giving other threads some CPU-time.
To answer your second question, there is currently no way to check if the future is ready other than waiting. We will likely get this at some point: https://en.cppreference.com/w/cpp/experimental/future/is_ready. If your runtime library supports the concurrency extensions and you don't mind using experimental in your code, then you can use is_ready() now. That being said, I know of few cases where you must check a future's state. Are you sure it's necessary?
Is it still the case when timeout_duration is 0 ?
Yes. That's true for any operation. The OS scheduler could pause the thread (or the whole process) to allow another thread to run on the same CPU.
If so, is there another way to query the state in a guaranteed wait-free manner ?
No. Using a zero timeout is the correct way.
There's not even a guarantee that the shared state of a std::future doesn't lock a mutex to check if it's ready, so it would be impossible to guarantee it was wait-free.
For GCC's implementation the ready flag is an atomic so there's no mutex lock needed, and if it's ready then wait_for returns immediately. If it's not ready then there are some more atomic operations and then a check to see if the timeout has passed already, then a system call. So for a zero timeout there are just some atomic loads and function calls (no system call).

Can long-running std::asyncs starve other std::asyncs?

As I understand it, usual implementations of std::async schedule these jobs on threads from a pre-allocated thread pool.
So lets say I first create and schedule enough long-running std::asyncs to keep all threads from that thread pool occupied. Directly afterwards (before long they finished executing) I also create and schedule some short-running std::asyncs. Could it happen that the short-running ones aren't executed at all until at least one of the long-running ones has finished? Or is there some guarantee in the standard (specifically C++11) that prevents this kind of situation (like spawning more threads so that the OS can schedule them in a round-robin fasion)?
The standard reads:
[futures.async#3.1] If launch​::​async is set in policy, calls INVOKE(DECAY_­COPY(std​::​forward<F>(f)), DECAY_­COPY(std​::​forward<Args>(args))...) ([func.require], [thread.thread.constr]) as if in a new thread of execution represented by a thread object with the calls to DECAY_­COPY being evaluated in the thread that called async.[...]
so, under the as-if rule, new threads must be spawned when async() is invoked with ​async launch policy. Of course, an implementation may use a thread pool internally but, usual thread creation overhead aside, no special 'starving' can occur. Moreover, things like the initialization of thread locals should always happen.
In fact, clang libc++ trunk async implementation reads:
unique_ptr<__async_assoc_state<_Rp, _Fp>, __release_shared_count>
__h(new __async_assoc_state<_Rp, _Fp>(_VSTD::forward<_Fp>(__f)));
VSTD::thread(&__async_assoc_state<_Rp, _Fp>::__execute, __h.get()).detach();
return future<_Rp>(__h.get());
as you can see, no 'explicit' thread pool is used internally.
Moreover, as you can read here also the libstdc++ implementation shipping with gcc 5.4.0 just invokes a plain thread.
Yes, MSVC's std::async seem to have exactly that property, at least as of MSVC2015.
I don't know if they fixed it in an 2017 update.
This is against the spirit of the standard. However, the standard is extremely vague about thread forward progress guarantees (at least as of C++14). So while std::async must behave as if it wraps a std::thread, the guarantees on std::thread forward progress are sufficiently weak that this isn't much of a guarantee under the as-if rule.
In practice, this has led me to replace std::async in my thread pool implementations with raw calls to std::thread, as raw use of std::thread in MSVC2015 doesn't appear to have that problem.
I find that a thread pool (with a task queue) is far more practical than raw calls to either std::async or std::thread, and as it is really easy to write a thread pool with either std::thread or std::async, I'd advise writing one with std::thread.
Your thread pool can return std::futures just like std::async does (but without the auto-blocking on destruction feature, as the pool itself manages the thread lifetimes).
I have read that C++17 added better forward progress guarantees, but I lack sufficient understanding to conclude if MSVC's behavior is now against the standard requirements.

Thread Safety: Multiple threads reading from a single const source

What should I be concerned about as far as thread safety and undefined behavior goes in a situation where multiple threads are reading from a single source that is constant?
I am working on a signal processing model that allows for parallel execution of independent processes, these processes may share an input buffer, but the process that fills the input buffer will always be complete before the next stage of possibly parallel processes will execute.
Do I need to worry about thread safety issues in this situation? and what could i do about it?
I would like to note that a lock free solution would be best if possible
but the process that fills the input buffer will always be complete before the next stage of possibly parallel processes will execute
If this is guaranteed then there is not a problem having multiple reads from different threads for const objects.
I don't have the official standard so the following is from n4296:
17.6.5.9 Data race avoidance
3 A C++ standard library function shall not directly or indirectly modify objects (1.10) accessible by threads
other than the current thread unless the objects are accessed directly or indirectly via the function’s non-const
arguments, including this.
4 [ Note: This means, for example, that implementations can’t use a static object for internal purposes without
synchronization because it could cause a data race even in programs that do not explicitly share objects
between threads. —end note ]
Here is the Herb Sutter video where I first learned about the meaning of const in the C++11 standard. (see around 7:00 to 10:30)
No, you are OK. Multiple reads from the same constant source are OK and do not pose any risks in all threading models I know of (namely, Posix and Windows).
However,
but the process that fills the input buffer will always be complete
What are the guarantees here? How do you really know this is the case? Do you have a synchronization?

Safely Destroying a Thread Pool

Consider the following implementation of a trivial thread pool written in C++14.
threadpool.h
threadpool.cpp
Observe that each thread is sleeping until it's been notified to awaken -- or some spurious wake up call -- and the following predicate evaluates to true:
std::unique_lock<mutex> lock(this->instance_mutex_);
this->cond_handle_task_.wait(lock, [this] {
return (this->destroy_ || !this->tasks_.empty());
});
Furthermore, observe that a ThreadPool object uses the data member destroy_ to determine if its being destroyed -- the destructor has been called. Toggling this data member to true will notify each worker thread that it's time to finish its current task and any of the other queued tasks then synchronize with the thread that's destroying this object; in addition to prohibiting the enqueue member function.
For your convenience, the implementation of the destructor is below:
ThreadPool::~ThreadPool() {
{
std::lock_guard<mutex> lock(this->instance_mutex_); // this line.
this->destroy_ = true;
}
this->cond_handle_task_.notify_all();
for (auto &worker : this->workers_) {
worker.join();
}
}
Q: I do not understand why it's necessary to lock the object's mutex while toggling destroy_ to true in the destructor. Furthermore, is it only necessary for setting its value or is it also necessary for accessing its value?
BQ: Can this thread pool implementation be improved or optimized while maintaining it's original purpose; a thread pool that can pool N amount of threads and distribute tasks to them to be executed concurrently?
This thread pool implementation is forked from Jakob Progsch's C++11 thread pool repository with a thorough code step through to understand the purpose behind its implementation and some subjective style changes.
I am introducing myself to concurrent programming and there is still much to learn -- I am a novice concurrent programmer as it stands right now. If my questions are not worded correctly then please make the appropriate correction(s) in your provided answer. Moreover, if the answer can be geared towards a client who is being introduced to concurrent programming for the first time then that would be best -- for myself and any other novices as well.
If the owning thread of the ThreadPool object is the only thread that atomically writes to the destroy_ variable, and the worker threads only atomically read from the destroy_ variable, then no, a mutex is not needed to protect the destroy_ variable in the ThreadPool destructor. Typically a mutex is necessary when an atomic set of operations must take place that can't be accomplished through a single atomic instruction on a platform, (i.e., operations beyond an atomic swap, etc.). That being said, the author of the thread pool may be trying to force some type of acquire semantics on the destroy_ variable without restoring to atomic operations (i.e. a memory fence operation), and/or the setting of the flag itself is not considered an atomic operation (platform dependent)... Some other options include declaring the variable as volatile to prevent it from being cached, etc. You can see this thread for more info.
Without some sort of synchronization operation in place, the worst case scenario could end up with a worker that won't complete due to the destroy_ variable being cached on a thread. On platforms with weaker memory ordering models, that's always a possibility if you allowed a benign memory race condition to exist ...
C++ defines a data race as multiple threads potentially accessing an object simultaneously with at least one of those accesses being a write. Programs with data races have undefined behavior. If you were to write to destroy in your destructor without holding the mutex, your program would have undefined behavior and we cannot predict what would happen.
If you were to read destroy elsewhere without holding the mutex, that read could potentially happen while the destructor is writing to it which is also a data race.