Why stackless coroutines require dynamic allocation? - c++

This question is not about coroutines in C++20 but coroutines in general.
I'm learning C++20 coroutines these days. I've learnt about stackful and stackless coroutines from Coroutines Introduction. I've also SO-ed for more infomation.
Here's my understanding about stackless coroutines:
A stackless coroutine does has stack on the caller's stack when it's running.
When it suspends itself, as stackless coroutines can only suspend at the top-level function, its stack is predictable and useful data are stored in a certain area.
When it's not running, it doesn't have a stack. It's bound with a handle, by which the client can resume the coroutine.
The Coroutines TS specifies that the non-array operator new is called when allocating storage for coroutine frames. However, I think this is unnecessary, hence my question.
Some explanation/consideration:
Where to put the coroutine's status instead? In the handle, which originally stores the pointer.
Dynamic allocation doesn't mean storing on the heap. But my intent is to elide calls to operator new, no matter how it is implemented.
From cppreference:
The call to operator new can be optimized out (even if custom allocator is used) if
The lifetime of the coroutine state is strictly nested within the lifetime of the caller, and
the size of coroutine frame is known at the call site
For the first requirement, storing the state directly in the handle is still okay if the coroutine outlives the caller.
For the other, if the caller doesn't know the size, how can it compose the argument to call operator new? Actually, I can't even imagine in which situation the caller doesn't know the size.
Rust seems to have a different implementation, according to this question.

A stackless coroutine does has stack on the caller's stack when it's running.
That right there is the source of your misunderstanding.
Continuation-based coroutines (which is what a "stackless coroutine" is) is a coroutine mechanism that is designed for being able to provide a coroutine to some other code which will resume its execution after some asynchronous process completes. This resumption may take place in some other thread.
As such, the stack cannot be assumed to be "on the caller's stack", since the caller and the process that schedules the coroutine's resumption are not necessarily in the same thread. The coroutine needs to be able to outlive the caller, so the coroutine's stack cannot be on the caller's stack (in general. In certain co_yield-style cases, it can be).
The coroutine handle represents the coroutine's stack. So long as that handle exists, so too does the coroutine's stack.
When it's not running, it doesn't have a stack. It's bound with a handle, by which the client can resume the coroutine.
And how does this "handle" store all of the local variables for the coroutine? Obviously they are preserved (it'd be a bad coroutine mechanism if they weren't), so they have to be stored somewhere. The name given for where a function's local variables are is called the "stack".
Calling it a "handle" doesn't change what it is.
But my intent is to elide calls to operator new, no matter how it is implemented.
Well... you can't. If never calling new is a vital component of writing whatever software you're writing, then you can't use co_await-style coroutine continuations. There is no set of rules you can use that guarantees elision of new in coroutines. If you're using a specific compiler, you can do some tests to see what it elides and what it doesn't, but that's it.
The rules you cite are merely cases that make it possible to elide the call.
For the other, if the caller doesn't know the size, how can it compose the argument to call operator new?
Remember: co_await coroutines in C++ are effectively an implementation detail of a function. The caller has no idea if any function it calls is or is not a coroutine. All coroutines look like regular functions from the outside.
The code for creating a coroutine stack happens within the function call, not outside of it.

The fundamental difference between stackful and stackless coroutines is if the coroutine owns a full, theoretically unbounded stack (but practically bounded) like a thread does.
In a stackful coroutine, the local variables of the coroutine are stored on the stack it owns, like anything else, both during execution and when suspended.
In a stackless coroutine, the local variables to the coroutine can be in the stack while the coroutine is running or not. They are stored in a fixed sized buffer that the stackless coroutine owns.
In theory, a stackless coroutine can be stored on someone else's stack. There is, however, no way to guarantee within C++ code that this happens.
Elision of operator new in the creation of a coroutine is sort of about doing that. If your coroutine object is stored on someone's stack, and new was elided because there is enough room in the coroutine object itself for its state, then the stackless coroutine that lives completely on someone else's stack is possible.
There is no way to guarantee this in the current implementation of C++ coroutines. Attempts to get that in where met with resistance by compiler developers, because the exact minimal capture that a coroutine does happens "later" than the time they need to know how big the coroutine is in their compiler.
This leads to the difference in practice. A stackful coroutine acts more like a thread. You can call normal functions, and those normal functions can interact within their bodies with coroutine operations like suspend.
A stackless coroutine cannot call a function with then interacts with the coroutine machinery. Interacting with the coroutine machinery is only permitted within the stackless coroutine itself.
A stackful coroutine has all of the machinery of a thread without being scheduled on the OS. A stackless coroutine is an augmented function object that has goto labels in it that let it be resumed part way through its body.
There are theoretical implementations of stackless coroutines that don't have the "could call new" feature. The C++ standard doesn't require such a type of stackless coroutine.
Some people proposed them. Their proposals lost out to the current one, in part because the current one was far more polished and closer to being shipped than the alternative proposals where. Some of the syntax of the alternative proposals ended up in the successful proposal.
I believe there was a convincing argument that the "stricter" fixed size no-new coroutine implementations where not ruled out by the current proposal, and could be added on afterwards, and that helped kill the alternative proposals.

Consider this hypothetical case:
void foo(int);
task coroutine() {
int a[100] {};
int * p = a;
while (true) {
co_await awaitable{};
foo (*p);
}
}
p points to the first element of a, if between two resumptions, a's memory location changed, p would not hold the right address.
Memory for what would be the function stack must be allocated in such a way that it is conserved between a suspension and its following resumption. But this memory cannot be moved or copied if some objects refers to objects that are within this memory (or at least not without adding complexity). This is a reason why, sometime, compilers need to allocate this memory on the heap.

Related

How is seastar::thread better than a stackless coroutine?

I looked at this question: Is seastar::thread a stackful coroutine?
In a comment to the answer, Botond Dénes wrote:
That said, seastar::thread still has its (niche) uses in the post-coroutine world as it supports things like waiting for futures in destructors and catch clauses, allowing for using RAII and cleaner error handling. This is something that coroutines don't and can't support.
Could someone elaborate on this? What are the cases when 'things like waiting for futures in destructors and catch clauses' are impossible with stackless coroutines, but possible with seastar::thread (and alike)?
And more generally, what are the advantages of seastar::thread over C++20 stackless coroutines? Do all stackful coroutine implementations (e.g. those in Boost) have these same advantages?
You have probably heard about the RAII (resource acquisition is initialization) idiom in C++, which is very useful for ensuring that resources get released even in case of exceptions. In the classic, synchronous, C++ world, here is an example:
{
someobject o = get_an_o();
// At this point, "o" is holding some resource (e.g., an open file)
call_some_function(o);
return;
// "o" is destroyed if call_some_function() throws, or before the return.
// When it does, the file it was holding is automatically closed.
}
Now, let's consider asynchronous code using Seastar. Imagine that get_an_o() takes some time to do its work (e.g., open a disk file), and returns a future<someobject>. Well, you might think that you can just use stackless coroutines like this:
someobject o = co_await get_an_o();
// At this point, "o" is holding some resource (e.g., an open file)
call_some_function(o);
return;
But there's a problem here... Although the someobject constructor could be made asynchronous (by using a factory function, get_an_o()in this example), thesomeobject` destructor is still synchronous... It gets called by C++ when unwinding the stack, and it can't return a future! So if the object's destructor does need to wait (e.g., to flush the file), you can't use RAII.
Using a seastar::thread allows you to still use RAII because the destructor now can wait. This is because in a seastar::thread any code - including a destructor, may use get() on a future. This doesn't return a future from the destructor (which we can't) - instead it just "pauses" the stackful coroutine (switches to other tasks and later returns back to this stack when the future resolves).

Is it necessary to call destroy on a std::coroutine_handle?

The std::coroutine_handle is an important part of the new coroutines of C++20. Generators for example often (always?) use it. The handle is manually destroyed in the destructor of the coroutine in all examples that I have seen:
struct Generator {
// Other stuff...
std::coroutine_handle<promise_type> ch;
~Generator() {
if (ch) ch.destroy();
}
}
Is this really necessary? If yes, why isn't this already done by the coroutine_handle, is there a RAII version of the coroutine_handle that behaves that way, and what would happen if we would omit the destroy call?
Examples:
https://en.cppreference.com/w/cpp/coroutine/coroutine_handle (Thanks 463035818_is_not_a_number)
The C++20 standard also mentions it in 9.5.4.10 Example 2 (checked on N4892).
(German) https://www.heise.de/developer/artikel/Ein-unendlicher-Datenstrom-dank-Coroutinen-in-C-20-5991142.html
https://www.scs.stanford.edu/~dm/blog/c++-coroutines.html - Mentiones that it would leak if it weren't called, but does not cite a passage from the standard or why it isn't called in the destructor of std::coroutine_handle.
This is because you want to be able to have a coroutine outlive its handle, a handle should be non-owning. A handle is merely a "view" much like std::string_view -> std::string. You wouldn't want the std::string to destruct itself if the std::string_view goes out of scope.
If you do want this behaviour though, creating your own wrapper around it would be trivial.
That being said, the standard specifies:
The coroutine state is destroyed when control flows off the end of the
coroutine or the destroy member function
([coroutine.handle.resumption]) of a coroutine handle
([coroutine.handle]) that refers to the coroutine is invoked.
The coroutine state will clean up after itself after it has finished running and thus it won't leak unless control doesn't flow off the end.
Of course, in the generator case control typically doesn't flow off the end and thus the programmer has to destroy the coroutine manually. Coroutines have multiple uses though and the standard thus can't really unconditionally mandate the handle destructor call destroy().

boost::coroutine2 vs CoroutineTS

Boost::Coroutine2 and CoroutineTS(C++20) are popular coroutine implementations in C++. Both do suspend and resume but two implementations follow a quite different approaches.
CoroutineTS(C++20)
Stackless
Suspend by return
Uses special keywords
generator<int> Generate()
{
co_yield;
});
boost::coroutine2
Stackful
Suspend by call
Do not use special keywords
pull_type source([](push_type& sink)
{
sink();
});
Are there any specific use cases where I should select only one of them?
The main technical distinction is whether you want to be able to yield from within a nested call. This cannot be done using stackless coroutines.
Another thing to consider is that stackful coroutines have a stack and context (such as signal masks, the stack pointer, the CPU registers, etc.) of their own, so they have a larger memory footprint than stackless coroutines. This can be an issue especially if you have a resource constrained system or massive amounts of coroutines existing simultaneously.
I have no idea how they compare performance-wise in the real world, but in general, stackless coroutines are more efficient, as they have less overhead (stackless task switches do not have to swap stacks, store/load registers, and restore the signal mask, etc.).
For an example of a minimal stackless coroutine implementation, see Simon Tatham's coroutines using Duff's Device. It is pretty intuitive that they are as efficient as you can get.
Also, this question has nice answers that go more into details about the differences between stackful and stackless coroutines.
How to yield from a nested call in stackless coroutines? 
Even though I said it's not possible, that was not 100% true: You can use (at least two) tricks to achieve this, each with some drawbacks:
First, you have to convert every call that should be able to yield your calling coroutine into a coroutine as well. Now, there are two ways:
The trampoline approach: You simply call the child coroutine from the parent coroutine in a loop, until it returns. Every time you notify the child coroutine, if it does not finish, you also yield the calling coroutine. Note that this approach forbids calling the child coroutine directly, you always have to call the outermost coroutine, which then has to re-enter the whole callstack. This has a call and return complexity of O(n) for nesting depth n. If you are waiting for an event, the event simply has to notify the outermost coroutine.
The parent link approach: You pass the parent coroutine address to the child coroutine, yield the parent coroutine, and the child coroutine manually resumes the parent coroutine once it finishes. Note that this approach forbids calling any coroutine besides the inner-most coroutine directly. This approach has a call and return complexity of O(1), so it is generally preferable. The drawback is that you have to manually register the innermost coroutine somewhere, so that the next event that wants to resume the outer coroutine knows which inner coroutine to directly target.
Note: By call and return complexity I mean the number of steps taken when notifying a coroutine to resume it, and the steps taken after notifying it to return to the calling notifier again.

C++ threads stack address range

Does the C++ standard provide a guarantee about the non-overlapping nature of thread stacks (as in started by an std::thread)? In particular is there a guarantee that threads will have have their own, exclusive, allocated range in the process's address space for the thread stack? Where is this described in the standard?
For example
std::uintptr_t foo() {
auto integer = int{0};
return std::bit_cast<std::uintptr_t>(&integer);
...
}
void bar(std::uint64_t id, std::atomic<std::uint64_t>& atomic) {
while (atomic.load() != id) {}
cout << foo() << endl;
atomic.fetch_add(1);
}
int main() {
auto atomic = std::atomic<std::uint64_t>{0};
auto one = std::thread{[&]() { bar(0, atomic); }};
auto two = std::thread{[&]() { bar(1, atomic); }};
one.join();
two.join();
}
Can this ever print the same value twice? It feels like the standard should be providing this guarantee somewhere. But not sure..
The C++ standard does not even require that function calls are implemented using a stack (or that threads have stack in this sense).
The current C++ draft says this about overlapping objects:
Two objects with overlapping lifetimes that are not bit-fields may have the same address if one is nested within the other, or if at least one is a subobject of zero size and they are of different types; otherwise, they have distinct addresses and occupy disjoint bytes of storage.
And in the (non-normative) footnote:
Under the “as-if” rule an implementation is allowed to store two objects at the same machine address or not store an object at all if the program cannot observe the difference ([intro.execution]).
In your example, I do not think the threads synchronize properly, as probably intended, so the lifetimes of the integer objects do not necessarily overlap, so both objects can be put at the same address.
If the code were fixed to synchronize properly and foo were manually inlined into bar, in such a way that the integer object still exists when its address is printed, then there would have to be two objects allocated at different addresses because the difference is observable.
However, none of this tells you whether stackful coroutines can be implemented in C++ without compiler help. Real-world compilers make assumptions about the execution environment that are not reflected in the C++ standard and are only implied by the ABI standards. Particularly relevant to stack-switching coroutines is the fact that the address of the thread descriptor and thread-local variables does not change while executing a function (because they can be expensive to compute and the compiler emits code to cache them in registers or on the stack).
This is what can happen:
Coroutine runs on thread A and accesses errno.
Coroutine is suspended from thread A.
Coroutine resumes on thread B.
Coroutine accesses errno again.
At this point, thread B will access the errno value of thread A, which might well be doing something completely different at this point with it.
This problem is avoid if a coroutine is only ever be resumed on the same thread on which it was suspended, which is very restrictive and probably not what most coroutine library authors have in mind. The worst part is that resuming on the wrong thread is likely appear to work, most of the time, because some widely-used thread-local variables (such as errno) which are not quite thread-local do not immediately result in obviously buggy programs.
For all the Standard cares, implementations call new __StackFrameFoo when foo() needs a stack frame. Where those end, who knows.
The chief rule is that different objects have different addresses, and that includes object which "live on the stack". But the rule only applies to two objects which exist at the same time, and then only as far as the comparison is done with proper thread synchronization. And of course, comparing addresses does hinder the optimizer, which might need to assign an address for an object that could otherwise be optimized out.

What are coroutines in C++20?

What are coroutines in c++20?
In what ways it is different from "Parallelism2" or/and "Concurrency2" (look into below image)?
The below image is from ISOCPP.
https://isocpp.org/files/img/wg21-timeline-2017-03.png
At an abstract level, Coroutines split the idea of having an execution state off of the idea of having a thread of execution.
SIMD (single instruction multiple data) has multiple "threads of execution" but only one execution state (it just works on multiple data). Arguably parallel algorithms are a bit like this, in that you have one "program" run on different data.
Threading has multiple "threads of execution" and multiple execution states. You have more than one program, and more than one thread of execution.
Coroutines has multiple execution states, but does not own a thread of execution. You have a program, and the program has state, but it has no thread of execution.
The easiest example of coroutines are generators or enumerables from other languages.
In pseudo code:
function Generator() {
for (i = 0 to 100)
produce i
}
The Generator is called, and the first time it is called it returns 0. Its state is remembered (how much state varies with implementation of coroutines), and the next time you call it it continues where it left off. So it returns 1 the next time. Then 2.
Finally it reaches the end of the loop and falls off the end of the function; the coroutine is finished. (What happens here varies based on language we are talking about; in python, it throws an exception).
Coroutines bring this capability to C++.
There are two kinds of coroutines; stackful and stackless.
A stackless coroutine only stores local variables in its state and its location of execution.
A stackful coroutine stores an entire stack (like a thread).
Stackless coroutines can be extremely light weight. The last proposal I read involved basically rewriting your function into something a bit like a lambda; all local variables go into the state of an object, and labels are used to jump to/from the location where the coroutine "produces" intermediate results.
The process of producing a value is called "yield", as coroutines are bit like cooperative multithreading; you are yielding the point of execution back to the caller.
Boost has an implementation of stackful coroutines; it lets you call a function to yield for you. Stackful coroutines are more powerful, but also more expensive.
There is more to coroutines than a simple generator. You can await a coroutine in a coroutine, which lets you compose coroutines in a useful manner.
Coroutines, like if, loops and function calls, are another kind of "structured goto" that lets you express certain useful patterns (like state machines) in a more natural way.
The specific implementation of Coroutines in C++ is a bit interesting.
At its most basic level, it adds a few keywords to C++: co_return co_await co_yield, together with some library types that work with them.
A function becomes a coroutine by having one of those in its body. So from their declaration they are indistinguishable from functions.
When one of those three keywords are used in a function body, some standard mandated examining of the return type and arguments occurs and the function is transformed into a coroutine. This examining tells the compiler where to store the function state when the function is suspended.
The simplest coroutine is a generator:
generator<int> get_integers( int start=0, int step=1 ) {
for (int current=start; true; current+= step)
co_yield current;
}
co_yield suspends the functions execution, stores that state in the generator<int>, then returns the value of current through the generator<int>.
You can loop over the integers returned.
co_await meanwhile lets you splice one coroutine onto another. If you are in one coroutine and you need the results of an awaitable thing (often a coroutine) before progressing, you co_await on it. If they are ready, you proceed immediately; if not, you suspend until the awaitable you are waiting on is ready.
std::future<std::expected<std::string>> load_data( std::string resource )
{
auto handle = co_await open_resouce(resource);
while( auto line = co_await read_line(handle)) {
if (std::optional<std::string> r = parse_data_from_line( line ))
co_return *r;
}
co_return std::unexpected( resource_lacks_data(resource) );
}
load_data is a coroutine that generates a std::future when the named resource is opened and we manage to parse to the point where we found the data requested.
open_resource and read_lines are probably async coroutines that open a file and read lines from it. The co_await connects the suspending and ready state of load_data to their progress.
C++ coroutines are much more flexible than this, as they were implemented as a minimal set of language features on top of user-space types. The user-space types effectively define what co_return co_await and co_yield mean -- I've seen people use it to implement monadic optional expressions such that a co_await on an empty optional automatically propogates the empty state to the outer optional:
modified_optional<int> add( modified_optional<int> a, modified_optional<int> b ) {
co_return (co_await a) + (co_await b);
}
instead of
std::optional<int> add( std::optional<int> a, std::optional<int> b ) {
if (!a) return std::nullopt;
if (!b) return std::nullopt;
return *a + *b;
}
A coroutine is like a C function which has multiple return statements and when called a 2nd time does not start execution at the begin of the function but at the first instruction after the previous executed return. This execution location is saved together with all automatic variables that would live on the stack in non coroutine functions.
A previous experimental coroutine implementation from Microsoft did use copied stacks so you could even return from deep nested functions. But this version was rejected by the C++ committee. You can get this implementation for example with the Boosts fiber library.
coroutines are supposed to be (in C++) functions that are able to "wait" for some other routine to complete and to provide whatever is needed for the suspended, paused, waiting, routine to go on. the feature that is most interesting to C++ folks is that coroutines would ideally take no stack space...C# can already do something like this with await and yield but C++ might have to be rebuilt to get it in.
concurrency is heavily focused on the separation of concerns where a concern is a task that the program is supposed to complete. this separation of concerns may be accomplished by a number of means...usually be delegation of some sort. the idea of concurrency is that a number of processes could run independently (separation of concerns) and a 'listener' would direct whatever is produced by those separated concerns to wherever it is supposed to go. this is heavily dependent on some sort of asynchronous management. There are a number of approaches to concurrency including Aspect oriented programming and others. C# has the 'delegate' operator which works quite nicely.
parallelism sounds like concurrency and may be involved but is actually a physical construct involving many processors arranged in a more or less parallel fashion with software that is able to direct portions of code to different processors where it will be run and the results will be received back synchronously.