boost::coroutine2 vs CoroutineTS

boost::coroutine2 vs CoroutineTS - c++

Boost::Coroutine2 and CoroutineTS(C++20) are popular coroutine implementations in C++. Both do suspend and resume but two implementations follow a quite different approaches.
CoroutineTS(C++20)
Stackless
Suspend by return
Uses special keywords
generator<int> Generate()
{
co_yield;
});
boost::coroutine2
Stackful
Suspend by call
Do not use special keywords
pull_type source([](push_type& sink)
{
sink();
});
Are there any specific use cases where I should select only one of them?

The main technical distinction is whether you want to be able to yield from within a nested call. This cannot be done using stackless coroutines.
Another thing to consider is that stackful coroutines have a stack and context (such as signal masks, the stack pointer, the CPU registers, etc.) of their own, so they have a larger memory footprint than stackless coroutines. This can be an issue especially if you have a resource constrained system or massive amounts of coroutines existing simultaneously.
I have no idea how they compare performance-wise in the real world, but in general, stackless coroutines are more efficient, as they have less overhead (stackless task switches do not have to swap stacks, store/load registers, and restore the signal mask, etc.).
For an example of a minimal stackless coroutine implementation, see Simon Tatham's coroutines using Duff's Device. It is pretty intuitive that they are as efficient as you can get.
Also, this question has nice answers that go more into details about the differences between stackful and stackless coroutines.
How to yield from a nested call in stackless coroutines? 
Even though I said it's not possible, that was not 100% true: You can use (at least two) tricks to achieve this, each with some drawbacks:
First, you have to convert every call that should be able to yield your calling coroutine into a coroutine as well. Now, there are two ways:
The trampoline approach: You simply call the child coroutine from the parent coroutine in a loop, until it returns. Every time you notify the child coroutine, if it does not finish, you also yield the calling coroutine. Note that this approach forbids calling the child coroutine directly, you always have to call the outermost coroutine, which then has to re-enter the whole callstack. This has a call and return complexity of O(n) for nesting depth n. If you are waiting for an event, the event simply has to notify the outermost coroutine.
The parent link approach: You pass the parent coroutine address to the child coroutine, yield the parent coroutine, and the child coroutine manually resumes the parent coroutine once it finishes. Note that this approach forbids calling any coroutine besides the inner-most coroutine directly. This approach has a call and return complexity of O(1), so it is generally preferable. The drawback is that you have to manually register the innermost coroutine somewhere, so that the next event that wants to resume the outer coroutine knows which inner coroutine to directly target.
Note: By call and return complexity I mean the number of steps taken when notifying a coroutine to resume it, and the steps taken after notifying it to return to the calling notifier again.

Related

Why stackless coroutines require dynamic allocation?

This question is not about coroutines in C++20 but coroutines in general.
I'm learning C++20 coroutines these days. I've learnt about stackful and stackless coroutines from Coroutines Introduction. I've also SO-ed for more infomation.
Here's my understanding about stackless coroutines:
A stackless coroutine does has stack on the caller's stack when it's running.
When it suspends itself, as stackless coroutines can only suspend at the top-level function, its stack is predictable and useful data are stored in a certain area.
When it's not running, it doesn't have a stack. It's bound with a handle, by which the client can resume the coroutine.
The Coroutines TS specifies that the non-array operator new is called when allocating storage for coroutine frames. However, I think this is unnecessary, hence my question.
Some explanation/consideration:
Where to put the coroutine's status instead? In the handle, which originally stores the pointer.
Dynamic allocation doesn't mean storing on the heap. But my intent is to elide calls to operator new, no matter how it is implemented.
From cppreference:
The call to operator new can be optimized out (even if custom allocator is used) if
The lifetime of the coroutine state is strictly nested within the lifetime of the caller, and
the size of coroutine frame is known at the call site
For the first requirement, storing the state directly in the handle is still okay if the coroutine outlives the caller.
For the other, if the caller doesn't know the size, how can it compose the argument to call operator new? Actually, I can't even imagine in which situation the caller doesn't know the size.
Rust seems to have a different implementation, according to this question.

A stackless coroutine does has stack on the caller's stack when it's running.
That right there is the source of your misunderstanding.
Continuation-based coroutines (which is what a "stackless coroutine" is) is a coroutine mechanism that is designed for being able to provide a coroutine to some other code which will resume its execution after some asynchronous process completes. This resumption may take place in some other thread.
As such, the stack cannot be assumed to be "on the caller's stack", since the caller and the process that schedules the coroutine's resumption are not necessarily in the same thread. The coroutine needs to be able to outlive the caller, so the coroutine's stack cannot be on the caller's stack (in general. In certain co_yield-style cases, it can be).
The coroutine handle represents the coroutine's stack. So long as that handle exists, so too does the coroutine's stack.
When it's not running, it doesn't have a stack. It's bound with a handle, by which the client can resume the coroutine.
And how does this "handle" store all of the local variables for the coroutine? Obviously they are preserved (it'd be a bad coroutine mechanism if they weren't), so they have to be stored somewhere. The name given for where a function's local variables are is called the "stack".
Calling it a "handle" doesn't change what it is.
But my intent is to elide calls to operator new, no matter how it is implemented.
Well... you can't. If never calling new is a vital component of writing whatever software you're writing, then you can't use co_await-style coroutine continuations. There is no set of rules you can use that guarantees elision of new in coroutines. If you're using a specific compiler, you can do some tests to see what it elides and what it doesn't, but that's it.
The rules you cite are merely cases that make it possible to elide the call.
For the other, if the caller doesn't know the size, how can it compose the argument to call operator new?
Remember: co_await coroutines in C++ are effectively an implementation detail of a function. The caller has no idea if any function it calls is or is not a coroutine. All coroutines look like regular functions from the outside.
The code for creating a coroutine stack happens within the function call, not outside of it.

The fundamental difference between stackful and stackless coroutines is if the coroutine owns a full, theoretically unbounded stack (but practically bounded) like a thread does.
In a stackful coroutine, the local variables of the coroutine are stored on the stack it owns, like anything else, both during execution and when suspended.
In a stackless coroutine, the local variables to the coroutine can be in the stack while the coroutine is running or not. They are stored in a fixed sized buffer that the stackless coroutine owns.
In theory, a stackless coroutine can be stored on someone else's stack. There is, however, no way to guarantee within C++ code that this happens.
Elision of operator new in the creation of a coroutine is sort of about doing that. If your coroutine object is stored on someone's stack, and new was elided because there is enough room in the coroutine object itself for its state, then the stackless coroutine that lives completely on someone else's stack is possible.
There is no way to guarantee this in the current implementation of C++ coroutines. Attempts to get that in where met with resistance by compiler developers, because the exact minimal capture that a coroutine does happens "later" than the time they need to know how big the coroutine is in their compiler.
This leads to the difference in practice. A stackful coroutine acts more like a thread. You can call normal functions, and those normal functions can interact within their bodies with coroutine operations like suspend.
A stackless coroutine cannot call a function with then interacts with the coroutine machinery. Interacting with the coroutine machinery is only permitted within the stackless coroutine itself.
A stackful coroutine has all of the machinery of a thread without being scheduled on the OS. A stackless coroutine is an augmented function object that has goto labels in it that let it be resumed part way through its body.
There are theoretical implementations of stackless coroutines that don't have the "could call new" feature. The C++ standard doesn't require such a type of stackless coroutine.
Some people proposed them. Their proposals lost out to the current one, in part because the current one was far more polished and closer to being shipped than the alternative proposals where. Some of the syntax of the alternative proposals ended up in the successful proposal.
I believe there was a convincing argument that the "stricter" fixed size no-new coroutine implementations where not ruled out by the current proposal, and could be added on afterwards, and that helped kill the alternative proposals.

Consider this hypothetical case:
void foo(int);
task coroutine() {
int a[100] {};
int * p = a;
while (true) {
co_await awaitable{};
foo (*p);
}
}
p points to the first element of a, if between two resumptions, a's memory location changed, p would not hold the right address.
Memory for what would be the function stack must be allocated in such a way that it is conserved between a suspension and its following resumption. But this memory cannot be moved or copied if some objects refers to objects that are within this memory (or at least not without adding complexity). This is a reason why, sometime, compilers need to allocate this memory on the heap.

Scheduling a coroutine with a context

There are plenty of tutorials that explain how it's easy to use coroutines in C++, but I've spent a lot of time getting how to schedule "detached" coroutines.
Assume, I have the following definition of coroutine result type:
struct task {
struct promise_type {
auto initial_suspend() const noexcept { return std::suspend_never{}; }
auto final_suspend() const noexcept { return std::suspend_never{}; }
void return_void() const noexcept { }
void unhandled_exception() const { std::terminate(); }
task get_return_object() const noexcept { return {}; }
};
};
And there is also a method that runs "detached" coroutine, i.e. runs it asynchronously.
/// Handler should have overloaded operator() returning task.
template<class Handler>
void schedule_coroutine(Handler &&handler) {
std::thread([handler = std::forward<Handler>(handler)]() { handler(); }).detach();
}
Obviously, I can not pass lambda-functions or any other functional object that has a state into this method, because once the coroutine is suspended, the lambda passed into std::thread method will be destroyed with all the captured variables.
task coroutine_1() {
std::vector<object> objects;
// ...
schedule_coroutine([objects]() -> task {
// ...
co_await something;
// ...
co_return;
});
// ...
co_return;
}
int main() {
// ...
schedule_coroutine(coroutine_1);
// ...
}
I think there is should be a way to save the handler somehow (preferably near or within the coroutine promise) so that the next time coroutine is resumed it won't try to access to the destroyed object data. But unfortunately I have no idea how to do it.

I think your problem is a general (and common) misunderstanding of how co_await coroutines are intended to work.
When a function performs co_await <expr>, this (generally) means that the function suspends execution until expr resumes its execution. That is, your function is waiting until some process completes (and typically returns a value). That process, represented by expr, is the one who is supposed to resume the function (generally).
The whole point of this is to make code that executes asynchronously look like synchronous code as much as possible. In synchronous code, you would do something like <expr>.wait(), where wait is a function that waits for the task represented by expr to complete. Instead of "waiting" on it, you "a-wait" or "asynchronously wait" on it. The rest of your function executes asynchronously relative to your caller, based on when expr completes and how it decides to resume your function's execution. In this way, co_await <expr> looks and appears to act very much like <expr>.wait().
Compiler Magictm then goes in behind the scenes to make it asynchronous.
So the idea of launching a "detached coroutine" doesn't make sense within this framework. The caller of a coroutine function (usually) isn't the one who determines where the coroutine executes; it's the processes the coroutine invokes during its execution that decides that.
Your schedule_coroutine function really ought to just be a regular "execute a function asynchronously" operation. It shouldn't have any particular association with coroutines, nor any expectation that the given functor is or represents some asynchronous task or if it happens to invoke co_await. The function is just going to create a new thread and execute a function on it.
Just as you would have done pre-C++20.
If your task type represents an asynchronous task, then in proper RAII style, its destructor ought to wait until the task is completed before exiting (this includes any resumptions of coroutines scheduled by that task, throughout the entire execution of said task. The task isn't done until it is entirely done). Therefore, if handler() in your schedule_coroutine call returns a task, then that task will be initialized and immediately destroyed. Since the destructor waits for the asynchronous task to complete, the thread will not die until the task is done. And since the thread's functor is copied/moved from the function object given to the thread constructor, any captures will continue to exist until the thread itself exits.

I hope I got you right, but I think there might be a couple of misconceptions here. First off, you clearly cannot detach a coroutine, that would not make any sense at all. But you can execute asynchronous tasks inside a coroutine for sure, even though in my opinion this defeats its purpose entirely.
But let's take a look at the second block of code you posted. Here you invoke std::async and forward a handler to it. Now, in order to prevent any kind of early destruction you should use std::move instead and pass the handler to the lambda so it will be kept alive for as long as the scope of the lambda function is valid. This should probably already answer your final question as well, because the place where you want this handler to be stored would be the lambda capture itself.
Another thing that bothers me is the usage of std::async. The call will return a std::future-kind of type that will block until the lambda has been executed. But this will only happen if you set the launch type to std::launch::async, otherwise you will need to call .get() or .wait() on the future as the default launch type is std::launch::deferred and this will only lazy fire (meaning: when you actually request the result).
So, in your case and if you really wanted to use coroutines that way, I would suggest to use a std::thread instead and store it for a later join() somewhere pseudo-globally. But again, I don't think you would really want to use coroutines mechanics that way.

Your question makes perfect sense, the misunderstanding is C++20 coroutines are actually generators mistakenly occupying coroutine header name.
Let me explain how generators work and then answer how to schedule detached coroutine.
How generators work
Your question Scheduling a detached coroutine then looks How to schedule a detached generator and answer is: not possible because special convention transforms regular function into generator function.
What is not obvious right there is the yielding a value must take place inside generator function body. When you want to call a helper function that yields value for you - you can't. Instead you also make a helper function into generator and then await instead of just calling helper function. This effectively chains generators and might feel like writing synchronous code that executes async.
In Javascript special convention is async keyword. In Python special convention is yield instead of return keyword.
The C++20 coroutines are low level mechanism allowing to implement Javascipt like async/await.
Nothing wrong with including this low-level mechanism in C++ language except placing it in header named coroutine.
How to schedule detached coroutine
This question makes sense if you want to have green threads or fibers and you are writing scheduler logic that uses symmetric or asymmetric coroutines to accomplish this.
Now others might ask: why should anyone bother with fibers(not windows fibers;) when you have generators? The answer is because you can have encapsulated concurrency and parallelism logic, meaning rest of your team isn't required to learn and apply additional mental gymnastics while working on the project.
The result is true asynchronous programming where the rest of the team write linear code, without callbacks and such, with simple concept of concurrency for example single spawn() library function, avoiding any locks/mutexes and other multithreading complexity.
The beauty of encapsulation is seen when all details are hidden in low level i/o methods. All context switching, scheduling, etc. happens deep inside i/o classes like Channel, Queue or File.
Everyone involved in async programming should experience working like this. The feeling is intense.
To accomplish this instead of C++20 coroutines use Boost::fiber that includes scheduler or Boost::context that allows symmetric coroutines. Symmetric coroutines allow to suspend and switch to any other coroutine while asymmetric coroutines suspend and resume calling coroutine.

What are coroutines in C++20?

What are coroutines in c++20?
In what ways it is different from "Parallelism2" or/and "Concurrency2" (look into below image)?
The below image is from ISOCPP.
https://isocpp.org/files/img/wg21-timeline-2017-03.png

At an abstract level, Coroutines split the idea of having an execution state off of the idea of having a thread of execution.
SIMD (single instruction multiple data) has multiple "threads of execution" but only one execution state (it just works on multiple data). Arguably parallel algorithms are a bit like this, in that you have one "program" run on different data.
Threading has multiple "threads of execution" and multiple execution states. You have more than one program, and more than one thread of execution.
Coroutines has multiple execution states, but does not own a thread of execution. You have a program, and the program has state, but it has no thread of execution.
The easiest example of coroutines are generators or enumerables from other languages.
In pseudo code:
function Generator() {
for (i = 0 to 100)
produce i
}
The Generator is called, and the first time it is called it returns 0. Its state is remembered (how much state varies with implementation of coroutines), and the next time you call it it continues where it left off. So it returns 1 the next time. Then 2.
Finally it reaches the end of the loop and falls off the end of the function; the coroutine is finished. (What happens here varies based on language we are talking about; in python, it throws an exception).
Coroutines bring this capability to C++.
There are two kinds of coroutines; stackful and stackless.
A stackless coroutine only stores local variables in its state and its location of execution.
A stackful coroutine stores an entire stack (like a thread).
Stackless coroutines can be extremely light weight. The last proposal I read involved basically rewriting your function into something a bit like a lambda; all local variables go into the state of an object, and labels are used to jump to/from the location where the coroutine "produces" intermediate results.
The process of producing a value is called "yield", as coroutines are bit like cooperative multithreading; you are yielding the point of execution back to the caller.
Boost has an implementation of stackful coroutines; it lets you call a function to yield for you. Stackful coroutines are more powerful, but also more expensive.
There is more to coroutines than a simple generator. You can await a coroutine in a coroutine, which lets you compose coroutines in a useful manner.
Coroutines, like if, loops and function calls, are another kind of "structured goto" that lets you express certain useful patterns (like state machines) in a more natural way.
The specific implementation of Coroutines in C++ is a bit interesting.
At its most basic level, it adds a few keywords to C++: co_return co_await co_yield, together with some library types that work with them.
A function becomes a coroutine by having one of those in its body. So from their declaration they are indistinguishable from functions.
When one of those three keywords are used in a function body, some standard mandated examining of the return type and arguments occurs and the function is transformed into a coroutine. This examining tells the compiler where to store the function state when the function is suspended.
The simplest coroutine is a generator:
generator<int> get_integers( int start=0, int step=1 ) {
for (int current=start; true; current+= step)
co_yield current;
}
co_yield suspends the functions execution, stores that state in the generator<int>, then returns the value of current through the generator<int>.
You can loop over the integers returned.
co_await meanwhile lets you splice one coroutine onto another. If you are in one coroutine and you need the results of an awaitable thing (often a coroutine) before progressing, you co_await on it. If they are ready, you proceed immediately; if not, you suspend until the awaitable you are waiting on is ready.
std::future<std::expected<std::string>> load_data( std::string resource )
{
auto handle = co_await open_resouce(resource);
while( auto line = co_await read_line(handle)) {
if (std::optional<std::string> r = parse_data_from_line( line ))
co_return *r;
}
co_return std::unexpected( resource_lacks_data(resource) );
}
load_data is a coroutine that generates a std::future when the named resource is opened and we manage to parse to the point where we found the data requested.
open_resource and read_lines are probably async coroutines that open a file and read lines from it. The co_await connects the suspending and ready state of load_data to their progress.
C++ coroutines are much more flexible than this, as they were implemented as a minimal set of language features on top of user-space types. The user-space types effectively define what co_return co_await and co_yield mean -- I've seen people use it to implement monadic optional expressions such that a co_await on an empty optional automatically propogates the empty state to the outer optional:
modified_optional<int> add( modified_optional<int> a, modified_optional<int> b ) {
co_return (co_await a) + (co_await b);
}
instead of
std::optional<int> add( std::optional<int> a, std::optional<int> b ) {
if (!a) return std::nullopt;
if (!b) return std::nullopt;
return *a + *b;
}

A coroutine is like a C function which has multiple return statements and when called a 2nd time does not start execution at the begin of the function but at the first instruction after the previous executed return. This execution location is saved together with all automatic variables that would live on the stack in non coroutine functions.
A previous experimental coroutine implementation from Microsoft did use copied stacks so you could even return from deep nested functions. But this version was rejected by the C++ committee. You can get this implementation for example with the Boosts fiber library.

coroutines are supposed to be (in C++) functions that are able to "wait" for some other routine to complete and to provide whatever is needed for the suspended, paused, waiting, routine to go on. the feature that is most interesting to C++ folks is that coroutines would ideally take no stack space...C# can already do something like this with await and yield but C++ might have to be rebuilt to get it in.
concurrency is heavily focused on the separation of concerns where a concern is a task that the program is supposed to complete. this separation of concerns may be accomplished by a number of means...usually be delegation of some sort. the idea of concurrency is that a number of processes could run independently (separation of concerns) and a 'listener' would direct whatever is produced by those separated concerns to wherever it is supposed to go. this is heavily dependent on some sort of asynchronous management. There are a number of approaches to concurrency including Aspect oriented programming and others. C# has the 'delegate' operator which works quite nicely.
parallelism sounds like concurrency and may be involved but is actually a physical construct involving many processors arranged in a more or less parallel fashion with software that is able to direct portions of code to different processors where it will be run and the results will be received back synchronously.

C++1z coroutine threading context and coroutine scheduling

Per this latest C++ TS: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4628.pdf, and based on the understanding of C# async/await language support, I'm wondering what is the "execution context" (terminology borrowed from C#) of the C++ coroutines?
My simple test code in Visual C++ 2017 RC reveals that coroutines seem to always execute on a thread pool thread, and little control is given to the application developer on which threading context the coroutines could be executed - e.g. Could an application forces all the coroutines (with the compiler generated state machine code) to be executed on the main thread only, without involving any thread pool thread?
In C#, SynchronizationContext is a way to specify the "context" where all the coroutine "halves" (compiler generated state machine code) will be posted and executed, as illustrated in this post: https://blogs.msdn.microsoft.com/pfxteam/2012/01/20/await-synchronizationcontext-and-console-apps/, while the current coroutine implementation in Visual C++ 2017 RC seems to always rely on the concurrency runtime, which by default executes the generated state machine code on a thread pool thread. Is there a similar concept of synchronization context that the user application can use to bind coroutine execution to a specific thread?
Also, what's the current default "scheduler" behavior of the coroutines as implemented in Visual C++ 2017 RC? i.e. 1) how a wait condition is exactly specified? and 2) when a wait condition is satisfied, who invokes the "bottom half" of the suspended coroutine?
My (naive) speculation regarding Task scheduling in C# is that C# "implements" the wait condition purely by task continuation - a wait condition is synthesized by a TaskCompletionSource owned task, and any code logic that needs to wait will be chained as a continuation to it, so if the wait condition is satisfied, e.g. if a full message is received from the low level network handler, it does TaskCompletionSource.SetValue, which transitions the underlying task to the completed state, effectively allowing the chained continuation logic to start execution (putting the task into the ready state/list from the previous created state) - In C++ coroutine, I'm speculating that std::future and std::promise would be used as similar mechanism (std::future being the task, while std::promise being the TaskCompletionSource, and the usage is surprisingly similar too!) - so does the C++ coroutine scheduler, if any, relies on some similar mechanism to carry out the behavior?
[EDIT]: after doing some further research, I was able to code a very simple yet very powerful abstraction called awaitable that supports single threaded and cooperative multitasking, and features a simple thread_local based scheduler, which can execute coroutines on the thread the root coroutine is started. The code can be found from this github repo: https://github.com/llint/Awaitable
Awaitable is composable in a way that it maintains correct invocation ordering at nested levels, and it features primitive yielding, timed wait, and setting ready from somewhere else, and very complex usage pattern can be derived from this (such as infinite looping coroutines that only get woken up when certain events happen), the programming model follows C# Task based async/await pattern closely. Please feel free to give your feedbacks.

The opposite!
C++ coroutine is all about control. the key point here is the void await_suspend(std::experimental::coroutine_handle<> handle)
function.
evey co_await expects awaitable type. in a nutshell, awaitable type is a type which provide these three functions:
bool await_ready() - should the program halt the execution of the coroutine?
void await_suspend(handle) - the program passes you a continuation context for that coroutine frame. if you activate the handle (for example, by calling operator () that the handle provides - the current thread resumes the coroutine immediately).
T await_resume() - tells the thread which resumes the coroutine what to do when resuming the coroutine and what to return from co_await.
so when you call co_await on awaitable type, the program asks the awaitable if the coroutine should be suspended (if await_ready returns false) and if so - you get a coroutine handle in which you can do whatever you like.
for example, you can pass the coroutine handle to a thread-pool. in this case a thread-pool thread will resume the coroutine.
you can pass the coroutine handle to a simple std::thread - your own create thread will resume the coroutine.
you can attach the coroutine handle into a derived class of OVERLAPPED and resume the coroutine when the asynchronous IO finishes.
as you can see - you can control where and when the coroutine is suspended and resumes - by managing the coroutine handle passed in await_suspend. there is no "default scheduler" - how you implement you awaitable type will decide how the coroutine is schedueled.
So, what happens in VC++? unfortunately, std::future still doesn't have then function, so you can't pass the coroutine handle to a std::future. if you await on std::future - the program will just open a new thread. look at the source code given by the future header:
template<class _Ty>
void await_suspend(future<_Ty>& _Fut,
experimental::coroutine_handle<> _ResumeCb)
{ // change to .then when future gets .then
thread _WaitingThread([&_Fut, _ResumeCb]{
_Fut.wait();
_ResumeCb();
});
_WaitingThread.detach();
}
So why did you see a win32 threadpool-thread if the coroutines are launched in a regular std::thread? that's because it wasn't the coroutine. std::async calls behind the scenes to concurrency::create_task. a concurrency::task is launched under the win32 threadpool by default. after all, the whole purpose of std::async is to launch the callable in another thread.

Why should I use std::async?

I'm trying to explore all the options of the new C++11 standard in depth, while using std::async and reading its definition, I noticed 2 things, at least under linux with gcc 4.8.1 :
it's called async, but it got a really "sequential behaviour", basically in the row where you call the future associated with your async function foo, the program blocks until the execution of foo it's completed.
it depends on the exact same external library as others, and better, non-blocking solutions, which means pthread, if you want to use std::async you need pthread.
at this point it's natural for me asking why choosing std::async over even a simple set of functors ? It's a solution that doesn't even scale at all, the more future you call, the less responsive your program will be.
Am I missing something ? Can you show an example that is granted to be executed in an async, non blocking, way ?

it's called async, but it got a really "sequential behaviour",
No, if you use the std::launch::async policy then it runs asynchronously in a new thread. If you don't specify a policy it might run in a new thread.
basically in the row where you call the future associated with your async function foo, the program blocks until the execution of foo it's completed.
It only blocks if foo hasn't completed, but if it was run asynchronously (e.g. because you use the std::launch::async policy) it might have completed before you need it.
it depends on the exact same external library as others, and better, non-blocking solutions, which means pthread, if you want to use std::async you need pthread.
Wrong, it doesn't have to be implemented using Pthreads (and on Windows it isn't, it uses the ConcRT features.)
at this point it's natural for me asking why choosing std::async over even a simple set of functors ?
Because it guarantees thread-safety and propagates exceptions across threads. Can you do that with a simple set of functors?
It's a solution that doesn't even scale at all, the more future you call, the less responsive your program will be.
Not necessarily. If you don't specify the launch policy then a smart implementation can decide whether to start a new thread, or return a deferred function, or return something that decides later, when more resources may be available.
Now, it's true that with GCC's implementation, if you don't provide a launch policy then with current releases it will never run in a new thread (there's a bugzilla report for that) but that's a property of that implementation, not of std::async in general. You should not confuse the specification in the standard with a particular implementation. Reading the implementation of one standard library is a poor way to learn about C++11.
Can you show an example that is granted to be executed in an async, non blocking, way ?
This shouldn't block:
auto fut = std::async(std::launch::async, doSomethingThatTakesTenSeconds);
auto result1 = doSomethingThatTakesTwentySeconds();
auto result2 = fut.get();
By specifying the launch policy you force asynchronous execution, and if you do other work while it's executing then the result will be ready when you need it.

If you need the result of an asynchronous operation, then you have to block, no matter what library you use. The idea is that you get to choose when to block, and, hopefully when you do that, you block for a negligible time because all the work has already been done.
Note also that std::async can be launched with policies std::launch::async or std::launch::deferred. If you don't specify it, the implementation is allowed to choose, and it could well choose to use deferred evaluation, which would result in all the work being done when you attempt to get the result from the future, resulting in a longer block. So if you want to make sure that the work is done asynchronously, use std::launch::async.

I think your problem is with std::future saying that it blocks on get. It only blocks if the result isn't already ready.
If you can arrange for the result to be already ready, this isn't a problem.
There are many ways to know that the result is already ready. You can poll the future and ask it (relatively simple), you could use locks or atomic data to relay the fact that it is ready, you could build up a framework to deliver "finished" future items into a queue that consumers can interact with, you could use signals of some kind (which is just blocking on multiple things at once, or polling).
Or, you could finish all the work you can do locally, and then block on the remote work.
As an example, imagine a parallel recursive merge sort. It splits the array into two chunks, then does an async sort on one chunk while sorting the other chunk. Once it is done sorting its half, the originating thread cannot progress until the second task is finished. So it does a .get() and blocks. Once both halves have been sorted, it can then do a merge (in theory, the merge can be done at least partially in parallel as well).
This task behaves like a linear task to those interacting with it on the outside -- when it is done, the array is sorted.
We can then wrap this in a std::async task, and have a future sorted array. If we want, we could add in a signally procedure to let us know that the future is finished, but that only makes sense if we have a thread waiting on the signals.

In the reference: http://en.cppreference.com/w/cpp/thread/async
If the async flag is set (i.e. policy & std::launch::async != 0), then
async executes the function f on a separate thread of execution as if
spawned by std::thread(f, args...), except that if the function f
returns a value or throws an exception, it is stored in the shared
state accessible through the std::future that async returns to the
caller.
It is a nice property to keep a record of exceptions thrown.

http://www.cplusplus.com/reference/future/async/
there are three type of policy,
launch::async
launch::deferred
launch::async|launch::deferred
by default launch::async|launch::deferred is passed to std::async.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js