Control degree of parallelism with std::async - c++

Is there a way to explicitly set/limit the degree of parallelism (= the number of separate threads) used by std::async and related classes?
Perusing the thread support library hasn’t turned up anything promising.
As close as I could figure out, std::async implementations (usually?) use a thread pool internally. Is there are standardised API to control this?
For background: I’m in a setting (shared cluster) where I have to manually limit the number of cores used. If I fail to do this, the load sharing scheduler throws a fit and I’m penalised. In particular, std::thread::hardware_concurrency() holds no useful information, since the number of physical cores is irrelevant for the constraints I’m under.
Here’s a relevant piece of code (which, in C++17 with parallelism TS, would probably be written using parallel std::transform):
auto read_data(std::string const&) -> std::string;
auto multi_read_data(std::vector<std::string> const& filenames, int ncores = 2) -> std::vector<std::string> {
auto futures = std::vector<std::future<std::string>>{};
// Haha, I wish.
std::thread_pool::set_max_parallelism(ncores);
for (auto const& filename : filenames) {
futures.push_back(std::async(std::launch::async, read_data, filename));
}
auto ret = std::vector<std::string>(filenames.size());
std::transform(futures.begin(), futures.end(), ret.begin(),
[](std::future<std::string>& f) {return f.get();});
return ret;
}
From a design point of view I’d have expected the std::execution::parallel_policy class (from parallelism TS) to allow specifying that (in fact, this is how I did it in the framework I designed for my master thesis). But this doesn’t seem to be the case.
Ideally I’d like a solution for C++11 but if there’s one for later versions I would still like to know about it (though I can’t use it).

No. std::async is opaque, and you have no control over it's usage of threads, thread pools or anything else. As a matter of fact, you do not even have any guarantee that it would use a thread at all - it might as well execute in the same thread (potentially, note #T.C. comment below), and such implementation would still be conformant.
C++ threading library was never supposed to handle fine-tuning of OS / hardware specifics of threads management, so I am afraid, in your case you will have to code for the proper support yourself, potentially using OS-provided thread control primitives.

As other people have noted, std::async doesn't let you do this.
Yet (but see the May 2022 update at the end of the answer)
You're describing one of the simpler use-cases of Executors which are currently still bouncing around the design-space of C++ Standardisation, specifically right now in Study Group 1: Concurrency.
Since reading WG21 standards proposals can be a slog, they authors have helpfully linked to both a prototype header-only reference implementation and some example code.
It even includes a static thread pool, and an example of almost exactly what you want:
async_1.cpp
#include <experimental/thread_pool>
#include <iostream>
#include <tuple>
namespace execution = std::experimental::execution;
using std::experimental::static_thread_pool;
template <class Executor, class Function>
auto async(Executor ex, Function f)
{
return execution::require(ex, execution::twoway).twoway_execute(std::move(f));
}
int main()
{
static_thread_pool pool{1};
auto f = async(pool.executor(), []{ return 42; });
std::cout << "result is " << f.get() << "\n";
}
Thank you to #jared-hoberock for pointing me at P0668R0 as the much simpler followup to P0443R1 which I had referenced in an earlier version of this answer.
This simplification has been applied, and now there's both a paper describing the rationale (P0761R0), and a much simpler version of the standard wording in P0443R2.
As of July 2017, the only actual guess I've seen on delivery of this is: Michael Wong, editor of the Concurrency TS --- the standardisation vehicle for Executors --- feels "confident that it will make it into C++20".
I'm still getting Stack Overflow Points™ for this answer, so here's a May 2022 update:
Executors didn't land in C++20.
"A Unified Executors Proposal for C++" reached revision 14 (P0443R14) in 2020, and a new paper std::execution (P2300R5) is proposed as a follow-on; See sections 1.8 and 1.9 for the reasons for the new paper and differences from P0443.
Notably:
A specific thread pool implementation is omitted, as per LEWG direction.
The "Do in a thread-pool" example from std::execution looks like:
using namespace std::execution;
scheduler auto sch = thread_pool.scheduler();
sender auto begin = schedule(sch);
sender auto hi = then(begin, []{
std::cout << "Hello world! Have an int.";
return 13;
});
sender auto add_42 = then(hi, [](int arg) { return arg + 42; });
auto [i] = this_thread::sync_wait(add_42).value();
There's a lot to process here. And the last decade of work has pretty much abandoned "std::async and related classes", so perhaps the actual answer to this question is no longer
Yet
but
No, and there never will be. There'll be a different model where you can do that instead.
c.f. the Motivation section of P2300R5
std::async/std::future/std::promise, C++11’s intended exposure for asynchrony, is inefficient, hard to use correctly, and severely lacking in genericity, making it unusable in many contexts.
P2453R0 records the rough feelings of the LEWG participants around this current approach, and also how it interacts with the existing Networking TS, i.e. Asio, which has its own concurrency model. My reading of the polls and comments says neither is likely to land in C++23.

Related

Implement Asynchronous Lazy Generator in C++

My intention is to use a generic interface for iterating over files from a variety of I/O sources. For example, I might want an iterator that, authorization permitting, will lazily open every file on my file system and return the open file handle. I'd then want to use the same interface for iterating over, perhaps, objects from an AWS S3 bucket. In this latter case, the iterator would download each object/file from S3 to the local file system, then open that file, and again return a file handle. Obviously the implementation behind both iterator interfaces would be very different.
I believe the three most important design goals are these:
For each iter++ invocation, a std::future or PPL pplx::task is returned representing the requested file handle. I need the ability to do the equivalent of the PPL choice(when_any), because I expect to have multiple iterators running simultaneously.
The custom iterator implementation must be durable / restorable. That is, it periodically records where it is in a file system scan (or S3 bucket scan, etc.) so that it can attempt to resume scanning from the last known position in the event of an application crash and restart.
Best effort to not go beyond C++11 (and possibly C++14).
I'd assume to make the STL input_iterator my point of departure for an interface. After all, I see this 2014 SO post with a simple example. It does not involve IO, but I see another article from 2001 that allegedly does incorporate IO into a custom STL iterator. So far so good.
Where I start to get concerned is when I read an article like "Generator functions in C++". Ack! That article gives me the impression that I can't achieve my intent to create a generator function, disguised as an iterator, possibly not without waiting for C++20. Likewise, this other 2016 SO post makes it sound like it is a hornets-nest to create generator functions in C++.
While the implementation for my custom iterators will be complex, perhaps what those last two links were tackling was something beyond what I'm trying to achieve. In other words, perhaps my plan is not flawed? I'd like to know what barriers I'm fighting if I assume to make a lazy-generator implementation behind a custom input_iterator. If I should be using something else, like Boost iterator_facade, I'd appreciate a bit of explanation around "why". Also, I'd like to know if what I'm doing has already been implemented elsewhere. Perhaps the PPL, which I've only just started to learn, already has a solution for this?
p.s. I gave the example of an S3 iterator that lazily downloads each requested file and then returns an open file handle. Yes I know this means the iterator is producing a side effect, which normally I would want to avoid. However, for my intended purpose, I'm not sure of a more clean way to do this.
Have you looked at CoroutineTS? It is coming with C++20 and allows what you are looking for.
Some compilers (GNU 10, MSVC) already have some support.
Specific library features on top of standard coroutines that may interest you:
generator<T>
cppcoro::generator<const std::uint64_t> fibonacci()
{
std::uint64_t a = 0, b = 1;
while (true)
{
co_yield b;
auto tmp = a;
a = b;
b += tmp;
}
}
void usage()
{
for (auto i : fibonacci())
{
if (i > 1'000'000) break;
std::cout << i << std::endl;
}
}
A generator represents a coroutine type that produces a sequence of values of type, T, where values are produced lazily and synchronously.
The coroutine body is able to yield values of type T using the co_yield keyword. Note, however, that the coroutine body is not able to use the co_await keyword; values must be produced synchronously.
async_generator<T>
An async_generator represents a coroutine type that produces a sequence of values of type, T, where values are produced lazily and values may be produced asynchronously.
The coroutine body is able to use both co_await and co_yield expressions.
Consumers of the generator can use a for co_await range-based for-loop to consume the values.
Example
cppcoro::async_generator<int> ticker(int count, threadpool& tp)
{
for (int i = 0; i < count; ++i)
{
co_await tp.delay(std::chrono::seconds(1));
co_yield i;
}
}
cppcoro::task<> consumer(threadpool& tp)
{
auto sequence = ticker(10, tp);
for co_await(std::uint32_t i : sequence)
{
std::cout << "Tick " << i << std::endl;
}
}
Sidenote: Boost Asio has experimental support for CoroutineTS for several releases, so if you want you can combine it.

Safe cross platform coroutines

All coroutine implementations I've encountered use assembly or inspect the contents of jmp_buf. The problem with this is it inherently not cross platform.
I think the following implementation doesn't go off into undefined behavior or rely on implementation details. But I've never encountered a coroutine written like this.
Is there some inherent flaw is using long jump with threads?
Is there some hidden gotcha in this code?
#include <setjmp.h>
#include <thread>
class Coroutine
{
public:
Coroutine( void ) :
m_done( false ),
m_thread( [&](){ this->start(); } )
{ }
~Coroutine( void )
{
std::lock_guard<std::mutex> lock( m_mutex );
m_done = true;
m_condition.notify_one();
m_thread.join();
}
void start( void )
{
if( setjmp( m_resume ) == 0 )
{
std::unique_lock<std::mutex> lock( m_mutex );
m_condition.wait( lock, [&](){ return m_done; } );
}
else
{
routine();
longjmp( m_yield, 1 );
}
}
void resume( void )
{
if( setjmp( m_yield ) == 0 )
{
longjmp( m_resume, 1 );
}
}
void yield( void )
{
if( setjmp( m_resume ) == 0 )
{
longjmp( m_yield, 1 );
}
}
private:
virtual void routine( void ) = 0;
jmp_buf m_resume;
jmp_buf m_yield;
bool m_done;
std::mutex m_mutex;
std::condition_variable m_condition;
std::thread m_thread;
};
UPDATE 2013-05-13 These days there is Boost Coroutine (built on Boost Context, which is not implemented on all target platforms yet, but likely to be supported on all major platforms sooner rather than later).
I don't know whether stackless coroutines fit the bill for your intended use, but I suggest you have a look at them here:
Boost Asio: The Proactor Design Pattern: Concurrency Without Threads
Asio also has a co-procedure 'emulation' model based on a single (IIRC) simple preprocessor macro, combined with some amount of cunningly designed template facilities that come things eerily close to compiler support for _stack-less co procedures.
The sample HTTP Server 4 is an example of the technique.
The author of Boost Asio (Kohlhoff) explains the mechanism and the sample on his Blog here: A potted guide to stackless coroutines
Be sure to look for the other posts in that series!
There is a C++ standard proposal for coroutine support - N3708 which is written by Oliver Kowalke (who is an author of Boost.Coroutine) and Goodspeed.
I suppose this would be the ultimate clean solution eventually (if it happens…)
Because we don't have stack exchange support from C++ compiler, coroutines currently need low level (usually assembly level, or setjmp/longjmp) hack, and that's out of abstraction range of C++. Then the implementations are fragile, and need help from compiler to be robust.
For example, it's really hard to set stack size of a coroutine context, and if you overflow the stack, your program will be corrupted silently. Or crash if you're lucky. Segmented stack seems can help this, but again, this needs compiler level support.
If once it becomes standard, compiler writers will take care. But before that day, Boost.Coroutine would be the only practical solution in C++ to me.
In C, there's libtask written by Russ Cox (who is a member of Go team). libtask works pretty well, but doesn't seem to be maintained anymore.
P.S. If someone know how to support standard proposal, please let me know. I really support this proposal.
There is no generalized cross-platform way of implementing co-routines. Although some implementations can fudge co-routines using setjmp/longjmp, such practices are not standards-compliant. If routine1 uses setjmp() to create jmp_buf1, and then calls routine2() which uses setjmp() to create jmp_buf2, any longjmp() to jmp_buf1 will invalidate jmp_buf2 (if it hasn't been invalidated already).
I've done my share of co-routine implementations on a wide variety of CPUs; I've always used at least some assembly code. It often doesn't take much (e.g. four instructions for a task-switch on the 8x51) but using assembly code can help ensure that a compiler won't apply creative optimizations that would break everything.
I don't believe you can fully implement co-routines with long jump. Co-routines are natively supported in WinAPI, they are called fibers. See for example, CreateFiber(). I don't think other operating systems have native co-routine support. If you look at SystemC library, for which co-routines are central part, they are implemented in assembly for each supported platform, except Windows. GBL library also uses co-routines for event-driven simulation based on Windows fibers. It's very easy to make hard to debug errors trying to implement co-routines and event-driven design, so I suggest using existing libraries, which are already thoroughly tested and have higher level abstractions to deal with this concept.

Coroutines in C or C++?

It seems like coroutines are normally found in higher level languages.
There seem to be several different definitions of them as well. I am trying to find a way to have the specifically called coroutines in C like we have in Lua.
function foo()
print("foo", 1)
coroutine.yield()
print("foo", 2)
end
There's no language level support for coroutines in either C or C++.
You could implement them using assembler or fibres, but the result would not be portable and in the case of C++ you'd almost certainly lose the ability to use exceptions and be unable to rely on stack unwinding for cleanup.
In my opinion you should either use a language the supports them or not use them - implementing your own version in a language that doesn't support them is asking for trouble.
There is a new (as of version 1.53.0) coroutine library in the Boost C++ library: http://www.boost.org/doc/libs/1_53_0/libs/coroutine/doc/html/index.html
I'm unaware of a C library--I came across this question looking for one.
Sorry - neither C nor C++ has support for coroutines. However, a simple search for "C coroutine: yields the following fascinating treatise on the problem: http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html, although you may find his solution a bit - um - impractical
Nowadays, C++ provides coroutines natively as part of C++20.
Concerning the C language, they are not supported natively but several libraries provides them. Some are not portable as they rely on some architecture-dependent assembly instructions but some are portable as they use standard library functions like setjmp()/longjmp() or getcontext()/setcontext()/makecontext()/swapcontext(). There are also some original propositions like this one which uses the C language trick from the Duff's device.
N.B.: On my side, I designed this library.
There's a bunch of coroutine libraries for C++. Here's one from RethinkDB.
There's also mine header-only library, which is tailored to be used with callbacks. I've tried Boost coroutines but I don't use them yet because of the incompatibility with valgrind. My implementation uses ucontext.h and works fine under valgrind so far.
With "standard" coroutines you have to jump thru some hoops to use them with callbacks. For example, here is how a working thread-safe (but leaking) Cherokee handler looks with Boost coroutines:
typedef coroutine<void()> coro_t;
auto lock = make_shared<std::mutex>();
coro_t* coro = new coro_t ([handler,buffer,&coro,lock](coro_t::caller_type& ca)->void {
p1: ca(); // Pass the control back in order for the `coro` to initialize.
coro_t* coro_ = coro; // Obtain a copy of the self-reference in order to call self from callbacks.
cherokee_buffer_add (buffer, "hi", 2); handler->sent += 2;
lock->lock(); // Prevents the thread from calling the coroutine while it still runs.
std::thread later ([coro_,lock]() {
//std::this_thread::sleep_for (std::chrono::milliseconds (400));
lock->lock(); // Wait for the coroutine to cede before resuming it.
(*coro_)(); // Continue from p2.
}); later.detach();
p2: ca(); // Relinquish control to `cherokee_handler_frople_step` (returning ret_eagain).
cherokee_buffer_add (buffer, ".", 1); handler->sent += 1;
});
(*coro)(); // Back to p1.
lock->unlock(); // Now the callback can run.
and here is how it looks with mine:
struct Coro: public glim::CBCoro<128*1024> {
cherokee_handler_frople_t* _handler; cherokee_buffer_t* _buffer;
Coro (cherokee_handler_frople_t *handler, cherokee_buffer_t* buffer): _handler (handler), _buffer (buffer) {}
virtual ~Coro() {}
virtual void run() override {
cherokee_buffer_add (_buffer, "hi", 2); _handler->sent += 2;
yieldForCallback ([&]() {
std::thread later ([this]() {
//std::this_thread::sleep_for (std::chrono::milliseconds (400));
invokeFromCallback();
}); later.detach();
});
cherokee_buffer_add_str (_buffer, "."); _handler->sent += 1;
}
};

c++ threads - parallel processing

I was wondering how to execute two processes in a dual-core processor in c++.
I know threads (or multi-threading) is not a built-in feature of c++.
There is threading support in Qt, but I did not understand anything from their reference. :(
So, does anyone know a simple way for a beginner to do it. Cross-platform support (like Qt) would be very helpful since I am on Linux.
Try the Multithreading in C++0x part 1: Starting Threads as a 101. If you compiler does not have C++0x support, then stay with Boost.Thread
Take a look at Boost.Thread. This is cross-platform and a very good library to use in your C++ applications.
What specifically would you like to know?
The POSIX thread (pthreads) library is probably your best bet if you just need a simple threading library, it has implementations both on Windows and Linux.
A guide can be found e.g. here. A Win32 implementation of pthreads can be downloaded here.
Edit: Didn't see you were on Linux. In that case I'm not 100% sure but I think the libraries are probably already bundled in with your GCC installation.
I'd recommend using the Boost libraries Boost.Thread instead. This will wrap platform specifics of Win32 and Posix, and give you a solid set of threading and synchronization objects. It's also in very heavy use, so finding help on any issues you encounter on SO and other sites is easy.
You can search for a free PDF book "C++-GUI-Programming-with-Qt-4-1st-ed.zip" and read Chapter 18 about Multi-threading in Qt.
Concurrent programming features supported by Qt includes (not limited to) the following:
Mutex
Read Write Lock
Semaphore
Wait Condition
Thread Specific Storage
However, be aware of the following trade-offs with Qt:
Performance penalties vs native threading libraries. POSIX thread (pthreads) has been native to Linux since kernel 2.4 and may not substitute for < process.h > in W32API in all situations.
Inter-thread communication in Qt is implemented with SIGNAL and SLOT constructs. These are NOT part of the C++ language and are implemented as macros which requires proprietary code generators provided by Qt to be fully compiled.
If you can live with the above limitations, just follow these recipes for using QThread:
#include < QtCore >
Derive your own class from QThread. You must implement a public function run() that returns void to contain instructions to be executed.
Instantiate your own class and call start() to kick off a new thread.
Sameple Code:
#include <QtCore>
class MyThread : public QThread {
public:
void run() {
// do something
}
};
int main(int argc, char** argv) {
MyThread t1, t2;
t1.start(); // default implementation from QThread::start() is fine
t2.start(); // another thread
t1.wait(); // wait for thread to finish
t2.wait();
return 0;
}
As an important note in c++14, the use of concurrent threading is available:
#include<thread>
class Example
{
auto DoStuff() -> std::string
{
return "Doing Stuff";
}
auto DoStuff2() -> std::string
{
return "Doing Stuff 2";
}
};
int main()
{
Example EO;
std::string(Example::*func_pointer)();
func_pointer = &Example::DoStuff;
std::future<string> thread_one = std::async(std::launch::async, func_pointer, &EO); //Launching upon declaring
std::string(Example::*func_pointer_2)();
func_pointer_2 = &Example::DoStuff2;
std::future<string> thread_two = std::async(std::launch::deferred, func_pointer_2, &EO);
thread_two.get(); //Launching upon calling
}
Both std::async (std::launch::async, std::launch::deferred) and std::thread are fully compatible with Qt, and in some cases may be better at working in different OS environments.
For parallel processing, see this.

Fastest Multi-Reader / Single Writer Protection for Shared Resources - C++

I would like confirmation that my approach is extremely fast and appropriate for cross platform protection of a shared resource for a mostly multiple reader, single writer approach using C++. It favors writers such that when they enter all current threads are allowed to finish, but all new threads of any type must wait. The reverse of these two functions should be obvious.
The reading I've done suggest that boost shared_mutex and other type rwlocks are not implemented very well and should be avoided. In fact, shared_mutex will not make it into C++0x I take it. See this response by Anthony Williams.
It seems it might even be possible to write to an integer and not need locking of any kind if it is aligned correctly. There is so many articles out there, any good reading on this subject so I don't have to sort the wheat from the chaff?
void AquireReadLock(void)
{
mutex::enter();
if(READ_STATE == true)
{
iReaders++;
mutex::leave();
return;
}
else
{
mutex::leave();
sleep(1);
AquireReadLock();
return;
}
}
void AquireWriteLock(void)
{
mutex::enter();
READ_STATE = false;
if (iReaders != 0)
{
mutex::leave();
sleep(1);
AquireWriteLock();
return;
}
else
{
mutex::leave();
return;
}
}
The decision to leave shared_mutex was made independently of any quality issues. The decision was part of the "Kona compromise" made in the Fall of 2007. This compromise was aimed at reducing the feature set of C++0x so as to ship a standard by 2009. It didn't work, but nevertheless, that is the rationale.
shared_mutex's will be discussed for inclusion to a technical report (i.e. tr2) after the committee has completed C++0x. The chairman on the library working group has already contacted me on this very subject. That is not to say that shared_mutex will be in tr2. Just that it will be discussed.
Your implementation of AquireReadLock and AquireWriteLock have the disadvantage in that they eat a stack frame at the rate of once per second when under contention. And when contention is over, they delay up to a second before reacting. This makes them both stack hungry and poor performing (sorry).
If you are interested, there is a full description and implementation of shared_mutex here:
http://home.roadrunner.com/~hinnant/mutexes/locking.html
The code is not part of boost, but does carry the boost open source license. Feel free to use it, just keep the copyright with the source code. No other strings attached. Here is its analog to your AquireReadLock:
void
shared_mutex::lock_shared()
{
std::unique_lock<mutex_t> lk(mut_);
while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
gate1_.wait(lk);
count_t num_readers = (state_ & n_readers_) + 1;
state_ &= ~n_readers_;
state_ |= num_readers;
}
And here is its analog to your AquireWriteLock:
void
shared_mutex::lock()
{
std::unique_lock<mutex_t> lk(mut_);
while (state_ & write_entered_)
gate1_.wait(lk);
state_ |= write_entered_;
while (state_ & n_readers_)
gate2_.wait(lk);
}
I consider this a well-tested and high-performing, fair read/write mutex implementation for C++. If you have ideas on how to improve it, I would welcome them.
I would like confirmation that my
approach is extremely fast and
appropriate for cross platform
protection of a shared resource for a
mostly multiple reader, single writer
approach using C++.
Faster than what? Your question seems like a micro-optimization, any solution will require profiling to make an adequate conclusion.
The reading I've done suggest that
boot shared_mutex and other type
rwlocks are not implemented very well
and should be avoided.
What are your sources for this statement?