Parallel programming with c++ async

Parallel programming with c++ async - c++

Is there any way to set maximum amount of threads that can be created by use async function (from future)?
I prefer to use async/future.get because it can be translate to sync/spawn multitasking model, which is common
in textbooks on Alghoritms(ie. Cormen). I want to be able to obtain T[p] (time to finish program using p processors/ threads).

Unfortunately no. std::async is rather notoriously limited in the control it provides you over how threads are created.
You might consider using a boost thread pool instead. This is (somewhat counter intuitively) part of boost asio, and uses an io_service object, even when/if you're not actually using it for I/O.
With this it's fairly easy to control the number of threads used, including using only one.
Of course you could build your own thread pool class from the standard components. Certainly possible, but not an entirely trivial task.

Related

Implementing asynchronous delays

What would be a smart way to implement something like the following?
// Plain C function for example purposes.
void sleep_async(delay_t delay, void (* callback)(void *), void * data);
That is, a means of asynchronously executing a callback after a delay. POSIX, for example, has a few functions that do something like this, but they are mostly for asynchronous I/O (see this for what I mean). What interests me about those functions how they are executed "as if" on a new thread, according to that manual page, where an implementation may choose to spawn "a single thread...to receive all notifications". I am aware that some may nonetheless choose to spawn a whole thread for each of them, and that stuff like this may require support from the OS itself, so this is just an example.
I already have a couple of ways I could implement this (e.g. priority queue of events sorted by wake time on a timer loop, with no need to start a thread at all), but I am wondering whether there already exists smart[er] or [more] complete implementations of what I want to accomplish. For example, maybe implementations of Task.Delay() from C♯ (and coroutines like it in other language environments) do something smart in minimizing the amount of thread spawning for getting asynchronous delays.
Why am I looking for something like this? As implied by the title, I'm looking for something asynchronous. The above signature is just a simple C example to illustrate roughly what POSIX does. I am implementing some C++20 coroutines for use with co_await and friends, with thread pools and whatnot. Scheduling anything that would end up synchronously waiting on something is probably a bad idea, as it would prevent otherwise free threads from doing any work. Spawning [and potentially immediately detaching] a new thread just to add in an asynchronous delay doesn't seem like a very smart idea, either. My timer loop idea could be okay, but that implies needing a predefined timer granularity, and overhead from the priority queue.
Edit
I neglected to mention any real set of target platforms, as a commenter mentioned. I don't expect to target anything outside the "usual" desktop platforms, so the quirks of embedded development are ignored. The way I plan to use asynchronous delays themselves this way does not necessarily require threading support (everything could just be on a timer loop), but threading will nonetheless be required and used in accord (namely thread pools on which coroutines would be scheduled).

The simple but inefficient way would be to spawn a thread, have it sleep for delay, and then call the callback. This can be done in just a few lines using std::async():
auto delayed_call = std::async(std::launch::async, [&]{
std::this_thread::sleep_for(delay);
callback(data);
});
As mentioned by Thomas Matthews, this requires support for threads. While it's fine for a one-off call, it's not efficient if you have many such delayed calls. Having a priority queue and an event loop or a dedicated thread to handle events in this queue, as you already mentioned, is probably the most efficient way to do it. If you are looking for a library that implements this, then have a look at boost::asio.
As for using C++20 coroutines, I do not think that this will make something like your sleep_async() any easier. However, an event loop could be implemented on top of it.

A smart way? You mean really, really smart? That would be my own implementation, of course. You know about POSIX timers, you probably know about linux timers and the various hacks involving std::thread. But, more seriously, what you require sounds mostly to the tune of something like libeio, or libuv - both of these provide callbacks. It depends on what you can afford in binary size and whether you like the particular abstractions a library offers. The 2 libraries seem to be evolved versions of libevent and libev, libevent being the progenitor of them all.
Creating a std::thread instance involves allocating a stack frame, at the very least, which is by no means cheap.

Node C++ module vs libuv thread pool size

I've written a Nodejs C++ module that makes use of NAN's AsyncWorker to expose async module functionality. Works great. However, I understand that AsyncWorker makes use of libuv's thread pool, which defaults to just 4 threads.
While this (or a #-of-cores based limitation) might make sense for CPU-heavy functions, some of my exposed functions may run relatively long, even though they don't use the CPU (network activity, etc). Therefore the thread pool might get all used up even though no computation-intensive work is going on.
The easy solution is to increase the thread pool size (UV_THREADPOOL_SIZE). However, I am concerned that this thread pool is used for other things as well, which might suffer from a performance hit due to too much parallelization (the libuv documentation states, "The threadpool is global and shared across all event loops...").
Is my concern valid? Is there a way to make use of a separate, larger, thread pool only for certain AsyncWorker's that are long-running but not CPU-intenstive, while leaving the common thread-pool untouched?

What libraries should I use for better OCaml Threading?

I have asked a related question before Why OCaml's threading is considered as `not enough`?
No matter how "bad" ocaml's threading is, I notice some libraries say they can do real threading.
For example, Lwt
Lwt offers a new alternative. It provides very light-weight
cooperative threads; ``launching'' a thread is a very fast operation,
it does not require a new stack, a new process, or anything else.
Moreover context switches are very fast. In fact, it is so easy that
we will launch a thread for every system call. And composing
cooperative threads will allow us to write highly asynchronous
programs.
Also Jane Street's aync_core also provides similar things, if I am right.
But I am quite confused. Do Lwt or aync_core provide threading like Java threading?
If I use them, can I utilise multiple cpu?
In what way, can I get a "real threading" (just like in Java) in OCaml?
Edit
I am still confused.
Let me add a scenario:
I have a server (16 cpu cores) and a server application.
What the server application does are:
It listens to requests
For each request, it starts a computational task (let's say costs 2 minutes to finish)
When each task finishes, the task will either return the result back to the main or just send the result back to client directly
In Java, it is very easy. I create a thread pool, then for each request, I create a thread in that pool. that thread will run the computational task. This is mature in Java and it can utilize the 16 cpu cores. Am I right?
So my question is: can I do the same thing in OCaml?

The example of parallelized server that you cite is one of those embarassingly parallel problem that are well solved with a simple multiprocessing model, using fork. This has been doable in OCaml for decades, and yes, you will an almost linear speedup using all the cores of your machine if you need.
To do that using the simple primitives of the standard library, see this Chapter of the online book "Unix system programming in OCaml" (first released in 2003), and/or this chapter of the online book "Developing Applications with OCaml" (first released in 2000).
You may also want to use higher-level libraries such as Gerd Stolpmann's OCamlnet library mentioned by rafix, which provides a lot of stuff from direct helper for the usual client/server design, to lower-level multiprocess communication libraries; see the documentation.
The library Parmap is also interesting, but maybe for slightly different use case (it's more that you have a large array of data available all at the same time, that you want to process with the same function in parallel): a drop-in remplacement of Array.map or List.map (or fold) that parallelizes computations.

The closest thing you will find to real (preemptive) threading is the built in threading library. By that mean I mean that your programming model will be the same but with 2 important differences:
OCaml's native threads are not lightweight like Java's.
Only a single thread executes at a time, so you cannot take advantage of multiple processes.
This makes OCaml's threads a pretty bad solution to either concurrency or parallelism so in general people avoid using them. But they still do have their uses.
Lwt and Async are very similar and provide you with a different flavour of threading - a cooperative style. Cooperative threads differ from preemptive ones in the fact context switching between threads is explicit in the code and blocking calls are always apparent from the type signature. The cooperative threads provided are very cheap so very well suited for concurrency but again will not help you with parallelilsm (due to the limitations of OCaml's runtime).
See this for a good introduction to cooperative threading: http://janestreet.github.io/guide-async.html
EDIT: for your particular scenario I would use Parmap, if the tasks are so computationally intensive as in your example then the overhead of starting the processes from parmap should be negligible.

Using asynchronous method vs thread wait

I have 2 versions of a function which are available in a C++ library which do the same task. One is a synchronous function, and another is of asynchronous type which allows a callback function to be registered.
Which of the below strategies is preferable for giving a better memory and performance optimization?
Call the synchronous function in a worker thread, and use mutex synchronization to wait until I get the result
Do not create a thread, but call the asynchronous version and get the result in callback
I am aware that worker thread creation in option 1 will cause more overhead. I am wanting to know issues related to overhead caused by thread synchronization objects, and how it compares to overhead caused by asynchronous call. Does the asynchronous version of a function internally spin off a thread and use synchronization object, or does it uses some other technique like directly talk to the kernel?

"Profile, don't speculate." (DJB)
The answer to this question depends on too many things, and there is no general answer. The role of the developer is to be able to make these decisions. If you don't know, try the options and measure. In many cases, the difference won't matter and non-performance concerns will dominate.
"Premature optimisation is the root of all evil, say 97% of the time" (DEK)
Update in response to the question edit:
C++ libraries, in general, don't get to use magic to avoid synchronisation primitives. The asynchronous vs. synchronous interfaces are likely to be wrappers around things you would do anyway. Processing must happen in a context, and if completion is to be signalled to another context, a synchronisation primitive will be necessary to do that.
Of course, there might be other considerations. If your C++ library is talking to some piece of hardware that can do processing, things might be different. But you haven't told us about anything like that.
The answer to this question depends on context you haven't given us, including information about the library interface and the structure of your code.

Use asynchronous function because will probably do what you want to do manually with synchronous one but less error prone.
Asynchronous: Will create a thread, do work, when done -> call callback
Synchronous: Create a event to wait for, Create a thread for work, Wait for event, On thread call sync version , transfer result, signal event.

You might consider that threads each have their own environment so they use more memory than a non threaded solution when all other things are equal.
Depending on your threading library there can also be significant overhead to starting and stopping threads.
If you need interprocess synchronization there can also be a lot of pain debugging threaded code.
If you're comfortable writing non threaded code (i.e. you won't burn a lot of time writing and debugging it) then that might be the best choice.

Is there a disadvantage to using boost::interprocess::interprocess_semaphore within a single multithreaded c++ process?

The disadvantage would be in comparison to a technique that was specialized to work on threads that are running within the same process. For example, does wait/post cause the whole process to yield, rather than just the executing thread, even though anyone waiting for a post would be within the same process?
The semaphore would be used, for example, to solve a producer/consumer problem in a shared buffer between two threads in the same process.
Are there any reasonable alternatives?

Use Boost.Thread condition variables as shown here. The accompanying article has a good summary of Boost.Thread features.
Using interprocess semaphores will work but it's likely to place a tax on your execution due to use of unnecessarily heavyweight underlying OS locking primitives (named kernel objects in Windows, for example).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js