folly IOThreadPoolExecutor vs CPUThreadPoolExecutor

folly IOThreadPoolExecutor vs CPUThreadPoolExecutor - c++

I'm trying to learn more about the async abstractions used by this codebase I'm working on.
I'm reading Folly's documentation for two async executor pools in the library, IOThreadPoolExecutor for io bound tasks, and CPUThreadPoolExecutor for cpu bound tasks (https://github.com/facebook/folly/blob/master/folly/docs/Executors.md).
I'm reading through the descriptions but I don't understand the main difference. It seems like IOThreadPoolExecutor is built around event_fd and epoll loop and CPUThreadPoolExecutor uses a queue and semaphore.
But that doesn't tell me that much about the benefits and trade-offs.

At a high level IPThreadPoolExecutors should be used only if you need a pool of EventBases. If you need a pool of workers, then use CPUThreadPoolExecutor.
CPUThreadPoolExecutor
Contains a series of priority queues which get constantly picked up by a series of workers. Each worker thread executes threadRun() after created. ThreadRun() is essentially an infinite loop which pulls one task from task queue and executes it. If the task is already expired when it is fetched, then the expire callback is executed instead of the task itself.
IOThreadPoolExecutor
Each IO thread runs its own EventBase. Instead of pulling task from task queue like the CPUThreadPoolExecutor, the IOThreadPoolExecutor registers an event to the EventBase of next IO thread. Each IO thread then calls loopForEver() for its EventBase, which essentially calls epoll() to perform async io.
So most of the time you should probably be using a CPUThreadPoolExecutor, as that is the usual use case for having a pool of workers.

Related

Using gRPC C++ on the Client, how can I keep the library single threaded?

TLDR;
I need to run the gRPC Cpp client library as a single thread. From what I can tell, initializing grpc creates two threads for Executors (default executor and resolver executor) and one to two threads for Timers (from timer_manager). I can turn these threads off after creation but I cant figure out how to prevent them from being created. Is there a way to stop their creation using any of the APIs?
Explanation
Threading in Completion Queue, Executors, and Execution Contexts
Lets say we have a cpp file with a completion queue:
using grpc::CompletionQueue;
CompletionQueue* globalCompletionQueuePtr;
void main()
{
globalCompletionQueuePtr = new CompletionQueue;
}
Having done this we then have this sequence kick off:
Creating a CompletionQueue in this way initializes grpc (grpc_init()
in init.cc)
grpc_init will then call grpc_iomgr_init which then calls InitAll off of grpc_core::Executor
In executor.cc, InitAll creates the default and resolver executors and then calls Init() on each.
Init then calls SetThreading(true) which goes about starting up an execution thread for each executor.
Now we have two threads spun up separate from the main thread, one for the default executor and one for resolver executor. Not looking any farther into this, I can then remove the threads by calling grpc_core::Executor::SetThreadingAll(false); after creating the completion queue but this means that the threads will create and start work and then be terminated.
Questions about executors, the completion queue, and execution contexts:
How do the executors relate to the poll engine? I see that the executors run closures but are they responsible for executing all closures? I can run closures when they are turned off so I must assume thats happening on the main thread. Is that right?
Calling AsyncNext on the completion queue above drives the operations on the queue to finish as the documentation says. I can push operations onto the queue (with grpc_cq_begin_op and grpc_cq_end_op) and I can grab the underlying pollset, create a pollent, and use that to schedule calls myself. In this way, it looks like the queue tracks the state of operations but is not itself responsible for the operations doing work. Is that right?
I know that certain calls into grpc need the grpc_core::ExecCtx exec_ctx; context object created on the stack. How does the stack ctx interact with the resolver and default executors? Does it?
Is it possible to init grpc without the executors? Calling SetThreading(false) seems to keep the library working but I dont want to create threads and then kill them.
Threading in Timers
Separate from the completion queue, after the iomgr init:
grpc_init in init.cc later calls grpc_iomgr_start in iomgr.cc which calls grpc_timer_manager_init in timer_manager.cc
The last thing grpc_timer_manager_init does is call start_threads()
start_threads() checks g_threaded to see it needs to start some threads and then does so by calling start_timer_thread_and_unlock
Now theres a timer thread which will figure out how long until the next timer fires, sleep until that time, then wake up and fire the timers. If we run out of threads, we will start up another thread as long as we are in threaded mode (g_threaded). The code basically puts us in threaded mode no matter what, but there is a call grpc_timer_manager_set_threading(false); from timer_manager that will stop all the timer threads.
Questions about timers:
For these timer threads, what is their main use relative to grpc calls? Are the timers mostly an internal construct or are they used by the public API in some way? Are they responsible for enforcing the deadlines on closures?
Is there a way to init grpc without the timer threads? I can turn them off as stated above but its got the same problem as the executors in that they get created and then I destroy them afterwards.
Will turning off the timer threads have any negative impact on the operations of gRPC such as deadlines no longer working? Will gRPC spin up new threads even after calling grpc_timer_manager_set_threading? Are the timers resolved on the main thread in a coroutine way similiar to the closures by calling AsyncNext on the queue without threads? Is it already doing that?
Extra Context
Finally, is there anything else in the library that will spin up threads that I'm blind to seeing?
The reason I ask all these questions is that I need to run grpc inside an application where the application provides a single thread for the library to run on. Performance degradation from the lack of threads is not a concern.
If anything I have said here is inaccurate, please do correct me. I know I am working with an imperfect understanding of the grpc cpp library.
Thanks in advance for any answers and anyone who takes the time to read through this and provide support. I greatly appreciate it!
UPDATE: Why do I need a single thread?
I have a specific hardware environment where the actual application that will run the gRPC client will be managing several threads. Each thread will get a time slice to run in and must be done at the end of that time slice. While extra threads can spin up during that thread's time slice, they must all be finished when the time slice is over so that the next thread, when given its time slice, has all the hardware resources available to it.

GRPC does not have any support for single threading. At the very least, we use threads for running timers and the resolver, along with a few other tasks as you had noticed.
You can avoid thread blowup by using the async server API rather than sync which creates a new thread per RPC
But nothing is single threaded

why some thread pool implementation doesn't use producer and consumer model

I intend to implement a thread pool to manage threads in my project. The basic structure of thread pool come to my head is queue, and some threads generate tasks into this queue, and some thread managed by thread pool are waiting to handle those task. I think this is class producer and consumer problem. But when I google thread pool implementation on the web, I find those implementation seldom use this classic model, so my question is why they don't use this classic model, does this model has any drawbacks? why they don't use full semaphore and empty semaphore to sync?

If you have multiple threads waiting on a single resource (in this case the semaphores and queue) then you are creating a bottle neck. You are forcing all tasks through one queue, even though you have multiple workers. Logically this might make sense if the workers are usually idle, but the whole point of a thread pool is to deal with a heavily loaded scenario where the workers are kept busy (for maximum through-put). Using a single input queue will be particularly bad on a multi-processor system where all workers read and write the head of the queue when they are trying to get the next task. Even though the lock contention might be low, the queue head pointer will still need to be shared/communicated from one CPU cache to another each time it is updated.
Think about the ideal case: all workers are always busy. When a new task is enqueued you want it to be dispatched to the worker that will complete its current/pending task(s) first.
If, as a client, you had a contention-free oracle that could tell you which worker to enqueue a new task to, and each worker had its own queue, then you could implement each worker with its own multi-writer-single-reader queue and always dispatch new tasks to the best queue, thus eliminating worker contention on a single shared input queue. Of course you don't have such an oracle, but this mechanism still works pretty well until a worker runs out of tasks or the queues get imbalanced. "Work stealing" deals with these cases, while still reducing contention compared to the single queue case.
See also:
Is Work Stealing always the most appropriate user-level thread scheduling algorithm?

Why there's no Producer and Consumer model implementation
This model is very generic and could have lots of different explanations, one of the implementation could be a Queue:
Try Apache APR Queue:
It's documented as Thread Safe FIFO bounded queue.
http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html

How to implement a back-off with Microsoft PPL lightweight task scheduler?

We use a PPL Concurrency::TaskScheduler to dispatch events from our media pipeline to subscribed clients (typically a GUI app).
These events are C++ lambdas passed to Concurrency::TaskScheduler::ScheduleTask().
But, under load, the pipeline can generate events at a greater rate than the client can consume them.
Is there a PPL strategy I can use to cause the event dispatcher to not queue an event (in reality, a scheduled task) if the 'queue' of scheduled tasks is greater than N? And if not, how would I roll my own?

Looking at the API, it appears that there's no way to know if the scheduler is going under heavy load or not, nor is there a way to tell it how to behave in such circumstances. My understanding is that while it is possible to set limits on how many conurrent threads may run within a scheduler using policies, the protocol by which the scheduler may accept or refuse new tasks isn't clear to me.
My bet is that you will have to implement that mechanism yourself, by counting how many tasks are in the scheduler already, and have a size limited queue ahead of the scheduler which help you mitigate the flow of incoming tasks.
I suppose that you could use a simple std::queue for your lambdas, and each time you have a new event, you check how many tasks are running, and add as many from the queue as possible to reach your max running task count.
If the queue is still full after that, then you refuse the new task.
To handle the running tasks accounting, you could wrap your tasks with a function decrementing the counter at completion time (use a mutex to avoid races), and increment the counter when scheduling a new task.

boost thread pool

I need a threadpool for my application, and I'd like to rely on standard (C++11 or boost) stuff as much as possible. I realize there is an unofficial(!) boost thread pool class, which basically solves what I need, however I'd rather avoid it because it is not in the boost library itself -- why is it still not in the core library after so many years?
In some posts on this page and elsewhere, people suggested using boost::asio to achieve a threadpool like behavior. At first sight, that looked like what I wanted to do, however I found out that all implementations I have seen have no means to join on the currently active tasks, which makes it useless for my application. To perform a join, they send stop signal to all the threads and subsequently join them. However, that completely nullifies the advantage of threadpools in my use case, because that makes new tasks require the creation of a new thread.
What I want to do is:
ThreadPool pool(4);
for (...)
{
for (int i=0;i<something;i++)
pool.pushTask(...);
pool.join();
// do something with the results
}
Can anyone suggest a solution (except for using the existing unofficial thread pool on sourceforge)? Is there anything in C++11 or core boost that can help me here?

At first sight, that looked like what I wanted to do, however I found out that all implementations I have seen have no means to join on the currently active tasks, which makes it useless for my application. To perform a join, they send stop signal to all the threads and subsequently join them. However, that completely nullifies the advantage of threadpools in my use case, because that makes new tasks require the creation of a new thread.
I think you might have misunderstood the asio example:
IIRC (and it's been a while) each thread running in the thread pool has called io_service::run which means that effectively each thread has an event loop and a scheduler. To then get asio to complete tasks you post tasks to the io_service using the io_service::post method and asio's scheduling mechanism takes care of the rest. As long as you don't call io_service::stop, the thread pool will continue running using as many threads as you started running (assuming that each thread has work to do or has been assigned a io_service::work object).
So you don't need to create new threads for new tasks, that would go against the concept of a threadpool.

Have each task class derive from a Task that has an 'OnCompletion(task)' method/event. The threadpool threads can then call that after calling the main run() method of the task.
Waiting for a single task to complete is then easy. The OnCompletion() can perform whatever is required to signal the originating thread, signaling a condvar, queueing the task to a producer-consumer queue, calling SendMessage/PostMessage API's, Invoke/BeginInvoke, whatever.
If an oringinating thread needs to wait for several tasks to all complete, you could extend the above and issue a single 'Wait task' to the pool. The wait task has its own OnCompletion to communicate the completion of other tasks and has a thread-safe 'task counter', (atomic ops or lock), set to the number of 'main' tasks to be issued. The wait task is issued to the pool first and the thread that runs it waits on a private 'allDone' condvar in the wait task. The 'main' tasks are then issued to the pool with their OnCompletion set to call a method of the wait task that decrements the task counter towards zero. When the task counter reaches zero, the thread that achieves this signals the allDone condvar. The wait task OnCompletion then runs and so signals the completion of all the main tasks.
Such a mechansism does not require the continual create/terminate/join/delete of threadpool threads, places no restriction on how the originating task needs to be signaled and you can issue as many such task-groups as you wish. You should note, however, that each wait task blocks one threadpool thread, so make sure you create a few extra threads in the pool, (not usually any problem).

This seems like a job for boost::futures. The example in the docs seems to demonstrate exactly what you're looking to do.

Joining a thread mean stop for it until it stop, and if it stop and you want to assign a new task to it, you must create a new thread. So in your case you should wait for a condition (for example boost::condition_variable) to indicate end of tasks. So using this technique it is very easy to implement it using boost::asio and boost::condition_variable. Each thread call boost::asio::io_service::run and tasks will be scheduled and executed on different threads and at the end, each task will set a boost::condition_variable or event decrement a std::atomic to indicate end of the job! that's really easy, isn't it?

Possible frameworks/ideas for thread managment and work allocation in C++

I am developing a C++ application that needs to process large amount of data. I am not in position to partition data so that multi-processes can handle each partition independently. I am hoping to get ideas on frameworks/libraries that can manage threads and work allocation among worker threads.
Manage threads should include at least below functionality.
1. Decide on how many workers threads are required. We may need to provide user-defined function to calculate number of threads.
2. Create required number of threads.
3. Kill/stop unnecessary threads to reduce resource wastage.
4. Monitor healthiness of each worker thread.
Work allocation should include below functionality.
1. Using callback functionality, the library should get a piece of work.
2. Allocate the work to available worker thread.
3. Master/slave configuration or pipeline-of-worker-threads should be possible.
Many thanks in advance.

Your question essentially boils down to "how do I implement a thread pool?"
Writing a good thread pool is tricky. I recommend hunting for a library that already does what you want rather than trying to implement it yourself. Boost has a thread-pool library in the review queue, and both Microsoft's concurrency runtime and Intel's Threading Building Blocks contain thread pools.
With regard to your specific questions, most platforms provide a function to obtain the number of processors. In C++0x this is std::thread::hardware_concurrency(). You can then use this in combination with information about the work to be done to pick a number of worker threads.
Since creating threads is actually quite time consuming on many platforms, and blocked threads do not consume significant resources beyond their stack space and thread info block, I would recommend that you just block worker threads with no work to do on a condition variable or similar synchronization primitive rather than killing them in the first instance. However, if you end up with a large number of idle threads, it may be a signal that your pool has too many threads, and you could reduce the number of waiting threads.
Monitoring the "healthiness" of each thread is tricky, and typically platform dependent. The simplest way is just to check that (a) the thread is still running, and hasn't unexpectedly died, and (b) the thread is processing tasks at an acceptable rate.
The simplest means of allocating work to threads is just to use a single shared job queue: all tasks are added to the queue, and each thread takes a task when it has completed the previous task. A more complex alternative is to have a queue per thread, with a work-stealing scheme that allows a thread to take work from others if it has run out of tasks.
If your threads can submit tasks to the work queue and wait for the results then you need to have a scheme for ensuring that your worker threads do not all get stalled waiting for tasks that have not yet been scheduled. One option is to spawn a new thread when a task gets blocked, and another is to run the not-yet-scheduled task that is blocking a given thread on that thread directly in a recursive manner. There are advantages and disadvantages with both these schemes, and with other alternatives.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js