How to implement a back-off with Microsoft PPL lightweight task scheduler? - c++

We use a PPL Concurrency::TaskScheduler to dispatch events from our media pipeline to subscribed clients (typically a GUI app).
These events are C++ lambdas passed to Concurrency::TaskScheduler::ScheduleTask().
But, under load, the pipeline can generate events at a greater rate than the client can consume them.
Is there a PPL strategy I can use to cause the event dispatcher to not queue an event (in reality, a scheduled task) if the 'queue' of scheduled tasks is greater than N? And if not, how would I roll my own?

Looking at the API, it appears that there's no way to know if the scheduler is going under heavy load or not, nor is there a way to tell it how to behave in such circumstances. My understanding is that while it is possible to set limits on how many conurrent threads may run within a scheduler using policies, the protocol by which the scheduler may accept or refuse new tasks isn't clear to me.
My bet is that you will have to implement that mechanism yourself, by counting how many tasks are in the scheduler already, and have a size limited queue ahead of the scheduler which help you mitigate the flow of incoming tasks.
I suppose that you could use a simple std::queue for your lambdas, and each time you have a new event, you check how many tasks are running, and add as many from the queue as possible to reach your max running task count.
If the queue is still full after that, then you refuse the new task.
To handle the running tasks accounting, you could wrap your tasks with a function decrementing the counter at completion time (use a mutex to avoid races), and increment the counter when scheduling a new task.

Related

folly IOThreadPoolExecutor vs CPUThreadPoolExecutor

I'm trying to learn more about the async abstractions used by this codebase I'm working on.
I'm reading Folly's documentation for two async executor pools in the library, IOThreadPoolExecutor for io bound tasks, and CPUThreadPoolExecutor for cpu bound tasks (https://github.com/facebook/folly/blob/master/folly/docs/Executors.md).
I'm reading through the descriptions but I don't understand the main difference. It seems like IOThreadPoolExecutor is built around event_fd and epoll loop and CPUThreadPoolExecutor uses a queue and semaphore.
But that doesn't tell me that much about the benefits and trade-offs.
At a high level IPThreadPoolExecutors should be used only if you need a pool of EventBases. If you need a pool of workers, then use CPUThreadPoolExecutor.
CPUThreadPoolExecutor
Contains a series of priority queues which get constantly picked up by a series of workers. Each worker thread executes threadRun() after created. ThreadRun() is essentially an infinite loop which pulls one task from task queue and executes it. If the task is already expired when it is fetched, then the expire callback is executed instead of the task itself.
IOThreadPoolExecutor
Each IO thread runs its own EventBase. Instead of pulling task from task queue like the CPUThreadPoolExecutor, the IOThreadPoolExecutor registers an event to the EventBase of next IO thread. Each IO thread then calls loopForEver() for its EventBase, which essentially calls epoll() to perform async io.
So most of the time you should probably be using a CPUThreadPoolExecutor, as that is the usual use case for having a pool of workers.

One asynchronous thread per task?

My application currently has a list of "tasks" (Each task consists of a function - That function can be as simple as printing something out but also way more complex) which gets looped through. (Additional note: Most tasks send a packet after having been executed) As some of these tasks could take quite some time, I thought about using a different, asynchronous thread for each task, thus letting run all the tasks concurrently.
Would that be a smart thing to do or not?One problem is that I can't possibly know the amount of tasks beforehand, so it could result in quite a few threads being created, and I read somewhere that each different hardware has it's limitations. I'm planing to run my application on a raspberry pi, and I think that I will always have to run between 5 and 20 tasks.
Also, some of the tasks have a lower "priority" of running that others.
Should I just run the important tasks first and then the less important ones? (Problem here is that if the sum of the time needed for all tasks together exceeds the time that some specific, important task should be run, my application won't be accurate anymore) Or implement everything in asynchronous threads? Or just try to make everything a little bit faster by only having the "packet-sending" in an asynchronous thread, thus not having to wait until the packets actually get sent?
There are number of questions you will need to ask yourself before you can reasonably design a solution.
Do you wish to limit the number of concurrent tasks?
Is it possible that in future the number of concurrent tasks will increase in a way that you cannot predict today?
... and probably a whole lot more.
If the answer to any of these is "yes" then you have a few options:
A producer/consumer queue with a fixed number of threads draining the queue (not recommended IMHO)
Write your tasks as asynchronous state machines around an event dispatcher such as boost::io_service (this is much more scalable).
If you know it's only going to be 20 concurrent tasks, you'll probably get away with std::async, but it's a sloppy way to write code.

task delegation scheduler

I implemented a scheduler task delegation scheduler instead of a task stealing scheduler. So the basic idea of this method is each thread has its own private local queue. Whenever a task is produced, before the task gets enqueued to the local queues, a search operation is done among the queues and minimum size queue is found by comparing each size of the queues. Each time this minimum size queue is used to enqueue the task. This is a way of diverting the pressure of the work from a busy thread's queue and delegate the jobs to the least busy thread's queue.
The problem in this scheduling technique is, we dont know how much time each tasks takes to complete. ie. the queue may have a minimal count, but the task may be still operating, on the other hand the queue may have higher value counter, but the tasks may be completed very soon. any ideas to solve this problem?
I am working on linux, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
It seems that your scheduling policy doesn't fit the job at hand. Usually this type of naive-scheduling which ignores task completion times is only relevant when tasks are relatively equal in execution time.
I'd recommend doing some research. A good place to start would be Wikipedia's Scheduling article but that is of course just the tip of the iceberg.
I'd also give a second (and third) thought to the task-delegation requirement since timeslicing task operations allows you to fine grain queue management by considering the task's "history". However, if clients are designed so that each client consistently sends the same "type" of task, then you can achieve similar results with this knowledge.
As far as I remember from my Queueing Theory class the fairest (of them all;) system is the one which has a single queue and multiple servers. Using such system ensures the lowest expected average execution time for all tasks and the largest utilization factor (% of time it works, I'm not sure the term is correct).
In other words, unless you have some priority tasks, please reconsider your task delegation scheduler implementation.

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.
If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.
Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.
You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

How to add tasks to a tango (D) ThreadPool asynchroniously ?

I am comparing a task queue/thread pool pattern system to an n-threads system in D. I'm really new to the D programming language but have worked with threads in C, Java, and Python before. I'm using the Tango library, and I'm building a webserver as an example.
I decided to use tango.core.ThreadPool as my thread pool, as my project is focused on ease of use and performance between traditional threading and task queues.
The documentation shows that I have 3 options:
ThreadPool.wait() - Blocks the current thread while the pool consumes tasks from the queue.
ThreadPool.shutdown() - Finishes the tasks in the pool but not the ones in the queue.
ThreadPool.finish() - Finishes all tasks in the pool and queue, but then accept no more.
None of these things are what I want. It is my understanding that your list of tasks should be able to grow in these systems. The web server is very simple and naïve; I just want it to try its best at scaling to many concurrent requests, even if its resource management only consists of consuming things in the task queue as quickly as possible.
I suspect that it's because the main thread needs to join the other threads, but I'm a bit rusty on my threading knowledge.
what about void append(JobD job, Args args) ? from the docs it works like the Executor.execute(Runnable) form java (submit a task to be run some time in the future)
note that here it is a LIFO queue instead of the expected FIFO queue so allocate enough workers
I discovered that the way I was constructing my delegate contributed to blocking in some part of the code. Instead of closing over the object returned by SocketServer.accept, I now pass that object as a parameter to my delegate. I don't know why this was the solution, but the program now works as expected. I heard that closures in D version 1 are broken; maybe this has something to do with it.