I want to schedule and run tasks in parallel in a multi-node distributed memory cluster.
I have many different tasks which have dependencies on other tasks, but I also want to run the same tasks in parallel with different parameters.
For example, say I have a task which has a desired optimal parameter of 10 but there is a chance that this might fail (due to certain complexities within the specific task - which cannot be known beforehand). To hedge the risk of failure I also want to run the same task with the next best parameter. To continue hedging risk, I want to run even more tasks with decreasing next best parameters.
There is an implied hierarchy amongst the tasks based upon the optimal parameter. All tasks with a lesser optimal parameter can be considered hedges to the better value.
My main question is:
Since I'm running the same tasks in parallel with different parameters, and the tasks, based on the optimality of the parameter, are prioritized in the sense that as soon as any instance of that task completes successfully, I want to kill all subsequent instances of the task which have lower priority (have worse parameters) and remove them from the process queue.
In other words, I want to kill off the hedge processes when I know that a better parameter task has completed successfully.
Does DAGuE allow this? i.e. the removal of tasks from the queue? If not, can another C++ scheduler be suggested.
Related
Context
I have a long running process returning a result. This result may need further processing. There can also be multiple results ready simultaneously. Each one of these results that needs further processing may take a while too.
I need to take each one of those results, throw each into a queue of sorts and process each one, with the ability to start/stop this process.
Problem:
At the moment, I am limited to using a QFuture with a QFutureWatcher and using either a QtConcurrent::mapped() or QtConcurrent::run() function (the latter which doesn't support direct pause/stop functionality, see note below).
The problem with the approech(es) mentioned above is it requires all results to be known up front, however to decrease the overall processing time, I would like to process each result as it comes in.
How can I effectively process create a thread pool with a queue of tasks?
Refering to Java's Fork/Join vs ExecutorService - when to use which?, a traditional thread pool is usually used to process many independent requests; and a ForkJoinPool is used to process coherent/recursive tasks, where a task may spawn another subtask and join on it later.
So, why does Java-8's parallelStream use ForkJoinPool by default but not a traditional executor?
In many cases, we use forEach() after a stream() or parallelStream() and then submit a functional interface as an argument. From my point of view, these tasks are independent, aren't they?
One important thing is that a ForkJoinPool can execute "normal" tasks (e.g. Runnable, Callable) as well, so it's not just meant to be used with recursively-created tasks.
Another (important) thing is that ForkJoinPool has multiple queues, one for each worker thread, for the tasks, where a normal executor (e.g. ThreadPoolExecutor) has just one. This has much impact on what kind of tasks they should run.
The smaller and the more tasks a normal executor has to execute, the higher is the overhead of synchronization for distributing tasks to the workers. If most of the tasks are small, the workers will access the internal task queue often, which leads to synchronization overhead.
Here's where the ForkJoinPool shines with its multiple queues. Every worker just takes tasks from its own queue, which doesn't need to be synchronized by blocking most of the time, and if it's empty, it can steal a task from another worker, but from the other end of the queue, which also leads rarely to synchronization overhead as work-stealing is supposed to be rather rare.
Now what does that have to do with parallel streams? The streams-framework is designed to be easy to use. Parallel streams are supposed to be used when you want to split something up in many concurrent tasks easily, where all tasks are rather small and simple. Here's the point where the ForkJoinPool is the reasonable choice. It provides the better performance on huge numbers of smaller tasks and it can handle longer tasks as well, if it has to.
My problem is running a job after thousands of jobs finish running on AWS Batch.
I have tried run the job in a job queue with lower priority and run the job in the same queue but submiting after all the others (the documentation says that the jobs are executed in approximately the order that they are submitted). But my question is if any one of these (or some other) guarantees that it will run after the others ?.
I wouldn't rely on a guarantee using the above methods. Execution order is explicitly not guaranteed to match submission order. Priority "should" work, but at large scale it's likely at some point something will delay your high priority execution and cause the scheduler to decide it has resources to spare for the low priority queue.
You can rely on job dependencies. They allow you to specify that one job depends on another N jobs, and therefore must wait until they all finish to begin running. This can even be chained - A depends on B, B depends on C, guarantees order C -> B -> A. Unfortunately, N <= 20.
The best way to depend on a lot of jobs (more than 20) is to depend on a single array job, with all those jobs inside it. On a related note, an array job can also be configured to make all its jobs serially dependent (jobs execute in array order). The only caveat is you have to put all your jobs into an array. On the off-chance your thousands of jobs you want to depend on aren't already in an array, there are ways of manipulating them into one - for example, if you're processing 1000 files, you can put the files in a list, and have each array job index into the list using its job index.
My application currently has a list of "tasks" (Each task consists of a function - That function can be as simple as printing something out but also way more complex) which gets looped through. (Additional note: Most tasks send a packet after having been executed) As some of these tasks could take quite some time, I thought about using a different, asynchronous thread for each task, thus letting run all the tasks concurrently.
Would that be a smart thing to do or not?One problem is that I can't possibly know the amount of tasks beforehand, so it could result in quite a few threads being created, and I read somewhere that each different hardware has it's limitations. I'm planing to run my application on a raspberry pi, and I think that I will always have to run between 5 and 20 tasks.
Also, some of the tasks have a lower "priority" of running that others.
Should I just run the important tasks first and then the less important ones? (Problem here is that if the sum of the time needed for all tasks together exceeds the time that some specific, important task should be run, my application won't be accurate anymore) Or implement everything in asynchronous threads? Or just try to make everything a little bit faster by only having the "packet-sending" in an asynchronous thread, thus not having to wait until the packets actually get sent?
There are number of questions you will need to ask yourself before you can reasonably design a solution.
Do you wish to limit the number of concurrent tasks?
Is it possible that in future the number of concurrent tasks will increase in a way that you cannot predict today?
... and probably a whole lot more.
If the answer to any of these is "yes" then you have a few options:
A producer/consumer queue with a fixed number of threads draining the queue (not recommended IMHO)
Write your tasks as asynchronous state machines around an event dispatcher such as boost::io_service (this is much more scalable).
If you know it's only going to be 20 concurrent tasks, you'll probably get away with std::async, but it's a sloppy way to write code.
I implemented a scheduler task delegation scheduler instead of a task stealing scheduler. So the basic idea of this method is each thread has its own private local queue. Whenever a task is produced, before the task gets enqueued to the local queues, a search operation is done among the queues and minimum size queue is found by comparing each size of the queues. Each time this minimum size queue is used to enqueue the task. This is a way of diverting the pressure of the work from a busy thread's queue and delegate the jobs to the least busy thread's queue.
The problem in this scheduling technique is, we dont know how much time each tasks takes to complete. ie. the queue may have a minimal count, but the task may be still operating, on the other hand the queue may have higher value counter, but the tasks may be completed very soon. any ideas to solve this problem?
I am working on linux, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
It seems that your scheduling policy doesn't fit the job at hand. Usually this type of naive-scheduling which ignores task completion times is only relevant when tasks are relatively equal in execution time.
I'd recommend doing some research. A good place to start would be Wikipedia's Scheduling article but that is of course just the tip of the iceberg.
I'd also give a second (and third) thought to the task-delegation requirement since timeslicing task operations allows you to fine grain queue management by considering the task's "history". However, if clients are designed so that each client consistently sends the same "type" of task, then you can achieve similar results with this knowledge.
As far as I remember from my Queueing Theory class the fairest (of them all;) system is the one which has a single queue and multiple servers. Using such system ensures the lowest expected average execution time for all tasks and the largest utilization factor (% of time it works, I'm not sure the term is correct).
In other words, unless you have some priority tasks, please reconsider your task delegation scheduler implementation.