Thread-level parallelism for set of mutually exclusive tasks - c++

I've encountered a problem while processing streaming data with C++. Data comes in entries, size of each entry is comparatively small, and the task processing each entry does not take much time. But each entry, as well as the task processing that entry, is assigned to a class (a concept, not C++ class), where only one of the tasks belonging to a same class can be executed at a time.
Besides, there are billions of entries and ten millions of classes, and entry comes in random classes.
I found it difficult to parallelize these tasks. Any suggestion of how to speed up the process will be great help!
Really thanks!

Put the work entries into a set of class-specific work queues. Use the queue size as as a priority with larger size having priority over smaller sizes. Set up a priority queue (of queues) to hold the class specific work queues based on their size as priority.
Work entries are entered into their corresponding queue if nobody is processing that queue.
Set up a thread pool of about the same size as the number of CPUs you have.
Each thread asks the priority queue for the highest priority work queue, which is the one that arguably holds the most work. The thread removes the class queue from the priority queue. It then processes all the elements in that queue, without locking it; this keeps the overhead per unit small.
If new members of that class show up in the meantime, they are added to a new queue for that class but not placed in the priority queue. The worker thread on using up current queue, checks for a new queue of the same class and processes that if present.

You will want one task processor per hardware processor or so, assuming CPU is at least somewhat involved.
Each processor locks a set of active classes, slurps a bunch of jobs off the front of the queue that can be run given the current set state, marks those as in use in the set, unlocks the set and queue, and processes.
How many jobs per slurp varies with the rate of jobs, the number of processors, and how clumped and latency sensitive they are.
Starvation should be impossible, as the unslurped elements should be high proirity once the class is free to be processed.
Work in batches to keep the contention for set and queue low: ideally dynamically change the batch size based off contention, idle, and latency information.
This makes the overhead stay proportional to number of processors, not number of task-classes.

Related

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.
How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.
Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

CFQ scheduling algorithm

The CFQ algorithm uses an ordered set of queues based on the I/O priority of the processes that made the requests. That means there's a queue for process of priority, let say, 1, another one for priority 2, etc.
I understand that the algorithm takes the first request from each queue, sort them (for avoiding unnecessary head movements) and place them into a dispatch queue for being handle. But since a single request can have many blocks to read (not necessarily contiguous), how is this sort possible? I mean, if I have:
Request1 = [1,2,345,6,423]
and
Request2 = [3,4,2344,664]
being [a,b,c] a list of the blocks a, b and c, how are the resquests 1 and 2 placed into the dispatch queue? As you can see they have a non empty intersection (for example the block 6 is after the blocks 3 and 4)
Other thing I don't get is, again, since a request can have multiples blocks to read, what kind of scheduling is made inside of it? FCFS? or does it order the blocks?
For example, let's say we have a request that contains the following list of blocks to read:
[1,23,5,76,3]
How would the algorithm handle this?
by FCFS:
[1,23,5,76,3]
or by sorting the blocks:
[1,3,4,23,76]
Maybe I didn't understand the algorithm, couldn't find enough documentation. If anyone have a link o paper with a more detailed explanation, please refer me to it.
I my understandig, CFQ does not schedule single track requests, but timeslices for a number of requests.
Quote from IA64wiki:
The CFQ scheduler aims to distribute disk time fairly amongst
processes competing for access to the disk. In addition, it uses
anticipation and deadlines to improve performance, and attempt to
bound latency.
Each process issuing I/O gets a time slice during which it has
exclusive access for synchronous requests. The time slice is bounded
by two parameters: slice_sync adjusted by the process's I/O priority
gives the length in milliseconds of each slice; and quantum gives the
number of requests that can be issued.
All processes share a set of 17 queues for asynchronous I/O, one for
each effective I/O priority. Each queue gets a time slice based on
slice_async adjusted by priority), and the queues are treated
round-robin.
Within each slice, requests are merged (both front and back), and
issued according to SCAN, with backward seeks up to back_seek_max
permitted, but biassed by back_seek_penalty compared with forward
seeks. CFQ will wait up to slice_idle ms within a slice for a process
to issue more I/O. This allows anticipation, and also improves
fairness when a process issues bursty I/O; but in general it reduces
performance.

task delegation scheduler

I implemented a scheduler task delegation scheduler instead of a task stealing scheduler. So the basic idea of this method is each thread has its own private local queue. Whenever a task is produced, before the task gets enqueued to the local queues, a search operation is done among the queues and minimum size queue is found by comparing each size of the queues. Each time this minimum size queue is used to enqueue the task. This is a way of diverting the pressure of the work from a busy thread's queue and delegate the jobs to the least busy thread's queue.
The problem in this scheduling technique is, we dont know how much time each tasks takes to complete. ie. the queue may have a minimal count, but the task may be still operating, on the other hand the queue may have higher value counter, but the tasks may be completed very soon. any ideas to solve this problem?
I am working on linux, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
It seems that your scheduling policy doesn't fit the job at hand. Usually this type of naive-scheduling which ignores task completion times is only relevant when tasks are relatively equal in execution time.
I'd recommend doing some research. A good place to start would be Wikipedia's Scheduling article but that is of course just the tip of the iceberg.
I'd also give a second (and third) thought to the task-delegation requirement since timeslicing task operations allows you to fine grain queue management by considering the task's "history". However, if clients are designed so that each client consistently sends the same "type" of task, then you can achieve similar results with this knowledge.
As far as I remember from my Queueing Theory class the fairest (of them all;) system is the one which has a single queue and multiple servers. Using such system ensures the lowest expected average execution time for all tasks and the largest utilization factor (% of time it works, I'm not sure the term is correct).
In other words, unless you have some priority tasks, please reconsider your task delegation scheduler implementation.

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.
If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.
Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.
You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

TBB for a workload that keeps changing?

I'm sorry I don't seem to get intel's TBB it seems great & supported but I can't wrap my head around how to use it since I guess I'm not used to thinking of parallelism in terms of tasks but instead saw it as threads.
My current workload has a job that sends work to a queue to keep processing(think of a recursion but instead of calling itself it sends work to a queue). The way I got this working in Java was to create a concurrent queue(non-blocking queue) and threadpoolexecutor that worked the queue/send work back to it. But now I'm trying to do something similar in c++, I found TBB can create pools but its approach is very different(Java threads seem to just keep working as long as their is work in the queue but TBB seems to break the task down at the beginning).
Here's a simple Java example of what I do(before this I set how many threads I want,etc..):
static class DoWork implements Callable<Void> {
// queue with contexts to process
private Queue<int> contexts;
DoWork(Context request) {
contexts = new ArrayDeque<int>();
contexts.add(request);
}
public Void call() {
while(!contexts.isEmpty()) {
//do work
contexts.add(new int(data)); //if needs to be send back to the queue to do more work
}
}
}
I sure its possible to do this in TBB, but I'm just not sure how because it seems to break up my work at the time I send it. So if it there's 2 items in the queue it may only launch 2 threads but won't grow as more work comes in(even if I have 8 cores).
Can someone help me understand how to achieve my tasks and also maybe suggest a better way to think about TBB coming from using Java's threading environment(also I have no allegiance to TBB, so if there's something easier/better then I'm happy to learn it. I just don't like c++ threadpool because it doesn't seem actively developed)?
The approach based on having a queue of items for parallel processing, where each thread just pops one item from the queue and proceeds (and possibly adds a new item to the end of the queue at some point) is fundamentally wrong since it limits parallelism of the application. The queue becomes a single point of synchronization, and threads need to wait in order to get access to the next item to process. In practice this approach works when tasks (each items' processing job) are quite large and take different times to complete, allowing queue to be less contended as opposed to when (most of the) threads finish at the same time and come to the queue for their next items to process.
If you're writing a somewhat reusable piece of code you can not guarantee that tasks are either large enough or that they vary in size (time to execute).
I assume that you application scales, which means that you start with some significant number of items (much larger than the number of threads) in your queue and while threads do the processing they add enough tasks to the end, so that there's enough job for everyone until application finishes.
If that's the case I would rather suggested that you kept two thread-safe vectors of your items (TBB's concurrent_vectors for instance) for interchangeability. You start with one vector (your initial set of items) and you enque() a task (I think it's described somewhere in chapter 12 of the TBB reference manual), which executes a parallel_for over the initial vector of items. While the first batch is being processed you would push_back the new items onto the second concurrent_vector and when you're done with the first one you enque() a task with a parallel_for over the second vector and start pushing new items back into the first one. You can try and overlap parallel processing of items better by having three vectors instead of two and gradually moving between them while there's still enough work for all thread to be kept busy.
What you're trying to do is exactly the sort of thing TBB's parallel_do is designed for. The "body" invoked by parallel_do is passed a "feeder" argument which you can do feeder.add(...some new task...) on while processing tasks to create new tasks for execution before the current parallel_do completes.