I've got an array of chunks that first need to be processed and then combined. chunks can be processed in arbitrary order but need to be combined in the order they appear in the array.
The following pseudo-code shows my first approach:
array chunks;
def worker_process(chunks):
while not all_chunks_processed:
// get the next chunk that isn't processed or in processing yet
get_next_chunk()
process_chunk()
def worker_combine(chunks):
for i=0 to num_chunks:
// wait until chunk i is processed
wait_until(chunks[i].is_processed())
combine_chunk(i)
def main():
array chunks;
// execute worker_process in n threads where n is the number of (logical) cores
start_n_threads(worker_process, chunks)
// wait until all processing threads are finished
join_threads()
// combine chunks in a single thread
start_1_thread(worker_combine, chunks)
// wait until combine thread is finished
join_threads()
Measurements of the above algorithm show that processing all chunks in parallel and combining the processed chunks sequentially both takes about 3 seconds which leads to a total runtime of roughly 6 seconds.
I'm using a 24 core CPU which means that during the combine stage, only one of those cores is used. My idea then was to use a pipeline to reduce the execution time. My expectation is that the execution time using a pipeline should be somewhere between 3 and 4 seconds. This is how the main function changes when using a pipeline:
def main():
array chunks;
// execute worker_process in n threads where n is the number of (logical) cores
start_n_threads(worker_process, chunks)
// combine chunks in a single thread
start_1_thread(worker_combine, chunks)
// wait until ALL threads are finished
join_threads()
This change decreases the runtime slightly, but by far not as much as I would have expected. I found out that all processing threads were finished after 3 to 4 seconds and the combine thread needed about 2 seconds more.
The problem is that the all threads are treated equally by the scheduler which leads to the combine thread being paused.
Now the question:
How can I change the pipeline so that the combine thread executes faster while still taking advantage of all cores?
I already tried reducing the number of processing threads which helped a bit, but this leads to some cores not being used at all which isn't good either.
EDIT:
While this question wasn't language-specific until here, I actually need to implement in c++ 14.
You could make your worker threads less specialized. So each time a worker thread is free, it could look for work to do; if a chunk is processed but not combined and no thread is currently combining, then the thread can run the combine for that chunk. Otherwise it can check the unprocessed queue for the next chunk to process.
UPDATE
First, I've thought a little more about why this might (or might not) help; and second, form the comments it's clear that some additional clarification is required.
But before I get into it - have you actually tried this approach to see if it helps? Because the fact is, reasoning about parallel processing it hard, which is why frameworks for parallel processing do everything they can to make simplifying assumptions. So if you want to know if it helps, try doing it and let the results direct the conversation. In truth neither of us can say for sure if it's going to help.
So, what this approach gives you is a more fluid acceptance of work onto the cores. Instead of having one worker who, if work is available when his turn comes up, will do that work but won't do anything else, and X (say 24) workers who will never do that one task even if it's ready to do, you have a pool of workers doing what needs done.
Now it's a simple reality that at any time when one core is being used to combine, one less core will be available for processing than would otherwise. And the total aggregate processor time that will be spent on each kind of work is constant. So those aren't variables to optimize. What we'd like is for the allocation of resources at any time to approximate the ratio of total work to be done.
Now to analyze in any detail we'd need to know whether each task (processing task, combining task) is processor-bound or not (and a million follow-up questions depending on the answer). I don't know that, so the following is only generally accurate...
Your initial data suggests that you spent 3 seconds of single-processing time on combining; lets just call that 3 units of work.
You spent 3 seconds of parallel processing time across 24 cores to do the processing task. Let's swag that out as 72 units of work.
So we'd guess that having roughly 1/25 of your resources on combining wouldn't be bad, but serialization constraints may keep you from realizing that ratio. (And again, if some other resource than processor is the real bottleneck in one or both tasks, then that ratio could be completely wrong anyway.)
Your pipeline approach should get close to that if you could ensure 100% utilization without the combiner thread ever falling asleep. But it can fall asleep, either because work isn't ready for it or because it loses the scheduler lottery some of the time. You can't do much about the former, and maybe the question is can you fix the latter...?
There may be architecture-specific games you can play with thread priority or affinity, but you specified portability and I would at best expect you to have to re-tune parameters to each specific hardware if you play those games. Again, my question is can we get by with something simpler?
The intuition behind my suggestion is simply this: Run enough workers to keep the system busy, and let them do whatever work is ready to do.
The limitation of this approach is that if a worker is put to sleep while it is doing combine work, then the combine pipeline is stalled. But you can't help that unless you can inoculate "thread that's doing combine work" - be that a specialized thread or a worker that happens to be doing a combine unit of work - against being put aside to let other threads run. Again, I'm not saying there's no way - but I'm saying there's no portable way, especially that would run optimally without machine-specific tuning.
And even if you are on a system where you can just outright assign one core to your combiner thread, that still might not be optimal because (a) combining work can still only be done as processing tasks finish, and (b) any time combining work isn't ready that reserved core would sit idle.
Anyway, my guess is that cases where a generic worker gets put to sleep when he happened to be combining would be less frequent then the cases where a dedicated combiner thread is unable to move forward. That's what would drive gains with this approach.
Sometimes it's better to let the incoming workload determine your task allocations, than to try to anticipate and outmaneuver the incoming workload.
One limitation of the approach the OP asked about is the wait on all threads. You need to pipeline the passing of finished jobs from the workers to the one combining as soon as they are ready to make maximum use of all cores, unless the combining operation really takes very little time compared to the actual computation in the workers (as in almost zero in comparison).
Using a simple threading framework like TBB or OpenMP would enable parallelization of the workers, but the reduce phase tuning will be critical (the chunk joining). If each join takes awhile, doing that at a course granularity will be needed. In OpenMP you could do something like:
join_arr;
#pragma omp parallel
{
double local_result;
#pragma omp for
for (i=0; i<N; i++) {
do work()
#pragma omp critical
join()
} // end of for loop
}
A more explicit and simpler way to do it would be to use something like RaftLib (http://raftlib.io, full disclosure, I'm one of the maintainers..but this is what I designed it for):
int main()
{
arr some_arr;
using foreach = raft::for_each< type_t >;
foreach fe( some_arr, arr_size, thread_count );
raft::map m;
/**
* send data, zero copy, from fe to join as soon as items are
* ready to be joined, so everything is done fully in parallel
* where fe is duplicated on fibers/threads up to thread_count
* wide and the join is on a separate fiber/thread and gathering
*/
m += fe >> join;
m.exe();
}
Related
I have a program that I am trying to accelerate, but I am not sure that the MPI asynchronous communication is behaving in the way I expect. I expect that a thread is created which performs the communication while the original thread can continue computation in parallel. How can I ensure the program executes this way?
The base program issues an Allgather every x iterations containing x iterations worth of future data. Any iteration only uses data produced x iterations ago. My idea was that instead of batching iterations of data together into a blocking Allgather (what the base program already does), I could issue an asynchronous Iallgather after every iteration and just check that the data has arrived x iterations from now. The program is ~90% computation so I thought this presented ample opportunity to hide the latencies of the communication. Unfortunately, the program is much slower now than before I started on it. The system has a message passing latency which is much smaller than the time it takes to perform x iterations of code for messages of this size.
I modified my code to try and debug a bit. I had 2 ideas about what could be causing the problem:
The asynchronous communication is just slower for some reason
The communication and computation are not being performed in parallel
To test (1), I turned all of the Iallgathers into Allgathers - the results were identical. This led me to believe that my Iallgathers were not being performed in parallel.
To test (2), I am not at all sure I did this right. I thought that calling MPI_Wait(&my_handle, ...) might force the calling thread to progress the transmission so I did something like this.
MPI_Iallgather(send_data, send_data_size, ..., &handle);
#pragma omp task
{
MPI_Wait(&handle, MPI_STATUS_IGNORE);
}
This may be the wrong approach, maybe the thread that issues the Iallgather has to perform it.
In summary: I want computation to continue while an Iallgather or any asynchronous communication call is being performed in parallel. How do I ensure that these are being performed as I expect? Is there a chance that it is being performed in parallel and the performance is garbage? I would expect that I would at least see a difference in execution time when switching Iallgather to Allgather in my code.
I realize that some answers to this question are likely to be iplementation dependent so I'm using openMPI 5.0.0a1.
My application currently has a list of "tasks" (Each task consists of a function - That function can be as simple as printing something out but also way more complex) which gets looped through. (Additional note: Most tasks send a packet after having been executed) As some of these tasks could take quite some time, I thought about using a different, asynchronous thread for each task, thus letting run all the tasks concurrently.
Would that be a smart thing to do or not?One problem is that I can't possibly know the amount of tasks beforehand, so it could result in quite a few threads being created, and I read somewhere that each different hardware has it's limitations. I'm planing to run my application on a raspberry pi, and I think that I will always have to run between 5 and 20 tasks.
Also, some of the tasks have a lower "priority" of running that others.
Should I just run the important tasks first and then the less important ones? (Problem here is that if the sum of the time needed for all tasks together exceeds the time that some specific, important task should be run, my application won't be accurate anymore) Or implement everything in asynchronous threads? Or just try to make everything a little bit faster by only having the "packet-sending" in an asynchronous thread, thus not having to wait until the packets actually get sent?
There are number of questions you will need to ask yourself before you can reasonably design a solution.
Do you wish to limit the number of concurrent tasks?
Is it possible that in future the number of concurrent tasks will increase in a way that you cannot predict today?
... and probably a whole lot more.
If the answer to any of these is "yes" then you have a few options:
A producer/consumer queue with a fixed number of threads draining the queue (not recommended IMHO)
Write your tasks as asynchronous state machines around an event dispatcher such as boost::io_service (this is much more scalable).
If you know it's only going to be 20 concurrent tasks, you'll probably get away with std::async, but it's a sloppy way to write code.
I wrote a code for the simulation of a communication system. Within this communication system, there is a part that I am running in parallel using pthreads. It basically corrects errors that are caused by the channel.
When I receive a frame of bits, 3 components of the algorithm run over the bits. Usually, they are run one after the other, which results in optimal performance, but a huge delay.
The idea is to make them run in parallel. But to get optimal performance, the 3 components much process each bit at the same time.
If I just run them in parallel I obtain bad results, but quite a fast performance. So I used barriers to synchronize the process, where each bit is processed by the three components before allowing them to jump to the following bit.
The performance of this method is optimal. But the code is running really slow, I am taking even slower than the serial implementation.
The code runs on Ubuntu with GCC compiler.
EDIT: One more question, do threads go to sleep while they are waiting for the barrier to open? and if so, how do I prevent them from doing so?
If you literally have to synchronise after every bit, then quite simply threading is not going to be an appropriate approach. The synchronisation overhead is going to far exceed the cost of computation, so you will be better off doing it in a single thread.
Can you split the work up at a higher level? For example, have an entire frame processed by a single thread, but have multiple frames processed in parallel?
Here is a possible solution, USING NO MUTEX.
Lets say you have 4 threads: main thread reading some input, the other 3 threads processing its chunk by chunk. a thread can process a chunk just after the previous one done processing it.
so you have a data type for a chunk:
class chunk{
byte buffer[CHUNK_SIZE];
char status; // use char for atomic input, c++11 can use std::atomic_int.
public:
chunk():status(0);
};
and you have a list of chunks:
std::list<chunk> chunks;
all threads running on chunks, till they reach end of list, but wait till status reach a condition, main thread set status to 1 when input chunk is done. 1st thread wait till status is 1, means input was done and set status to 2 when done, thread 2 wait till status is 2 means thread 1 was done, and when done processing this chunk, set status to 3, and so on. finally, main thread wait till status is 4 to get results
Important when setting the status, to use a = not ++ to make it as atomic as possible.
Basically I have a Task and a Thread class,I create threads equal to the amount of physical cores(or logical cores,since on Intel CPU cores they're double the count).
So basically threads take tasks from a list of tasks and execute them.However,I do have to make sure everything is safe and multiple threads don't try to take the same task at once and of course this introduces extra overhead(and headaches).
What I put the tasks functionality inside the threads?I mean - instead of 4 threads grabbing tasks from a pool of 200 tasks,why not 200 threads that execute in groups of 4 by 4,basically I won't need to synchronize anything,no locking,no nothing.Of course I won't be creating the threads during the whole run-time,just at the initialization.
What pros and cons would such a method have?One problem I can thin of is - since I only create the threads at initialization,their count is fixed,while with tasks I can keep dumping more tasks in the task pool.
Threads have cost - each one requires space for a TLS and for a stack as a minimum.
Keeping your Task and Thread classes separate would be a cleaner and more managable approach in the long run, and keep overhead down by allowing you to limit how many Threads are created and running at any given time (also, a Task is likely to take up less memory than a Thread, and be faster to create and free when needed). A Task is what controls what gets done. A Thread is what controls when a Task is run. Yes, you would need to store the Task objects in a thread-safe list, but that is very simple to implement using a critical section, mutexe, semaphore, etc. On Windows specifically, you could alternatively use an I/O Completion Port instead to submit Tasks to Threads, and let the OS handle the synchornization and scheduling for you.
It will definitely take longer to have 200 threads running at once than it is to run 4 threads to run 200 "tasks". You can test this by a simple program that does some simple math (e.g. calculate the first 20000000 prime, by asking each thread to do 100000 numbers at a time, then grabbing the next lot, or making 200 threads with 100000 numbers each).
How much slower? Don't know, depends on so many things.
suppose I have a code like this
for(i = 0; i < i_max; i++)
for(j = 0; j < j_max; j++)
// do something
and I want to do this by using different threads (assuming the //do something tasks are independent from each other, think about montecarlo simulations for instance). My question is this: is it necessarily better to create a thread for each value of i, than creating a thread for each value of j? Something like this
for(i = 0; i < i_max; i++)
create_thread(j_max);
additionally: what would a suitable number of threads? Shall I just create i_max threads or, perhaps, use a semaphore with k < i_max threads running concurrently at any given time.
Thank you,
The best way to apportion the workload is workload-dependent.
Broadly - for parallelizable workload, use OpenMP; for heterogeneous workload, use a thread pool. Avoid managing your own threads if you can.
Monte Carlo simulation should be a good candidate for truly parallel code rather than thread pool.
By the way - in case you are on Visual C++, there is in Visual C++ v10 an interesting new Concurrency Runtime for precisely this type of problem. This is somewhat analogous to the Task Parallel Library that was added to .Net Framework 4 to ease the implementation of multicore/multi-CPU code.
Avoid creating threads unless you can keep them busy!
If your scenario is compute-bound, then you should minimize the number of threads you spawn to the number of cores you expect your code to run on. If you create more threads than you have cores, then the OS has to waste time and resources scheduling the threads to execute on the available cores.
If your scenario is IO-bound, then you should consider using async IO operations that are queued and which you check the response codes from after the async result is returned. Again, in this case, spawning a thread per IO operation is hugely wasteful as you'll cause the OS to have to waste time scheduling threads that are stalled.
Everyone here is basically right, but here's a quick-and-dirty way to split up the work and keep all of the processors busy. This works best when 1) creating threads is expensive compared to the work done in an iteration 2) most iterations take about the same amount of time to complete
First, create 1 thread per processor/core. These are your worker threads. They sit idle until they're told to do something.
Now, split up your work such that work that data that is needed at the same time is close together. What I mean by that is that if you were processing a ten-element array on a two processor machine, you'd split it up so that one group is elements 1,2,3,4,5 and the other is 6,7,8,9,10. You may be tempted to split it up 1,3,5,7,9 and 2,4,6,8,10, but then you're going to cause more false sharing (http://en.wikipedia.org/wiki/False_sharing) in your cache.
So now that you have a thread per processor and a group of data for each thread, you just tell each thread to work on an independent group of that data.
So in your case I'd do something like this.
for (int t=0;t<n_processors;++t)
{
thread[t]=create_thread();
datamin[t]=t*(i_max/n_processors);
datamax[t]=(t+1)*(i_max/n_processors);
}
for (int t=0;t<n_processors;++t)
do_work(thread[t], datamin[t], datamax[t], j_max)
//wait for all threads to be done
//continue with rest of the program.
Of course I left out things like dealing with your data not being an integer multiple of the number of processors, but those are easily fixed.
Also, if you're not adverse to 3rd party libraries, Intel's TBB (threading building blocks) does a great job of abstracting this from you and letting you get to the real work you want to do.
Everything around creating and calling threads is relatively expensive so you want to do that as little as possible.
If you parallelize your inner loop instead of the outer loop, then for each iteration of the outer loop j_max threads are created. An order of i_max more than if you parallelized the outer loop instead.
That said, the best parallelization depends on your actual problem. Depending on that, it can actually make sense to parallelize the inner loop instead.
Depends on the tasks and on what platform you're about to simulate on. For example, on CUDA's architecture you can split the tasks up so that each i,j,1 is done individually.
You still have the time to load data onto the card to consider.
Using for loops and something like OpenMP/MPI/your own threading mechanism, you can basically choose. In one scenario, parallel threads are broken out and j is looped sequentially on each thread. In the ohter, a loop is processed sequentially, and a loop is broken out in each parallelisation.
Parallelisation (breaking out threads) is costly. Remember that you have the cost of setting up n threads, then synchronising n threads. This represents a cost c over and above the runtime of the routines which in and of itself can make the total time greater for parallel processing than in single threaded mode. It depends on the problem in question; often, there's a critical size beyond which parallel is faster.
I would suggest breaking out into the parallel zone in the first for loop would be faster. If you do so on the inner loop, you must fork/join each time the loop runs, adding a large overhead to the speed of the code. You want, ideally, to have to create threads only once.