How to ensure MPI asynchronous calls are executed in parallel? - c++

I have a program that I am trying to accelerate, but I am not sure that the MPI asynchronous communication is behaving in the way I expect. I expect that a thread is created which performs the communication while the original thread can continue computation in parallel. How can I ensure the program executes this way?
The base program issues an Allgather every x iterations containing x iterations worth of future data. Any iteration only uses data produced x iterations ago. My idea was that instead of batching iterations of data together into a blocking Allgather (what the base program already does), I could issue an asynchronous Iallgather after every iteration and just check that the data has arrived x iterations from now. The program is ~90% computation so I thought this presented ample opportunity to hide the latencies of the communication. Unfortunately, the program is much slower now than before I started on it. The system has a message passing latency which is much smaller than the time it takes to perform x iterations of code for messages of this size.
I modified my code to try and debug a bit. I had 2 ideas about what could be causing the problem:
The asynchronous communication is just slower for some reason
The communication and computation are not being performed in parallel
To test (1), I turned all of the Iallgathers into Allgathers - the results were identical. This led me to believe that my Iallgathers were not being performed in parallel.
To test (2), I am not at all sure I did this right. I thought that calling MPI_Wait(&my_handle, ...) might force the calling thread to progress the transmission so I did something like this.
MPI_Iallgather(send_data, send_data_size, ..., &handle);
#pragma omp task
{
MPI_Wait(&handle, MPI_STATUS_IGNORE);
}
This may be the wrong approach, maybe the thread that issues the Iallgather has to perform it.
In summary: I want computation to continue while an Iallgather or any asynchronous communication call is being performed in parallel. How do I ensure that these are being performed as I expect? Is there a chance that it is being performed in parallel and the performance is garbage? I would expect that I would at least see a difference in execution time when switching Iallgather to Allgather in my code.
I realize that some answers to this question are likely to be iplementation dependent so I'm using openMPI 5.0.0a1.

Related

c++: Thread pipeline balancing

I've got an array of chunks that first need to be processed and then combined. chunks can be processed in arbitrary order but need to be combined in the order they appear in the array.
The following pseudo-code shows my first approach:
array chunks;
def worker_process(chunks):
while not all_chunks_processed:
// get the next chunk that isn't processed or in processing yet
get_next_chunk()
process_chunk()
def worker_combine(chunks):
for i=0 to num_chunks:
// wait until chunk i is processed
wait_until(chunks[i].is_processed())
combine_chunk(i)
def main():
array chunks;
// execute worker_process in n threads where n is the number of (logical) cores
start_n_threads(worker_process, chunks)
// wait until all processing threads are finished
join_threads()
// combine chunks in a single thread
start_1_thread(worker_combine, chunks)
// wait until combine thread is finished
join_threads()
Measurements of the above algorithm show that processing all chunks in parallel and combining the processed chunks sequentially both takes about 3 seconds which leads to a total runtime of roughly 6 seconds.
I'm using a 24 core CPU which means that during the combine stage, only one of those cores is used. My idea then was to use a pipeline to reduce the execution time. My expectation is that the execution time using a pipeline should be somewhere between 3 and 4 seconds. This is how the main function changes when using a pipeline:
def main():
array chunks;
// execute worker_process in n threads where n is the number of (logical) cores
start_n_threads(worker_process, chunks)
// combine chunks in a single thread
start_1_thread(worker_combine, chunks)
// wait until ALL threads are finished
join_threads()
This change decreases the runtime slightly, but by far not as much as I would have expected. I found out that all processing threads were finished after 3 to 4 seconds and the combine thread needed about 2 seconds more.
The problem is that the all threads are treated equally by the scheduler which leads to the combine thread being paused.
Now the question:
How can I change the pipeline so that the combine thread executes faster while still taking advantage of all cores?
I already tried reducing the number of processing threads which helped a bit, but this leads to some cores not being used at all which isn't good either.
EDIT:
While this question wasn't language-specific until here, I actually need to implement in c++ 14.
You could make your worker threads less specialized. So each time a worker thread is free, it could look for work to do; if a chunk is processed but not combined and no thread is currently combining, then the thread can run the combine for that chunk. Otherwise it can check the unprocessed queue for the next chunk to process.
UPDATE
First, I've thought a little more about why this might (or might not) help; and second, form the comments it's clear that some additional clarification is required.
But before I get into it - have you actually tried this approach to see if it helps? Because the fact is, reasoning about parallel processing it hard, which is why frameworks for parallel processing do everything they can to make simplifying assumptions. So if you want to know if it helps, try doing it and let the results direct the conversation. In truth neither of us can say for sure if it's going to help.
So, what this approach gives you is a more fluid acceptance of work onto the cores. Instead of having one worker who, if work is available when his turn comes up, will do that work but won't do anything else, and X (say 24) workers who will never do that one task even if it's ready to do, you have a pool of workers doing what needs done.
Now it's a simple reality that at any time when one core is being used to combine, one less core will be available for processing than would otherwise. And the total aggregate processor time that will be spent on each kind of work is constant. So those aren't variables to optimize. What we'd like is for the allocation of resources at any time to approximate the ratio of total work to be done.
Now to analyze in any detail we'd need to know whether each task (processing task, combining task) is processor-bound or not (and a million follow-up questions depending on the answer). I don't know that, so the following is only generally accurate...
Your initial data suggests that you spent 3 seconds of single-processing time on combining; lets just call that 3 units of work.
You spent 3 seconds of parallel processing time across 24 cores to do the processing task. Let's swag that out as 72 units of work.
So we'd guess that having roughly 1/25 of your resources on combining wouldn't be bad, but serialization constraints may keep you from realizing that ratio. (And again, if some other resource than processor is the real bottleneck in one or both tasks, then that ratio could be completely wrong anyway.)
Your pipeline approach should get close to that if you could ensure 100% utilization without the combiner thread ever falling asleep. But it can fall asleep, either because work isn't ready for it or because it loses the scheduler lottery some of the time. You can't do much about the former, and maybe the question is can you fix the latter...?
There may be architecture-specific games you can play with thread priority or affinity, but you specified portability and I would at best expect you to have to re-tune parameters to each specific hardware if you play those games. Again, my question is can we get by with something simpler?
The intuition behind my suggestion is simply this: Run enough workers to keep the system busy, and let them do whatever work is ready to do.
The limitation of this approach is that if a worker is put to sleep while it is doing combine work, then the combine pipeline is stalled. But you can't help that unless you can inoculate "thread that's doing combine work" - be that a specialized thread or a worker that happens to be doing a combine unit of work - against being put aside to let other threads run. Again, I'm not saying there's no way - but I'm saying there's no portable way, especially that would run optimally without machine-specific tuning.
And even if you are on a system where you can just outright assign one core to your combiner thread, that still might not be optimal because (a) combining work can still only be done as processing tasks finish, and (b) any time combining work isn't ready that reserved core would sit idle.
Anyway, my guess is that cases where a generic worker gets put to sleep when he happened to be combining would be less frequent then the cases where a dedicated combiner thread is unable to move forward. That's what would drive gains with this approach.
Sometimes it's better to let the incoming workload determine your task allocations, than to try to anticipate and outmaneuver the incoming workload.
One limitation of the approach the OP asked about is the wait on all threads. You need to pipeline the passing of finished jobs from the workers to the one combining as soon as they are ready to make maximum use of all cores, unless the combining operation really takes very little time compared to the actual computation in the workers (as in almost zero in comparison).
Using a simple threading framework like TBB or OpenMP would enable parallelization of the workers, but the reduce phase tuning will be critical (the chunk joining). If each join takes awhile, doing that at a course granularity will be needed. In OpenMP you could do something like:
join_arr;
#pragma omp parallel
{
double local_result;
#pragma omp for
for (i=0; i<N; i++) {
do work()
#pragma omp critical
join()
} // end of for loop
}
A more explicit and simpler way to do it would be to use something like RaftLib (http://raftlib.io, full disclosure, I'm one of the maintainers..but this is what I designed it for):
int main()
{
arr some_arr;
using foreach = raft::for_each< type_t >;
foreach fe( some_arr, arr_size, thread_count );
raft::map m;
/**
* send data, zero copy, from fe to join as soon as items are
* ready to be joined, so everything is done fully in parallel
* where fe is duplicated on fibers/threads up to thread_count
* wide and the join is on a separate fiber/thread and gathering
*/
m += fe >> join;
m.exe();
}

How to improve the time performance of a C++ pthread code that uses Barriers

I wrote a code for the simulation of a communication system. Within this communication system, there is a part that I am running in parallel using pthreads. It basically corrects errors that are caused by the channel.
When I receive a frame of bits, 3 components of the algorithm run over the bits. Usually, they are run one after the other, which results in optimal performance, but a huge delay.
The idea is to make them run in parallel. But to get optimal performance, the 3 components much process each bit at the same time.
If I just run them in parallel I obtain bad results, but quite a fast performance. So I used barriers to synchronize the process, where each bit is processed by the three components before allowing them to jump to the following bit.
The performance of this method is optimal. But the code is running really slow, I am taking even slower than the serial implementation.
The code runs on Ubuntu with GCC compiler.
EDIT: One more question, do threads go to sleep while they are waiting for the barrier to open? and if so, how do I prevent them from doing so?
If you literally have to synchronise after every bit, then quite simply threading is not going to be an appropriate approach. The synchronisation overhead is going to far exceed the cost of computation, so you will be better off doing it in a single thread.
Can you split the work up at a higher level? For example, have an entire frame processed by a single thread, but have multiple frames processed in parallel?
Here is a possible solution, USING NO MUTEX.
Lets say you have 4 threads: main thread reading some input, the other 3 threads processing its chunk by chunk. a thread can process a chunk just after the previous one done processing it.
so you have a data type for a chunk:
class chunk{
byte buffer[CHUNK_SIZE];
char status; // use char for atomic input, c++11 can use std::atomic_int.
public:
chunk():status(0);
};
and you have a list of chunks:
std::list<chunk> chunks;
all threads running on chunks, till they reach end of list, but wait till status reach a condition, main thread set status to 1 when input chunk is done. 1st thread wait till status is 1, means input was done and set status to 2 when done, thread 2 wait till status is 2 means thread 1 was done, and when done processing this chunk, set status to 3, and so on. finally, main thread wait till status is 4 to get results
Important when setting the status, to use a = not ++ to make it as atomic as possible.

C++ Pthreads - Multithreading slower than single-threading [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am developing a C++ application, using pthreads library. Every thread in the program accesses a common unordered_map. The program runs slower with 4 threads than 1. I commented all the code in thread, and left only the part that tokenizes a string. Still the single-threading execution is faster, so I came to the conclusion that the map wasn't the problem.
After that I printed to the screen the threads' Ids, and they seemed to execute sequentially.
In the function that calls the threads, I have a while loop, which creates threads in an array, which size is the number of threads (let's say 'tn'). And every time tn threads are created, I execute a for loop to join them. (pthread_join). While runs many times(not only 4).
What may be wrong?
If you are running a small trivial program this tends to be the case because the work to start the threads, schedule priority, run, context switch, then sync could actually take more time then running it as a single process.
The point here is that when dealing with trivial problems it can run slower. BUT another factor might be how many cores you actually have in your CPU.
When you run a multitthreaded program, each thread will be processed sequentially according to the given CPU clock.
You will only have true multithreading if you have multiple cores. And in such scenario the only multithreading will be 1 thread /core.
Now, given the fact that you (most likely) have both threads on one core, try to keep in mind the overhead generated to the CPU for :
allocating different clock time for each thread
synchronizing thread accesses to various internal CPU operations
other thread priority operations
So in other words, for a simple application, multithreading is actually a downgrade in terms of performance.
Multithreading comes in handy when you need a asynchronous operation (meaning you don't want to wait for a rezult, such as loading an image from an url or streaming geomtery from HDD which is slower then ram) .
In such scenarios, applying multithreading will lead to better user experience, because your program won't hung up when a slow operation occurrs.
Without seeing the code it's difficult to tell for sure, but there could be a number of issues.
Your threads might not be doing enough work to justify their creation. Creating and running threads is expensive, so if your workload is too small, they won't pay for themselves.
Execution time could be spent mostly doing memory accesses on the map, in which case mutually excluding the threads means that you aren't really doing much parallel work in practice (Amdahl's Law).
If most of your code is running under a mutex that it will run serially and not in parllel

fastest way to wake up a thread without using condition variable

I am trying to speed up a piece of code by having background threads already setup to solve one specific task. When it is time to solve my task I would like to wake up these threads, do the job and block them again waiting for the next task. The task is always the same.
I tried using condition variables (and mutex that need to go with them), but I ended up slowing my code down instead of speeding it up; mostly it happened because the calls to all needed functions are very expensive (pthread_cond_wait/pthread_cond_signal/pthread_mutex_lock/pthread_mutex_unlock).
There is no point in using a thread pool (that I don't have either) because it is a too generic construct; here I want to address only my specific task. Depending on the implementation I would also pay a performance penalty for the queue.
Do you have any suggestion for a quick wake-up without using mutex or con_var?
I was thinking in setup threads like timers reading an atomic variable; if the variable is set to 1 the threads will do the job; if it is set to 0 they will go to sleep for few microseconds (I would start with microsecond sleep since I would like to avoid using spinlocks that might be too expensive for the CPU). What do you think about it? Any suggestion is very appreciated.
I am using Linux, gcc, C and C++.
These functions should be fast. If they are taking a large fraction of your time, it is quite possible that you are trying to switch threads too often.
Try buffering up a work queue, and send the signal once a significant amount of work has accumulated.
If this is impossible due to dependencies between the tasks, then your application is not amenable to multithreading at all.
In order to gain performance in a multithreaded application, spawn as many threads as there are CPUs, not a separate thread for each task. Otherwise you end up with a lot of overhead from context switching.
You may also consider making your algorithm more linear (i.e. by using non-blocking calls).

Best way to slow down a thread? Is using Sleep() OK?

I've written a C++ library that does some seriously heavy CPU work (all of it math and calculations) and if left to its own devices, will easily consume 100% of all available CPU resources (it's also multithreaded to the number of available logical cores on the machine).
As such, I have a callback inside the main calculation loop that software using the library is supposed to call:
while(true)
{
//do math here
callback(percent_complete);
}
In the callback, the client calls Sleep(x) to slow down the thread.
Originally, the clientside code was a fixed Sleep(100) call, but this led to bad unreliable performance because some machines finish the math faster than others, but the sleep is the same on all machines. So now the client checks the system time, and if more than 1 second has passed (which == several iterations), it will sleep for half a second.
Is this an acceptable way of slowing down a thread? Should I be using a semaphore/mutex instead of Sleep() in order to maximize performance? Is sleeping x milliseconds for each 1 second of processing work fine or is there something wrong that I'm not noticing?
The reason I ask is that the machine still gets heavily bogged down even though taskman shows the process taking up ~10% of the CPU. I've already explored hard disk and memory contention to no avail, so now I'm wondering if the way I'm slowing down the thread is causing this problem.
Thanks!
Why don't you use a lower priority for the calculation threads? That will ensure other threads are scheduled while allowing your calculation threads to run as fast as possible if no other threads need to run.
What is wrong with the CPU at 100%? That's what you should strive for, not try to avoid. These math calculations are important, no? Unless you're trying to avoid hogging some other resource not explicitly managed by the OS (a mutex, the disk, etc) and used by the main thread, generally trying to slow your thread down is a bad idea. What about on multicore systems (which almost all systems will be, going forward)? You'd be slowing down a thread for absolutely no reason.
The OS has a concept of a thread quantum. It will take care of ensuring that no important thread on your system is starved. And, as I mentioned, on multicore systems spiking one thread on one CPU does not hurt performance for other threads on other cores at all.
I also see in another comment that this thread is also doing a lot of disk I/O - these operations will already cause your thread to yield while it's waiting for the results, so the sleeps will do nothing.
In general, if you're calling Sleep(x), there is something wrong/lazy with your design, and if x==0, you're opening yourself up to live locks (the thread calling Sleep(0) can actually be rescheduled immediately, making it a noop).
Sleep should be fine for throttling an app, which from your comments is what you're after. Perhaps you just need to be more precise how long you sleep for.
The only software in which I use a feature like this is the BOINC client. I don't know what mechanism it uses, but it's open-source and multi-platform, so help yourself.
It has a configuration option ("limit CPU use to X%"). The way I'd expect to implement that is to use platform-dependent APIs like clock() or GetSystemTimes(), and compare processor time against elapsed wall clock time. Do a bit of real work, check whether you're over or under par, and if you're over par sleep for a while to get back under.
The BOINC client plays nicely with priorities, and doesn't cause any performance issues for other apps even at 100% max CPU. The reason I use the throttle it is that otherwise, the client runs the CPU flat-out all the time, and drives up the fan speed and noise. So I run it at the level where the fan stays quiet. With better cooling maybe I wouldn't need it :-)
Another, not so elaborate, method could be to time one iteration and let the thread sleep for (x * t) milliseconds before the next iteration where t is the millisecond time for one iteration and x is the choosen sleep time fraction (between 0 and 1).
Have a look at cpulimit. It sends SIGSTOP and SIGCONT as required to keep a process below a given CPU usage percentage.
Even still, WTF at "crazy complaints and outlandish reviews about your software killing PC performance". I'd be more likely to complain that your software was slow and not making the best use of my hardware, but I'm not your customer.
Edit: on Windows, SuspendThread() and ResumeThread() can probably produce similar behaviour.