Writing a sorting program with MPI. It is probably the best to have the code that handles IO outside the MPI scope, e.g. reading in data file before sorting, writing out sorted data into a file after sorting.
So in my main function I did the input before MPI_Init and output after MPI_Finalize. However it does not seem to work the way I wanted. Because I was trying to print out a line of "*" before MPI_Init and guess what, it does it n_procs times instead of just once. What is the best way to handle IO in a MPI code?
int main()
{
read in data;
cout << "************************";
MPI_Init();
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_rank(MPI_COMM_WORLD, &nproc);
if(rank == 0)
{
mergesort_parallel; // recursively
}
else
{
MPI_Recv subarray from parent;
mergesort_parallel(subarray);
MPI_Send subarray after sorting to parent;
MPI_Finalize();
return 0;
}
MPI_Finalize();
output sorted data to file;
}
The processes are created by mpiexec/mpirun and exist before MPI_Init() is called. Therefore it prints the *** lines the number of times equal to the process count. I suggest go for using standard I/O routines like fopen(), fread() etc. inside the code for the root process. i.e.
If(myrank == 0)
{
read file into buffer of root ; //Master I/O.
}
else
{
other code ;
}
MPI_Finalize();
return 0;
}
Further, place MPI_Finalize(), return 0 outside both If and else conditions. If you want to read a portion of the file into individual buffers of a process in parallel then go for MPI I/O i.e. use I/O functions provided by MPI like MPI_File_open(), MPI_File_set_view() etc.
Related
I have a C++ MPI program that runs on Windows HPC cluster (12 nodes, 24 cores per node).
The logic of the program is really simple:
there is a pool of tasks
At the start, the program divides the tasks equally to each MPI process
Each MPI process execute their tasks
After everything is finished, using MPI reduce to gather the results to the root process.
There is one problem. Each task can have drastically different execution time and there is no way that I can tell that in advance. Equally distributing the task will results a lot of processes waiting idle. This wastes a lot of computer resources and make the total execution time longer.
I am thinking of one solution that might work.
The process is like this.
The task pool is divided into small parcels (like 10 tasks a parcel)
Each MPI process take a parcel at a time when it is idle (have not received a parcel, or finished the previous parcel)
The step 2 is continued until the task pool is exhausted
Using MPI reduce to gather all the results to root process
As far as I understand, this scheme need a universal counter across nodes/process (to avoid different MPI process execute the same parcel) and changing it need some lock/sync mechanism. It certainly has its overhead but with proper tuning, I think it can help to improve the performance.
I am not quite familiar with MPI and have some implementation issues. I can think of two ways to implement this universal counter
Using MPI I/O technique, write this counter in file, when a parcel is took, increase this counter (will certainly need file lock mechanism)
Using MPI one side communication/shared memory. Put this counter in the shared memory and increase it when a parcel is taken. (will certainly need a sync mechanism)
Unfortunately, I am not familiar with either technique and want to explore the possibility, implementation, or possible drawbacks of the two above methods. A sample code would be greatly appreciated.
If you have other ways to solve the problem or suggestions, that will also be great. Thanks.
Follow-ups:
Thanks for all the useful suggestions. I am implemented a test program following the scheme of using process 0 as the task distributor.
#include <iostream>
#include <mpi.h>
using namespace std;
void doTask(int rank, int i){
cout<<rank<<" got task "<<i<<endl;
}
int main ()
{
int numTasks = 5000;
int parcelSize = 100;
int numParcels = (numTasks/parcelSize) + (numTasks%parcelSize==0?0:1);
//cout<<numParcels<<endl;
MPI_Init(NULL, NULL);
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Status status;
MPI_Request request;
int ready = 0;
int i = 0;
int maxParcelNow = 0;
if(rank == 0){
for(i = 0; i <numParcels; i++){
MPI_Recv(&ready, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
//cout<<i<<"Yes"<<endl;
MPI_Send(&i, 1, MPI_INT, status.MPI_SOURCE, 0, MPI_COMM_WORLD);
//cout<<i<<"No"<<endl;
}
maxParcelNow = i;
cout<<maxParcelNow<<" "<<numParcels<<endl;
}else{
int counter = 0;
while(true){
if(maxParcelNow == numParcels) {
cout<<"Yes exiting"<<endl;
break;
}
//if(maxParcelNow == numParcels - 1) break;
ready = 1;
MPI_Send(&ready, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//cout<<rank<<"send"<<endl;
MPI_Recv(&i, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
//cout<<rank<<"recv"<<endl;
doTask(rank, i);
}
}
MPI_Bcast(&maxParcelNow, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
It does not work and it never stops. Any suggestions on how to make it work? Does this code reflect the idea right or am I missing something? Thanks
[Converting my comments into an answer...]
Given n processes, you can have your first process p0 dispatch tasks for the other n - 1 processes. First, it will do point-to-point communication to the other n - 1 processes so that everyone has work to do, and then it will block on a Recv. When any given process completes, say p3, it will send its result back to p0. At this point, p0 will send another message to p3 with one of two things:
1) Another task
or
2) Some kind of termination signal if there are no tasks remaining. (Using the 'tag' of the message is one easy way.)
Obviously, p0 will loop over that logic until there is no task left, in which case it will call MPI_Finalize too.
Unlike what you thought in your comments, this isn't round-robin. It first gives a job to every process, or worker, and then gives back another job whenever one completes...
I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.
in a simple MPI program I have used a column wise division of a large matrix.
How can I order the output so that each matrix appears next to the other ordered ?
I have tried this simple code the effect is quite different from the wanted:
for(int i=0;i<10;i++)
{
for(int k=0;k<numprocs;k++)
{
if (my_id==k){
for(int j=1;j<10;j++)
printf("%d",data[i][j]);
}
MPI_Barrier(com);
}
if(my_id==0)
printf("\n");
}
Seems that each process has his own stdout and so is impossible to have ordered lines output without sending all the data to one master which will print out. Is my guess true ? Or what I'm doing wrong ?
You guessed right. The MPI standard does not specify how stdout from different nodes should be collected for printing at the originating process. It is often the case that when multiple processes are doing prints the output will get merged in an unspecified way. fflush doesn't help.
If you want the output ordered in a certain way, the most portable method would be to send the data to the master process for printing.
For example, in pseudocode:
if (rank == 0) {
print_col(0);
for (i = 1; i < comm_size; i++) {
MPI_Recv(buffer, .... i, ...);
print_col(i);
}
} else {
MPI_Send(data, ..., 0, ...);
}
Another method which can sometimes work would be to use barries to lock step processes so that each process prints in turn. This of course depends on the MPI Implementation and how it handles stdout.
for(i = 0; i < comm_size; i++) {
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) {
printf(...);
}
}
Of course, in production code where the data is too large to print sensibly anyway, data is eventually combine by having each process writing to a separate file and merged separately, or using MPI I/O (defined in the MPI2 standards) to coordinate parallel writes.
I produced ordered output to a file before using the exact same method. You could try printing to a temporary file, printing the contents of said file and then deleting it.
Have the root processor do all of the printing. Use MPI_Send/MPI_Recv or MPI_Gather (or whatever) to send the data in turn from each processor to the root.
To solve this problem you can use short sleep. I use and then it works in 99%
printf("text nr 1\n");
MPI_Barrier(MPI_COMM_WORLD);
usleep(100);
printf("text nr 2\n");
It's not very elegant but works.
I am writing a program which requires communicating with an external program two-way simultaneously, i.e., reading and writing to an external program at the same time.
I create two pipes, one for sending data to the external process, one for receiving data from the external process. After forking the child process, which becomes the external program, the parent forks again. The new child now writes data into the outgoing pipe to the external program, and the parent now reads data from the incoming pipe from the external program for further processing.
I've heard that using exit(3) may cause buffers to be flushed twice, however I am also afraid that using _exit(2) may leave buffers left unflushed. In my program, there are outputs both before and after forking. Which, exit(3) or _exit(2), should I use in this case?
The below is my main function. The #includes and auxiliary function is left out for simplicity.
int main() {
srand(time(NULL));
ssize_t n;
cin >> n;
for (double p = 0.0; p <= 1.0; p += 0.1) {
string s = generate(n, p);
int out_fd[2];
int in_fd[2];
pipe(out_fd);
pipe(in_fd);
pid_t child = fork();
if (child) {
// parent
close(out_fd[0]);
close(in_fd[1]);
if (fork()) {
close(out_fd[1]);
ssize_t size = 0;
const ssize_t block_size = 1048576;
char buf[block_size];
ssize_t n_read;
while ((n_read = read(in_fd[0], buf, block_size)) != 0) {
size += n_read;
}
size += n_read;
close(in_fd[0]);
cout << "p = " << p << "; compress ratio = " << double(size) / double(n) << '\n'; // data written before forking (the loop continues to fork)
} else {
write(out_fd[1], s.data(), s.size()); // data written after forking
exit(EXIT_SUCCESS); // exit(3) or _exit(2) ?
}
} else {
// child
close(in_fd[0]);
close(out_fd[1]);
dup2(out_fd[0], STDIN_FILENO);
dup2(in_fd[1], STDOUT_FILENO);
close(STDERR_FILENO);
execlp("xz", "xz", "-9", "--format=raw", reinterpret_cast<char *>(NULL));
}
}
}
You need to be careful with these sort of things. exit() does different things to _exit() and yet again different to _Exit(), and as the answer suggested as a duplicate explains, the _Exit (not same as _exit, note upper case E) will not call atexit() handlers, or flush any output buffers, delete temporary files, etc [which may in fact be atexit() handling, but it could also be done as a direct call, depending on how the C library code has been written].
Most of your output is done via write, which should be unbuffered from the applications perspective. But you are calling cout << ... as well. You will need to make sure that is flushed before exiting. Right now, you are using '\n' as the end of line marker, which may or may not flush the output. If you change that to endl instead, it will flush the file. Now you can safely use _Exit() from an output perspective - if your code were to set up its own atexit() handler for example, open temporary files or a bunch of other such things, this would be problematic. If you want to do more complex things in the forked process, it should be done by another exec.
In your program as it stands, there isn't any pending output to flush, so it "works" anyway, but if you add a cout << ... << '\n'; (or without the newline) type statement at the beginning of the code, you would see it go wrong. If you add a cout.flush();, it would "fix" the problem (based on your current code).
You should also check the return value from your execlp() call and call _Exit() in that case (and handle it in the main process so you don't continue the loop in case of failure?)
In the child branch of a fork(), it is normally incorrect to use exit(), because that can lead to stdio buffers being flushed twice, and temporary files being unexpectedly removed. In C++ code the situation is worse, because destructors for static objects may be run incorrectly. (There are some unusual cases, like daemons, where the parent should call _exit() rather than the child; the basic rule, applicable in the overwhelming majority of cases, is that exit() should be called only once for each entry into main.)
I'm writing an MPI program (Visual Studio 2k8 + MSMPI) that uses Boost::thread to spawn two threads per MPI process, and have run into a problem I'm having trouble tracking down.
When I run the program with: mpiexec -n 2 program.exe, one of the processes suddenly terminates:
job aborted:
[ranks] message
[0] terminated
[1] process exited without calling finalize
---- error analysis -----
[1] on winblows
program.exe ended prematurely and may have crashed. exit code 0xc0000005
---- error analysis -----
I have no idea why the first process is suddenly terminating, and can't figure out how to track down the reason. This happens even if I put the rank zero process into an infinite loop at the end of all of it's operations... it just suddenly dies. My main function looks like this:
int _tmain(int argc, _TCHAR* argv[])
{
/* Initialize the MPI execution environment. */
MPI_Init(0, NULL);
/* Create the worker threads. */
boost::thread masterThread(&Master);
boost::thread slaveThread(&Slave);
/* Wait for the local test thread to end. */
masterThread.join();
slaveThread.join();
/* Shutdown. */
MPI_Finalize();
return 0;
}
Where the master and slave functions do some arbitrary work before ending. I can confirm that the master thread, at the very least, is reaching the end of it's operations. The slave thread is always the one that isn't done before the execution gets aborted. Using print statements, it seems like the slave thread isn't actually hitting any errors... it's happily moving along and just get's taken out in the crash.
So, does anyone have any ideas for:
a) What could be causing this?
b) How should I go about debugging it?
Thanks so much!
Edit:
Posting minimal versions of the Master/Slave functions. Note that the goal of this program is purely for demonstration purposes... so it isn't doing anything useful. Essentially, the master threads send a dummy payload to the slave thread of the other MPI process.
void Master()
{
int myRank;
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* Create a message with numbers 0 through 39 as the payload, addressed
* to this thread. */
int *payload= new int[40];
for(int n = 0; n < 40; n++) {
payload[n] = n;
}
if(myRank == 0) {
MPI_Send(payload, 40, MPI_INT, 1, MPI_ANY_TAG, MPI_COMM_WORLD);
} else {
MPI_Send(payload, 40, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD);
}
/* Free memory. */
delete(payload);
}
void Slave()
{
MPI_Status status;
int *payload= new int[40];
MPI_Recv(payload, 40, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
/* Free memory. */
delete(payload);
}
you have to use thread safe version of mpi runtime.
read up on MPI_Init_thread.