I am using MPI to run a program in parallel and measure the execution time. I am currently splitting the computation between each process by giving a start and end index as a parameter in the "voxelise" function. This will then work on different sections of the data set and store the result in "p_voxel_data".
I then want to send all of these sub arrays to the root process using "MPI_Gather" so the data can be written to a file and the timer stopped.
The program executes fine when i have the "MPI_Gather" line commented out, i get output similar to this:
Computing time: NODE 3 = 1.07 seconds.
Computing time: NODE 2 = 1.12 seconds.
But when that line is included i get
"APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
And also the computing time for the root node 0 shows up as a minus number "-1.40737e+08"
Can anyone suggest any issues in my call to MPI_Gather?
int main(int argc, char** argv)
//-----------------------------
{
int rank;
int nprocs;
MPI_Comm comm;
MPI::Init(argc, argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* Set up data for voxelise function */
. . . . . .
clock_t start(clock());
// Generate the density field
voxelise(density_function,
a,
b,
p_control_point_set,
p_voxel_data,
p_number_of_voxel,
p_voxel_size,
p_centre,
begin,
endInd );
std::vector<float> completeData(512);
std::vector<float> cpData(toProcess);
std::copy(p_voxel_data.begin() + begin, p_voxel_data.begin() + endInd, cpData.begin());
MPI_Gather(&cpData, toProcess, MPI::FLOAT, &completeData, toProcess, MPI::FLOAT, 0, MPI_COMM_WORLD);
// Stop the timer
clock_t end(clock());
float number_of_seconds(float(end - start) / CLOCKS_PER_SEC);
std::cout << "Computing time:\t" << "NODE " << rank << " = " << number_of_seconds << " seconds." <<std::endl;
if(rank == 0) {
MPI::Finalize();
return (EXIT_SUCCESS);
}
You are giving MPI_Gather address to vector object, not address to vector data.
You must do:
MPI_Gather(&cpData[0], toProcess, MPI::FLOAT, &completeData[0], ...
Of course you have to make sure sizes are correct too.
Related
I am having problems with ending my program using MS-MPI.
All return values seems fine but I have to ctrl + c in cmd to end it (it doesn't look like it's still computing so the exit condition looks fine).
I want to run a program using N processes. When one of them finds a solution, it should set flag as false, send it for all others and then in next iteration they shall all stop and the program ends.
The program actually does some more advanced calculations and I'm working on simplified version for clarity. I just wanted to make sure that communication works.
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
//sets as 0 -> (N-1) depending on number of processes running
int c = world_rank;
bool flag = true;
while (flag) {
std::cout << "process: " << world_rank << " value: " << c << std::endl;
c += world_size;
//dummy condition just to test stop
if (c == 13) {
flag = false;
}
MPI_Barrier(MPI_COMM_WORLD);
//I have also tried using MPI_Bcast without that if
if(!flag) MPI_Bcast(&flag, 1, MPI_C_BOOL, world_rank, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
} //end of while
MPI_Finalize();
return 0;
}
How I think it works:
it starts with every process defining its c and flag, then on each (while) pass it increments its c by a fixed number. Then when it gets to stop condition it sets flag as false and sends it to all remaining processes. What I get when I run it with 4 processes:
process: 0 value: 0
process: 2 value: 2
process: 1 value: 1
process: 3 value: 3
process: 1 value: 5
process: 3 value: 7
process: 0 value: 4
process: 2 value: 6
process: 3 value: 11
process: 1 value: 9
process: 2 value: 10
process: 0 value: 8
process: 3 value: 15
process: 2 value: 14
process: 0 value: 12
(I am fine with those few extra values)
But after that I have to manually terminate it with ctrl + c. When running on 1 process it gets smoothly from 1 to 12 and exits.
MPI_Bcast() is a collective operation, and all the ranks of the communicator have to use the same value for the root argument (in your program, they all use a different value).
A valid approach (though unlikely the optimal one) is to send a termination message to rank 0, update flag accordingly and have all the ranks call MPI_Bcast(..., root=0, ...).
We are writing a code to solve non linear problem using an iterative method (Newton). Anyway, the problem is that we don't know a priori how many MPI processes will be needed from one iteration to another, due to e.g. remeshing, adaptivity, etc. And there is quite a lot of iterations...
We hence would like to use MPI_Comm_Spawn at each iteration to create as much MPI process as we need, gather the results and "destroy" the subprocesses. We know this limits the scalability of the code due to the gathering of information, however, we have been asked to do it :)
I did a couple of tests of MPI_Comm_Spawn on my laptop (on windows 7/64bit) using intel MPI and Visual Studio express 2013. I tried these simple codes
//StackMain
#include <iostream>
#include <mpi.h>
#include<vector>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
for (int i = 0; i < 10000; i++)
{
std::cout << "Loop number "<< i << std::endl;
MPI_Comm children;
std::vector<int> err(4);
ierr = MPI_Comm_spawn("StackWorkers.exe", NULL, 4, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &children, &err[0]);
MPI_Barrier(children);
MPI_Comm_disconnect(&children);
}
ierr = MPI_Finalize();
return 0;
}
And the program launched by the spawned processes:
//StackWorkers
#include <mpi.h>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
MPI_Comm parent;
ierr = MPI_Comm_get_parent(&parent);
MPI_Barrier(parent);
ierr = MPI_Finalize();
return 0;
}
The program is launched using one MPI process:
mpiexec -np 1 StackMain.exe
It seems to work, I do have however some questions...
1- The program freezes during iteration 4096, this number do not change if I relaunch the program. If during each iteration I launch 2 times 4 process, then it will stop at iteration 2048th...
Is it a limitation from the operating system ?
2- When I look at the memory occupied by "mpiexec" during the program, it grows continuously (never going down). Do you know why ? I though that, when subprocess finnished their job, they would release the memory they used...
3- Should I disconnect/free the children communicator or not ? If yes, MPI_Disconnect(...) must be called on both spawned and spawnee processes ? Or only spawnee ?
Thanks a lot!
I have created multiple threads ( 4 threads) inside main thread. While every thread execute same functions,
the scheduling of threads are not same as expected. As per my understanding of OS , linux CFS scheduler will
assign "t" virtual run time quantum and on expiry of that time quantum, CPU is preempted from current thread and
allocate to next thread. In this manner every thread will get fair share of CPU. What I am getting is not as per expectation.
I am expecting all threads (thread 1-4, main thread ) will get CPU before same thread(any) get CPU next time.
Expected output is
foo3-->1--->Time Now : 00:17:45.346225000
foo3-->1--->Time Now : 00:17:45.348818000
foo4-->1--->Time Now : 00:17:45.350216000
foo4-->1--->Time Now : 00:17:45.352800000
main is running ---> 1--->Time Now : 00:17:45.355803000
main is running ---> 1--->Time Now : 00:17:45.360606000
foo2-->1--->Time Now : 00:17:45.345305000
foo2-->1--->Time Now : 00:17:45.361666000
foo1-->1--->Time Now : 00:17:45.354203000
foo1-->1--->Time Now : 00:17:45.362696000
foo1-->2--->Time Now : 00:17:45.362716000 // foo1 thread got CPU 2nd time as expected
foo1-->2--->Time Now : 00:17:45.365306000
but I am getting
foo3-->1--->Time Now : 00:17:45.346225000
foo3-->1--->Time Now : 00:17:45.348818000
foo4-->1--->Time Now : 00:17:45.350216000
foo4-->1--->Time Now : 00:17:45.352800000
main is running ---> 1--->Time Now : 00:17:45.355803000
main is running ---> 1--->Time Now : 00:17:45.360606000
foo3-->2--->Time Now : 00:17:45.345305000 // // foo3 thread got CPU 2nd time UNEXPECTEDLY before scheduling other threads as per CFS
foo3-->2--->Time Now : 00:17:45.361666000
foo1-->1--->Time Now : 00:17:45.354203000
foo1-->1--->Time Now : 00:17:45.362696000
foo1-->2--->Time Now : 00:17:45.362716000
foo1-->2--->Time Now : 00:17:45.365306000
Here is my program (thread_multi.cpp)
#include <pthread.h>
#include <stdio.h>
#include "boost/date_time/posix_time/posix_time.hpp"
#include <iostream>
#include <cstdlib>
#include <fstream>
#define NUM_THREADS 4
using namespace std;
std::string now_str()
{
// Get current time from the clock, using microseconds resolution
const boost::posix_time::ptime now =
boost::posix_time::microsec_clock::local_time();
// Get the time offset in current day
const boost::posix_time::time_duration td = now.time_of_day();
const long hours = td.hours();
const long minutes = td.minutes();
const long seconds = td.seconds();
const long nanoseconds = td.total_nanoseconds() - ((hours * 3600 + minutes * 60 + seconds) * 1000000000);
char buf[40];
sprintf(buf, "Time Now : %02ld:%02ld:%02ld.%03ld", hours, minutes, seconds, nanoseconds);
return buf;
}
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
char *str;
int i = 0;
str=(char*)arg;
while(i < 100 )
{
++i;
ofstream myfile ("example.txt", ios::out | ios::app | ios::binary);
if (myfile.is_open())
{
myfile << str <<"-->"<<i<<"--->" <<now_str() <<" \n";
}
else cout << "Unable to open file";
// generate delay
for(volatile int k=0;k<1000000;k++);
if (myfile.is_open())
{
myfile << str <<"-->"<<i<<"--->" <<now_str() <<"\n\n";
myfile.close();
}
else cout << "Unable to open file";
}
}
int main(void)
{
pthread_t pth[NUM_THREADS]; // this is our thread identifier
int i = 0;
pthread_create(&pth[0],NULL, threadFunc, (void *) "foo1");
pthread_create(&pth[1],NULL, threadFunc, (void *) "foo2");
pthread_create(&pth[2],NULL, threadFunc, (void *) "foo3");
pthread_create(&pth[3],NULL, threadFunc, (void *) "foo4");
std::cout <<".............\n" <<now_str() << '\n';
while(i < 100)
{
for(int k=0;k<1000000;k++);
ofstream myfile ("example.txt", ios::out | ios::app | ios::binary);
if (myfile.is_open())
{
myfile << "main is running ---> "<< i <<"--->"<<now_str() <<'\n';
myfile.close();
}
else cout << "Unable to open file";
++i;
}
// printf("main waiting for thread to terminate...\n");
for(int k=0;k<4;k++)
pthread_join(pth[k],NULL);
std::cout <<".............\n" <<now_str() << '\n';
return 0;
}
Here is Completely Fair Scheduler details
kernel.sched_min_granularity_ns = 100000
kernel.sched_wakeup_granularity_ns = 25000 kernel.sched_latency_ns =
1000000
As per sched_min_granularity_ns value,any task will be execute for that minimum amount of time and if the task needs more than that minimum time then time slice is calculated and every task will be executed for that time slice.
Here time slice is calculated using the formula ,
time slice = ( weight of each task / total weight of all tasks under
that CFS run-queue ) x sched_latency_ns
Can anyone explain why I am getting those results of scheduling ????
Any help to understand the output will be highly appreciated.
Thank you in advance.
I am using gcc under linux.
EDIT 1:
If I change this loop
for(int k=0;k<100000;k++);
into
for(int k=0;k<10000;k++);
then sometimes thread 1 got CPU 10 times consecutively, thread 2 got CPU for 5 times consecutively, thread 3 got CPU for 5 times consecutively, main thread 2 times consecutively, thread 4 got CPU for 7 times consecutively. It looks like different threads are preempted at random time.
Any clue for these random no of times consecutive CPU allocation to different threads.??
CPU allocate some time to execute each thread. Why each thread doesn't make same number of print?
I'll explain this within an example :
Admit that you computer can make 100 instructions by ns
Admit that make 1 print is equivalent to use 25 instructions
Admit that each thread have 1ns to work
Now you have to understand that all program in the computer is consumming the 100 available instructions
If when your thread want to print something there is 100 instructions available, it can print 4 sentences.
If when your thread want to print something there is 40 instructions available, it can print 1 sentences. There is only 40 instructions because some other program is using instructions.
Do you get it?
If you have any question, you are welcome. :)
I'm working on ABC algorithm using MPI to optimize Rastrigin function. My code's structure goes as follows:
I defined the control parameters
I defined my variables and arrays
I wrote my functions.
and called them back in a for loop right in my main.
and here is the my main in which I think my problem is at, I have defined total run times of 30 but when I run it get stuck on the 27th run. I'm running it on 4 nodes but it gets stuck! Any help?
Here is my main code :
int main (int argc, char* argv[])
{
int iter,run,j;
double mean;
mean=0;
srand(time(NULL));
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &myRank);
MPI_Comm_size (MPI_COMM_WORLD, &numProc);
for(run=0;run<runtime;run++)
{
if(myRank==0){
initial();
MemorizeBestSource();
}
for (iter=0;iter<maxCycle;iter++)
{
SendEmployedBees();
if(myRank==0){
CalculateProbabilities();
SendOnlookerBees();
MemorizeBestSource();
SendScoutBees();
}
}
if(myRank==master){
for(j=0;j<D;j++)
printf("GlobalParam[%d]: %f\n",j+1,GlobalParams[j]);
printf("%d. run: %e \n",run+1,GlobalMin);
GlobalMins[run]=GlobalMin;
mean=mean+GlobalMin;
}
}
if(myRank==master){
mean=mean/runtime;
printf("Means of %d runs: %e\n",runtime,mean);
getch();
MPI_Finalize ();
}
}
I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!
Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?
You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.