Why MPI_barrier is not synchronized correctly? [duplicate] - fortran

in a simple MPI program I have used a column wise division of a large matrix.
How can I order the output so that each matrix appears next to the other ordered ?
I have tried this simple code the effect is quite different from the wanted:
for(int i=0;i<10;i++)
{
for(int k=0;k<numprocs;k++)
{
if (my_id==k){
for(int j=1;j<10;j++)
printf("%d",data[i][j]);
}
MPI_Barrier(com);
}
if(my_id==0)
printf("\n");
}
Seems that each process has his own stdout and so is impossible to have ordered lines output without sending all the data to one master which will print out. Is my guess true ? Or what I'm doing wrong ?

You guessed right. The MPI standard does not specify how stdout from different nodes should be collected for printing at the originating process. It is often the case that when multiple processes are doing prints the output will get merged in an unspecified way. fflush doesn't help.
If you want the output ordered in a certain way, the most portable method would be to send the data to the master process for printing.
For example, in pseudocode:
if (rank == 0) {
print_col(0);
for (i = 1; i < comm_size; i++) {
MPI_Recv(buffer, .... i, ...);
print_col(i);
}
} else {
MPI_Send(data, ..., 0, ...);
}
Another method which can sometimes work would be to use barries to lock step processes so that each process prints in turn. This of course depends on the MPI Implementation and how it handles stdout.
for(i = 0; i < comm_size; i++) {
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) {
printf(...);
}
}
Of course, in production code where the data is too large to print sensibly anyway, data is eventually combine by having each process writing to a separate file and merged separately, or using MPI I/O (defined in the MPI2 standards) to coordinate parallel writes.

I produced ordered output to a file before using the exact same method. You could try printing to a temporary file, printing the contents of said file and then deleting it.

Have the root processor do all of the printing. Use MPI_Send/MPI_Recv or MPI_Gather (or whatever) to send the data in turn from each processor to the root.

To solve this problem you can use short sleep. I use and then it works in 99%
printf("text nr 1\n");
MPI_Barrier(MPI_COMM_WORLD);
usleep(100);
printf("text nr 2\n");
It's not very elegant but works.

Related

what is the optimal Multithreading scenario for processing a long file lines?

I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All
An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}
What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.
Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime
First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.

How to handle IO etc outide MPI portion

Writing a sorting program with MPI. It is probably the best to have the code that handles IO outside the MPI scope, e.g. reading in data file before sorting, writing out sorted data into a file after sorting.
So in my main function I did the input before MPI_Init and output after MPI_Finalize. However it does not seem to work the way I wanted. Because I was trying to print out a line of "*" before MPI_Init and guess what, it does it n_procs times instead of just once. What is the best way to handle IO in a MPI code?
int main()
{
read in data;
cout << "************************";
MPI_Init();
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_rank(MPI_COMM_WORLD, &nproc);
if(rank == 0)
{
mergesort_parallel; // recursively
}
else
{
MPI_Recv subarray from parent;
mergesort_parallel(subarray);
MPI_Send subarray after sorting to parent;
MPI_Finalize();
return 0;
}
MPI_Finalize();
output sorted data to file;
}
The processes are created by mpiexec/mpirun and exist before MPI_Init() is called. Therefore it prints the *** lines the number of times equal to the process count. I suggest go for using standard I/O routines like fopen(), fread() etc. inside the code for the root process. i.e.
If(myrank == 0)
{
read file into buffer of root ; //Master I/O.
}
else
{
other code ;
}
MPI_Finalize();
return 0;
}
Further, place MPI_Finalize(), return 0 outside both If and else conditions. If you want to read a portion of the file into individual buffers of a process in parallel then go for MPI I/O i.e. use I/O functions provided by MPI like MPI_File_open(), MPI_File_set_view() etc.

When does cout flush?

I know endl or calling flush() will flush it. I also know that when you call cin after cout, it flushes too. And also when the program exit. Are there other situations that cout flushes?
I just wrote a simple loop, and I didn't flush it, but I can see it being printed to the screen. Why? Thanks!
for (int i =0; i<399999; i++) {
cout<<i<<"\n";
}
Also the time for it to finish is same as withendl both about 7 seconds.
for (int i =0; i<399999; i++) {
cout<<i<<endl;
}
There is no strict rule by the standard - only that endl WILL flush, but the implementation may flush at any time it "likes".
And of course, the sum of all digits in under 400K is 6 * 400K = 2.4MB, and that's very unlikely to fit in the buffer, and the loop is fast enough to run that you won't notice if it takes a while between each output. Try something like this:
for(int i = 0; i < 100; i++)
{
cout<<i<<"\n";
Sleep(1000);
}
(If you are using a Unix based OS, use sleep(1) instead - or add a loop that takes some time, etc)
Edit: It should be noted that this is not guaranteed to show any difference. I know that on my Linux machine, if you don't have a flush in this particular type of scenario, it doesn't output anything - however, some systems may do "flush on \n" or something similar.

How to set a priority for the execution of the program in source code?

I wrote the following code, that must do search of all possible combinations of two digits in a string whose length is specified:
#include <iostream>
#include <Windows.h>
int main ()
{
using namespace std;
cout<<"Enter length of array"<<endl;
int size;
cin>>size;
int * ps=new int [size];
for (int i=0; i<size; i++)
ps[i]=3;
int k=4;
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
while (k>=0)
{
for (int bi=0; bi<size; bi++)
std::cout<<ps[bi];
std::cout<<std::endl;
int i=size-1;
if (ps[i]==3)
{
ps[i]=4;
continue;
}
if (ps[i]==4)
{
while (ps[i]==4)
{
ps[i]=3;
--i;
}
ps[i]=4;
if (i<k)
k--;
}
}
}
When programm was executing on Windows 7, I saw that load of CPU is only 10-15%, in order to make my code worked faster, i decided to change priority of my programm to High. But when i did it there was no increase in work and load of CPU stayed the same. Why CPU load doesn't change? Incorrect statement SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);? Or this code cannot work faster?
If your CPU is not working at it's full capacity it means that your application is not capable of using it because of causes like I/O, sleeps, memory or other device throughtput capabilties.
Most probably, however, it means that your CPU has 2+ cores and your application is single-threaded. In this case you have to go through the process of paralellizing your application, which is often neither simple nor fast.
In case of the code you posted, the most time consuming operation is actually (most probably) printing the results. Remove the cout code and see for yourself how fast the code will work.
Increasing the priority of your programm won't help much.
What you need to do is to remove the cout from your calculations. Store your computations and output them afterwards.
As others have noted it might also be that you use a multi-core machine. Anyway removing any output from your computation loop is always a first step to use 100% of your machines computation power for that and not waste cycles on output.
std::vector<int> results;
results.reserve(1000); // this should ideally match the number of results you expect
while (k>=0)
{
for (int bi=0; bi<size; bi++){
results.push_back(ps[bi]);
}
int i=size-1;
if (ps[i]==3)
{
ps[i]=4;
continue;
}
if (ps[i]==4)
{
while (ps[i]==4)
{
ps[i]=3;
--i;
}
ps[i]=4;
if (i<k)
k--;
}
}
// now here yuo can output your data
for(auto&& res : results){
cout << res << "\n"; // \n to not force flush
}
cout << endl; // now force flush
What's probably happening is you're on a multi-core/multi-thread machine and you're running on only one thread, the rest of the CPU power is just sitting idle. So you'll want to multi-thread your code. Look at boost thread.

How to reduce cpu usage during data transfer on TCP ports realtime

I have a socket program which acts like both client and server.
It initiates connection on an input port and reads data from it. On a real time scenario it reads data on input port and sends the data (record by record ) on to the output port.
The problem here is that while sending data to the output port CPU usage increases to 50% while is not permissible.
while(1)
{
if(IsInputDataAvail==1)//check if data is available on input port
{
//condition to avoid duplications while sending
if( LastRecordSent < LastRecordRecvd )
{
record_time temprt;
list<record_time> BufferList;
list<record_time>::iterator j;
list<record_time>::iterator i;
// Storing into a temp list
for(i=L.begin(); i != L.end(); ++i)
{
if((i->recordId > LastRecordSent) && (i->recordId <= LastRecordRecvd))
{
temprt.listrec = i->listrec;
temprt.recordId = i->recordId;
temprt.timestamp = i->timestamp;
BufferList.push_back(temprt);
}
}
//Sending to output port
for(j=BufferList.begin(); j != BufferList.end(); ++j)
{
LastRecordSent = j->recordId;
std::string newlistrecord = j->listrec;
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
if ( s.OutputClientAvail() == 1) //check if output client is available
{
int ret = s.SendBytes(newrecord,strlen(newrecord));
if ( ret < 0)
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Disconnected");
--connected;
return;
}
}
else
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Timedout..connection closed");
--connected; //if output client not available disconnect after a timeout
return;
}
}
}
}
// Sleep(100); if we include sleep here CPU usage is less..but to send data real time I need to remove this sleep.
If I remove Sleep()...CPU usage goes very high while sending data to out put port.
}//End of while loop
Any possible ways to maintain real time data transfer and reduce CPU usage..please suggest.
There are two potential CPU sinks in the listed code. First, the outer loop:
while (1)
{
if (IsInputDataAvail == 1)
{
// Not run most of the time
}
// Sleep(100);
}
Given that the Sleep call significantly reduces your CPU usage, this spin-loop is the most likely culprit. It looks like IsInputDataAvail is a variable set by another thread (though it could be a preprocessor macro), which would mean that almost all of that CPU is being used to run this one comparison instruction and a couple of jumps.
The way to reclaim that wasted power is to block until input is available. Your reading thread probably does so already, so you just need some sort of semaphore to communicate between the two, with a system call to block the output thread. Where available, the ideal option would be sem_wait() in the output thread, right at the top of your loop, and sem_post() in the input thread, where it currently sets IsInputDataAvail. If that's not possible, the self-pipe trick might work in its place.
The second potential CPU sink is in s.SendBytes(). If a positive result indicates that the record was fully sent, then that method must be using a loop. It probably uses a blocking call to write the record; if it doesn't, then it could be rewritten to do so.
Alternatively, you could rewrite half the application to use select(), poll(), or a similar method to merge reading and writing into the same thread, but that's far too much work if your program is already mostly complete.
if(IsInputDataAvail==1)//check if data is available on input port
Get rid of that. Just read from the input port. It will block until data is available. This is where most of your CPU time is going. However there are other problems:
std::string newlistrecord = j->listrec;
Here you are copying data.
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
Here you are copying the same data again. You are also dynamically allocating memory, and you are also leaking it.
if ( s.OutputClientAvail() == 1) //check if output client is available
I don't know what this does but you should delete it. The following send is the time to check for errors. Don't try to guess the future.
int ret = s.SendBytes(newrecord,strlen(newrecord));
Here you are recomputing the length of the string which you probably already knew back at the time you set j->listrec. It would be much more efficient to just call s.sendBytes() directly with j->listrec and then again with "\n" than to do all this. TCP will coalesce the data anyway.