MPI send function not blocking and continuing execution - c++

I have an MPI program and the assignment is to create a barrier function that halts execution of all processes until all processes enter the barrier function. There is an additional timing constraint as well and the barrier function must operate within roughly 2log_2(n) units of time. When I print the timings of when the processes enter my barrier function and when they send/receive from other processes, the timings are not making sense for what I expected. Am I not understanding how MPI send works? I thought that it would block until the other process reaches the receive corresponding to that send.
This is the program code:
#include<iostream>
#include<mpi.h>
#include<time.h>
#include<windows.h>
#define MESSAGE_TAG 999
void barrier();
using namespace std;
int main(int argc, char* argv[])
{
MPI_Init(NULL, NULL);
int my_rank;
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
barrier();
MPI_Finalize();
return 0;
}
void barrier() {
int my_rank;
int world_size;
MPI_Status status;
char message[6] = "";
clock_t startTime, endTime;
double duration;
startTime = clock();
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank == 2) {
Sleep(5000);
}
endTime = clock();
duration = ((double)endTime - startTime) / CLOCKS_PER_SEC;
printf("Beginning of barrier for process %d at time %f\n", my_rank, duration);
if (my_rank == 0) {
message[0] = 'r';
message[1] = 'e';
message[2] = 'a';
message[3] = 'd';
message[4] = 'y';
message[5] = '\0';
}
int stepNumber = 0;
int totalStepsNeeded = (int) (2*(floor(log2(world_size))-1));
int closestPower2 = (int) pow(2, ((totalStepsNeeded/2)+1));
int excessProcs = world_size - closestPower2;
if (my_rank < closestPower2) {
while (stepNumber < totalStepsNeeded) {
if (my_rank % 2 == 0) {
int reciever = (stepNumber * 2) + 1 + my_rank;
if (reciever > closestPower2) reciever = reciever - closestPower2;
MPI_Send(message, 10, MPI_CHAR, reciever, MESSAGE_TAG, MPI_COMM_WORLD);
endTime = clock();
duration = ((double)endTime - startTime) / CLOCKS_PER_SEC;
printf("Process %d sent to process %d at time %f\n", my_rank, reciever, duration);
}
else {
int sender = my_rank - (stepNumber * 2) - 1;
if (sender < 0) sender = closestPower2 + sender;
MPI_Recv(message, 10, MPI_CHAR, sender, MESSAGE_TAG, MPI_COMM_WORLD, &status);
sender = sender;
endTime = clock();
duration = ((double)endTime - startTime) / CLOCKS_PER_SEC;
printf("Process %d recieved from process %d at time %f\n", my_rank, sender, duration);
}
stepNumber++;
}
}
if (my_rank < excessProcs) {
MPI_Send(message, 10, MPI_CHAR, my_rank + closestPower2, MESSAGE_TAG, MPI_COMM_WORLD);
}
else if (my_rank >= closestPower2) {
MPI_Recv(message, 10, MPI_CHAR, my_rank - closestPower2, MESSAGE_TAG, MPI_COMM_WORLD, &status);
}
endTime = clock();
duration = ((double)endTime - startTime) / CLOCKS_PER_SEC;
printf("End of barrier for process %d at time %f", my_rank, duration);
}
When I made process 2 sleep, processes 0,4, and 6 finish their execution before process 2 wakes up. When I printed the execution times I noticed that process 4 is sending to process 3 and is finishing its send before 3 ever reaches its receive function.
The output is:
Beginning of barrier for process 4 at time 0.000000
Process 4 sent to process 5 at time 0.000000
Process 4 sent to process 7 at time 0.000000
Process 4 sent to process 1 at time 0.000000
Process 4 sent to process 3 at time 0.000000
End of barrier for process 4 at time 0.000000
Beginning of barrier for process 5 at time 0.000000
Process 5 recieved from process 4 at time 0.001000
Process 5 recieved from process 2 at time 5.005000
Process 5 recieved from process 0 at time 5.005000
Process 5 recieved from process 6 at time 5.005000
End of barrier for process 5 at time 5.005000
Beginning of barrier for process 1 at time 0.000000
Process 1 recieved from process 0 at time 0.001000
Process 1 recieved from process 6 at time 0.001000
Process 1 recieved from process 4 at time 0.001000
Process 1 recieved from process 2 at time 5.006000
End of barrier for process 1 at time 5.006000
Beginning of barrier for process 6 at time 0.000000
Process 6 sent to process 7 at time 0.000000
Process 6 sent to process 1 at time 0.000000
Process 6 sent to process 3 at time 0.000000
Process 6 sent to process 5 at time 0.000000
End of barrier for process 6 at time 0.000000
Beginning of barrier for process 2 at time 5.003000
Process 2 sent to process 3 at time 5.004000
Process 2 sent to process 5 at time 5.004000
Process 2 sent to process 7 at time 5.005000
Process 2 sent to process 1 at time 5.005000
End of barrier for process 2 at time 5.005000
Beginning of barrier for process 0 at time 0.000000
Process 0 sent to process 1 at time 0.001000
Process 0 sent to process 3 at time 0.001000
Process 0 sent to process 5 at time 0.001000
Process 0 sent to process 7 at time 0.001000
End of barrier for process 0 at time 0.001000
Beginning of barrier for process 7 at time 0.000000
Process 7 recieved from process 6 at time 0.001000
Process 7 recieved from process 4 at time 0.001000
Process 7 recieved from process 2 at time 5.005000
Process 7 recieved from process 0 at time 5.005000
End of barrier for process 7 at time 5.005000
Beginning of barrier for process 3 at time 0.000000
Process 3 recieved from process 2 at time 5.004000
Process 3 recieved from process 0 at time 5.004000
Process 3 recieved from process 6 at time 5.004000
Process 3 recieved from process 4 at time 5.004000
End of barrier for process 3 at time 5.004000

Related

ALSA Capture missing frames

I have inherited a chunk of code that is using ALSA to capture an audio input at 8KHz, 8 bits, 1 channel. The code looks rather simple, it sets channels to 1, rate to 8000, and period size to 8000. The goal of this program to gather audio data in 30+ min chunks at a time.
The main loop looks like
int retval;
snd_pcm_uframes_t numFrames = 8000;
while (!exit)
{
// Gather data
while( (unsigned int)(retval = snd_pcm_readi ( handle, buffer, numFrames ) ) != numFrames )
{
if( retval == -EPIPE )
{
cerr << "overrun " << endl;
snd_pcm_prepare( handle );
}
else if ( reval < 0 )
{
cerr << "Error : " << snd_strerror( retval ) << endl;
break;
}
}
// buffer processing logic here
}
We have been having behavioral issue (not getting the full 8K samples per second, and weird timing), so I added gettimeofday timestamps around the snd_pcm_readi loop to see how time was being used and I got the following :
loop 1 : 1.017 sec
loop 2 : 2.019 sec
loop 3 : 0 (less than 1ms)
loop 4 : 2.016 sec
loop 5 : .001 sec
.. the 2 loop pattern continues (even run 2.01x sec, odd run 0-1 ms) for the rest of the run. This means I am acutally getting on average less that 8000 samples per second (loss appears to be about 3 seconds per 10 minutes of running). This does not sync well with the other gathered data. Also we would have expected to process the data at about 1 second intervals, not have 2 back to back processes every 2 seconds or so.
As an additional check, I printed out the buffer values are setting the hardware parameters and I got the following :
Buffer Size : 43690
Periods : 5
Period Size : 8000
Period Time : 1000000
Rate : 8000
So in the end I have 2 questions :
1) Why do I get actual data at less than 8 Khz ? (Possible theory, actual hardware is not quite at 8KHz, even if ALSA thinks it can do it).
2) Why the 2 secs/0 secs cycle on the reads which should be 1 second each ? And what can be done to get it to a real 1 second cycle ?
Thanks for the help.
Dale Pennington
snd_pcm_readi() returns as many samples as are available. And it will not wait for more if the device is in non-blocking mode.
You have only retval samples. If you want to handle 8000 samples at once, call snd_pcm_readi() in a loop with the remaining buffer.

Why does vector "emplace_back" behave much slower in multiple threads than single threads

I'm doing a project that needs to put lots of data into a vector. I found that it was much slower to "emplace_back" about 800,000 data into a vector in multithreaded callback function (about 4.5 seconds) than the single thread with the same work (about 0.04s), I wonder why and how to solve this problem?
My CPU has 18 cores (Xeon E5 2699 v3, 36 threads), 2 * 8G memory, I opened 17 threads, VS2015 release x64, the concurrency visualizer says the CPU has 85% execution and the "emplace_back" has about 98% inclusive samples. I wrote a simple demo to test the performance, the code is shown below:
#include <Windows.h>
#include <stdio.h>
#include <process.h>
#include<time.h>
#include <vector>
/**brief: In the thread callback function, 800,000 emplace_back
* operations were performed on local vector,
*/
unsigned int __stdcall ThreadFun(PVOID pM)
{
double stop, start, durationTime;
int x = 0;
std::vector<int> indices_v;
indices_v.reserve(10000000);
//========= emplace_back test==============
start = clock();
for (; x < 800000; ++x)
{
indices_v.emplace_back(7788);
}
stop = clock();
durationTime = ((double)(stop - start)) / CLK_TCK;
printf("Thread ID %4d ,time: %f\n",
GetCurrentThreadId(),durationTime);
return 0;
}
/*
* same tesk with ThreadFun(), but no reserve(1000000)
* still faster then multithread
*/
void SingleThread()
{
double stop, start, durationTime;
int x = 0;
std::vector<int> indices_v;
//=========emplace_back test==============
start = clock();
for (; x < 800000; ++x)
{
indices_v.emplace_back(7788);
}
stop = clock();
durationTime = ((double)(stop - start)) / CLK_TCK;
//
printf("Single Thread time: %f\n", durationTime);
}
int main()
{
const int ThreadNum = 17;
//do 800000
SingleThread();
printf("\n");
//===========MultiThreading======================
HANDLE handle[ThreadNum];
for (int i = 0; i < ThreadNum; i++)
{
handle[i] = (HANDLE)_beginthreadex(NULL, 0, ThreadFun, NULL, 0, NULL);
}
WaitForMultipleObjects(ThreadNum, handle, TRUE, INFINITE);
Sleep(5000);
return 0;
}
Output:
Single Thread time: 0.046000
Thread ID 28580 ,time: 0.050000
Thread ID 25132 ,time: 1.384000
Thread ID 15428 ,time: 3.059000
Thread ID 15964 ,time: 3.556000
Thread ID 17620 ,time: 3.849000
Thread ID 9056 ,time: 3.965000
Thread ID 18300 ,time: 4.191000
Thread ID 13328 ,time: 4.182000
Thread ID 24972 ,time: 4.184000
Thread ID 13352 ,time: 4.174000
Thread ID 29316 ,time: 4.293000
Thread ID 3056 ,time: 4.278000
Thread ID 25016 ,time: 4.111000
Thread ID 13976 ,time: 4.195000
Thread ID 652 ,time: 4.259000
Thread ID 22104 ,time: 4.174000
Thread ID 13772 ,time: 4.148000
I expect the time consumed by "emplace_back" in multiple threads should be similar to the single thread, but it takes much more time then single thread. I want to know why and how to solve it, any help?
So, having a single thread run the code took 0.048 seconds of CPU time. Running the code 18 times, once with a single thread and then with 17 threads, took 4.479 seconds of CPU time.
Subtracting to see how long the 17 threads took gives 4.431 seconds. That's 0.26 seconds per iteration.
That means that having all your cores running full tilt caused the code to run about 6 times slower. Or, putting it another way, having all your cores trying to get the work done at the same time allowed it to run three times faster.
18 cores are not going to be 18 times as fast as one core. They share caches. They share memory bandwidth. And so on.
A 3X speedup is not terrible but not great. There might be some issues with compiler flags and the like.

MPI end program with Broadcast when some process finds a solution

I am having problems with ending my program using MS-MPI.
All return values seems fine but I have to ctrl + c in cmd to end it (it doesn't look like it's still computing so the exit condition looks fine).
I want to run a program using N processes. When one of them finds a solution, it should set flag as false, send it for all others and then in next iteration they shall all stop and the program ends.
The program actually does some more advanced calculations and I'm working on simplified version for clarity. I just wanted to make sure that communication works.
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
//sets as 0 -> (N-1) depending on number of processes running
int c = world_rank;
bool flag = true;
while (flag) {
std::cout << "process: " << world_rank << " value: " << c << std::endl;
c += world_size;
//dummy condition just to test stop
if (c == 13) {
flag = false;
}
MPI_Barrier(MPI_COMM_WORLD);
//I have also tried using MPI_Bcast without that if
if(!flag) MPI_Bcast(&flag, 1, MPI_C_BOOL, world_rank, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
} //end of while
MPI_Finalize();
return 0;
}
How I think it works:
it starts with every process defining its c and flag, then on each (while) pass it increments its c by a fixed number. Then when it gets to stop condition it sets flag as false and sends it to all remaining processes. What I get when I run it with 4 processes:
process: 0 value: 0
process: 2 value: 2
process: 1 value: 1
process: 3 value: 3
process: 1 value: 5
process: 3 value: 7
process: 0 value: 4
process: 2 value: 6
process: 3 value: 11
process: 1 value: 9
process: 2 value: 10
process: 0 value: 8
process: 3 value: 15
process: 2 value: 14
process: 0 value: 12
(I am fine with those few extra values)
But after that I have to manually terminate it with ctrl + c. When running on 1 process it gets smoothly from 1 to 12 and exits.
MPI_Bcast() is a collective operation, and all the ranks of the communicator have to use the same value for the root argument (in your program, they all use a different value).
A valid approach (though unlikely the optimal one) is to send a termination message to rank 0, update flag accordingly and have all the ranks call MPI_Bcast(..., root=0, ...).

C++ Threading Error

I am getting a C++ threading error with the below code:
//create MAX_THREADS arrays for writing data to
thread threads[MAX_THREADS];
char ** data = new char*[MAX_THREADS];
char * currentSlice;
int currentThread = 0;
for (int slice = 0; slice < convertToVoxels(ARM_LENGTH); slice+=MAX_SLICES_PER_THREAD){
currentThread++;
fprintf(stderr, "Generating volume for slice %d to %d on thread %d...\n", slice, slice + MAX_SLICES_PER_THREAD >= convertToVoxels(ARM_LENGTH) ? convertToVoxels(ARM_LENGTH) : slice + MAX_SLICES_PER_THREAD, currentThread);
try {
//Allocate memory for the slice
currentSlice = new char[convertToVoxels(ARM_RADIUS) * convertToVoxels(ARM_RADIUS) * MAX_SLICES_PER_THREAD];
} catch (std::bad_alloc&) {
cout << endl << "Bad alloc" << endl;
exit(0);
}
data[currentThread] = currentSlice;
//Spawn a thread
threads[currentThread] = thread(buildDensityModel, slice * MAX_SLICES_PER_THREAD, currentSlice);
//If the number of threads is maxed out or if we are on the last thread
if (currentThread == MAX_THREADS || slice + MAX_SLICES_PER_THREAD > convertToVoxels(ARM_LENGTH)){
fprintf(stderr, "Joining threads... \n");
//Join all threads
for (int i = 0; i < MAX_THREADS; i++){
threads[i].join();
}
fprintf(stderr, "Writing file chunks... \n");
FILE* fd = fopen("density.raw", "ab");
for (int i = 0; i < currentThread; i++){
fwrite(&data[i], sizeof(char), convertToVoxels(ARM_RADIUS) * convertToVoxels(ARM_RADIUS), fd);
delete data[i];
}
fclose(fd);
currentThread = 0;
}
}
The goal of this code is to create smaller sections of a large three dimensional array that can be threaded for increased processing speed, but can also be stitched back together when I write it to a file. To this end I tried to spawn n threads at a time, and after spawning the nth thread join all existing threads, write to the file in question, then reset things and continue the process until all sub problems have been completed.
I am getting the following error:
Generating volume for slice 0 to 230 on thread 1...
Generating volume for slice 230 to 460 on thread 2...
Generating volume for slice 460 to 690 on thread 3...
Generating volume for slice 690 to 920 on thread 4...
Generating volume for slice 920 to 1150 on thread 5...
Generating volume for slice 1150 to 1380 on thread 6...
Generating volume for slice 1380 to 1610 on thread 7...
terminate called without an active exception
Aborted (core dumped)
After doing some research it seems that the issue is I am not joining my threads before they go out of scope. However I thought the code I wrote would do this correctly. Namely this section:
//Join all threads
for (int i = 0; i < MAX_THREADS; i++){
threads[i].join();
}
Could anyone point out my error (or errors) and explain it a little clearer so I do not repeat the same mistake?
Edit: Note I have verified I am getting into inner if block that is meant to join the threads. After running the file with the thread spawning line and thread joining line commented out I get the following output:
Generating volume for slice 0 to 230 on thread 1...
Generating volume for slice 230 to 460 on thread 2...
Generating volume for slice 460 to 690 on thread 3...
Generating volume for slice 690 to 920 on thread 4...
Generating volume for slice 920 to 1150 on thread 5...
Generating volume for slice 1150 to 1380 on thread 6...
Generating volume for slice 1380 to 1610 on thread 7...
Joining threads and writing file chunk...
The issue: you are calling join method for empty thread - you cannot do this, when you call join on non-joinable thread you will get exception.
In this line
thread threads[MAX_THREADS];
you created MAX_THREADS threads using its default constructor. Each thread object after calling default ctor is in non-joinable state. Before calling join you should invoke joinable method, if it returns true you can call join method.
for (int i = 0; i < MAX_THREADS; i++){
if(threads[i].joinable())
threads[i].join();
}
Now your code crashes when i = 0 because you increment currentThread at the beginning of your for loop:
for (int slice = 0; slice < convertToVoxels(ARM_LENGTH); slice+=MAX_SLICES_PER_THREAD){
currentThread++; // <---
and you leave threads[0] with empty object while doing this assignment (before first assignment currentThread is 1)
threads[currentThread] = thread(buildDensityModel, slice * MAX_SLICES_PER_THREAD, currentSlice);

How does a smaller pipe speed up data flow?

Have a 1MB pipe:
if (0 == CreatePipe(&hRead,&hWrite,0,1024*1024))
{
printf("CreatePipe failed\n");
return success;
}
Sending 4000 bytes at a time (bytesReq = 4000)
while ((bytesReq = (FileSize - offset)) != 0)
{
//Send data to Decoder.cpp thread, converting to human readable CSV
if ( (0 == WriteFile(hWrite,
readBuff,
bytesReq,
&bytesWritten,
0) ) ||
(bytesWritten != bytesReq) )
{
printf("WriteFile failed error = %d\n",GetLastError());
break;
}
}
Only 4 bytes at a time being read in at another thread, on other end of pipe.
When I made the pipe smaller, the total time of sending and reading got a lot smaller.
Changed the Pipe Size to –
1024*1024 = 2 minutes (original size)
1024*512 = 1min 47 sec
10,000 = 1min 33 sec
Anything below 10k, 1min 33 sec
How can this be?
Less waiting.
If the pipe buffer is too big, then one process writes all the data and closes it's end of the pipe before the second process even begins.
When the pipe is too big, the processes are executed serially.