MPI (Summation) - c++

I am writing a program that calculates the sum of every number up to 1000. For example, 1+2+3+4+5....+100. First, I assign summation jobs to 10 processors: Processor 0 gets 1-100, Processor 1 gets 101-200 and so on. The sums are stored in an array.
After all summations have been done parallelly, processors send their values to Processor 0 (Processor 0 receives values using nonblocking send/recv) and Processor 0 sums up all the values and displays the result.
Here is the code:
#include <mpi.h>
#include <iostream>
using namespace std;
int summation(int, int);
int main(int argc, char ** argv)
{
int * array;
int total_proc;
int curr_proc;
int limit = 0;
int partial_sum = 0;
int upperlimit = 0, lowerlimit = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &total_proc);
MPI_Comm_rank(MPI_COMM_WORLD, &curr_proc);
MPI_Request send_request, recv_request;
/* checking if 1000 is divisible by number of procs, else quit */
if(1000 % total_proc != 0)
{
MPI_Finalize();
if(curr_proc == 0)
cout << "**** 1000 is not divisible by " << total_proc << " ...quitting..."<< endl;
return 0;
}
/* number of partial summations */
limit = 1000/total_proc;
array = new int [total_proc];
/* assigning jobs to processors */
for(int i = 0; i < total_proc; i++)
{
if(curr_proc == i)
{
upperlimit = upperlimit + limit;
lowerlimit = (upperlimit - limit) + 1;
partial_sum = summation(upperlimit, lowerlimit);
array[i] = partial_sum;
}
else
{
upperlimit = upperlimit + limit;
lowerlimit = (upperlimit - limit) + 1;
}
}
cout << "** Partial Sum From Process " << curr_proc << " is " << array[curr_proc] << endl;
/* send and receive - non blocking */
for(int i = 1; i < total_proc; i++)
{
if(curr_proc == i) /* (i = current processor) */
{
MPI_Isend(&array[i], 1, MPI_INT, 0, i, MPI_COMM_WORLD, &send_request);
cout << "-> Process " << i << " sent " << array[i] << " to Process 0" << endl;
MPI_Irecv(&array[i], 1, MPI_INT, i, i, MPI_COMM_WORLD, &recv_request);
//cout << "<- Process 0 received " << array[i] << " from Process " << i << endl;
}
}
MPI_Finalize();
if(curr_proc == 0)
{
for(int i = 1; i < total_proc; i++)
array[0] = array[0] + array[i];
cout << "Sum is " << array[0] << endl;
}
return 0;
}
int summation(int u, int l)
{
int result = 0;
for(int i = l; i <= u; i++)
result = result + i;
return result;
}
Output:
** Partial Sum From Process 0 is 5050
** Partial Sum From Process 3 is 35050
-> Process 3 sent 35050 to Process 0
<- Process 0 received 35050 from Process 3
** Partial Sum From Process 4 is 45050
-> Process 4 sent 45050 to Process 0
<- Process 0 received 45050 from Process 4
** Partial Sum From Process 5 is 55050
-> Process 5 sent 55050 to Process 0
<- Process 0 received 55050 from Process 5
** Partial Sum From Process 6 is 65050
** Partial Sum From Process 8 is 85050
-> Process 8 sent 85050 to Process 0
<- Process 0 received 85050 from Process 8
-> Process 6 sent 65050 to Process 0
** Partial Sum From Process 1 is 15050
** Partial Sum From Process 2 is 25050
-> Process 2 sent 25050 to Process 0
<- Process 0 received 25050 from Process 2
<- Process 0 received 65050 from Process 6
** Partial Sum From Process 7 is 75050
-> Process 1 sent 15050 to Process 0
<- Process 0 received 15050 from Process 1
-> Process 7 sent 75050 to Process 0
<- Process 0 received 75050 from Process 7
** Partial Sum From Process 9 is 95050
-> Process 9 sent 95050 to Process 0
<- Process 0 received 95050 from Process 9
Sum is -1544080023
Printing the contents of the array:
5050
536870912
-1579286148
-268433415
501219332
32666
501222192
32666
1
0
I'd like to know what is causing this.
If I print the array before MPI_Finalize is invoked it works fine.

The most important flaw your program has is how you divide the work. In MPI, every process is executing the main function. Therefore, you must ensure that all the processes execute your summation function if you want them to collaborate on building the result.
You don't need the for loop. Every process is executing the main separately. They just have different curr_proc values, and you can compute which portion of the job they have to perform based on that:
/* assigning jobs to processors */
int chunk_size = 1000 / total_proc;
lowerlimit = curr_proc * chunk_size;
upperlimit = (curr_proc+1) * chunk_size;
partial_sum = summation(upperlimit, lowerlimit);
Then, how the master process receives all the other processes' partial sum is not correct.
MPI rank values (curr_proc) start form 0 up to MPI_Comm_size output value (total_proc-1).
Only the process #1 is sending/receiving data.
You are using the immediate version of send and receive: MPI_Isend and MPI_recv but you are not waiting until those requests are completed. You should use MPI_Waitall for that purpose.
The correct version would be something like the following:
if( curr_proc == 0 ) {
// master process receives all data
for( int i = 1; i < total_proc; i++ )
MPI_Recv( &array[i], MPI_INT, 1, i, 0, MPI_COMM_WORLD );
} else {
// other processes send data to the master
MPI_Send( &partial_sum, MPI_INT, 1, 0, 0, MPI_COMM_WORLD );
}
This all-to-one communication pattern is known as gather. In MPI there is a function which already performs this functionality: MPI_Gather.
Finally, what you intent to perform is known as reduction: take a given amount of numeric values and generate a single output value by continuously performing a single operation (a sum, in your case). In MPI there is a function which does that, too: MPI_Reduce.
I strongly suggest you to do some basic guided exercises before trying to make your own. MPI is difficult to understand at the beginning. Building a good base is vital for you to be able to add complexity later on. A hands on tutorial is also a good way of getting started into MPI.
EDIT: Forgot to mention that you don't need to enforce an even divission of the problem size (1000 in this case) by the number of resources (total_proc). Depending on the case, you can either assign the remainder to a single process:
chunk_size = 1000 / total_proc;
if( curr_proc == 0 )
chunk_size += 1000 % total_proc;
Or balance it as much as possible:
int remainder = curr_proc < ( 1000 % proc )? 1 : 0;
lowerlimit = curr_proc * chunk_size /* as usual */
+ curr_proc; /* cumulative remainder */
upperlimit = (curr_proc + 1) * chunk_size /* as usual */
+ remainder; /* curr_proc remainder */
The second case, the load unbalance will be as much as 1, while in the first case the load unbalance can reach total_proc-1 in the worst case.

You're only initializing array[i], the element that corresponds to the curr_proc id. The other elements in that array will be uninitialized, resulting in random values. In your send/receive print loop, you only access the initialized element.
I'm not that familiar with MPI so I'm guessing, but you might want to allocate array before calling MPI_Init. Or call MPI_Receive on process 0, not each individual one.

Related

Trying to create a multithreaded program to find the total primes from 0-100000000

Hello I am trying to write a C++ multithreaded program using POSIX thread library to find the number of prime numbers between 1 and 10,000,000 (10 million) and find out how many microseconds it takes...
Creating my threads and running them works completely fine, however I feel as if there is an error found in my Prime function when determining if a number is prime or not...
I keep receiving 78496 as my output, however I desire 664579. Below is my code. Any hints or pointers would be greatly appreciated.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <unistd.h>
#include <iostream>
#include <sys/time.h> //measure the execution time of the computations
using namespace std;
//The number of thread to be generated
#define NUMBER_OF_THREADS 4
void * Prime(void* index);
long numbers[4] = {250000, 500000, 750000, 1000000};
long start_numbers[4] = {1, 250001, 500001, 750001};
int thread_numbers[4] = {0, 1, 2, 3};
int main(){
pthread_t tid[NUMBER_OF_THREADS];
int tn;
long sum = 0;
timeval start_time, end_time;
double start_time_microseconds, end_time_microseconds;
gettimeofday(&start_time, NULL);
start_time_microseconds = start_time.tv_sec * 1000000 + start_time.tv_usec;
for(tn = 0; tn < NUMBER_OF_THREADS; tn++){
if (pthread_create(&tid[tn], NULL, Prime, (void *) &thread_numbers[tn]) == -1 ) {
perror("thread fail");
exit(-1);
}
}
long value[4];
for(int i = 0; i < NUMBER_OF_THREADS; i++){
if(pthread_join(tid[i],(void **) &value[i]) == 0){
sum = sum + value[i]; //add four sums together
}else{
perror("Thread join failed");
exit(-1);
}
}
//get the end time in microseconds
gettimeofday(&end_time, NULL);
end_time_microseconds = end_time.tv_sec * 1000000 + end_time.tv_usec;
//calculate the time passed
double time_passed = end_time_microseconds - start_time_microseconds;
cout << "Sum is: " << sum << endl;
cout << "Running time is: " << time_passed << " microseconds" << endl;
exit(0);
}
//Prime function
void* Prime(void* index){
int temp_index;
temp_index = *((int*)index);
long sum_t = 0;
for(long i = start_numbers[temp_index]; i <= numbers[temp_index]; i++){
for (int j=2; j*j <= i; j++)
{
if (i % j == 0)
{
break;
}
else if (j+1 > sqrt(i)) {
sum_t++;
}
}
}
cout << "Thread " << temp_index << " terminates" << endl;
pthread_exit( (void*) sum_t);
}```
This is because, you used 10^6 instead of 10^7.
Also, added some corner cases for numbers 1, 2 and 3:
//Prime function
void* Prime(void* index){
int temp_index;
temp_index = *((int*)index);
long sum_t = 0;
for(long i = start_numbers[temp_index]; i <= numbers[temp_index]; i++){
// Corner cases
if(i<=1)continue;
if (i <= 3){
sum_t++;
continue;
}
for (int j=2; j*j <= i; j++)
{
if ((i % j == 0) || (i %( j+2))==0 )
{
break;
}
else if (j+1 > sqrt(i)) {
sum_t++;
}
}
}
cout << "Thread " << temp_index << " terminates" << endl;
pthread_exit( (void*) sum_t);
}
I tested your code with correct number and got the correct number of primes as output:
Thread 0 terminates
Thread 1 terminates
Thread 2 terminates
Thread 3 terminates
Sum is: 664579
Running time is: 4.69242e+07 microseconds
Thanks to #chux - Reinstate Monica for pointing this out
Along with taking 10^7 as the numbers divided in thread instead of setting the limit as 10^6 ,a number of other small scale errors are there and a number of optimizations could be made -
First of all start numbers could be from 2 itself
long start_numbers[4] = {2, 2500001, 5000001, 7500001};
sum_t++ in your code may not work on edge cases. It is better to follow the following algorithm for calculating Prime function
bool flag = false;
for(long i = start_numbers[temp_index]; i <= numbers[temp_index]; i++){
flag = false;
for (long j=2; j*j <= i; j++){
if (i % j == 0 )
{
flag = true;
break;
}
}
if(!flag)
sum_t++;
}
After these 2 operations i am getting the result as
Thread 0 terminates
Thread 1 terminates
Thread 2 terminates
Thread 3 terminates
Sum is: 664579
Running time is: 6.62618e+06 microseconds
edit:
( Note : in this case j is taken as long datatype but it could work as well with int in this 'example' since the tested compiler takes int as 32 bits long)

Random Walk with MPI: Why are my messages getting lost?

I am trying to develop a parallel random walker simulation with MPI and C++.
In my simulation, each process can be thought of as a cell which can contain particles (random walkers). The cells are aligned in one dimension with periodic boundary conditions (i.e. ring topology).
In each time step, a particle can stay in its cell or go into the left or right neighbour cell with a certain probability. To make it a bit easier, only the last particle in each cell's list can walk. If the particle walks, it has to be sent to the process with the according rank (MPI_Isend + MPI_Probe + MPI_Recv + MPI_Waitall).
However, after the first step my particles start disappearing, i.e. the messages are getting 'lost' somehow.
Below is a minimal example (sorry if it's still rather long). To better track the particle movements, each particle has an ID which corresponds to the rank of the process in which it started. After each step, each cell prints the IDs of the particles stored in it.
#include <mpi.h>
#include <vector>
#include <iostream>
#include <random>
#include <string>
#include <sstream>
#include <chrono>
#include <algorithm>
using namespace std;
class Particle
{
public:
int ID; // this is the rank of the process which initialized the particle
Particle () : ID(0) {};
Particle (int ID) : ID(ID) {};
};
stringstream msg;
string msgString;
int main(int argc, char** argv)
{
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// communication declarations
MPI_Status status;
// get the ranks of neighbors (periodic boundary conditions)
int neighbors[2];
neighbors[0] = (world_size + world_rank - 1) % world_size; // left neighbor
neighbors[1] = (world_size + world_rank + 1) % world_size; // right neighbor
// declare particle type
MPI_Datatype type_particle;
MPI_Type_contiguous (1, MPI_INT, &type_particle);
MPI_Type_commit (&type_particle);
// every process inits 1 particle with ID = world_rank
vector<Particle> particles;
particles.push_back (Particle(world_rank));
// obtain a seed from the timer
typedef std::chrono::high_resolution_clock myclock;
myclock::time_point beginning = myclock::now();
myclock::duration d = myclock::now() - beginning;
unsigned seed2 = d.count();
default_random_engine generator (seed2);
uniform_real_distribution<double> distribution (0, 1);
// ------------------------------------------------------------------
// begin time loop
//-------------------------------------------------------------------
for (int t=0; t<10; t++)
{
// ------------------------------------------------------------------
// 1) write a message string containing the current list of particles
//-------------------------------------------------------------------
// write the rank and the particle IDs into the msgString
msg << "rank " << world_rank << ": ";
for (auto& i : particles)
{
msg << i.ID << " ";
}
msg << "\n";
msgString = msg.str();
msg.str (string()); msg.clear ();
// to print the messages in order, the messages are gathered by root (rank 0) and then printed
// first, gather nums to root
int num = msgString.size();
int rcounts[world_size];
MPI_Gather( &num, 1, MPI_INT, rcounts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// root now has correct rcounts, using these we set displs[] so
// that data is placed contiguously (or concatenated) at receive end
int displs[world_size];
displs[0] = 0;
for (int i=1; i<world_size; ++i)
{
displs[i] = displs[i-1]+rcounts[i-1]*sizeof(char);
}
// create receive buffer
int rbuf_size = displs[world_size-1]+rcounts[world_size-1];
char *rbuf = new char[rbuf_size];
// gather the messages
MPI_Gatherv( &msgString[0], num, MPI_CHAR, rbuf, rcounts, displs, MPI_CHAR,
0, MPI_COMM_WORLD);
// root prints the messages
if (world_rank == 0)
{
cout << endl << "step " << t << endl;
for (int i=0; i<rbuf_size; i++)
cout << rbuf[i];
}
// ------------------------------------------------------------------
// 2) send particles randomly to neighbors
//-------------------------------------------------------------------
Particle snd_buf;
int sndDest = -1;
// 2a) if there are particles left, prepare a message. otherwise, proceed to step 2b)
if (!particles.empty ())
{
// write the last particle in the list to a buffer
snd_buf = particles.back ();
// flip a coin. with a probability of 50 %, the last particle in the list gets sent to a random neighbor
double rnd = distribution (generator);
if (rnd <= .5)
{
particles.pop_back ();
// pick random neighbor
if (rnd < .25)
{
sndDest = neighbors[0]; // send to the left
}
else
{
sndDest = neighbors[1]; // send to the right
}
}
}
// 2b) always send a message to each neighbor (even if it's empty)
MPI_Request requests[2];
for (int i=0; i<2; i++)
{
int dest = neighbors[i];
MPI_Isend (
&snd_buf, // void* data
sndDest==dest ? 1 : 0, // int count <---------------- send 0 particles to every neighbor except the one specified by sndDest
type_particle, // MPI_Datatype
dest, // int destination
0, // int tag
MPI_COMM_WORLD, // MPI_Comm
&requests[i]
);
}
// ------------------------------------------------------------------
// 3) probe and receive messages from each neighbor
//-------------------------------------------------------------------
for (int i=0; i<2; i++)
{
int src = neighbors[i];
// probe to determine if the message is empty or not
MPI_Probe (
src, // int source,
0, // int tag,
MPI_COMM_WORLD, // MPI_Comm comm,
&status // MPI_Status* status
);
int nRcvdParticles = 0;
MPI_Get_count (&status, type_particle, &nRcvdParticles);
// if the message if non-empty, receive it
if (nRcvdParticles > 0) // this proc can receive max. 1 particle from each neighbor
{
Particle rcv_buf;
MPI_Recv (
&rcv_buf, // void* data
1, // int count
type_particle, // MPI_Datatype
src, // int source
0, // int tag
MPI_COMM_WORLD, // MPI_Comm comm
MPI_STATUS_IGNORE // MPI_Status* status
);
// add received particle to the list
particles.push_back (rcv_buf);
}
}
MPI_Waitall (2, requests, MPI_STATUSES_IGNORE);
}
// ------------------------------------------------------------------
// end time loop
//-------------------------------------------------------------------
// Finalize the MPI environment.
MPI_Finalize();
if (world_rank == 0)
cout << "\nMPI_Finalize()\n";
return 0;
}
I ran the simulation with 8 processes and below is a sample of the output. In step 1, it still seems to work well, but beginning with step 2 the particles begin disappearing.
step 0
rank 0: 0
rank 1: 1
rank 2: 2
rank 3: 3
rank 4: 4
rank 5: 5
rank 6: 6
rank 7: 7
step 1
rank 0: 0
rank 1: 1
rank 2: 2 3
rank 3:
rank 4: 4 5
rank 5:
rank 6: 6 7
rank 7:
step 2
rank 0: 0
rank 1:
rank 2: 2
rank 3:
rank 4: 4
rank 5:
rank 6: 6 7
rank 7:
step 3
rank 0: 0
rank 1:
rank 2: 2
rank 3:
rank 4:
rank 5:
rank 6: 6
rank 7:
step 4
rank 0: 0
rank 1:
rank 2: 2
rank 3:
rank 4:
rank 5:
rank 6: 6
rank 7:
I have no ideas what's wrong with the code... Somehow, the combination MPI_Isend + MPI_Probe + MPI_Recv + MPI_Waitall seems not to work... Any help is really appreciated!
There is an error in your code. The following logic (irrelevant code and arguments omitted) is wrong:
MPI_Probe(..., &status);
MPI_Get_count (&status, type_particle, &nRcvdParticles);
// if the message if non-empty, receive it
if (nRcvdParticles > 0)
{
MPI_Recv();
}
MPI_Probe does not remove zero-sized messages from the message queue. The only MPI calls that do so is MPI_Recv and the combination of MPI_Irecv + MPI_Test/MPI_Wait. You must receive all messages, including zero-sized ones, otherwise they will prevent the reception of further messages with the same (source, tag) combination. Although reception of a zero-sized message writes nothing into the receive buffer, it removes the message envelope from the queue and the next matching message could be received.
Solution: move the call to MPI_Recv before the conditional operator.

MPI_Scatterv: segmentation fault 11 on process 0 only

I'm trying to scatter values among processes belonging to an hypercube group (quicksort project).
Depending on the amount of processes I either create a new communicator excluding excessive processes, or I duplicate MPI_COMM_WORLD if it fits exactly any hypercube (power of 2).
In both cases, processes other than 0 receive their data, but:
- On first scenario, process 0 throws a segmentation fault 11
- On second scenario, nothing faults, but process 0 received values are gibberish.
NOTE: If I try a regular MPI_Scatter everything works well.
//Input
vector<int> LoadFromFile();
int d; //dimension of hypercube
int p; //active processes
int idle; //idle processes
vector<int> values; //values loaded
int arraySize; //number of total values to distribute
int main(int argc, char* argv[])
{
int mpiWorldRank;
int mpiWorldSize;
int mpiRank;
int mpiSize;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpiWorldRank);
MPI_Comm_size(MPI_COMM_WORLD, &mpiWorldSize);
MPI_Comm MPI_COMM_HYPERCUBE;
d = log2(mpiWorldSize);
p = pow(2, d); //Number of processes belonging to the hypercube
idle = mpiWorldSize - p; //number of processes in excess
int toExclude[idle]; //array of idle processes to exclude from communicator
int sendCounts[p]; //array of values sizes to be sent to processes
//
int i = 0;
while (i < idle)
{
toExclude[i] = mpiWorldSize - 1 - i;
++i;
}
//CREATING HYPERCUBE GROUP: Group of size of power of 2 -----------------
MPI_Group world_group;
MPI_Comm_group(MPI_COMM_WORLD, &world_group);
// Remove excessive processors if any from communicator
if (idle > 0)
{
MPI_Group newGroup;
MPI_Group_excl(world_group, 1, toExclude, &newGroup);
MPI_Comm_create(MPI_COMM_WORLD, newGroup, &MPI_COMM_HYPERCUBE);
//Abort any processor not part of the hypercube.
if (mpiWorldRank > p)
{
cout << "aborting: " << mpiWorldRank <<endl;
MPI_Finalize();
return 0;
}
}
else
{
MPI_Comm_dup(MPI_COMM_WORLD, &MPI_COMM_HYPERCUBE);
}
MPI_Comm_rank(MPI_COMM_HYPERCUBE, &mpiRank);
MPI_Comm_size(MPI_COMM_HYPERCUBE, &mpiSize);
//END OF: CREATING HYPERCUBE GROUP --------------------------
if (mpiRank == 0)
{
//STEP1: Read input
values = LoadFromFile();
arraySize = values.size();
}
//Transforming input vector into an array
int valuesArray[values.size()];
if(mpiRank == 0)
{
copy(values.begin(), values.end(), valuesArray);
}
//Broadcast input size to all processes
MPI_Bcast(&arraySize, 1, MPI_INT, 0, MPI_COMM_HYPERCUBE);
//MPI_Scatterv: determining size of arrays to be received and displacement
int nmin = arraySize / p;
int remainingData = arraySize % p;
int displs[p];
int recvCount;
int k = 0;
for (i=0; i<p; i++)
{
sendCounts[i] = i < remainingData
? nmin+1
: nmin;
displs[i] = k;
k += sendCounts[i];
}
recvCount = sendCounts[mpiRank];
int recvValues[recvCount];
//Following MPI_Scatter works well:
// MPI_Scatter(&valuesArray, 13, MPI_INT, recvValues , 13, MPI_INT, 0, MPI_COMM_HYPERCUBE);
MPI_Scatterv(&valuesArray, sendCounts, displs, MPI_INT, recvValues , recvCount, MPI_INT, 0, MPI_COMM_HYPERCUBE);
int j = 0;
while (j < recvCount)
{
cout << "rank " << mpiRank << " received: " << recvValues[j] << endl;
++j;
}
MPI_Finalize();
return 0;
}
First of all, you are supplying wrong arguments to MPI_Group_excl:
MPI_Group_excl(world_group, 1, toExclude, &newGroup);
// ^
The second argument specifies the number of entries in the exclusion list and should therefore be equal to idle. Since you are excluding a single rank only, the resulting group has mpiWorldSize-1 ranks and hence MPI_Scatterv expects that both sendCounts[] and displs[] have that many elements. Of those only p elements are properly initialised and and the rest are random, therefore MPI_Scatterv crashes in the root.
Another error is the code that aborts the idle processes: it should read if (mpiWorldRank >= p).
I would recommend that the entire exclusion code is replaced by a single call to MPI_Comm_split instead:
MPI_Comm comm_hypercube;
int colour = mpiWorldRank >= p ? MPI_UNDEFINED : 0;
MPI_Comm_split(MPI_COMM_WORLD, colour, mpiWorldRank, &comm_hypercube);
if (comm_hypercube == MPI_COMM_NULL)
{
MPI_Finalize();
return 0;
}
When no process supplies MPI_UNDEFINED as its colour, the call is equivalent to MPI_Comm_dup.
Note that you should avoid using in your code names starting with MPI_ as those could clash with symbols from the MPI implementation.
Additional note: std::vector<T> uses contiguous storage, therefore you could do without copying the elements into a regular array and simply provide the address of the first element in the call to MPI_Scatter(v):
MPI_Scatterv(&values[0], ...);

MPI's Scatterv operation

I'm not sure that I am correctly understanding what MPI_Scatterv is supposed to do. I have 79 items to scatter amounts a variable amount of nodes. However, when I use the MPI_Scatterv command I get ridiculous numbers (as if the array elements of my receiving buffer are uninitialized). Here is the relevant code snippet:
MPI_Init(&argc, &argv);
int id, procs;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
//Assign each file a number and figure out how many files should be
//assigned to each node
int file_numbers[files.size()];
int send_counts[nodes] = {0};
int displacements[nodes] = {0};
for (int i = 0; i < files.size(); i++)
{
file_numbers[i] = i;
send_counts[i%nodes]++;
}
//figure out the displacements
int sum = 0;
for (int i = 0; i < nodes; i++)
{
displacements[i] = sum;
sum += send_counts[i];
}
//Create a receiving buffer
int *rec_buf = new int[79];
if (id == 0)
{
MPI_Scatterv(&file_numbers, send_counts, displacements, MPI_INT, rec_buf, 79, MPI_INT, 0, MPI_COMM_WORLD);
}
cout << "got here " << id << " checkpoint 1" << endl;
cout << id << ": " << rec_buf[0] << endl;
cout << "got here " << id << " checkpoint 2" << endl;
MPI_Barrier(MPI_COMM_WORLD);
free(rec_buf);
MPI_Finalize();
When I run that code I receive this output:
got here 1 checkpoint 1
1: -1168572184
got here 1 checkpoint 2
got here 2 checkpoint 1
2: 804847848
got here 2 checkpoint 2
got here 3 checkpoint 1
3: 1364787432
got here 3 checkpoint 2
got here 4 checkpoint 1
4: 903413992
got here 4 checkpoint 2
got here 0 checkpoint 1
0: 0
got here 0 checkpoint 2
I read the documentation for OpenMPI and looked through some code examples, I'm not sure what I'm missing any help would be great!
One of the most common MPI mistakes strikes again:
if (id == 0) // <---- PROBLEM
{
MPI_Scatterv(&file_numbers, send_counts, displacements, MPI_INT,
rec_buf, 79, MPI_INT, 0, MPI_COMM_WORLD);
}
MPI_SCATTERV is a collective MPI operation. Collective operations must be executed by all processes in the specified communicator in order to complete successfully. You are executing it only in rank 0 and that's why only it gets the correct values.
Solution: remove the conditional if (...).
But there is another subtle mistake here. Since collective operations do not provide any status output, the MPI standard enforces strict matching of the number of elements sent to some rank and the number of elements the rank is willing to receive. In your case the receiver always specifies 79 elements which might not match the corresponding number in send_counts. You should instead use:
MPI_Scatterv(file_numbers, send_counts, displacements, MPI_INT,
rec_buf, send_counts[id], MPI_INT,
0, MPI_COMM_WORLD);
Also note the following discrepancy in your code that might as well be a typo while posting the question here:
MPI_Comm_size(MPI_COMM_WORLD, &procs);
^^^^^
int send_counts[nodes] = {0};
^^^^^
int displacements[nodes] = {0};
^^^^^
While you obtain the number of ranks in the procs variable, nodes is used in the rest of your code. I guess nodes should be replaced by procs.

Print number of 1s in a sequence up to a number, without actually counting 1s [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
An interview question:
Make a program which takes input 'N'(unsigned long) and prints two columns, 1st column prints numbers from 1 to N (in hexadecimal format) and second column prints the number of 1s in the binary representation of the number in the left column. Condition is that this program should not count 1s (so no computations 'per number' to get 1s/ no division operators).
I tried to implement this by leveraging fact that No of 1s in 0x0 to 0xF can be re-used to generate 1s for any number. I am pasting code ( basic one without error checking.) Its giving correct results but I am not happy with space usage. How can I improve on this?
( Also I am not sure if its what interviewer was looking for).
void printRangeFasterWay(){
uint64_t num = ~0x0 ;
cout << " Enter upper number " ;
cin >> num ;
uint8_t arrayCount[] = { 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4} ;
// This array will store information needed to print
uint8_t * newCount = new uint8_t[num] ;
uint64_t mask = 0x0 ;
memcpy(newCount, &arrayCount[0], 0x10) ;
uint64_t lower = 0;
uint64_t upper = 0xF;
uint64_t count = 0 ;
uint32_t zcount= 0 ;
do{
upper = std::min(upper, num) ;
for(count = lower ; count <= upper ; count++){
newCount[count] = (uint32_t)( newCount[count & mask] + newCount[(count & ~mask)>>(4*zcount)]) ;
}
lower += count ;
upper |= (upper<<4) ;
mask = ((mask<<4) | 0xF ) ;
zcount++ ;
}while(count<=num) ;
for(uint64_t xcount=0 ; xcount <= num ; xcount++){
cout << std::hex << " num = " << xcount << std::dec << " number of 1s = " << (uint32_t)newCount[xcount] << endl;
}
}
Edited to add sample run
Enter upper number 18
num = 0 number of 1s = 0
num = 1 number of 1s = 1
num = 2 number of 1s = 1
num = 3 number of 1s = 2
num = 4 number of 1s = 1
num = 5 number of 1s = 2
num = 6 number of 1s = 2
num = 7 number of 1s = 3
num = 8 number of 1s = 1
num = 9 number of 1s = 2
num = a number of 1s = 2
num = b number of 1s = 3
num = c number of 1s = 2
num = d number of 1s = 3
num = e number of 1s = 3
num = f number of 1s = 4
num = 10 number of 1s = 1
num = 11 number of 1s = 2
num = 12 number of 1s = 2
I have a slightly different approach which should solve your memory problem. Its based on the fact that the bitwise operation i & -i gives you the smallest power of two in the number i. For example, for i = 5, i & -i = 1, for i = 6, i & -i = 2. Now, for code:
void countBits(unsigned N) {
for (int i = 0;i < N; i ++)
{
int bits = 0;
for (int j = i; j > 0; j= j - (j&-j))
bits++;
cout <<"Num: "<<i <<" Bits:"<<bits<<endl;
}
}
I hope I understood your question correctly. Hope that helps
Edit:
Ok, try this - this is dynamic programming without using every bit in every number:
void countBits(unsigned N) {
unsigned *arr = new unsigned[N + 1];
arr[0]=0;
for (int i = 1;i <=N; i ++)
{
arr[i] = arr[i - (i&-i)] + 1;
}
for(int i = 0; i <=N; i++)
cout<<"Num: "<<i<<" Bits:"<<arr[i]<<endl;
}
Hopefully, this works better
Several of the answers posted so far make use of bit shifting (just another word for division by 2) or
bit masking. This stikes me as a bit of a cheat. Same goes for using the '1' bit count in a 4 bit pattern then
matching by chunks of 4 bits.
How about a simple recursive solution using an imaginary binary tree of bits. each left branch contains a '0', each
right branch contains a '1'. Then do a depth first traversal counting the number of 1 bits on the way down. Once
the bottom of the tree is reached add one to the counter, print out the number of 1 bits found so far, back out
one level and recurse again.
Stop the recursion when the counter reaches the desired number.
I am not a C/C++ programmer, but here is a REXX solution that should translate without much imagination. Note
the magic number 32 is just the number of bits in an Unsigned long. Set it to anything
/* REXX */
SAY 'Stopping number:'
pull StopNum
Counter = 0
CALL CountOneBits 0, 0
return
CountOneBits: PROCEDURE EXPOSE Counter StopNum
ARG Depth, OneBits
If Depth = 32 then Return /* Number of bits in ULong */
if Counter = StopNum then return /* Counted as high as requested */
call BitCounter Depth + 1, OneBits /* Left branch is a 0 bit */
call BitCounter Depth + 1, OneBits + 1 /* Right branch is a 1 bit */
Return
BitCounter: PROCEDURE EXPOSE Counter StopNum
ARG Depth, OneBits
if Depth = 32 then do /* Bottom of binary bit tree */
say D2X(Counter) 'contains' OneBits 'one bits'
Counter = Counter + 1
end
call CountOneBits Depth, OneBits
return
Results:
Stopping number:
18
0 contains 0 one bits
1 contains 1 one bits
2 contains 1 one bits
3 contains 2 one bits
4 contains 1 one bits
5 contains 2 one bits
6 contains 2 one bits
7 contains 3 one bits
8 contains 1 one bits
9 contains 2 one bits
A contains 2 one bits
B contains 3 one bits
C contains 2 one bits
D contains 3 one bits
E contains 3 one bits
F contains 4 one bits
10 contains 1 one bits
11 contains 2 one bits
This answer is resonably efficient in time and space.
Can be done relatively trivially in constant time with the appropriate bit switching. No counting of 1s and no divisions. I think you were on the right track with keeping the array of known bit values:
int bits(int x)
{
// known bit values for 0-15
static int bc[16] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4};
// bit "counter"
int b = 0;
// loop iterator
int c = 0;
do
{
// get the last 4 bits in the number
char lowc = static_cast<char>(x & 0x0000000f);
// find the count
b += bc[lowc];
// lose the last four bits
x >>= 4;
++c;
// loop for each possible 4 bit combination,
// or until x is 0 (all significant bits lost)
}
while(c < 8 && x > 0);
return b;
}
Explanation
The following algorithm is like yours, but expands on the idea (if I understood your approach correctly.) It does not do any computation 'per number' as directed by the question, but instead uses a recursion that exists between sequences of lengths that are powers of 2. Basically, the observation is that for the sequence 0, 1,..,2^n-1 , we can use the sequence 0, 1, ...,2^(n-1)-1 in the following way.
Let f(i) be the number of ones in number i then f(2^(n-1)+i)=f(i)+1 for all 0<=i<2^(n-1). (Verify this for yourself)
Algorithm in C++
#include <stdio.h>
#include <stdlib.h>
int main( int argc, char *argv[] )
{
const int N = 32;
int* arr = new int[N];
arr[0]=0;
arr[1]=1;
for ( int i = 1; i < 15; i++ )
{
int pow2 = 1 << i;
int offset = pow2;
for ( int k = 0; k < pow2; k++ )
{
if ( offset+k >= N )
goto leave;
arr[offset+k]=arr[k]+1;
}
}
leave:
for ( int i = 0; i < N; i++ )
{
printf( "0x%8x %16d", i, arr[i] );
}
delete[] arr;
return EXIT_SUCCESS;
}
Note that in the for loop
for ( int i = 0; i < 15; i++ )
there may be overflow into negative numbers if you go higher than 15, otherwise use unsigned int's if you want to go higher than that.
Efficiency
This algorithm runs in O(N) and uses O(N) space.
Here is an approach that has O(nlogn) time complexity and O(1) memory usage. The idea is to get the Hex equivalent of the number and iterate over it to get number of ones per Hex digit.
int oneCount[] = { 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4};
int getOneCount(int n)
{
char inStr[70];
sprintf(inStr,"%X",n);
int i;
int sum=0;
for(i=0; inStr[i];i++)
{
if ( inStr[i] > '9' )
sum += oneCount[inStr[i]-'A' + 10];
else
sum+= oneCount[inStr[i] -'0'];
}
return sum;
}
int i,upperLimit;
cin>>upperLimit;
for(i=0;i<=upperLimit;i++)
{
cout << std::hex << " num = " << i << std::dec << " number of 1s = " << getOneCount(i) << endl;
}
enum bit_count_masks32
{
one_bits= 0x55555555, // 01...
two_bits= 0x33333333, // 0011...
four_bits= 0x0f0f0f0f, // 00001111....
eight_bits= 0x00ff00ff, // 0000000011111111...
sixteen_bits= 0x0000ffff, // 00000000000000001111111111111111
};
unsigned int popcount32(unsigned int x)
{
unsigned int result= x;
result= (result & one_bits) + (result & (one_bits << 1)) >> 1;
result= (result & two_bits) + (result & (two_bits << 2)) >> 2;
result= (result & four_bits) + (result & (four_bits << 4)) >> 4;
result= (result & eight_bits) + (result & (eight_bits << 8)) >> 8;
result= (result & sixteen_bits) + (result & (sixteen_bits << 16)) >> 16;
return result;
}
void print_range(unsigned int low, unsigned int high)
{
for (unsigned int n= low; unsigned int n<=high; ++n)
{
cout << std::hex << " num = " << xcount << std::dec << " number of 1s = " << popcount32(n) << endl;
}
}