Scattering to more nodes than there is data - c++

What happens if I call MPI::Scatter and there are more nodes in the communicator than there is data?
Suppose I have a 4 × 4 array and I'm sending one row to every processor, but I have 8 processors. What happens? Will rank 0 – 3 receive data and rank 4 – 7 get nothing?

Related

MPI receiving data from an unknown number of ranks

I have a list of indices for which I do not know their corresponding entries in a vector, because the vector is distributed among the ranks. I have to send these indices to the ranks in charge to get the data.
On the other hand "my" rank also get lists of indices from an unknown number of ranks. After receiving the list, "my" rank has to send the corresponding data to this requesting ranks.
I think I have to work with a mixture of MPI_Probe and MPI_Gather. But at the moment I cannot see how to receive lists from an unknown number of ranks.
I think it has to look like this, but how can I receive the data from a bigger unknown number of rank? Or do I have to loop over all possible ranks, that could send me something?
MPI_Status status;
int nbytes;
std::vector<Size> indices;
MPI_Probe(MPI_ANY_SOURCE,MPI_ANY_TAG, comm, &status);
MPI_Get_count(&status,MPI_UINT64_T, &nbytes);
if(nbytes!=MPI_UNDEFINED){
indices.reserve(nbytes);
MPI_Recv(&indices[0],nbytes,MPI_UINT64_T,status.SOURCE,status.TAG,comm,&status);
}
This resembles a lot what I did a few years ago for parallel I/O.
One option:
From all senders, get the size that you need to send to each other rank
Send the sizes (Allgather if all ranks can be senders, otherwise sends/receives)
Does a (all)gatherv that will retrieve the size on each receiver
You can use non blocking send/receives as well as gatherv (MPI3) and this scales well (depending ont he hardware) to 500 cores for 8 senders.
The way we did it was to go through the vector by chunk of several MB and send the data in chunks. Of course, the bigger the chunks the better, but also the more memory you need on each sender ranks to hold the data.

Hadoop 3.0 erasure coding - determining the number of acceptable node failures?

In hadoop 2.0 the default replication factor is 3. And the number of node failures acceptable was 3-1=2.
So on a 100 node cluster if a file was divided in to say 10 parts (blocks), with replication factor of 3 the total storage blocks required are 30. And if any 3 nodes containing a block X and it's replicas failed then the file is not recoverable. Even if the cluster had 1000 nodes or the file was split in to 20 parts, failure of 3 nodes on the cluster can still be disastrous for the file.
Now stepping into hadoop 3.0.
With erasure coding, as Hadoop says it provides the same durability with 50% efficient storage. And based on how Reed-Solomon method works (that is for k data blocks and n parity blocks, at least k of the (k+n) blocks should be accessible for the file to be recoverable/readable)
So for the same file above - there are 10 data blocks and to keep the data efficiency to 50%, 5 parity blocks can be added. So from the 10+5 blocks, at least any 10 blocks should be available for the file to be accessible. And on the 100 node cluster if each of the 15 blocks are stored on a separate node, then as you can see, a total of 5 node failures is acceptable. Now storing the same file (ie 15 blocks) on a 1000 node cluster would not make any difference w.r.t the number of acceptable node failures - it's still 5.
But the interesting part here is - if the same file (or another file) was divided into 20 blocks and then 10 parity block were added, then for the total of 30 blocks to be saved on the 100 node cluster, the acceptable number of node failures is 10.
The point I want to make here is -
in hadoop 2 the number of acceptable node failures is ReplicationFactor-1 and is clearly based on the replication factor. And this is a cluster wide property.
but in hadoop 3, say if the storage efficiency was fixed to 50%, then the number of acceptable node failures seems to be different for different files based on the number of blocks it is divided in to.
So can anyone comment if the above inference is correct? And how any clusters acceptable node failures is determined?
(And I did not want to make it complex above, so did not discuss the edge case of a file with one block only. But the algorithm, I guess, will be smart enough to replicate it as is or with parity data so that the data durability settings are guaranteed.)
Edit:
This question is part of a series of questions I have on EC - Others as below -
Hadoop 3.0 erasure coding: impact on MR jobs performance?
Using your numbers for Hadoop 2.0, each block of data is stored on 3 different nodes. As long as any one of the 3 nodes has not failed to read a specific block, that block of data is recoverable.
Again using your numbers, for Hadoop 3.0, every set of 10 blocks of data and 5 blocks of parities are stored on 15 different nodes. So the data space requirement is reduced to 50% overhead, but the number of nodes the data and parities are written to have increased by a factor of 5, from 3 nodes for Hadoop 2.0 to 15 nodes for Hadoop 3.0. Since the redundancy is based on Reed Solomon erasure correction, then as long as any 10 of the 15 nodes have not failed to read a specific set of blocks, that set of blocks is recoverable (maximum allowed failure for a set of blocks is 5 nodes). If it's 20 blocks of data and 10 blocks of parities, then the data and parity blocks are distributed on 30 different nodes (maximum allowed failure for a set of blocks is 10 nodes).
For a cluster-wide view, failure can occur if more than n-k nodes fail, regardless of the number of nodes, since there's some chance that a set of data and parity blocks will happen to include all of the failing nodes. To avoid this, n should be increased along with the number of nodes in a cluster. For 100 nodes, each set could be 80 blocks of data, 20 blocks of parity (25% redundancy). Note 100 nodes would be unusually large. The example from this web page is 14 nodes RS(14,10) (for each set: 10 blocks of data, 4 blocks of parity).
https://hadoop.apache.org/docs/r3.0.0
with your numbers, the cluster size would be 15 (10+5) or 30 (20+10) nodes.
For a file with 1 block or less than k blocks, n-k parity blocks would still be needed to ensure that it takes more than n-k nodes to fail before a failure occurs. For Reed Solomon encoding, this could be done by emulating leading blocks of zeroes for the "missing" blocks.
I thought I'd add some probability versus number of nodes in a cluster.
Assume node failure rate is 1%.
15 nodes, 10 for data, 5 for parities, using comb(a,b) for combinations of a things b at a time:
Probability of exactly x node failures is:
6 => ((.01)^6) ((.99)^9) (comb(15,6)) ~= 4.572 × 10^-9
7 => ((.01)^7) ((.99)^8) (comb(15,7)) ~= 5.938 × 10^-11
8 => ((.01)^8) ((.99)^7) (comb(15,8)) ~= 5.998 × 10^-13
...
Probability of 6 or more failures ~= 4.632 × 10^-9
30 nodes, 20 for data, 10 for parities
Probability of exactly x node failures is:
11 => ((.01)^11) ((.99)^19) (comb(30,11)) ~= 4.513 × 10^-15
12 => ((.01)^12) ((.99)^18) (comb(30,12)) ~= 7.218 × 10^-17
13 => ((.01)^13) ((.99)^17) (comb(30,13)) ~= 1.010 × 10^-18
14 => ((.01)^14) ((.99)^16) (comb(30,14)) ~= 1.238 × 10^-20
Probability of 11 or more failures ~= 4.586 × 10^-15
To show that the need for parity overhead decreases with number of nodes, consider the extreme case of 100 nodes, 80 for data, 20 for parties (25% redundancy):
Probability of exactly x node failures is:
21 => ((.01)^21) ((.99)^79) (comb(100,21)) ~= 9.230 × 10^-22
22 => ((.01)^22) ((.99)^78) (comb(100,22)) ~= 3.348 × 10^-23
23 => ((.01)^23) ((.99)^77) (comb(100,23)) ~= 1.147 × 10^-24
Probability of 21 or more failures ~= 9.577 × 10^-22

MPI Sending Dynamically Sized Array using MPI_Alltoallv

Given p processors and each having a unique input size of i where every input is a unique number. Each processor is given a integer range.
Goal: Have each processor only have integers in their range
Currently running into issues in the following circumstances.
Each processor wants to export the values that are not in its range
Each processor buckets the values not in their range from their input in bucket overflow
Every processor p, broadcasts the size of overflow,
Take the sum of the sizes of overflow including the local overflow size and create an array of total_overflow_size
Each processor will now broadcast their bucket using MPI_Alltoallv
Broadcasted buckets are now stored in globalBucket array
This is the MPI_ALLTOALLV call I am using:
int *local_overflow = &overflow;
//buckets = local buckets <- contains values not in local range, size of local overflow.
//globalBucket <- size of all overflows from all processors
//offset = Sum of all rank-1 processor's overflow.
MPI_Alltoallv(&buckets,local_overflow,off,MPI_INTEGER,
&globalBucket,offest,local_overflow,
MPI_INTEGER,MPI_COMM_WORLD);
I believe my issue lies in offsetting the values appropriately, corresponds to the 3rd and 7th parameter.
The goal is to have for example if processor 0 has bucket of size 5, and processors 1 has bucket of size 12, I want proc 0's bucket to occupy the firs 5 spaces in the array, and proc 1's bucket to occupy the next twelve in the globalBucket.
I receive error messages such as
*** Process received signal ***
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0xc355bb0
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x1733ae0
MPI_ALLTOALLV is an uncommon call, more info available at: http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Alltoallv.html
EDIT: I have calculated my offsets correctly -> the values of all previous processor's rank, I am now receiving the same error as above.
local_overflow should be an array of int, because you want to send the same number of values from this rank to all of the other ranks, all of the elements of local_overflow should have the same value.
int *local_overflow = new int[p];
for (int i = 0; i < p){
local_overflow[i] = overflow;
}

Is a process can send to himself data? Using MPICH2

I have an upper triangular matrix and the result vector b.
My program need to solve the linear system:
Ax = b
using the pipeline method.
And one of the constraints is that the number of process is smaller than the number of
the equations (let's say it can be from 2 to numberOfEquations-1).
I don't have the code right now, I'm thinking about the pseudo code..
My Idea was that one of the processes will create the random upper triangular matrix (A)
the vector b.
lets say this is the random matrix:
1 2 3 4 5 6
0 1 7 8 9 10
0 0 1 12 13 14
0 0 0 1 16 17
0 0 0 0 1 18
0 0 0 0 0 1
and the vector b is [10 5 8 9 10 5]
and I have a smaller amount of processes than the number of equations (lets say 2 processes)
so what I thought is that some process will send to each process line from the matrix and the relevant number from vector b.
so the last line of the matrix and the last number in vector b will be send to
process[numProcs-1] (here i mean to the last process (process 1) )
than he compute the X and sends the result to process 0.
Now process 0 need to compute the 5 line of the matrix and here i'm stuck..
I have the X that was computed by process 1, but how can the process can send to himself
the next line of the matrix and the relevant number from vector b that need to be computed?
Is it possible? I don't think it's right to send to "myself"
Yes, MPI allows a process to send data to itself but one has to be extra careful about possible deadlocks when blocking operations are used. In that case one usually pairs a non-blocking send with blocking receive or vice versa, or one uses calls like MPI_Sendrecv. Sending a message to self usually ends up with the message simply being memory-copied from the source buffer to the destination one with no networking or other heavy machinery involved.
And no, communicating with self is not necessary a bad thing. The most obvious benefit is that it makes the code more symmetric as it removes/reduces the special logic needed to handle self-interaction. Sending to/receiving from self also happens in most collective communication calls. For example, MPI_Scatter also sends part of the data to the root process. To prevent some send-to-self cases that unnecessarily replicate data and decrease performance, MPI allows in-place mode (MPI_IN_PLACE) for most communication-related collectives.
Is it possible? I don't think it's right to send to "myself"
Sure, it is possible to communicate with oneself. There is even a communicator for it: MPI_COMM_SELF. Talking to yourself is not too uncommon.
Your setup sounds like you would rather use MPI collectives. Have a look at MPI_Scatter and MPI_Gather and see if they don't provide you with the functionality, you are looking for.

fortran 77 direct access max record number

I'm trying to store double precision data from different blocks into a direct access file, i.e. the data is g(m,n) for one block and they all have the same size. Here's the code I wrote:
OPEN(3,FILE='a.TMP',ACCESS='DIRECT',RECL=8*m*n)
WRITE(3,REC=I) ((g(K,L),K=1,m),L=1,n) ! here "I" is the block number
I have 200 this kind of blocks. However, I got the following error after writing the 157th block data into the file:
severe (66): output statement overflows record, unit 3
I believe that means the record size is too large.
Is there any way to handle this? I wonder if there is the record number has a maximum value.