I guess my question has 2 parts:
(1) Is this the right approach to send different chunks of an array to different processors?
Let's say I have n processors whose rank ranges from 0 to n-1.
I have an array of size d. I want to split this array into k equally-sized chunks. Assume d is divisible by k.
I want to send each of these chunks to a processor whose rank is less than k.
It would be easy if I can use something like MPI_Scatter, but this function sends to EVERY OTHER processors, and I only want to send to a certain number of procs.
So what I did was that I have a loop of k iterations and do k MPI_Isend's.
Is this efficient?
(2) If it is, how do I split an array into chunks? There's always the easy way, which is
int size = d/k;
int buffs[k][size];
for (int rank = 0; rank < k; ++rank)
{
for (int i = 0; i < size ++i)
buffs[rank][i] = input[rank*size + i];
MPI_Isend(&buffs[rank], size, MPI_INT, rank, 1, comm, &request);
}
What you are looking for is MPI_Scatterv which allows you to explicitly specify the length of each chunk and its position relative to the beginning of the buffer. If you don't want to send data to certain ranks, simply set the length of their chunks to 0:
int blen[n];
MPI_Aint displ[n];
for (int rank = 0; rank < n; rank++)
{
blen[rank] = (rank < k) ? size : 0;
displ[rank] = rank * size;
}
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Scatterv(input, blen, displ, MPI_INT,
mybuf, myrank < k ? size : 0, MPI_INT,
0, MPI_COMM_WORLD);
Note that for rank >= k the displacements will run past the end of the buffer. That is all right since block lengths are set to zero for rank >= k and no data will be accessed.
As for your original approach, it is not portable and might not always work. The reason is that you are overwriting the same request handle and you never wait for the sends to complete. The correct implementation is:
MPI_Request request[k];
for (int rank = 0; rank < k; ++rank)
{
MPI_Isend(&input[rank*size], size, MPI_INT, rank, 1, comm, &request[rank]);
}
MPI_Waitall(k, request, MPI_STATUSES_IGNORE);
The most optimal implementation would be to use MPI_Scatter in a subcommunicator:
MPI_Comm subcomm;
MPI_Comm_split(MPI_COMM_WORLD, myrank < k ? 0 : MPI_UNDEFINED, myrank,
&subcomm);
// Now there are k ranks in subcomm
// Perform the scatter in the subcommunicator
if (subcomm != MPI_COMM_NULL)
MPI_Scatter(input, size, MPI_INT, mybuf, size, MPI_INT, 0, subcomm);
The MPI_Comm_split call splits MPI_COMM_WORLD and creates a new communicator from all original ranks less than k. It uses the original rank as key for ordering the ranks in the new communicator, therefore rank 0 in MPI_COMM_WORLD becomes rank 0 in subcomm. Since MPI_Scatter often performs better than MPI_Scatterv, this one is the most optimal solution.
Related
Question: There are n balls, they are labeled 0, 1, 2 and the order is chaotic, I want to sort them from small to large. Balls:
1, 2, 0, 1, 1, 2, 2, 0, 1, 2, ...
We must use the fastest way to solve and cannot use sort() function, I thought many ways like the bubble sort, inset sort, etc. But it is not fast. Is there an algorithm that makes the time complexity is O(logn) or O(n)?
given balls list A[] and length n
void sortBalls(int A[], int n)
{
//code here
}
Given the very limited number of item types (0, 1, and 2), you just count the number of occurrences of each. Then to print the "sorted" array, you repeatedly print each label the number of times it occurred. Running time is O(N)
int balls[N] = {...}; // array of balls: initialized to whatever
int sorted_balls[N]; // sorted array of balls (to be set below)
int counts[3] = {}; // count of each label, zero initialized array.
// enumerate over the input array and count each label's occurance
for (int i = 0; i < N; i++)
{
counts[balls[i]]++;
}
// sort the items by just printing each label the number of times it was counted above
int k = 0;
for (int j = 0; j < 3; j++)
{
for (int x = 0; x < counts[j]; x++)
{
cout << j << ", "; // print
sorted_balls[k] = j; // store into the final sorted array
k++;
}
}
If you have a small number of possible values known in advance, and the value is everything you need to know about the ball (they carry no other attributes), "sorting" becomes equivalent to "counting how many of each value there are". So you generate a histogram - an array from 0 to 2, in your case - go through your values and increase the corresponding count. Then you generate an array of n_0 balls with number 0, n_1 balls with number 1 and n_2 with number 2, and voila, they're sorted.
It's trivially obvious that you cannot go below O(n) - at the very least, you have to look at each value once to count it, and for n values, that's n operations right away.
I have a list of 100 random integers. Each random integer has a value from 0 to 99. Duplicates are allowed, so the list could be something like
56, 1, 1, 1, 1, 0, 2, 6, 99...
I need to find the smallest integer (>= 0) is that is not contained in the list.
My initial solution is this:
vector<int> integerList(100); //list of random integers
...
vector<bool> listedIntegers(101, false);
for (int theInt : integerList)
{
listedIntegers[theInt] = true;
}
int smallestInt;
for (int j = 0; j < 101; j++)
{
if (!listedIntegers[j])
{
smallestInt = j;
break;
}
}
But that requires a secondary array for book-keeping and a second (potentially full) list iteration. I need to perform this task millions of times (the actual application is in a greedy graph coloring algorithm, where I need to find the smallest unused color value with a vertex adjacency list), so I'm wondering if there's a clever way to get the same result without so much overhead?
It's been a year, but ...
One idea that comes to mind is to keep track of the interval(s) of unused values as you iterate the list. To allow efficient lookup, you could keep intervals as tuples in a binary search tree, for example.
So, using your sample data:
56, 1, 1, 1, 1, 0, 2, 6, 99...
You would initially have the unused interval [0..99], and then, as each input value is processed:
56: [0..55][57..99]
1: [0..0][2..55][57..99]
1: no change
1: no change
1: no change
0: [2..55][57..99]
2: [3..55][57..99]
6: [3..5][7..55][57..99]
99: [3..5][7..55][57..98]
Result (lowest value in lowest remaining interval): 3
I believe there is no faster way to do it. What you can do in your case is to reuse vector<bool>, you need to have just one such vector per thread.
Though the better approach might be to reconsider the whole algorithm to eliminate this step at all. Maybe you can update least unused color on every step of the algorithm?
Since you have to scan the whole list no matter what, the algorithm you have is already pretty good. The only improvement I can suggest without measuring (that will surely speed things up) is to get rid of your vector<bool>, and replace it with a stack-allocated array of 4 32-bit integers or 2 64-bit integers.
Then you won't have to pay the cost of allocating an array on the heap every time, and you can get the first unused number (the position of the first 0 bit) much faster. To find the word that contains the first 0 bit, you only need to find the first one that isn't the maximum value, and there are bit twiddling hacks you can use to get the first 0 bit in that word very quickly.
You program is already very efficient, in O(n). Only marginal gain can be found.
One possibility is to divide the number of possible values in blocks of size block, and to register
not in an array of bool but in an array of int, in this case memorizing the value modulo block.
In practice, we replace a loop of size N by a loop of size N/block plus a loop of size block.
Theoretically, we could select block = sqrt(N) = 12 in order to minimize the quantity N/block + block.
In the program hereafter, block of size 8 are selected, assuming that dividing integers by 8 and calculating values modulo 8 should be fast.
However, it is clear that a gain, if any, can be obtained only for a minimum value rather large!
constexpr int N = 100;
int find_min1 (const std::vector<int> &IntegerList) {
constexpr int Size = 13; //N / block
constexpr int block = 8;
constexpr int Vmax = 255; // 2^block - 1
int listedBlocks[Size] = {0};
for (int theInt : IntegerList) {
listedBlocks[theInt / block] |= 1 << (theInt % block);
}
for (int j = 0; j < Size; j++) {
if (listedBlocks[j] == Vmax) continue;
int &k = listedBlocks[j];
for (int b = 0; b < block; b++) {
if ((k%2) == 0) return block * j + b;
k /= 2;
}
}
return -1;
}
Potentially you can reduce the last step to O(1) by using some bit manipulation, in your case __int128, set the corresponding bits in loop one and call something like __builtin_clz or use the appropriate bit hack
The best solution I could find for finding smallest integer from a set is https://codereview.stackexchange.com/a/179042/31480
Here are c++ version.
int solution(std::vector<int>& A)
{
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
while (0 < A[i] && A[i] - 1 < A.size()
&& A[i] != i + 1
&& A[i] != A[A[i] - 1])
{
int j = A[i] - 1;
auto tmp = A[i];
A[i] = A[j];
A[j] = tmp;
}
}
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
if (A[i] != i+1)
{
return i + 1;
}
}
return A.size() + 1;
}
I was trying to solve a question and I got into a few obstacles that I failed to solve, starting off here is the question: Codeforces - 817D
Now I tried to brute force it, using a basic get min and max for each segment of the array I could generate and then keeping track of them I subtract them and add them together to get the final imbalance, this looked good but it gave me a time limit exceeded cause brute forcing n*(n+1)/2 subsegments of the array given n is 10^6 , so I just failed to go around it and after like a couple of hours of not getting any new ideas I decided to see a solution that I could not understand anything in to be honest :/ , here is the solution:
#include <bits/stdc++.h>
#define ll long long
int a[1000000], l[1000000], r[1000000];
int main(void) {
int i, j, n;
scanf("%d",&n);
for(i = 0; i < n; i++) scanf("%d",&a[i]);
ll ans = 0;
for(j = 0; j < 2; j++) {
vector<pair<int,int>> v;
v.push_back({-1,INF});
for(i = 0; i < n; i++) {
while (v.back().second <= a[i]) v.pop_back();
l[i] = v.back().first;
v.push_back({i,a[i]});
}
v.clear();
v.push_back({n,INF});
for(i = n-1; i >= 0; i--) {
while (v.back().second < a[i]) v.pop_back();
r[i] = v.back().first;
v.push_back({i,a[i]});
}
for(i = 0; i < n; i++) ans += (ll) a[i] * (i-l[i]) * (r[i]-i);
for(i = 0; i < n; i++) a[i] *= -1;
}
cout << ans;
}
I tried tracing it but I keep wondering why was the vector used , the only idea I got is he wanted to use the vector as a stack since they both act the same(Almost) but then the fact that I don't even know why we needed a stack here and this equation ans += (ll) a[i] * (i-l[i]) * (r[i]-i); is really confusing me because I don't get where did it come from.
Well thats a beast of a calculation. I must confess, that i don't understand it completely either. The problem with the brute force solution is, that you have to calculate values or all over again.
In a slightly modified example, you calculate the following values for an input of 2, 4, 1 (i reordered it by "distance")
[2, *, *] (from index 0 to index 0), imbalance value is 0; i_min = 0, i_max = 0
[*, 4, *] (from index 1 to index 1), imbalance value is 0; i_min = 1, i_max = 1
[*, *, 1] (from index 2 to index 2), imbalance value is 0; i_min = 2, i_max = 2
[2, 4, *] (from index 0 to index 1), imbalance value is 2; i_min = 0, i_max = 1
[*, 4, 1] (from index 1 to index 2), imbalance value is 3; i_min = 2, i_max = 1
[2, 4, 1] (from index 0 to index 2), imbalance value is 3; i_min = 2, i_max = 1
where i_min and i_max are the indices of the element with the minimum and maximum value.
For a better visual understanding, i wrote the complete array, but hid the unused values with *
So in the last case [2, 4, 1], brute-force looks for the minimum value over all values, which is not necessary, because you already calculated the values for a sub-space of the problem, by calculating [2,4] and [4,1]. But comparing only the values is not enough, you also need to keep track of the indices of the minimum and maximum element, because those can be reused in the next step, when calculating [2, 4, 1].
The idead behind this is a concept called dynamic programming, where results from a calculation are stored to be used again. As often, you have to choose between speed and memory consumption.
So to come back to your question, here is what i understood :
the arrays l and r are used to store the indices of the greatest number left or right of the current one
vector v is used to find the last number (and it's index) that is greater than the current one (a[i]). It keeps track of rising number series, e.g. for the input 5,3,4 at first the 5 is stored, then the 3 and when the 4 comes, the 3 is popped but the index of 5 is needed (to be stored in l[2])
then there is this fancy calculation (ans += (ll) a[i] * (i-l[i]) * (r[i]-i)). The stored indices of the maximum (and in the second run the minimum) elements are calculated together with the value a[i] which does not make much sense for me by now, but seems to work (sorry).
at last, all values in the array a are multiplied by -1, which means, the old maximums are now the minimums, and the calculation is done again (2nd run of the outer for-loop over j)
This last step (multiply a by -1) and the outer for-loop over j are not necessary but it's an elegant way to reuse the code.
Hope this helps a bit.
After doing calculations to multiply a matrix with a vector using Cartesian topology. I got the following process with the their ranks and vectors.
P0 (process with rank = 0) =[2 , 9].
P1 (process with rank = 1) =[2 , 3]
P2 (process with rank = 2) =[1 , 9]
P3 (process with rank = 3) =[4 , 6].
Now. I need to sum the elements of the even rank processes and the odd ones separately, like this:
temp1 = [3 , 18]
temp2 = [6 , 9]
and then , gather the results in a different vector, like this:
result = [3 , 18 , 6 , 9]
My attampt to do it is to use the MPI_Reduce and then MPI_Gather like this :
// Previous code
double* temp1 , *temp2;
if(myrank %2 == 0){
BOOLEAN flag = Allocate_vector(&temp1 ,local_m); // function to allocate space for vectors
MPI_Reduce(local_y, temp1, local_n, MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp1, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE,0, comm);
free(temp1);
}
else{
Allocate_vector(&temp2 ,local_m);
MPI_Reduce(local_y, temp2, local_n , MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp2, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0,comm);
free(temp2);
}
But the answer is not correct.It seemd that the code sums all elements of the even and odd process togather and then gives a segmentation fault error:
Wrong_result = [21 15 0 0]
and this error
** Error in ./test': double free or corruption (fasttop): 0x00000000013c7510 ***
*** Error in./test': double free or corruption (fasttop): 0x0000000001605b60 ***
It won't work the way you are trying to do it. To perform reduction over the elements of a subset of processes, you have to create a subcommunicator for them. In your case, the odd and the even processes share the same comm, therefore the operations are not over the two separate groups of processes but rather over the combined group.
You should use MPI_Comm_split to perform a split, perform the reduction using the two new subcommunicators, and finally have rank 0 in each subcommunicator (let's call those leaders) participate in the gather over another subcommunicator that contains those two only:
// Make sure rank is set accordingly
MPI_Comm_rank(comm, &rank);
// Split even and odd ranks in separate subcommunicators
MPI_Comm subcomm;
MPI_Comm_split(comm, rank % 2, 0, &subcomm);
// Perform the reduction in each separate group
double *temp;
Allocate_vector(&temp, local_n);
MPI_Reduce(local_y, temp, local_n , MPI_DOUBLE, MPI_SUM, 0, subcomm);
// Find out our rank in subcomm
int subrank;
MPI_Comm_rank(subcomm, &subrank);
// At this point, we no longer need subcomm. Free it and reuse the variable.
MPI_Comm_free(&subcomm);
// Separate both group leaders (rank 0) into their own subcommunicator
MPI_Comm_split(comm, subrank == 0 ? 0 : MPI_UNDEFINED, 0, &subcomm);
if (subcomm != MPI_COMM_NULL) {
MPI_Gather(temp, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0, subcomm);
MPI_Comm_free(&subcomm);
}
// Free resources
free(temp);
The result will be in gResult of rank 0 in the latter subcomm, which happens to be rank 0 in comm because of the way the splits are performed.
Not as simple as expected, I guess, but that is the price of having convenient collective operations in MPI.
On a side node, in the code shown you are allocating temp1 and temp2 to be of length local_m, while in all collective calls the length is specified as local_n. If it happens that local_n > local_m, then heap corruption will occur.
I've been learning how to implement MPI over the past couple of weeks and I'm having a very hard time to understand how to set up some of the input arguments for MPI_Allgatherv. I'll use a toy example because I need to take baby steps here. Some of the research I've done is listed at the end of this post (including my previous question, which led me to this question). First, a quick summary of what I'm trying to accomplish:
--Summary-- I'm taking a std::vector A, having multiple processors work on different parts of A, and then taking the updated parts of A and redistributing those updates to all processors. Therefore, all processors start with copies of A, update portions of A, and end with fully updated copies of A.--End--
Let's say I have a std::vector < double > containing 5 elements called "mydata" initialized as follows:
for (int i = 0; i < 5; i++)
{
mydata[i] = (i+1)*1.1;
}
Now let's say I'm running my code on 2 nodes (int tot_proc = 2). I identify the "current" node using "int id_proc," therefore, the root processor has id_proc = 0. Since the number of elements in mydata is odd, I cannot evenly distribute the work between processors. Let's say that I always break the work up as follows:
if (id_proc < tot_proc - 1)
{
//handle mydata.size()/tot_proc elements
}
else
{
//handle whatever is left over
}
In this example, that means:
id_proc = 0 will work on mydata[0] and mydata[1] (2 elements, since 5/2 = 2) … and … id_proc = 1 will work on mydata[2] - mydata[4] (3 elements, since 5/2 + 5%2 = 3)
Once each processor has worked on their respective portions of mydata, I want to use Allgatherv to merge the results together so that mydata on each processor contains all of the updated values. We know Allgatherv takes 8 arguments: (1) the starting address of the elements/data being sent, (2) the number of elements being sent, (3) the type of data being sent, which is MPI_DOUBLE in this example, (4) the address of the location you want the data to be received (no mention of "starting" address), (5) the number of elements being received, (6) the "displacements" in memory relative to the receiving location in argument #4, (7) the type of data being received, again, MPI_DOUBLE, and (8) the communicator you're using, which in my case is simply MPI_COMM_WORLD.
Now here's where the confusion begins. Since processor 0 worked on the first two elements, and processor 1 worked on the last 3 elements, then processor 0 will need to SEND the first two elements, and processor 1 will need to SEND the last 3 elements. To me, this suggests that the first two arguments of Allgatherv should be:
Processor 0: MPI_Allgatherv(&mydata[0],2,…
Processor 1: MPI_Allgatherv(&mydata[2],3,…
(Q1) Am I right about that? If so, my next question is in regard to the format of argument 2. Let's say I create a std::vector < int > sendcount such that sendcount[0] = 2, and sendcount[1] = 3.
(Q2) Does Argument 2 require the reference to the first location of sendcount, or do I need to send the reference to the location relevant to each processor? In other words, which of these should I do:
Q2 - OPTION 1
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[0],…
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[0],…
Q2 - OPTION 2
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[id_proc], … (here id_proc = 0)
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[id_proc], … (here id_proc = 1)
...On to Argument 4. Since I am collecting different sections of mydata back into itself, I suspect that this argument will look similar to Argument 1. i.e. it should be something like &mydata[?]. (Q3) Can this argument simply be a reference to the beginning of mydata (i.e. &mydata[0]), or do I have to change the index the way I did for Argument 1? (Q4) Imagine I had used 3 processors. This would mean that Processor 1 would be sending mydata[2] and mydata[3] which are in "the middle" of the vector. Since the vector's elements are contiguous, then the data that Processor 1 is receiving has to be split (some goes before, and mydata[4] goes after). Do I have to account for that split in this argument, and if so, how?
...Slightly more confusing to me is Argument 5 but I had an idea this morning. Using the toy example: if Processor 0 is sending 2 elements, then it will be receiving 3, correct? Similarly, if Processor 1 is sending 3 elements, then it is receiving 2. (Q5) So, if I were to create a std::vector < int > recvcount, couldn't I just initialize it as:
for (int i = 0; i < tot_proc; i++)
{
recvcount[i] = mydata.size() - sendcount[i];
}
And if that is true, then do I pass it to Allgatherv as &recvcount[0] or &recvcount[id_proc] (similar to Argument 2)?
Finally, Argument 6. I know this is tied to my input for Argument 4. My guess is the following: if I were to pass &mydata[0] as Argument 4 on all processors, then the displacements are the number of positions in memory that I need to move in order to get to the first location where data actually needs to be received. For example,
Processor 0: MPI_Allgatherv( … , &mydata[0], … , 2, … );
Processor 1: MPI_Allgatherv( … , &mydata[0], … , 0, … );
(Q5) Am I right in thinking that the above two lines means "Processor 0 will receive data beginning at location &mydata[0+2]. Processor 1 will receive data beginning at location &mydata[0+0]." ?? And what happens when the data needs to be split like in Q4? Finally, since I am collecting portions of a vector back into itself (replacing mydata with updated mydata by overwriting it), then this tells me that all processors other than the root process will be receiving data beginning at &mydata[0]. (Q6) If this is true, then shouldn't the displacements be 0 for all processors that are not the root?
Some of the links I've read:
Difference between MPI_allgather and MPI_allgatherv
Difference between MPI_Allgather and MPI_Alltoall functions?
Problem with MPI_Gatherv for std::vector
C++: Using MPI's gatherv to concatenate vectors of differing lengths
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Allgatherv.html
https://computing.llnl.gov/tutorials/mpi/#Routine_Arguments
My previous post on stackoverflow:
MPI C++ matrix addition, function arguments, and function returns
Most tutorials, etc, that I've read just gloss over Allgatherv.
Part of confusion here is that you're trying to do an in-place gather; you are trying to send from and receive into the same array. If you're doing that, you should use the MPI_IN_PLACE option, in which case you don't explicitly specify the send location or count. Those are there for if you are sending from a different buffer than you're receiving into, but in-place gathers are somewhat more constrained.
So this works:
#include <iostream>
#include <vector>
#include <mpi.h>
int main(int argc, char **argv) {
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size < 2) {
std::cerr << "This demo requires at least 2 procs." << std::endl;
MPI_Finalize();
return 1;
}
int datasize = 2*size + 1;
std::vector<int> data(datasize);
/* break up the elements */
int *counts = new int[size];
int *disps = new int[size];
int pertask = datasize/size;
for (int i=0; i<size-1; i++)
counts[i] = pertask;
counts[size-1] = datasize - pertask*(size-1);
disps[0] = 0;
for (int i=1; i<size; i++)
disps[i] = disps[i-1] + counts[i-1];
int mystart = disps[rank];
int mycount = counts[rank];
int myend = mystart + mycount - 1;
/* everyone initialize our data */
for (int i=mystart; i<=myend; i++)
data[i] = 0;
int nsteps = size;
for (int step = 0; step < nsteps; step++ ) {
for (int i=mystart; i<=myend; i++)
data[i] += rank;
MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
&(data[0]), counts, disps, MPI_INT, MPI_COMM_WORLD);
if (rank == step) {
std::cout << "Rank " << rank << " has array: [";
for (int i=0; i<datasize-1; i++)
std::cout << data[i] << ", ";
std::cout << data[datasize-1] << "]" << std::endl;
}
}
delete [] disps;
delete [] counts;
MPI_Finalize();
return 0;
}
Running gives
$ mpirun -np 3 ./allgatherv
Rank 0 has array: [0, 0, 1, 1, 2, 2, 2]
Rank 1 has array: [0, 0, 2, 2, 4, 4, 4]
Rank 2 has array: [0, 0, 3, 3, 6, 6, 6]