Understanding MPI_Allgatherv in plain english - c++

I've been learning how to implement MPI over the past couple of weeks and I'm having a very hard time to understand how to set up some of the input arguments for MPI_Allgatherv. I'll use a toy example because I need to take baby steps here. Some of the research I've done is listed at the end of this post (including my previous question, which led me to this question). First, a quick summary of what I'm trying to accomplish:
--Summary-- I'm taking a std::vector A, having multiple processors work on different parts of A, and then taking the updated parts of A and redistributing those updates to all processors. Therefore, all processors start with copies of A, update portions of A, and end with fully updated copies of A.--End--
Let's say I have a std::vector < double > containing 5 elements called "mydata" initialized as follows:
for (int i = 0; i < 5; i++)
{
mydata[i] = (i+1)*1.1;
}
Now let's say I'm running my code on 2 nodes (int tot_proc = 2). I identify the "current" node using "int id_proc," therefore, the root processor has id_proc = 0. Since the number of elements in mydata is odd, I cannot evenly distribute the work between processors. Let's say that I always break the work up as follows:
if (id_proc < tot_proc - 1)
{
//handle mydata.size()/tot_proc elements
}
else
{
//handle whatever is left over
}
In this example, that means:
id_proc = 0 will work on mydata[0] and mydata[1] (2 elements, since 5/2 = 2) … and … id_proc = 1 will work on mydata[2] - mydata[4] (3 elements, since 5/2 + 5%2 = 3)
Once each processor has worked on their respective portions of mydata, I want to use Allgatherv to merge the results together so that mydata on each processor contains all of the updated values. We know Allgatherv takes 8 arguments: (1) the starting address of the elements/data being sent, (2) the number of elements being sent, (3) the type of data being sent, which is MPI_DOUBLE in this example, (4) the address of the location you want the data to be received (no mention of "starting" address), (5) the number of elements being received, (6) the "displacements" in memory relative to the receiving location in argument #4, (7) the type of data being received, again, MPI_DOUBLE, and (8) the communicator you're using, which in my case is simply MPI_COMM_WORLD.
Now here's where the confusion begins. Since processor 0 worked on the first two elements, and processor 1 worked on the last 3 elements, then processor 0 will need to SEND the first two elements, and processor 1 will need to SEND the last 3 elements. To me, this suggests that the first two arguments of Allgatherv should be:
Processor 0: MPI_Allgatherv(&mydata[0],2,…
Processor 1: MPI_Allgatherv(&mydata[2],3,…
(Q1) Am I right about that? If so, my next question is in regard to the format of argument 2. Let's say I create a std::vector < int > sendcount such that sendcount[0] = 2, and sendcount[1] = 3.
(Q2) Does Argument 2 require the reference to the first location of sendcount, or do I need to send the reference to the location relevant to each processor? In other words, which of these should I do:
Q2 - OPTION 1
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[0],…
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[0],…
Q2 - OPTION 2
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[id_proc], … (here id_proc = 0)
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[id_proc], … (here id_proc = 1)
...On to Argument 4. Since I am collecting different sections of mydata back into itself, I suspect that this argument will look similar to Argument 1. i.e. it should be something like &mydata[?]. (Q3) Can this argument simply be a reference to the beginning of mydata (i.e. &mydata[0]), or do I have to change the index the way I did for Argument 1? (Q4) Imagine I had used 3 processors. This would mean that Processor 1 would be sending mydata[2] and mydata[3] which are in "the middle" of the vector. Since the vector's elements are contiguous, then the data that Processor 1 is receiving has to be split (some goes before, and mydata[4] goes after). Do I have to account for that split in this argument, and if so, how?
...Slightly more confusing to me is Argument 5 but I had an idea this morning. Using the toy example: if Processor 0 is sending 2 elements, then it will be receiving 3, correct? Similarly, if Processor 1 is sending 3 elements, then it is receiving 2. (Q5) So, if I were to create a std::vector < int > recvcount, couldn't I just initialize it as:
for (int i = 0; i < tot_proc; i++)
{
recvcount[i] = mydata.size() - sendcount[i];
}
And if that is true, then do I pass it to Allgatherv as &recvcount[0] or &recvcount[id_proc] (similar to Argument 2)?
Finally, Argument 6. I know this is tied to my input for Argument 4. My guess is the following: if I were to pass &mydata[0] as Argument 4 on all processors, then the displacements are the number of positions in memory that I need to move in order to get to the first location where data actually needs to be received. For example,
Processor 0: MPI_Allgatherv( … , &mydata[0], … , 2, … );
Processor 1: MPI_Allgatherv( … , &mydata[0], … , 0, … );
(Q5) Am I right in thinking that the above two lines means "Processor 0 will receive data beginning at location &mydata[0+2]. Processor 1 will receive data beginning at location &mydata[0+0]." ?? And what happens when the data needs to be split like in Q4? Finally, since I am collecting portions of a vector back into itself (replacing mydata with updated mydata by overwriting it), then this tells me that all processors other than the root process will be receiving data beginning at &mydata[0]. (Q6) If this is true, then shouldn't the displacements be 0 for all processors that are not the root?
Some of the links I've read:
Difference between MPI_allgather and MPI_allgatherv
Difference between MPI_Allgather and MPI_Alltoall functions?
Problem with MPI_Gatherv for std::vector
C++: Using MPI's gatherv to concatenate vectors of differing lengths
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Allgatherv.html
https://computing.llnl.gov/tutorials/mpi/#Routine_Arguments
My previous post on stackoverflow:
MPI C++ matrix addition, function arguments, and function returns
Most tutorials, etc, that I've read just gloss over Allgatherv.

Part of confusion here is that you're trying to do an in-place gather; you are trying to send from and receive into the same array. If you're doing that, you should use the MPI_IN_PLACE option, in which case you don't explicitly specify the send location or count. Those are there for if you are sending from a different buffer than you're receiving into, but in-place gathers are somewhat more constrained.
So this works:
#include <iostream>
#include <vector>
#include <mpi.h>
int main(int argc, char **argv) {
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size < 2) {
std::cerr << "This demo requires at least 2 procs." << std::endl;
MPI_Finalize();
return 1;
}
int datasize = 2*size + 1;
std::vector<int> data(datasize);
/* break up the elements */
int *counts = new int[size];
int *disps = new int[size];
int pertask = datasize/size;
for (int i=0; i<size-1; i++)
counts[i] = pertask;
counts[size-1] = datasize - pertask*(size-1);
disps[0] = 0;
for (int i=1; i<size; i++)
disps[i] = disps[i-1] + counts[i-1];
int mystart = disps[rank];
int mycount = counts[rank];
int myend = mystart + mycount - 1;
/* everyone initialize our data */
for (int i=mystart; i<=myend; i++)
data[i] = 0;
int nsteps = size;
for (int step = 0; step < nsteps; step++ ) {
for (int i=mystart; i<=myend; i++)
data[i] += rank;
MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
&(data[0]), counts, disps, MPI_INT, MPI_COMM_WORLD);
if (rank == step) {
std::cout << "Rank " << rank << " has array: [";
for (int i=0; i<datasize-1; i++)
std::cout << data[i] << ", ";
std::cout << data[datasize-1] << "]" << std::endl;
}
}
delete [] disps;
delete [] counts;
MPI_Finalize();
return 0;
}
Running gives
$ mpirun -np 3 ./allgatherv
Rank 0 has array: [0, 0, 1, 1, 2, 2, 2]
Rank 1 has array: [0, 0, 2, 2, 4, 4, 4]
Rank 2 has array: [0, 0, 3, 3, 6, 6, 6]

Related

Tell me the Input in which this code will give incorrect Output

There's a problem, which I've to solve in c++. I've written the whole code and it's working in the given test cases but when I'm submitting it, It's saying wrong answer. I can't understand that why is it showing wrong answer.
I request you to tell me an input for the given code, which will give incorrect output so I can modify my code further.
Shrink The Array
You are given an array of positive integers A[] of length L. If A[i] and A[i+1] both are equal replace them by one element with value A[i]+1. Find out the minimum possible length of the array after performing such operation any number of times.
Note:
After each such operation, the length of the array will decrease by one and elements are renumerated accordingly.
Input format:
The first line contains a single integer L, denoting the initial length of the array A.
The second line contains L space integers A[i] − elements of array A[].
Output format:
Print an integer - the minimum possible length you can get after performing the operation described above any number of times.
Example:
Input
7
3 3 4 4 4 3 3
Output
2
Sample test case explanation
3 3 4 4 4 3 3 -> 4 4 4 4 3 3 -> 4 4 4 4 4 -> 5 4 4 4 -> 5 5 4 -> 6 4.
Thus the length of the array is 2.
My code:
#include <bits/stdc++.h>
using namespace std;
int main()
{
bool end = false;
int l;
cin >> l;
int arr[l];
for(int i = 0; i < l; i++){
cin >> arr[i];
}
int len = l, i = 0;
while(i < len - 1){
if(arr[i] == arr[i + 1]){
arr[i] = arr[i] + 1;
if((i + 1) <= (len - 1)){
for(int j = i + 1; j < len - 1; j++){
arr[j] = arr[j + 1];
}
}
len--;
i = 0;
}
else{
i++;
}
}
cout << len;
return 0;
}
THANK YOU
As noted in the comments: Just picking the first two neighbours that have the same value and combining those will lead to suboptimal results.
You will need to investigate which two neighbours you should combine somehow. When you have combined two neighbours you then need to investigate which neighbours to combine on the next level. The number of combinations may become plentiful.
One way to solve this is through recursion.
If you've followed the advice in the comments, you now have all your input data in std::vector<unsigned> A(L).
You can now do std::cout << solve(A) << '\n'; where solve has the signature size_t solve(const std::vector<unsigned>& A) and is described below:
Find the indices of all neighbour pairs in A that has the same values and put the indices in a std::vector<size_t> neighbours. Example: If A contains 2 2 2 3, put 0 and 1 in neighbours.
If no neighbours are found (neighbours.empty() == true), return A.size().
Define a minimum variable and initialize it with A.size() - 1 which is the worst result you know you can get at this point. So, size_t minimum = A.size() - 1;
Loop over all indices stored in neighbours (for(size_t idx : neighbours))
Copy A into a new std::vector<unsigned>. Let's call it cpy.
Increase cpy[idx] by one and remove cpy[idx+1].
Call size_t result = solve(cpy). This is where recursion comes in.
Is result less than minimum? If so assign result to minimum.
Return minimum.
I don't think I ruined the programming exercise by providing one algorithm for solving this. It should still have plenty of things to deal with. Recursion won't be possible with big data etc.

Fastest way to find smallest missing integer from list of integers

I have a list of 100 random integers. Each random integer has a value from 0 to 99. Duplicates are allowed, so the list could be something like
56, 1, 1, 1, 1, 0, 2, 6, 99...
I need to find the smallest integer (>= 0) is that is not contained in the list.
My initial solution is this:
vector<int> integerList(100); //list of random integers
...
vector<bool> listedIntegers(101, false);
for (int theInt : integerList)
{
listedIntegers[theInt] = true;
}
int smallestInt;
for (int j = 0; j < 101; j++)
{
if (!listedIntegers[j])
{
smallestInt = j;
break;
}
}
But that requires a secondary array for book-keeping and a second (potentially full) list iteration. I need to perform this task millions of times (the actual application is in a greedy graph coloring algorithm, where I need to find the smallest unused color value with a vertex adjacency list), so I'm wondering if there's a clever way to get the same result without so much overhead?
It's been a year, but ...
One idea that comes to mind is to keep track of the interval(s) of unused values as you iterate the list. To allow efficient lookup, you could keep intervals as tuples in a binary search tree, for example.
So, using your sample data:
56, 1, 1, 1, 1, 0, 2, 6, 99...
You would initially have the unused interval [0..99], and then, as each input value is processed:
56: [0..55][57..99]
1: [0..0][2..55][57..99]
1: no change
1: no change
1: no change
0: [2..55][57..99]
2: [3..55][57..99]
6: [3..5][7..55][57..99]
99: [3..5][7..55][57..98]
Result (lowest value in lowest remaining interval): 3
I believe there is no faster way to do it. What you can do in your case is to reuse vector<bool>, you need to have just one such vector per thread.
Though the better approach might be to reconsider the whole algorithm to eliminate this step at all. Maybe you can update least unused color on every step of the algorithm?
Since you have to scan the whole list no matter what, the algorithm you have is already pretty good. The only improvement I can suggest without measuring (that will surely speed things up) is to get rid of your vector<bool>, and replace it with a stack-allocated array of 4 32-bit integers or 2 64-bit integers.
Then you won't have to pay the cost of allocating an array on the heap every time, and you can get the first unused number (the position of the first 0 bit) much faster. To find the word that contains the first 0 bit, you only need to find the first one that isn't the maximum value, and there are bit twiddling hacks you can use to get the first 0 bit in that word very quickly.
You program is already very efficient, in O(n). Only marginal gain can be found.
One possibility is to divide the number of possible values in blocks of size block, and to register
not in an array of bool but in an array of int, in this case memorizing the value modulo block.
In practice, we replace a loop of size N by a loop of size N/block plus a loop of size block.
Theoretically, we could select block = sqrt(N) = 12 in order to minimize the quantity N/block + block.
In the program hereafter, block of size 8 are selected, assuming that dividing integers by 8 and calculating values modulo 8 should be fast.
However, it is clear that a gain, if any, can be obtained only for a minimum value rather large!
constexpr int N = 100;
int find_min1 (const std::vector<int> &IntegerList) {
constexpr int Size = 13; //N / block
constexpr int block = 8;
constexpr int Vmax = 255; // 2^block - 1
int listedBlocks[Size] = {0};
for (int theInt : IntegerList) {
listedBlocks[theInt / block] |= 1 << (theInt % block);
}
for (int j = 0; j < Size; j++) {
if (listedBlocks[j] == Vmax) continue;
int &k = listedBlocks[j];
for (int b = 0; b < block; b++) {
if ((k%2) == 0) return block * j + b;
k /= 2;
}
}
return -1;
}
Potentially you can reduce the last step to O(1) by using some bit manipulation, in your case __int128, set the corresponding bits in loop one and call something like __builtin_clz or use the appropriate bit hack
The best solution I could find for finding smallest integer from a set is https://codereview.stackexchange.com/a/179042/31480
Here are c++ version.
int solution(std::vector<int>& A)
{
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
while (0 < A[i] && A[i] - 1 < A.size()
&& A[i] != i + 1
&& A[i] != A[A[i] - 1])
{
int j = A[i] - 1;
auto tmp = A[i];
A[i] = A[j];
A[j] = tmp;
}
}
for (std::vector<int>::size_type i = 0; i != A.size(); i++)
{
if (A[i] != i+1)
{
return i + 1;
}
}
return A.size() + 1;
}

Matrix Exponentiation for calculating number of routes possible

Recently I came across this problem in the Iarcs website. Here is the problem statement:
It is well known that the routing algorithm used on the Internet is
highly non-optimal. A "hop", in Internet jargon, is a pair of nodes
that are directly connected - by a cable or a microwave link or
whatever. The number of hops that a packet may take in going from one
node to another may be far more than the minimum required.
But the routing algorithm used by the Siruseri Network company is
worse. Here, a packet sent from one node to another could even go
through the same node twice or even traverse the same hop twice before
it eventually finds its way to its destination. Sometimes a packet
even goes through the destination more than once before it is
considered "delivered". Suppose the network in Siruseri consisted of
the following nodes and cables: Figure
There are 5 nodes and 8 cable links. Note that a pair of nodes may be
connected by more than one link. These are considered to be different
hops. All links are bidirectional. A packet from node 1 to node 5 may,
for example, travel as follows: 1 to 2, 2 to 1, 1 to 3, 3 to 2, 2 to
1, 1 to 4, 4 to 5, 5 to 4, 4 to 5. This routing is of length 9 (the
number of hops is the length of a given routing). We are interested in
counting the number of different routings from a given source to a
target that are of a given length.
For example, the number of routings from 1 to 2 of length 3 are 7.
They are as follows (separated by ;): 1 to 2, 2 to 1 and 1 to 2; 1 to
3, 3 to 1 and 1 to 2; 1 to 4, 4 to 1 and 1 to 2; 1 to 5, 5 to 1 and 1
to 2; 1 to 4, 4 to 3 (via the left cable) and 3 to 2; 1 to 4, 4 to 3
(via the right cable) and 3 to 2; 1 to 2, 2 to 3 and 3 to 2.
You will be given a description of the network at Siruseri as well as
a source, a target and the number of hops, and your task is to
determine the number of routings from the source to the target which
have the given number of hops. The answer is to be reported modulo
42373.
So as discussed on this thread , the solution is to calculate the given matrix to the power k where k is the number of routes given.
Here I did the same:
#include <iostream>
#include <vector>
std::vector<std::vector<int> >MatrixMultiplication(std::vector<std::vector<int> >matrix1,std::vector<std::vector<int> >matrix2,int n){
std::vector<std::vector<int> >retMatrix(n,std::vector<int>(n));
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
for(int k=0;k<n;k++){
retMatrix[i][j] = retMatrix[i][j] + matrix1[i][k] * matrix2[k][j];
}
}
}
return retMatrix;
}
std::vector<std::vector<int> >MatrixExponentiation(std::vector<std::vector<int> >matrix,int n,int power){
if(power == 0 || power == 1){
return matrix;
}
if(power%2 == 0){
return MatrixExponentiation(MatrixMultiplication(matrix,matrix,n),n,power/2);
}else{
return MatrixMultiplication(matrix,MatrixExponentiation(MatrixMultiplication(matrix,matrix,n),n,(power-1)/2),n);
}
}
int main (int argc, char const* argv[])
{
int n;
std::cin >> n;
std::vector<std::vector<int> >matrix(n,std::vector<int>(n));
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
std::cin >> matrix[i][j];
}
}
int i ,j ,power;
std::cin >> i >> j >> power;
std::vector<std::vector<int> >retMax(n,std::vector<int>(n));
retMax = MatrixExponentiation(matrix,n,power);
std::cout << matrix[i-1][j-1] << std::endl;
return 0;
}
But the output doesnt matches even for the example case , am I missing something here , or I have to try another approach for this problem?
Edit : As #grigor suggested I changed the code for power == 0 to return the identity matrix but the code still produces wrong outputs ,
if(power == 0){
std::vector<std::vector<int> >retMatrix(n,std::vector<int>(n));
for(int i=0;i<n;i++){
retMatrix[i][i] = 1;
}
return retMatrix;
}
Note : I havent wrote the code for modulo thing , do you think it will effect for the example testcase?
I think you are just printing out the wrong value, change:
std::cout << matrix[i-1][j-1] << std::endl;
to
std::cout << retMax [i-1][j-1] << std::endl;
If power == 0 you should return the identity matrix, not the actual matrix.

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?
Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.
If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.
You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.

Swapping blocks of elements in an array

I am working on C++.. am in a need to swap two blocks of elements in an array..
Say, {1,2,3,4,5,6} is my input array.. block {4,5} should be moved to beginning and the output array should be like {4,5,1,2,3,6}.. all i have is the start index and end index of the block {4,5}.. for doing this i am using a temp array, copying the blocks individually to temp array and moving it back to the original array, which is tedious
but i am sure there will be better methods to do this using memcpy or memmove.. any ideas?
There is a standard algorithm designed specifically for this task called std::rotate():
#include <algorithm>
#include <cstdio>
int main()
{
int inputArray[] = {1, 2, 3, 4, 5, 6};
::printf("Before: ");
for(int i = 0; i < 6; ++i)
{
::printf("%d ", inputArray[i]);
}
::printf("\n");
int startIndex = 3; // refers to the number 4 in inputArray
int endIndex = 5; // refers one-past the number 5 in inputArray
std::rotate(inputArray, inputArray+startIndex, inputArray+endIndex);
::printf("After: ");
for(int i = 0; i < 6; ++i)
{
::printf("%d ", inputArray[i]);
}
::printf("\n");
}
Expected output:
Before: 1 2 3 4 5 6
After: 4 5 1 2 3 6
std::rotate() performs the rotation in-place via std::swap(), so there's no temporary array involved.
Bentley's "Programming Pearls" describes three algorithms for solving this problem. You can find slides for this specific problem here
http://www.cs.bell-labs.com/cm/cs/pearls/s02b.pdf
For example, the simplest algorithms would be the Reversal one. Just reverse the blocks that you need to swap, and then reverse the entire array. Done.
P.S. In your example "the entire array" would stand for the 1,2,3,4,5 subsequence (6 is not included), since these are the blocks that you need to swap.
Reverse the blocks:
3, 2, 1, 5, 4
Reverse the whole thing
4, 5, 1, 2, 3