int mpi_vertex_dist(graph_t *graph, int start_vertex, int *result)
{
int num_vertices = graph->num_vertices;
fill_n(result, num_vertices, MAX_DIST);
auto start_time = Time::now();
int depth = 0;
result[start_vertex] = depth;
int keep_going = true;
int start,stop;
int elements_per_proc = num_vertices / num_proc;
int remainder = num_vertices % num_proc;
if(my_rank < remainder)
{
start = my_rank * (elements_per_proc + 1);
stop = start + elements_per_proc;
}
else
{
start = my_rank * elements_per_proc + remainder;
stop = start + (elements_per_proc-1);
}
int *resultTmp = new int[num_vertices];
int count = 0;
while (keep_going)
{
keep_going = false;
for (int vertex = start; vertex <= stop; vertex++)
{
if (result[vertex] == depth) {
for (int n = graph->v_adj_begin[vertex];
n < graph->v_adj_begin[vertex] + graph->v_adj_length[vertex];
n++)
{
int neighbor = graph->v_adj_list[n];
if (result[neighbor] > depth+1)
{
result[neighbor] = depth+1;
keep_going = true;
resultTmp[count] = neighbor;
count++;
}
}
}
}
if(count != 0)
{
MPI_Allreduce(resultTmp,result,count, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
}
// Old Code
/*
if(count != 0)
{
MPI_Reduce(resultTmp,result,count, MPI_INT, MPI_MIN,my_rank,MPI_COMM_WORLD);
}
*/
depth++;
}
//print_result(graph, result, depth);
return std::chrono::duration_cast<us>(Time::now()-start_time).count();
}
I managed to split the graph vertices between processors each getting 5 5 4 4 - I have total of 18 in the graph I am using for testing. I keep count of the updates using count. And at the end of the iteration I use that buffer to send to result array using MPI_Allreduce, however when I run the code it doesn't end. What am I doing wrong?
Your main problem is that you approach MPI as if it's shared memory. In your code, each process has a copy of the whole graph, which is wasteful if you ever run this at large scale, but it's not a serious problem for now.
The real problem is the result array: each process does reads and writes into it, but each process has a copy of that array, so it does not see the changes that other processes make.
The fix to this problem is that in each iteration of the while loop you need to reconcile these copies: do an Allreduce with the maximum operator. (I think. I'm not entirely sure what your algorithm is supposed to be.)
Finally, you are using each vertex to update its neigbors. That works fine sequentially, but is not a good design in parallel. Use symmetry, and update each vertex from its neigbors. That is a much better design in shared memory and distributed memory both.
Related
I have developed a code for my problem and it seems like it is working but when I increase the problem size, it does not output anything. in my problem I have several courses with multiple meetings each week with a condition of each course must have at least one overall in all weeks(e.g. in a 4 weeks case, atleast one meeting in 4 weeks combined).
A sample of decired out put with 4 courses and 4 weeks looks like the following:
0, 0, 2, 0,
1, 0, 0, 0,
0, 1, 0, 1,
0, 2, 0, 3,
I have written the following recursive code and it is working for small number of courses and weeks but when I increase these number it does not output anything. even for small cases sometime it does not output anything and I have to run it againg to get a result. here is my code:
//headers
#include <iostream>
//global parameters
const int NumberOfCourses = 4;
const int AvailableWeeks = 4;
double Desired[NumberOfCourses][AvailableWeeks];
//parameters deciding how many courses should we remove from schedule
//option 0:0 f2f meeting
//option 1:1 f2f meeting
//option 2:2 f2f meeting
//option 3:3 f2f meeting
const double DN_possibilites[4] = { 0.7, 0.15, 0.1, 0.05 };
double Rand;
long int OptionSelected;
double SumOfProbabiltiesSofar = 0;
double total = 0;
int c, w;
using namespace std;
void DN_generator() {
long int currSysTime = time(NULL);
srand(currSysTime);
for (int c = 0; c < NumberOfCourses; c++) {
for (int w = 0; w < AvailableWeeks; w++) {
long int currSysTime = time(NULL);
Rand = ((float)rand() / RAND_MAX);
//cout << Rand << endl;
long int OptionSelected;
double SumOfProbabiltiesSofar = 0;
for (int i = 0; i < 4; i++) {
SumOfProbabiltiesSofar += DN_possibilites[i];
if (Rand < SumOfProbabiltiesSofar) {
OptionSelected = i;
break;
}
}
if (OptionSelected == 0) {
Desired[c][w] = 0;
}
else if (OptionSelected == 1) {
Desired[c][w] = 1;
}
else if (OptionSelected == 2) {
Desired[c][w] = 2;
}
else if (OptionSelected == 3) {
Desired[c][w] = 3;
}
}
}
for (c = 0; c < NumberOfCourses; c++) {
total = 0;
for (w = 0; w < AvailableWeeks; w++) {
total += Desired[c][w];
}
if (total == 0) {
DN_generator();
}
}
}
int main(){
DN_generator();
for (c = 0; c < NumberOfCourses; c++) {
for (w = 0; w < AvailableWeeks; w++) {
cout << Desired[c][w] << ", ";
}
cout << endl;
}
return 0;
}
any help is much appreciated.
I see at least four fundamental flaws in the shown code:
With the random seed getting reset to the system clock on every recursive iteration, every call to the recursive function within the same second will produce the same results. If any one of them results in a total sum of 0, the recursive function repeats and produces the same results. The simple code can run fast enough on modern CPUs to blow through the stack, and crash, in much less than the second.
Increasing the array size increases the chances of the first problem happening. The combination of these two factors is what results in the crash, but that's not the end of the problems in the shown code.
If the total sum is 0, a recursive call is made. When the recursive call returns, it happily resumes totaling all the remaining rows, which makes no logical sense. This is a flaw in the recursion algorithm
Finally, the shown code has guaranteed undefined behavior:
SumOfProbabiltiesSofar += DN_possibilites[i];
The assumption in this line of the code is that on the last iteration SumOfProbabiltiesSofar becomes 1.0 and the following if statement's condition is guaranteed to evaluate to true. This is not true because floating point math is broken.
With the sufficiently large number of iteration it becomes likely that the random value will be enough close to 1 so that exceeds the almost-1.0 value that ends up being here. The for loop exits without initializing OptionSelected, resulting in undefined behavior. This logic must be fixed too.
All of these problems will need to be fixed before the shown algorithm works correctly.
I have been struggling to try to parallel this function which calculates interactions between particles. I had an idea to use Allgatherv which should distribute my original buffer to all other processes. Then using "rank" make a loop in which each process will calculate its part.In this program, MPI is overwritten to show stats that's why I am calling it mpi->function. Unfortunately, when I run it I receive following error. Can somebody advice what is wrong?
void calcInteraction() {
totalInteraction = 0.0;
int procs_number;
int rank;
int particle_amount = particles;
mpi->MPI_Comm_size(MPI_COMM_WORLD,&procs_number);
mpi->MPI_Comm_rank(MPI_COMM_WORLD,&rank);
//long sendBuffer[particle_amount];
//int *displs[procs_number];
//long send_counts[procs_number+mod];
int mod = (particle_amount%procs_number);
int elements_per_process = particle_amount/procs_number;
int *send_buffer = new int[particle_amount]; //data to send
int *displs = new int[procs_number]; //displacement
int *send_counts = new int[procs_number];
if(rank == 0)
{
for(int i = 0; i < particle_amount; i++)
{
send_buffer[i] = i; // filling buffer with particles
}
}
for (int i = 0; i < procs_number;i++)
{
send_counts[i] = elements_per_process; // filling buffer since it can't be empty
}
// calculating displacement
displs[ 0 ] = 0;
for ( int i = 1; i < procs_number; i++ )
displs[ i ] = displs[ i - 1 ] + send_counts[ i - 1 ];
int allData = displs[ procs_number - 1 ] + send_counts[ procs_number - 1 ];
int * endBuffer = new int[allData];
int start,end; // initializing indices
cout<<"entering allgather"<<endl;
mpi->MPI_Allgatherv(send_buffer,particle_amount,MPI_INT,endBuffer,send_counts,displs,MPI_INT,MPI_COMM_WORLD);
// ^from ^how many ^send ^receive ^how many ^displ ^send ^communicator
// to send type buffer receive type
start = rank*elements_per_process;
cout<<"start = "<< start <<endl;
if(rank == procs_number) //in case that particle_amount is not even
{
end = (rank+1)*elements_per_process + mod;
}
else
{
end = (rank+1)*elements_per_process;
}
cout<<"end = "<< end <<endl;
cout << "calcInteraction" << endl;
for (long idx = start; idx < end; idx++) {
for (long idxB = start; idxB < end; idxB++) {
if (idx != idxB) {
totalInteraction += physics->interaction(x[idx], y[idx], z[idx], age[idx], x[idxB], y[idxB],
z[idxB], age[idxB]);
}
}
}
cout << "calcInteraction - done" << endl;
}
You are not using MPI_Allgatherv() correctly.
I had an idea to use Allgatherv which should distribute my original
buffer to all other processes.
The description suggests you need MPI_Scatter[v]() in order to slice your array from a given rank, and distributes the chunks to all the MPI tasks.
If all tasks should receive the full array, then MPI_Bcast() is what you need.
Anyway, let's assume you need an all gather.
First, you must ensure all tasks have the same particles value.
Second, since you gather the same amout of data from every MPI tasks, and store them in a contiguous location, you can simplify your code with MPI_Allgather(). If only the last task might have a bit less data, then you can use MPI_Allgatherv() (but this is not what your code is currently doing) or transmit some ghost data so you can use the simple (and probably more optimized) MPI_Allgather().
Last but not least, you should send elements_per_process elements (and not particle_amount). That should be enough to get rid of the crash (e.g. MPI_ERR_TRUNCATE). But that being said, i am not sure that will achieve the result you need or expect.
I have worked out a O(n square) solution to the problem. I was wondering about a better solution to this. (this is not a homework/interview problem but something I do out of my own interest, hence sharing here):
If a=1, b=2, c=3,….z=26. Given a string, find all possible codes that string
can generate. example: "1123" shall give:
aabc //a = 1, a = 1, b = 2, c = 3
kbc // since k is 11, b = 2, c= 3
alc // a = 1, l = 12, c = 3
aaw // a= 1, a =1, w= 23
kw // k = 11, w = 23
Here is my code to the problem:
void alpha(int* a, int sz, vector<vector<int>>& strings) {
for (int i = sz - 1; i >= 0; i--) {
if (i == sz - 1) {
vector<int> t;
t.push_back(a[i]);
strings.push_back(t);
} else {
int k = strings.size();
for (int j = 0; j < k; j++) {
vector<int> t = strings[j];
strings[j].insert(strings[j].begin(), a[i]);
if (t[0] < 10) {
int n = a[i] * 10 + t[0];
if (n <= 26) {
t[0] = n;
strings.push_back(t);
}
}
}
}
}
}
Essentially the vector strings will hold the sets of numbers.
This would run in n square. I am trying my head around at least an nlogn solution.
Intuitively tree should help here, but not getting anywhere post that.
Generally, your problem complexity is more like 2^n, not n^2, since your k can increase with every iteration.
This is an alternative recursive solution (note: recursion is bad for very long codes). I didn't focus on optimization, since I'm not up to date with C++X, but I think the recursive solution could be optimized with some moves.
Recursion also makes the complexity a bit more obvious compared to the iterative solution.
// Add the front element to each trailing code sequence. Create a new sequence if none exists
void update_helper(int front, std::vector<std::deque<int>>& intermediate)
{
if (intermediate.empty())
{
intermediate.push_back(std::deque<int>());
}
for (size_t i = 0; i < intermediate.size(); i++)
{
intermediate[i].push_front(front);
}
}
std::vector<std::deque<int>> decode(int digits[], int count)
{
if (count <= 0)
{
return std::vector<std::deque<int>>();
}
std::vector<std::deque<int>> result1 = decode(digits + 1, count - 1);
update_helper(*digits, result1);
if (count > 1 && (digits[0] * 10 + digits[1]) <= 26)
{
std::vector<std::deque<int>> result2 = decode(digits + 2, count - 2);
update_helper(digits[0] * 10 + digits[1], result2);
result1.insert(result1.end(), result2.begin(), result2.end());
}
return result1;
}
Call:
std::vector<std::deque<int>> strings = decode(codes, size);
Edit:
Regarding the complexity of the original code, I'll try to show what would happen in the worst case scenario, where the code sequence consists only of 1 and 2 values.
void alpha(int* a, int sz, vector<vector<int>>& strings)
{
for (int i = sz - 1;
i >= 0;
i--)
{
if (i == sz - 1)
{
vector<int> t;
t.push_back(a[i]);
strings.push_back(t); // strings.size+1
} // if summary: O(1), ignoring capacity change, strings.size+1
else
{
int k = strings.size();
for (int j = 0; j < k; j++)
{
vector<int> t = strings[j]; // O(strings[j].size) vector copy operation
strings[j].insert(strings[j].begin(), a[i]); // strings[j].size+1
// note: strings[j].insert treated as O(1) because other containers could do better than vector
if (t[0] < 10)
{
int n = a[i] * 10 + t[0];
if (n <= 26)
{
t[0] = n;
strings.push_back(t); // strings.size+1
// O(1), ignoring capacity change and copy operation
} // if summary: O(1), strings.size+1
} // if summary: O(1), ignoring capacity change, strings.size+1
} // for summary: O(k * strings[j].size), strings.size+k, strings[j].size+1
} // else summary: O(k * strings[j].size), strings.size+k, strings[j].size+1
} // for summary: O(sum[i from 1 to sz] of (k * strings[j].size))
// k (same as string.size) doubles each iteration => k ends near 2^sz
// string[j].size increases by 1 each iteration
// k * strings[j].size increases by ?? each iteration (its getting huge)
}
Maybe I made a mistake somewhere and if we want to play nice we can treat a vector copy as O(1) instead of O(n) in order to reduce complexity, but the hard fact remains, that the worst case is doubling outer vector size in each iteration (at least every 2nd iteration, considering the exact structure of the if conditions) of the inner loop and the inner loop depends on that growing vector size, which makes the whole story at least O(2^n).
Edit2:
I figured out the result complexity (the best hypothetical algoritm still needs to create every element of the result, so result complexity is like a lower bound to what any algorithm can archieve)
Its actually following the Fibonacci numbers:
For worst case input (like only 1s) of size N+2 you have:
size N has k(N) elements
size N+1 has k(N+1) elements
size N+2 is the combination of codes starting with a followed by the combinations from size N+1 (a takes one element of the source) and the codes starting with k, followed by the combinations from size N (k takes two elements of the source)
size N+2 has k(N) + k(N+1) elements
Starting with size 1 => 1 (a) and size 2 => 2 (aa or k)
Result: still exponential growth ;)
Edit3:
Worked out a dynamic programming solution, somewhat similar to your approach with reverse iteration over the code array and kindof optimized in its vector usage, based on the properties explained in Edit2.
The inner loop (update_helper) is still dominated by the count of results (worst case Fibonacci) and a few outer loop iterations will have a decent count of sub-results, but at least the sub-results are reduced to a pointer to some intermediate node, so copying should be pretty efficient. As a little bonus, I switched the result from numbers to characters.
Another edit: updated code with range 0 - 25 as 'a' - 'z', fixed some errors that led to wrong results.
struct const_node
{
const_node(char content, const_node* next)
: next(next), content(content)
{
}
const_node* const next;
const char content;
};
// put front in front of each existing sub-result
void update_helper(int front, std::vector<const_node*>& intermediate)
{
for (size_t i = 0; i < intermediate.size(); i++)
{
intermediate[i] = new const_node(front + 'a', intermediate[i]);
}
if (intermediate.empty())
{
intermediate.push_back(new const_node(front + 'a', NULL));
}
}
std::vector<const_node*> decode_it(int digits[9], size_t count)
{
int current = 0;
std::vector<const_node*> intermediates[3];
for (size_t i = 0; i < count; i++)
{
current = (current + 1) % 3;
int prev = (current + 2) % 3; // -1
int prevprev = (current + 1) % 3; // -2
size_t index = count - i - 1; // invert direction
// copy from prev
intermediates[current] = intermediates[prev];
// update current (part 1)
update_helper(digits[index], intermediates[current]);
if (index + 1 < count && digits[index] &&
digits[index] * 10 + digits[index + 1] < 26)
{
// update prevprev
update_helper(digits[index] * 10 + digits[index + 1], intermediates[prevprev]);
// add to current (part 2)
intermediates[current].insert(intermediates[current].end(), intermediates[prevprev].begin(), intermediates[prevprev].end());
}
}
return intermediates[current];
}
void cleanupDelete(std::vector<const_node*>& nodes);
int main()
{
int code[] = { 1, 2, 3, 1, 2, 3, 1, 2, 3 };
int size = sizeof(code) / sizeof(int);
std::vector<const_node*> result = decode_it(code, size);
// output
for (size_t i = 0; i < result.size(); i++)
{
std::cout.width(3);
std::cout.flags(std::ios::right);
std::cout << i << ": ";
const_node* item = result[i];
while (item)
{
std::cout << item->content;
item = item->next;
}
std::cout << std::endl;
}
cleanupDelete(result);
}
void fillCleanup(const_node* n, std::set<const_node*>& all_nodes)
{
if (n)
{
all_nodes.insert(n);
fillCleanup(n->next, all_nodes);
}
}
void cleanupDelete(std::vector<const_node*>& nodes)
{
// this is like multiple inverse trees, hard to delete correctly, since multiple next pointers refer to the same target
std::set<const_node*> all_nodes;
for each (auto var in nodes)
{
fillCleanup(var, all_nodes);
}
nodes.clear();
for each (auto var in all_nodes)
{
delete var;
}
all_nodes.clear();
}
A drawback of the dynamically reused structure is the cleanup, since you wanna be careful to delete each node only once.
I'm trying to implement NegaMax ai for Connect 4. The algorithm works well some of the time, and the ai can win. However, sometimes it completely fails to block opponent 3 in a rows, or doesn't take a winning shot when it has three in a row.
The evaluation function iterates through the grid (horizontally, vertically, diagonally up, diagonally down), and takes every set of four squares. It then checks within each of these sets and evaluates based on this.
I've based the function on the evaluation code provided here: http://blogs.skicelab.com/maurizio/connect-four.html
My function is as follows:
//All sets of four tiles are evaluated before this
//and values for the following variables are set.
if (redFoursInARow != 0)
{
redScore = INT_MAX;
}
else
{
redScore = (redThreesInARow * threeWeight) + (redTwosInARow * twoWeight);
}
int yellowScore = 0;
if (yellowFoursInARow != 0)
{
yellowScore = INT_MAX;
}
else
{
yellowScore = (yellowThreesInARow * threeWeight) + (yellowTwosInARow * twoWeight);
}
int finalScore = yellowScore - redScore;
return turn ? finalScore : -finalScore; //If this is an ai turn, return finalScore. Else return -finalScore.
My negamax function looks like this:
inline int NegaMax(char g[6][7], int depth, int &bestMove, int row, int col, bool aiTurn)
{
{
char c = CheckForWinner(g);
if ('E' != c || 0 == depth)
{
return EvaluatePosition(g, aiTurn);
}
}
int bestScore = INT_MIN;
for (int i = 0; i < 7; ++i)
{
if (CanMakeMove(g, i)) //If column i is not full...
{
{
//...then make a move in that column.
//Grid is a 2d char array.
//'E' = empty tile, 'Y' = yellow, 'R' = red.
char newPos[6][7];
memcpy(newPos, g, sizeof(char) * 6 * 7);
int newRow = GetNextEmptyInCol(g, i);
if (aiTurn)
{
UpdateGrid(newPos, i, 'Y');
}
else
{
UpdateGrid(newPos, i, 'R');
}
int newScore = 0; int newMove = 0;
newScore = NegaMax(newPos, depth - 1, newMove, newRow, i, !aiTurn);
newScore = -newScore;
if (newScore > bestScore)
{
bestMove = i;
bestScore = newScore;
}
}
}
}
return bestScore;
}
I'm aware that connect four has been solved are that there are definitely better ways to go about this, but any help or suggestions with fixing/improving this will be greatly appreciated. Thanks!
Edit: to clarify, the problem is with the second algorithm.
I have a bit of C++ code that samples cards from a 52 card deck, which works just fine:
void sample_allcards(int table[5], int holes[], int players) {
int temp[5 + 2 * players];
bool try_again;
int c, n, i;
for (i = 0; i < 5 + 2 * players; i++) {
try_again = true;
while (try_again == true) {
try_again = false;
c = fast_rand52();
// reject collisions
for (n = 0; n < i + 1; n++) {
try_again = (temp[n] == c) || try_again;
}
temp[i] = c;
}
}
copy_cards(table, temp, 5);
copy_cards(holes, temp + 5, 2 * players);
}
I am implementing code to sample the hole cards according to a known distribution (stored as a 2d table). My code for this looks like:
void sample_allcards_weighted(double weights[][HOLE_CARDS], int table[5], int holes[], int players) {
// weights are distribution over hole cards
int temp[5 + 2 * players];
int n, i;
// table cards
for (i = 0; i < 5; i++) {
bool try_again = true;
while (try_again == true) {
try_again = false;
int c = fast_rand52();
// reject collisions
for (n = 0; n < i + 1; n++) {
try_again = (temp[n] == c) || try_again;
}
temp[i] = c;
}
}
for (int player = 0; player < players; player++) {
// hole cards according to distribution
i = 5 + 2 * player;
bool try_again = true;
while (try_again == true) {
try_again = false;
// weighted-sample c1 and c2 at once
// h is a number < 1325
int h = weighted_randi(&weights[player][0], HOLE_CARDS);
// i2h uses h and sets temp[i] to the 2 cards implied by h
i2h(&temp[i], h);
// reject collisions
for (n = 0; n < i; n++) {
try_again = (temp[n] == temp[i]) || (temp[n] == temp[i+1]) || try_again;
}
}
}
copy_cards(table, temp, 5);
copy_cards(holes, temp + 5, 2 * players);
}
My problem? The weighted sampling algorithm is a factor of 10 slower. Speed is very important for my application.
Is there a way to improve the speed of my algorithm to something more reasonable? Am I doing something wrong in my implementation?
Thanks.
edit: I was asked about this function, which I should have posted, since it is key
inline int weighted_randi(double *w, int num_choices) {
double r = fast_randd();
double threshold = 0;
int n;
for (n = 0; n < num_choices; n++) {
threshold += *w;
if (r <= threshold) return n;
w++;
}
// shouldn't get this far
cerr << n << "\t" << threshold << "\t" << r << endl;
assert(n < num_choices);
return -1;
}
...and i2h() is basically just an array lookup.
Your reject collisions are turning an O(n) algorithm into (I think) an O(n^2) operation.
There are two ways to select cards from a deck: shuffle and pop, or pick sets until the elements of the set are unique; you are doing the latter which requires a considerable amount of backtracking.
I didn't look at the details of the code, just a quick scan.
you could gain some speed by replacing the all the loops that check if a card is taken with a bit mask, eg for a pool of 52 cards, we prevent collisions like so:
DWORD dwMask[2] = {0}; //64 bits
//...
int nCard;
while(true)
{
nCard = rand_52();
if(!(dwMask[nCard >> 5] & 1 << (nCard & 31)))
{
dwMask[nCard >> 5] |= 1 << (nCard & 31);
break;
}
}
//...
My guess would be the memcpy(1326*sizeof(double)) within the retry-loop. It doesn't seem to change, so should it be copied each time?
Rather than tell you what the problem is, let me suggest how you can find it. Either 1) single-step it in the IDE, or 2) randomly halt it to see what it's doing.
That said, sampling by rejection, as you are doing, can take an unreasonably long time if you are rejecting most samples.
Your inner "try_again" for loop should stop as soon as it sets try_again to true - there's no point in doing more work after you know you need to try again.
for (n = 0; n < i && !try_again; n++) {
try_again = (temp[n] == temp[i]) || (temp[n] == temp[i+1]);
}
Answering the second question about picking from a weighted set also has an algorithmic replacement that should be less time complex. This is based on the principle of that which is pre-computed does not need to be re-computed.
In an ordinary selection, you have an integral number of bins which makes picking a bin an O(1) operation. Your weighted_randi function has bins of real length, thus selection in your current version operates in O(n) time. Since you don't say (but do imply) that the vector of weights w is constant, I'll assume that it is.
You aren't interested in the width of the bins, per se, you are interested in the locations of their edges that you re-compute on every call to weighted_randi using the variable threshold. If the constancy of w is true, pre-computing a list of edges (that is, the value of threshold for all *w) is your O(n) step which need only be done once. If you put the results in a (naturally) ordered list, a binary search on all future calls yields an O(log n) time complexity with an increase in space needed of only sizeof w / sizeof w[0].