Lock strategy of OpenMP programming in C++ for nested loops

Lock strategy of OpenMP programming in C++ for nested loops - c++

Yesterday I tested my program of landscape evolution with a big data set(20 Million nodes), without a doubt the running speed was unacceptable. During debugging I noticed that it was a certain function which slowed the whole system down. So I'd like to add multithreading process to it.
However, the function itself is a nested loop with pointer iterators and I believe some of the data has to be locked during the process.
Basically what I was trying to do is to calculate the contributing area of each node. Contributing area, as suggested by its name, is a product (sum) of of all area from upstream node.
Here is the code of that two functions,
for( curnode = nodIter.FirstP(); nodIter.IsActive();
curnode = nodIter.NextP() ) //iterating thru a linked-list of pointers to active stream nodes (*IsActive* returns a bool value)
{
CalcDRArea( curnode, curnode->getVArea() ); //calls another function and pass a pointer to current node as well as a value of its VArea
}
void CalcDRArea( NodeClass *curnode, double addedArea )
{
// As long as the current node is neither a boundary nor non-valid, add
// _addedArea_ to its total drainage area and advance to the next node downstream
while( (curnode->ReachBoundary() == NonBoundary) &&
(curnode->valid()!=Nonvalid) )
{
curnode->AddDrArea( addedArea ); //*AddDrArea* is a simple inline function *`"drarea +=value"`*
curnode = curnode->getDownStreammNeighbor(); // *getDownstrmNbr()* reruns a pointer to the downstream node
}
}
Here is a simple illustration of nodes and their flowing direction
A B C D
\ | / |
E F G H
| |
I Q K L
| /
M N O P
//letters are stream nodes and slashes are their flowing directions
My plan is to use OpenMP to implement multithreading at the beginning of first function, the for loop. Ideally it will create several threads to calculate each node separately.
However, as shown in figure above, a sequent process can handle streams like
A-> F -> Q -> N
B-> F -> Q -> N
C-> F -> Q -> N
easily, but it will definitely cause problem in a multithreading conditions.
From what I've just read from OpenMP's document, flush and lock might be the right way to do this, but still I am quite clueless right now and there might still be other potential issues within this loops (like gcc ver. of OpenMP doesn't support "!=").
====Update ======
There are two kinds of areas: vArea, which is the area of each node; and drArea which is the sum of the area of current node and area from all its upstream nodes.
I was wondering if I can change the current function to this :
for( active node iterator）
{
if（currentNode.hasDownStreamNode）
{
downStreamNode.drArea += currentNode.vArea + currentNode.DrArea;
}
CurrentNode.drArea += currentNode.varea;
}

Before worrying about parallelism, you should first pick a better algorithm. Although in this case one actually goes with the other.
You want a dynamic programming solution in O(N) instead of your current O(n^2) approach. The intuition is simple, just call the following method on each of the leaf nodes in your tree:
def compute_value(node):
if node.DrArea != 0: return node.DrArea
total = node.vArea
for n in node.upstream_nodes():
total += compute_value(n)
node.DrArea = total
return node.DrArea
To see why this is more efficient, let's take a look at your example. At the moment you add the value of A to F, Q and N. Then you do the same for B and for C. So you have 12 add operations.
The recursive method on the other hand computes the value for A, B and C first, then we compute F from the already known values of A, B and C. Q gets computed from F and so on. So that's only 8 adds. Basically every node only adds its total value to all of its children instead of going through the whole subtree and adding only its own value.
For a simple sequential algorithm you can then go ahead and just implement this iteratively using a list of nodes whose predecessors have all been evaluated (so instead of starting from leaves start from the root nodes). The easiest parallel implementation here is to just use a concurrent queue and atomic add operations, although using one normal queue per processor and some work-stealing would probably be a very good idea in practice.

Related

How to calculate the total distance between various vertices in a graph?

Let's say I have a weighted, undirected, acyclic graph with no negative value weights, comprised of n vertices and n-1 edges. If I want to calculate the total distance between every single one of them (using edge weight) and then add it up, which algorithm should I use? If for example a graph has 4 vertices, connected like a-b, a-c, c-d then the program should output the total distance needed to go from a-d, a-c, a-b, b-c, b-d and so on. You could call it every possible path between the given vertices. The language I am using is C++.
I have tried using Dijikstra's and Prim's algorithm, but none have worked for me. I have thought about using normal or multisource DFS, but I have been struggling with it for some time now. Is there really a fast way to calculate it, or have I misunderstood the problem entirely?

Since you have an acyclic graph, there is only one possible path between any two points. This makes things a lot simpler to compute and you don't need to use any real pathfinding algorithms.
Let's say we have an edge E that connects nodes A and B. Calculate how many nodes can be reached from node A, not using edge E (including A). Multiply that by the number of nodes that can be reached from node B, not using edge E (including B). Now you have the number of paths that travel through edge E. Multiply this by the weight of edge E, and you have the total contribution of edge E to the sum.
Do the same thing for every edge and add up the results.
To make the algorithm more efficient, each edge can store cached values that say the number of nodes that are reachable on each side of the edge.
You don't have to use a depth first search. Here is some pseudocode showing how you calculate the number of nodes reachable on a side of edge E very fast taking advantage of caching:
int count_nodes_reachable_on_edge_side(Edge e, Node a) {
// assume edge e directly connects to node a
if (answer is already cached in e) { return the answer; }
answer = 1; // a is reachable
for each edge f connected to node a {
if (f is not e) {
let b be other node f touches (not a)
answer += count_nodes_reachable_on_edge_side(f, b)
}
}
cache the answer in edge e;
return answer;
}

I already presented an O(N^2) algorithm in my other answer, but I think you can actually do this in O(N) time with this pseudo code:
let root be an arbitrary node on the graph;
let total_count be the total number of nodes;
let total_cost be 0;
process(root, null);
// Returns the number of nodes reachable from node n without going
// through edge p. Also adds to total_cost the contribution from
// all edges touching node n, except for edge p.
int process(Node n, Edge p)
{
count = 1
for each edge q that touches node n {
if (q != p) {
let m be the other node connected to q (not n)
sub_count = process(m, q)
total_cost += weight(q) * sub_count * (total_count - sub_count)
count += sub_count
}
}
return count
}
The run time of this is O(N), where N is the number of nodes, because process will be called exactly once for each node.
(For the detail-oriented readers: the loop inside process does not matter: there are O(N) iterations that call process, because process is called on each node exactly once. There are O(N) iterations that don't do anything (because q == p), because those iterations can only happen once for process call.)
Every edge will also be visited. After we recursively count the number of nodes on one side of the edge, we can do a simple subtraction (total_count - sub_count) to get the number of nodes on the other side of the edge. When we have these two node counts, we can just multiply them together to get the total number of paths going through the edge, then mulitply that by the weight, and add it to the total cost.

Speed up Iteration Over Neighbors in a Graph

I have a static graph (the topology does not change over time and is known at compile time) where each node in the graph can have one of three states. I then simulate a dynamic where a node has a probability of changing its state over time, and this probability depends on the state of its neighbors. As the graph grows larger the simulations start getting very slow, but after some profiling, I identified that most of the computation time was spent iterating over the list of neighbors.
I was able to improve the speed of the simulations by changing the data structure used to access neighbors in the graph but was wondering if there are better (faster) ways to do it.
My current implementation goes like this:
For a graph with N nodes labeled from 0 to N-1 and average number of neighbors of K, I store each state as an integer in an std::vector<int> states and the number of neighbors for each node in std::vector<int> number_of_neighbors.
To store neighbors information I created two more vectors: std::vector<int> neighbor_lists which stores, in order, the nodes that are neighbors to node 0, node 1, ... , node N, and an index vector std::vector<int> index which stores, for each node, the index of its first neighbor in neighbor_lists.
So I have four vectors in total:
printf( states.size() ); // N
printf( number_of_neighbors.size() ); // N
printf( neighbor_lists.size() ); // N * k
printf( index.size() ); // N
When updating node i I access its neighbors like so:
// access neighbors of node i:
for ( int s=0; s<number_of_neighbors[i]; s++ ) {
int neighbor_node = neighbor_lists[index[i] + s];
int state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
To sum up my question then: is there a faster implementation for accessing neighboring nodes in a fixed graph structure?
Currently, I've gone up to N = 5000 for a decent number of simulation time, but I was aiming for N ~ 15.000 if at all possible.

It's important to know the order of magnitude of N because, if it isn't to high, you can use the fact that you know compile time the topology so you can put the data in std::arrays of known dimensions (instead of std::vectors), using the smallest possible type to (if necessary) save stack memory, ad define some of they as constexpr (all but states).
So, if N isn't too big (stack limit!), you can define
states as an std::array<std::uint_fast8_t, N> (8 bits for 3 state are enough)
number_of_neighbors as a constexpr std::array<std::uint_fast8_t, N> (if the maximum number of neighbors is less that 256, a bigger type otherwise)
neighbor_list as a constexpr std::array<std::uint_fast16_t, M> (where M is the known sum of the number of neighbors) if 16 bit are enough for N; a bigger type otherwise
index as a constexpr std::array<std::uint_fast16_t, N> if 16 bit are enough for M; a bigger type otherwise
I think (I hope) that using arrays of known dimensions that are constexpr (when possible) the compiler can create a fastest code.
Regarding the updating code... I'm a old C programmer so I'm used to trying to optimize the code in a way that modern compiler do better, so I don't know if the following code is a good idea; anyway, I would write the code like this
auto first = index[i];
auto top = first + number_of_neighbors[i];
for ( auto s = first ; s < top ; ++s ) {
auto neighbor_node = neighbor_lists[s];
auto state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
-- EDIT --
The OP specify that
Currently, I've gone up to N = 5000 for a decent number of simulation time, but I was aiming for N ~ 15.000 if at all possible.
So 16 bit should be enough -- for the type in neighbor_list and in index -- and
states and number_of_neighbors are about 15 kB each (30 kB using a 16 bit variable)
index is about 30 kB.
It seems to me that are reasonable values for stack variables.
The problem could be neighbor_list; if the medium number of neighbor is low, say 10 to fix a number, we have that M (sum of neighbors) is about 150'000, so neighbor_list is about 300 kB; not low but reasonable for some environment.
If the medium number is high -- say 100, to fix another number --, neighbor_list become about 3 MB; it should be to high, in some environments.

Currently you are accessing sum(K) nodes for each iteration. That doesn't sound so bad ... until you hit access the cache.
For less than 2^16 nodes you only need an uint16_t to identify a node, but with K neighbours you will need an uint32_t to index the neighbour list.
The 3 states can as already mentioned be stored in 2 bits.
So having
// your nodes neighbours, N elements, 16K*4 bytes=64KB
// really the start of the next nodes neighbour as we start in zero.
std::vector<uint32_t> nbOffset;
// states of your nodes, N elements, 16K* 1 byte=16K
std::vector<uint8_t> states;
// list of all neighbour relations,
// sum(K) > 2^16, sum(K) elements, sum(K)*2 byte (E.g. for average K=16, 16K*2*16=512KB
std::vector<uint16_t> nbList;
Your code:
// access neighbors of node i:
for ( int s=0; s<number_of_neighbors[i]; s++ ) {
int neighbor_node = neighbor_lists[index[i] + s];
int state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
rewriting your code to
uint32_t curNb = 0;
for (auto curOffset : nbOffset) {
for (; curNb < curOffset; curNb++) {
int neighbor_node = nbList[curNb]; // done away with one indirection.
int state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
}
So to update one node you need to read the current state from states, read the offset from nbOffset and use that index to look up the neighbour list nbList and the index from nbList to look up the neighbours states in states.
The first 2 will most likely already be in L1$ if you run linearly through the list. Reading the first value from nbList for each node might be in L1$ if you compute the nodes linearly otherwise it will most likely cause a L1$ and likely a L2$ miss, the following reads would be hardware pre-fetched.
Reading linearly through the nodes has the added advantage that each neighbour list will only be read once per iteration of the node set and therefore the likelihood that states stay in L1$ will increase dramatically.
Decreasing the size of states could improve the the chance that it stayes in L1$ further, with a little calculation there can be store 4 states of 2 bits in each byte, reducing the size of states to 4KB. So depending on how much "stuff" you do you could have a very low cache miss rate.
But if you jump around in the nodes and do "stuff" the situation quickly gets worse inducing a nearly guaranteed L2$ miss for nbList and potential L1$ misses for the current node and the K calls to state. This could lead to slow downs by a factor 10 to 50.
If your in the latter scenario with random access you should consider storing an extra copy of the state in the neighbour list saving the cost of accessing states K times. You have to measure if this is faster.
Regarding in-lining the data in the program you would gain a little for not having to access the vector, I would in this case estimate it to less than 1% gain from that.
In-lining and constexpr aggressively your compiler would boil your computer for years and reply "42" as the final result of the program. You have to find a middle ground.

Enhanced FrogRiverN

I'm working some programming exercises. This one has been quite well known and answered in different places.
FrogRiverOne
Find the earliest time when a frog can jump to the other side of a river.
https://codility.com/programmers/task/frog_river_one/
My question is, what if the Frog can jump a distance of D? how can we find the shortest time to cross the river, with the best runtime complexity? Thanks!
int solution(int X, vector<int> &A, int D); // frog can jumps from 1 to D steps

I think shole's greedy solution is almost correct. If you include a recursive propagation step when you change Current_Pos, you will ensure that the frog is always at the front-most position.
Here is an alternative that avoids the recursion:
Use an occupancy array that stores for each position if there is a leaf. And use a union-find data structure with nodes for every position. The union-find data structure will keep track of nodes that can be reached from each other (i.e. connected components). The task then is to find the first point in time when both river banks are connected.
To find this, do the following: Every time a new leaf comes into play, mark its position as occupied. Then, unite the node in the union-find data structure with every other occupied node that is reachable from this position (-D to +D). Finally, check if both river banks are connected. Overall time complexity is O(ND+X).
Which of the two solutions is faster depends on the input.

Try this on c# with Linq :
private static int FrogRiverOne(int X,int[] A)
{
if (Enumerable.Range(1,X).Except(A).Any()) { return -1; }
var orderBy = A.Select((y, z) => new { y, z }).GroupBy(a => a.y).Select(a =>
{ var ff = a.Min(xe => xe.z); return new { a,ff }; });
var second = orderBy.Max(xe => xe.ff);
return second;
}

binary tree -print the elements according to the level

This question was asked to me in an interview:
lets say we have above binary tree,how can i produce an output like below
2 7 5 2 6 9 5 11 4
i answered like may be we can have a level count variable and print all the elements sequentially by checking the level count variable of each node.
probably i was wrong.
can anybody give anyidea as to how we can achieve that?

You need to do a breadth first traversal of the tree. Here it is described as follows:
Breadth-first traversal: Depth-first
is not the only way to go through the
elements of a tree. Another way is to
go through them level-by-level.
For example, each element exists at a
certain level (or depth) in the tree:
tree
----
j <-- level 0
/ \
f k <-- level 1
/ \ \
a h z <-- level 2
\
d <-- level 3
people like to number things starting
with 0.)
So, if we want to visit the elements
level-by-level (and left-to-right, as
usual), we would start at level 0 with
j, then go to level 1 for f and k,
then go to level 2 for a, h and z, and
finally go to level 3 for d.
This level-by-level traversal is
called a breadth-first traversal
because we explore the breadth, i.e.,
full width of the tree at a given
level, before going deeper.

The traversal in your question is called a level-order traversal and this is how it's done (very simple/clean code snippet I found).
You basically use a queue and the order of operations will look something like this:
enqueue F
dequeue F
enqueue B G
dequeue B
enqueue A D
dequeue G
enqueue I
dequeue A
dequeue D
enqueue C E
dequeue I
enqueue H
dequeue C
dequeue E
dequeue H
For this tree (straight from Wikipedia):

The term for that is level-order traversal. Wikipedia describes an algorithm for that using a queue:
levelorder(root)
q = empty queue
q.enqueue(root)
while not q.empty do
node := q.dequeue()
visit(node)
if node.left ≠ null
q.enqueue(node.left)
if node.right ≠ null
q.enqueue(node.right)

BFS:
std::queue<Node const *> q;
q.push(&root);
while (!q.empty()) {
Node const *n = q.front();
q.pop();
std::cout << n->data << std::endl;
if (n->left)
q.push(n->left);
if (n->right)
q.push(n->right);
}
Iterative deepening would also work and saves memory use, but at the expense of computing time.

If we are able to fetch the next element at same level, we are done. As per our prior knowledge, we can access these element using breadth first traversal.
Now only problem is how to check if we are at last element at any level. For this reason, we should be appending a delimiter (NULL in this case) to mark end of a level.
Algorithm:
1. Put root in queue.
2. Put NULL in queue.
3. While Queue is not empty
4. x = fetch first element from queue
5. If x is not NULL
6. x->rpeer <= top element of queue.
7. put left and right child of x in queue
8. else
9. if queue is not empty
10. put NULL in queue
11. end if
12. end while
13. return
#include <queue>
void print(tree* root)
{
queue<tree*> que;
if (!root)
return;
tree *tmp, *l, *r;
que.push(root);
que.push(NULL);
while( !que.empty() )
{
tmp = que.front();
que.pop();
if(tmp != NULL)
{
cout << tmp=>val; //print value
l = tmp->left;
r = tmp->right;
if(l) que.push(l);
if(r) que.push(r);
}
else
{
if (!que.empty())
que.push(NULL);
}
}
return;
}

I would use a collection, e.g. std::list, to store all elements of the currently printed level:
Collect pointers to all nodes in the current level in the container
Print the nodes listed in the container
Make a new container, add the subnodes of all nodes in the container
Overwrite the old container with the new container
repeat until container is empty

as an example of what you can do at an interview if you don't remember/don't know the "official" algorithm, my first idea was - traverse the tree in the regular pre-order dragging a level counter along, maintaining a vector of linked-lists of pointers to nodes per level, e.g.
levels[level].push_back(&node);
and in the end print the list of each level.

C++ minimax function

I have searched Google and Stackoverflow for this question, but I still don't understand how a minimax function works.
I found the wikipedia entry has a pseudocode version of the function:
function integer minimax(node, depth)
if node is a terminal node or depth <= 0:
return the heuristic value of node
α = -∞
for child in node: # evaluation is identical for both players
α = max(α, -minimax(child, depth-1))
return α
Several other minimax functions I found with Google are basically the same thing; I'm trying to implement this in C++, and this is what I have come up with so far:
double miniMax(Board eval, int iterations)
{
//I evaluate the board from both players' point of view and subtract the difference
if(iterations == 0)
return boardEval(eval, playerNumber) - boardEval(eval, opponentSide());
/*Here, playerTurn tells the findPossibleMoves function whose turn it is;
I mean, how do you generate a list of possible moves if you don't even know
whose turn it's supposed to be? But the problem is, I don't see where I can
get playerTurn from, as there are only 2 parameters in all the examples of
minimax I've seen*/
vector<int> moves = eval.findPossibleMoves(playerTurn);
//I'm assuming -∞ in the wikipedia article means a very low number?
int result = -999999999;
//Now I run this loop to evaluate each possible move
/*Also, the Lua example in the wiki article has
alpha = node.player==1 and math.max(alpha,score) or math.min(alpha,score)
Is alpha a boolean there?!*/
for(int i = 0; i * 2 < moves.size(); i++)
{
//I make a copy of the board...
Board temp = eval;
/*and make the next possible move... once again playerTurn crops up, and I
don't know where I can get that variable from*/
temp.putPiece(moves[i * 2], moves[i * 2 + 1], playerTurn);
/*So do I create a function max that returns the bigger of two doubles?*/
result = max(result, -miniMax(temp, iterations - 1));
}
return result;
/*So now I've returned the maximum score from all possible moves within a certain
# of moves; so how do I know which move to make? I have the score; how do I know
which sequence of moves that score belongs to?*/
}
As you can see, I'm pretty confused about this minimax function. Please at the very least give me some hints to help me with this.
Thanks! :)

That sample from Wikipedia is doing NegaMax with Alpha/Beta pruning.
You may be helped by getting the naming straight:
The basis is MiniMax, a literal implementation would involve 2 methods that take turns (mutually recursive), 1 for each side.
Lazy programmers turn this into NegaMax, one method with a strategically placed - operator.
Alpha/Beta pruning is keeping track of a Window of best moves (over multiple depths) to detect dead branches.
Your playerTurn is used to determine whose turn it is . In NegaMax you can derive this from the depth (iterations) being odd or even. But it would be easier to use 2 parameters (myColor, otherColor) and switch them at each level.

Your miniMax() function should remember the best move it found so far. So instead of this code:
/*So do I create a function max that returns the bigger of two doubles?*/
result = max(result, -miniMax(temp, iterations - 1));
You should do something like this:
/*So do I create a function max that returns the bigger of two doubles?*/
double score = -miniMax(temp, iterations - 1);
if (score > result)
{
result = score;
bestMove = i;
}
Of course, you need a variable "bestMove" and a way to return the best move found to the caller.

Add the playerTurn variable as an argument to miniMax, and call miniMax which the current player's move initially and recursively.
Also, opponentSide needs to be a function of playerTurn.

A good place to start with game tree searching is the chess programming wiki. For your question about the move: I think it is most common to have two max-functions. The difference between the two max functions is that one returns only the score and the other returns the score and the best move. A recursive call order would be like following:
maxWithBestMoveReturn(...) --> min(...) --> max(...) --> min(...)
There are some good papers with pseudocode for the Alpha Beta algorithm:
TA Marsland - Computer Chess and Search
J Schaeffer - The games Computers (and People) Play
To your question in the comment: and math.max(alpha,score) or math.min(alpha,score) Is alpha a boolean there?!
No alpha is a window bound in a alpha beta algorithm. The alpha value gets updated with a new value. Because alpha and beta are swapped with the recursive call of the negamax-Function the alpha variable refers to the beta variable in the next recursive call.
One note to the playerTurn variable: The minimax or alpha-beta algorithm doesn't need this information. So i would give the information -- who's next --, into the Board-Structure. The functions findPossibleMoves and boardEval get all information they need from the Board-Structure.
One note to the recursive break condition: If i understand your code right, then you only have the one with iterations == o. I think this means the algorithm has reached the desired depth. But what if there are no possible moves left befor the algorithm reaches this depth. Maybe you should write following:
vector<int> moves = findPossibleMoves(...);
if (!moves.size())
return boardEval(...);

In your pseudocode, the node variable has to contain all the information about the current board position (or whatever). This information would include whose turn it is to move.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Lock strategy of OpenMP programming in C++ for nested loops - c++

Related

How to calculate the total distance between various vertices in a graph?

Speed up Iteration Over Neighbors in a Graph

Enhanced FrogRiverN

binary tree -print the elements according to the level

C++ minimax function

Categories

Resources