I'm a new user to openmp, and I try to parallelize a part of my program and use locks for a 2 dimentional array. I won't go over all the details of my real problem, but instead discuss the following simplified example:
Let's say I have a huge group of N>>1 particles, which their x,y locations is stored in some data structure . I want to create a 2d counter array that will represent a grid, and count all the particles in each cell of the grid:
count[j][k]+=1 (for the corresponding j,k of each particle, according to his x,y values)
Using locks is a natural choice (The 2D array is on the scale of 100x100 and above , so when you have a 20 cores machine, the chances of 2 threads updating an array element simultaneously is still pretty low). The relevant parts of the code:
//define a 2d lock:
omp_lock_t **count_lock;
count_lock = new omp_lock_t* [J+1];
for(j=0;j<=J;j++) {
count_lock[j]=new omp_lock_t[K+1];
memset(count_lock[j],0,(K+1)*sizeof(omp_lock_t));
}
//initializing:
for(j=0;j<=J;j++)
for(k=0;k<=K;k++){
omp_init_lock(&(count_lock[j][k]));
omp_unset_lock(&(count_lock[j][k]));
}
//using the lock (after determining the right j,k):
omp_set_lock(&(count_lock[j][k]));
count[j][k] += 1;
omp_unset_lock(&(count_lock[j][k]));
//destroying the lock:
for(i=0;i<=J;i++)
for(j=0;j<=K;j++)
omp_destroy_lock(&(count_lock[i][j]));
for (j=0; j<=J; j++)
delete[] count_lock[j];
delete[] count_lock;
The particles are divided into groups of 500 particles. each group is a linked list of particles, and all the particle groups are also forming a linked list. The paralellization comes from parallelizing the "for" loop that goes over the particle groups.
For some reason I can't seem to get it right...no improvement in performance is obtained, and the simulation get stuck after a few iterations.
I tried using "atomic" but it gave even worse performance than the serial code. Another option I tried is to create a private 2D array for each thread, and then sum them up. I got some improvement that way, but it is pretty costly and I hope there is a better way.
Thanks!
Related
I need your expertise:
I am about to implement a graph class in c++ and thinking about the right representation. The graphs are simple and undirected. Number of vertices get for now just up to 1000 but maybe higher in the future. Number of edges up to 200k and maybe higher. Each vertex got a color (int) and an id (int). Edges transport no more information than connecting to vertices.
I store the graph and just need to access if x and y are connected or not - this I need very often.
After initialising i never remove or add new vertices or edges (N = Number of vertices and M=number of edges given from the start)!
The one representation which is already available to me:
An adjacency list rolled out into just one long list. Along with this representation goes an array with starting indices for each vertex. Storage O(2M) and check if edge between x and y in an average of O(n/m)
A representation I thought of:
the idea is to, instead of rolling out the adjacency list into one array, do it with the adjacency matrix. So storage O(N^2)? Yes but I want to store an edge in one bit except of one byte.(actually 2 bits symmetricallywise)
Example: Say N=8, then create an vector<uint8_t> of length 8 (64 bit). Init each entry on 0. If there is an edge between vertex 3 and vertex 5, then add pow(2,5) to the entry of my vector belonging to vertex 3 and symmetrically. So there is a 1 in the entry of vertex 3 at position of vertex 5 exactly when there is an edge between 3 and 5. After inserting my graph into this data structure I think one should be able to access neighborhood in constant time by just a binary operation: Are 3 and 5 connected? Yes if v[3] ^ pow(2,5) == 0. When there are more vertices than 8, then every vertex needs to get more than one entry in the vector and I need to perform one modulo and one division operation for accessing the correct spot.
What do you think of the second solution - is it maybe already known and in use?
Am I wrong by thinking about an access of O(1)?
Is it to much effort for no real performance improvement?
The reason for loading both representations in one big list is due to cache improvements I was told.
I am happy to get some feedback on this idea. I might be way off - pls be kind in that case :D
A 1000x1000 matrix with 200,000 edges will be quite sparse. Since the graph is undirected, the edges in the matrix will be written twice:
VerticeA -> VerticeB and VerticeB -> VerticeA
You will end up filling up 40% of the matrix, the rest will be empty.
Edges
The best approach I can think of here is to use a 2D vector of booleans:
std::vector<std::vector<bool>> matrix(1000, std::vector<bool>(1000, false));
The lookup will take O(1) time and std::vector<bool> saves space by using a single bit for each boolean value. You will end up using 1Mbit or ~128kB (125 kB) of memory.
The storage is not necessarily an array of bool values, but the library implementation may optimize storage so that each value is stored in a single bit.
This will allow you to check for an edge like this:
if( matrix[3][5] )
{
// vertice 3 and 5 are connected
}
else
{
// vertice 3 and 5 are not connected
}
Vertices
If the id values of the vertices form a continuous range of ints (e.g. 0,1,2,3,...,999) then you could store the color information in a std::vector<int> which has O(1) access time:
std::vector<int> colors(1000);
This would use up memory equal to:
1000 * sizeof(int) = 4000 B ~ 4 kB (3.9 kB)
On the other hand, if the id values don't form a continuous range of ints it might be a better idea to use a std::unordered_map<int, int> which will on average give you O(1) lookup time.
std::unordered_map<int, int> map;
So e.g. to store and look up the color of vertice 4:
map[4] = 5; // assign color 5 to vertice 4
std::cout << map[4]; // prints 5
The amount of memory used by std::unordered_map<int, int> will be:
1000 * 2 * sizeof(int) = 8000 B ~ 8 kB (7.81 kB)
All together, for edges:
Type
Memory
Access time
std::vector<std::vector<bool>>
125 kB
O(1)
and for vertices:
Type
Memory
Access time
std::vector<int>
3.9 kB
O(1)
std::unordered_map<int, int>
7.8 kB
O(1) on avg.
If you go for a bit matrix then the memory usage is O(V^2), so ~1Mb bits or 128KB, of which slightly less than half really are duplicates.
If you make an array of the edges O(E) and another array of index into the edges from the vertexes to the first of its edge you use 200K*sizeof(int) or 800KB which is much more, half of it is also duplicates (A-B and B-A are the same) which here actually could be saved. Same if you know (or can template you out of it) that the number of vertexes can be stored in an uint16_t half can be saved again.
To save half you just check which of the Vertexes has the lower number and checks its edges.
To find out when to stop looking you use the index on the next Vertex.
So with your numbers it is fine or even good to use a bit matrix.
The first problem comes when (V^2)/8 > (E*4) though the binary search in the Edge algorithm would still be much slower than checking a bit. That would occur if we set E = V * 200 (1000 Vertexes vs 200K edges)
V*V/8 > V*200*4
V/8 > 200*4
V > 200*4*8 = 6400
That would be 5120000 ~ 5MB easily fits into a L3 cache nowadays. If the connectivity (here average number of connections per vertex) is higher than 200 so much the better.
Checking the edges will also cost lg2(connectivity)*K(mispredicts) which gets rather steep. checking the bit matrix would be O(1).
You would need to measure, among others when the bit matrix breaks the L3 significantly while the Edge list still fits L3 and when it spills over in virtual memory.
In other words with a high connectivity the bit matrix should beat the Edge list with a much lower connectivity or much higher number of vertexes the Edge list might be faster.
I am trying to make a 16x16 LED Snake game using Arduino (C++).
I need to assign a random grid index for the next food tile.
What I have is a list of indices that are occupied by the snake (snakeSquares).
So, my thought is that I need to generate a list of potential foodSquares. Then I can pick a random index from that list, and use the value there for my next food square.
I have some ideas for this but they seem kind of clunky, so I was looking for some feedback. I am using the Arduino LinkedList.h library for my lists in lieu of stdio.h (and random() in place of rand()):
Generate a list (foodSquares) containing the integers [0, 255] so that the indices correspond to the values in the list (I don't know of a quick way to do this, will probably need to use a for loop).
When generating list of snakeSquares, set foodSquares[i] = -1. Afterwards, loop through foodSquares and remove all elements that equal -1.
Then, generate a random number randNum from [0, foodSquares.size()-1] and make the next food square index equal to foodSquares[randNum].
So I guess my question is, will this approach work, and is there a better way to do it?
Thanks.
Potential approach that won't require more lists:
Calculate random integer representing number of steps.
Take head or tail as a starting tile.
For each step move at random free adjacent tile.
I couldn't understand it completely your question as some of those points are quite waste of processor time (i.e. point 1 and 2). But, the first point could be solved quite easily in n proportional complexity as follows:
for (uint8_t i = 0; i < 256; i++) {
// assuming there is a list of food_squares
food_squares[i] = i;
}
Then to the second point you would have to set every food_square to -1, for what? Anyway. A way you could implement this would be as VTT has said and I will describe it further:
Take a random number between [0..255].
Does it is one the snake_squares? If so, back to one, else, go to three.
This is the same as your third point, use this random number to set the position of the food in food_square (food_square[random_number] = some_value).
I have an array - 2D(100 x 100 in this case) with some states limited within borders as shown on picture:
http://tinypic.com/view.php?pic=mimiw5&s=5#.UkK8WIamiBI
Each cell has its own id(color, for example green is id=1) and flag isBorder(marked as white on pic if true). What I am trying to do is exclude set of cell with one state limited with borders(Grain) so i could work on each grain separately which means i would need to store all indexes for each grain.
Any one got an idea how to solve it?
Now that I've read your question again... The algorithm is essentially the same as filling the contiguous area with color. The most common way to do it is a BFS algorithm.
Simply start within some point you are sure lays inside the current area, then gradually move in every direction, selecting traversed fields and putting them into a vector.
// Edit: A bunch of other insights, made before I understood the question.
I can possibly imagine an algorithm working like this:
vector<2dCoord> result = data.filter(DataType::Green);
for (2dCoord in result) {
// do some operations on data[2dCoord]
}
The implementation of filter in a simple unoptimized way would be to scan the whole array and push_back matching fields to the vector.
If you shall need more complicated queries, lazily-evaluated proxy objects can work miracles:
data.filter(DataType::Green)
.filter_having_neighbours(DataType::Red)
.closest(/*first*/ 100, /*from*/ 2dCoord(x,y))
.apply([](DataField& field) {
// processing here
});
I'm writing a program that is supposed
to sort a number of square tiles (of which
each side is colored in one of five colors—red, orange,
blue, green and yellow), that are laying next to each other
(eg 8 rows and 12 columns) in a way that as many sides with
the same color connect as possible.
So, for instance, a tile with right side colored
red should have a tile on the right that has a red left-side.)
The result is evaluated by counting how many non-matching pairs
of sides exist on the board. I'm pretty much done with the actual program;
I just have some trouble with my sorting algorithm. Right now I'm using
Bubble-sort based algorithm, that compares every piece on the board
with every other piece, and if switching those two reduces the amount of
non-matching pairs of sides on the board, it switches them. Here a
abstracted version of the sorting function, as it is now:
for(int i = 0 ; i < DimensionOfBoard.cx * DimensionOfBoard.cy ; i++)
for(int j = 0 ; j < DimensionOfBoard.cx * DiemsionOfBoard.cy ; j++)
{
// Comparing a piece with itself is useless
if(i == j)
continue;
// v1 is the amount of the nonmatching sides of both pieces
// (max is 8, since we have 2 pieces with 4 sides each (duh))
int v1 = Board[i].GetNonmatchingSides() + Board[j].GetNonmatchingSides();
// Switching the pieces, if this decreases the value of the board
// (increases the amount of nonmatching sides) we'll switch back)
SwitchPieces(Board[i], Board[j]);
// If switching worsened the situation ...
if(v1 < Board[i].GetNonmathcingSides() + Board[j].GetNonmatchingSides())
// ... we switch back to the initial state
SwitchPieces(Board[i], Board[j]);
}
As an explanation: Board is a pointer to an array of Piece Object. Each Piece has
four Piece-pointers that point to the four adjacent pieces (or NULL, if the Piece is
a side/corner piece.) And switching actually doesn't switch the pieces itself, but
rather switches the colors. (Instead of exchanging the pieces it scrapes off the color
of both and switches that.)
This algorithm doesn't work too bad, it significantly improves the value of the
board, but it doesn't optimize it as it should. I assume it's because side and corner
pieces can't have move than three/two wrong adjacent pieces, since one/two side(s)
are empty. I tried to compensate for that (by multiplying Board[i].GetMatchingPieces()
with Board[i].GetHowManyNonemptySides() before comparing), but that didn't help a bit.
And that's where I need help. I don't know very many sorting algorithms, let alone
those that work with two-dimensional arrays. So can anyone of you know about
an algorithmic concept that might help me to improve my work? Or can anyone see a
problem that I haven't found yet? Any help is appreciated. Thank you.
if there was a switch you have to re-evaluate a board, because there might be previous positions where now you could find an enhancement.
Note that you are going to find only a local minimum with those swappings. You might won't be able to find any enhancements but that doesn't mean that's the best board configuration.
One way to find a better configuration is to shuffle a board and search for a new local minumum, or use an algorithm-skeleten that allows you to make bigger jumps in the state, eg: Simulated annealing.
I'm implementing a tile engine for games using C++. Currently the game is divided into maps, each map has a 2D grid of sprites where each represents a tile.
I am coding a system where if several maps are adjacents you can walk from one to the other.
At startup of the game, all the maps are instancied but are "unloaded" ie the sprites objects are not in memory. When I'm close enough of an adjacent map, the maps sprites are "loaded" in memory by basically doing:
for(int i=0; i < sizeX; i++) {
for(int j=0; j < sizeY; j++) {
Tile *tile_ptr = new Tile(tileset, tilesId[i][j], i + offsetX, j + offsetY);
tilesMap[i][j] = tile_ptr;
}
}
And they are unloaded by being destroyed the same way when I am too far away from the map.
For a 50x50 map of sprites of 32x32 pixels, it takes me roughly 0.3 secs to load or unload which is done during 1 frame. My question is: what is a more efficient way to load/unload maps dynamically, even using a totally different mechanism? thanks
PS : I'm using SFML as a graphic library but I'm not sure this changes anything
A different possibility to improve latency, but will increase overall number of ops needed:
Instead of waiting when you are 'too close' or 'too far' from a map, store in memory the maps for a bigger square around the player [i.e. if the map is 50x50, store 150x150], but show only the 50x50. now, every step - calculate the new 150x150 map, it will require 150 destroy ops, and 150 build ops in each step.
By doing so, you will actually need to calculate and build/destroy elements more times! But, latency will improve, since you don't need to wait 0.3 secs for building 2,500 elements, since you always need a small portion: 150*2 = 300 elements.
I think it's a perfect occasion to learn multithreading and asynchronous calls.
It can seem complex if you're new to it but it's a very useful skill to have.
It will still take 0.3sec to load (well, a bit more actually), but the game will not freeze.
That's what most games do. You can search SO for the various ways to do it in C++.