On-the-fly terrain chunk generation - c++

I'm writing an engine that can generate landscapes using noise functions, and load in new chunks as the player moves around the terrain. I spent the best part of two days figuring out how to place these chunks in the right position, so they don't overlap or get placed on top of existing chunks. It works well functionally, but there is a massive performance hit the further away you generate the chunks from the player (e.g. if you generate in a 3 chunk radius around the player, it's lighting fast, but if you increase that to a radius of 20 chunks it slows down very fast).
I know exactly why that is, but I can't think of any other way to do this. Before I go any further, here's the code I'm currently using, hopefully it's commented well enough to understand:
// Get the player's position rounded to the nearest chunk on the grid.
D3DXVECTOR3 roundedPlayerPos(SnapToMultiple(m_Dx->m_Camera->GetPosition().x, CHUNK_X), 0, SnapToMultiple(m_Dx->m_Camera->GetPosition().z, CHUNK_Z));
// Iterate through every point on an invisible grid. At each point, check if it is
// inside a circle the size of the grid (so we generate chunks in a circle around
// the player, not a square). At each point that is inside the circle, add a chunk to
// the ChunksToAdd vector.
for (int x = -CHUNK_RANGE-1; x <= CHUNK_RANGE; x++)
{
for (int z = -CHUNK_RANGE-1; z <= CHUNK_RANGE; z++)
{
if (IsInside(roundedPlayerPos, CHUNK_X*CHUNK_RANGE, D3DXVECTOR3(roundedPlayerPos.x+x*CHUNK_X, 0, roundedPlayerPos.z+z*CHUNK_Z)))
{
Chunk chunkToAdd;
chunkToAdd.chunk = 0;
chunkToAdd.position = D3DXVECTOR3((roundedPlayerPos.x + x*CHUNK_X), 0, (roundedPlayerPos.z + z*CHUNK_Z));
chunkToAdd.chunkExists = false;
m_ChunksToAdd.push_back(chunkToAdd);
}
}
}
// Iterate through the ChunksToAdd vector. For each chunk in this vector, compare it's
// position to every chunk in the Chunks vector (which stores each generated chunk).
// If the statement returns true, then there is already a chunk at that location, and
// we don't need to generate another.
for (i = 0; i < m_ChunksToAdd.size(); i++)
{
for (int j = 0; j < m_Chunks.size(); j++)
{
// Check the chunk in the ChunksToAdd vector with the chunk in the Chunks vector (chunks which are already generated).
if (m_ChunksToAdd[i].position.x == m_Chunks[j].position.x && m_ChunksToAdd[i].position.z == m_Chunks[j].position.z)
{
m_ChunksToAdd[i].chunkExists = true;
}
}
}
// Determine the closest chunk to the player, so we can generate that first.
// Iterate through the ChunksToAdd vector, and if the vector doesn't exist (if it
// does exist, we're not going to generate it so ignore it), compare the current (i)
// chunk against the current closest chunk. If it is larger, move on, and if it is
// smaller, store it's position as the new smallest chunk.
int closest = 0;
for (j = 0; j < m_ChunksToAdd.size(); j++)
{
if (!m_ChunksToAdd[j].chunkExists)
{
// Get the distance from the player to the chunk for the current closest chunk, and
// the chunk being tested.
float x1 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[j].position));
float x2 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[closest].position));
// If the chunk being tested is closer to the player, make it the new closest chunk.
if (x1 <= x2)
closest = j;
}
}
// After determining the position of the closest chunk, generate the volume and mesh, and add it
// to the Chunks vector for rendering.
if (!m_ChunksToAdd[closest].chunkExists) // Only add it if the chunk doesn't already exist in the Chunks vector.
{
Chunk chunk;
chunk.chunk = new chunkClass;
chunk.chunk->m_Position = m_ChunksToAdd[closest].position;
chunk.chunk->GenerateVolume(m_Simplex);
chunk.chunk->GenerateMesh(m_Dx->GetDevice());
chunk.position = m_ChunksToAdd[closest].position;
chunk.chunkExists = true;
m_Chunks.push_back(chunk);
}
// Clear the ChunksToAdd vector ready for another frame.
m_ChunksToAdd.clear();
(if it wasn't already obvious, this is run every frame.)
The problem area is to do with the CHUNK_RANGE variable. The larger this value, the more the first two loops are iterated through each frame, slowing the whole thing down tremendously. I need some advice or suggestions on how to do this more efficiently, thanks.
EDIT: Here's some improved code:
// Get the player's position rounded to the nearest chunk on the grid.
D3DXVECTOR3 roundedPlayerPos(SnapToMultiple(m_Dx->m_Camera->GetPosition().x, CHUNK_X), 0, SnapToMultiple(m_Dx->m_Camera->GetPosition().z, CHUNK_Z));
// Find if the player has changed into another chunk, if they have, we will scan
// to see if more chunks need to be generated.
static D3DXVECTOR3 roundedPlayerPosOld = roundedPlayerPos;
static bool playerPosChanged = true;
if (roundedPlayerPosOld != roundedPlayerPos)
{
roundedPlayerPosOld = roundedPlayerPos;
playerPosChanged = true;
}
// Iterate through every point on an invisible grid. At each point, check if it is
// inside a circle the size of the grid (so we generate chunks in a circle around
// the player, not a square). At each point that is inside the circle, add a chunk to
// the ChunksToAdd vector.
if (playerPosChanged)
{
m_ChunksToAdd.clear();
for (int x = -CHUNK_CREATE_RANGE-1; x <= CHUNK_CREATE_RANGE; x++)
{
for (int z = -CHUNK_CREATE_RANGE-1; z <= CHUNK_CREATE_RANGE; z++)
{
if (IsInside(roundedPlayerPos, CHUNK_X*CHUNK_CREATE_RANGE, D3DXVECTOR3(roundedPlayerPos.x+x*CHUNK_X, 0, roundedPlayerPos.z+z*CHUNK_Z)))
{
bool chunkExists = false;
for (int j = 0; j < m_Chunks.size(); j++)
{
// Check the chunk in the ChunksToAdd vector with the chunk in the Chunks vector (chunks which are already generated).
if ((roundedPlayerPos.x + x*CHUNK_X) == m_Chunks[j].position.x && (roundedPlayerPos.z + z*CHUNK_Z) == m_Chunks[j].position.z)
{
chunkExists = true;
break;
}
}
if (!chunkExists)
{
Chunk chunkToAdd;
chunkToAdd.chunk = 0;
chunkToAdd.position = D3DXVECTOR3((roundedPlayerPos.x + x*CHUNK_X), 0, (roundedPlayerPos.z + z*CHUNK_Z));
m_ChunksToAdd.push_back(chunkToAdd);
}
}
}
}
}
playerPosChanged = false;
// If there are chunks to render.
if (m_ChunksToAdd.size() > 0)
{
// Determine the closest chunk to the player, so we can generate that first.
// Iterate through the ChunksToAdd vector, and if the vector doesn't exist (if it
// does exist, we're not going to generate it so ignore it), compare the current (i)
// chunk against the current closest chunk. If it is larger, move on, and if it is
// smaller, store it's position as the new smallest chunk.
int closest = 0;
for (j = 0; j < m_ChunksToAdd.size(); j++)
{
// Get the distance from the player to the chunk for the current closest chunk, and
// the chunk being tested.
float x1 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[j].position));
float x2 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[closest].position));
// If the chunk being tested is closer to the player, make it the new closest chunk.
if (x1 <= x2)
closest = j;
}
// After determining the position of the closest chunk, generate the volume and mesh, and add it
// to the Chunks vector for rendering.
Chunk chunk;
chunk.chunk = new chunkClass;
chunk.chunk->m_Position = m_ChunksToAdd[closest].position;
chunk.chunk->GenerateVolume(m_Simplex);
chunk.chunk->GenerateMesh(m_Dx->GetDevice());
chunk.position = m_ChunksToAdd[closest].position;
m_Chunks.push_back(chunk);
m_ChunksToAdd.erase(m_ChunksToAdd.begin()+closest);
}
// Remove chunks that are far away from the player.
for (i = 0; i < m_Chunks.size(); i++)
{
if (DistanceFrom(roundedPlayerPos, m_Chunks[i].position) > (CHUNK_REMOVE_RANGE*CHUNK_X)*(CHUNK_REMOVE_RANGE*CHUNK_X))
{
m_Chunks[i].chunk->Shutdown();
delete m_Chunks[i].chunk;
m_Chunks[i].chunk = 0;
m_Chunks.erase(m_Chunks.begin()+i);
}
}

Have you tried profiling it to work out exactly where the bottleneck is?
Do you need to check all of those chunks or could you get away with checking the direction the player is looking and only generate the ones in view?
Is there any reason why you draw the chunk closest to the player first if you're generating it all once per frame before displaying it? Skipping the stage where you sort them may free up a bit of processing power.
Is there any reason you couldn't combine the first two loops to just create a vector of chunks which need generating?

It sounds like you're trying to do too much work (i.e. building chunks) on the render thread. If you can do the work of a three chunk radius really fast you should limit it to that per frame. How many chunks are you trying to generate, in each situation, per frame?
I'm going to assume that generating each chunk is independent, therefore, you can probably move the work to another thread - then show the chunk when it is ready.

Related

Tallest tower with stacked boxes in the given order

Given N boxes. How can i find the tallest tower made with them in the given order ? (Given order means that the first box must be at the base of the tower and so on). All boxes must be used to make a valid tower.
It is possible to rotate the box on any axis in a way that any of its 6 faces gets parallel to the ground, however the perimeter of such face must be completely restrained inside the perimeter of the superior face of the box below it. In the case of the first box it is possible to choose any face, because the ground is big enough.
To solve this problem i've tried the following:
- Firstly the code generates the rotations for each rectangle (just a permutation of the dimensions)
- secondly constructing a dynamic programming solution for each box and each possible rotation
- finally search for the highest tower made (in the dp table)
But my algorithm is taking wrong answer in unknown test cases. What is wrong with it ? Dynamic programming is the best approach to solve this problem ?
Here is my code:
#include <cstdio>
#include <vector>
#include <algorithm>
#include <cstdlib>
#include <cstring>
struct rectangle{
int coords[3];
rectangle(){ coords[0] = coords[1] = coords[2] = 0; }
rectangle(int a, int b, int c){coords[0] = a; coords[1] = b; coords[2] = c; }
};
bool canStack(rectangle &current_rectangle, rectangle &last_rectangle){
for (int i = 0; i < 2; ++i)
if(current_rectangle.coords[i] > last_rectangle.coords[i])
return false;
return true;
}
//six is the number of rotations for each rectangle
int dp(std::vector< std::vector<rectangle> > &v){
int memoization[6][v.size()];
memset(memoization, -1, sizeof(memoization));
//all rotations of the first rectangle can be used
for (int i = 0; i < 6; ++i) {
memoization[i][0] = v[0][i].coords[2];
}
//for each rectangle
for (int i = 1; i < v.size(); ++i) {
//for each possible permutation of the current rectangle
for (int j = 0; j < 6; ++j) {
//for each permutation of the previous rectangle
for (int k = 0; k < 6; ++k) {
rectangle &prev = v[i - 1][k];
rectangle &curr = v[i][j];
//is possible to put the current rectangle with the previous rectangle ?
if( canStack(curr, prev) ) {
memoization[j][i] = std::max(memoization[j][i], curr.coords[2] + memoization[k][i-1]);
}
}
}
}
//what is the best solution ?
int ret = -1;
for (int i = 0; i < 6; ++i) {
ret = std::max(memoization[i][v.size()-1], ret);
}
return ret;
}
int main ( void ) {
int n;
scanf("%d", &n);
std::vector< std::vector<rectangle> > v(n);
for (int i = 0; i < n; ++i) {
rectangle r;
scanf("%d %d %d", &r.coords[0], &r.coords[1], &r.coords[2]);
//generate all rotations with the given rectangle (all combinations of the coordinates)
for (int j = 0; j < 3; ++j)
for (int k = 0; k < 3; ++k)
if(j != k) //micro optimization disease
for (int l = 0; l < 3; ++l)
if(l != j && l != k)
v[i].push_back( rectangle(r.coords[j], r.coords[k], r.coords[l]) );
}
printf("%d\n", dp(v));
}
Input Description
A test case starts with an integer N, representing the number of boxes (1 ≤ N ≤ 10^5).
Following there will be N rows, each containing three integers, A, B and C, representing the dimensions of the boxes (1 ≤ A, B, C ≤ 10^4).
Output Description
Print one row containing one integer, representing the maximum height of the stack if it’s possible to pile all the N boxes, or -1 otherwise.
Sample Input
2
5 2 2
1 3 4
Sample Output
6
Sample image for the given input and output.
Usually you're given the test case that made you fail. Otherwise, finding the problem is a lot harder.
You can always approach it from a different angle! I'm going to leave out the boring parts that are easily replicated.
struct Box { unsigned int dim[3]; };
Box will store the dimensions of each... box. When it comes time to read the dimensions, it needs to be sorted so that dim[0] >= dim[1] >= dim[2].
The idea is to loop and read the next box each iteration. It then compares the second largest dimension of the new box with the second largest dimension of the last box, and same with the third largest. If in either case the newer box is larger, it adjusts the older box to compare the first largest and third largest dimension. If that fails too, then the first and second largest. This way, it always prefers using a larger dimension as the vertical one.
If it had to rotate a box, it goes to the next box down and checks that the rotation doesn't need to be adjusted there too. It continues until there are no more boxes or it didn't need to rotate the next box. If at any time, all three rotations for a box failed to make it large enough, it stops because there is no solution.
Once all the boxes are in place, it just sums up each one's vertical dimension.
int main()
{
unsigned int size; //num boxes
std::cin >> size;
std::vector<Box> boxes(size); //all boxes
std::vector<unsigned char> pos(size, 0); //index of vertical dimension
//gets the index of dimension that isn't vertical
//largest indicates if it should pick the larger or smaller one
auto get = [](unsigned char x, bool largest) { if (largest) return x == 0 ? 1 : 0; return x == 2 ? 1 : 2; };
//check will compare the dimensions of two boxes and return true if the smaller one is under the larger one
auto check = [&boxes, &pos, &get](unsigned int x, bool largest) { return boxes[x - 1].dim[get(pos[x - 1], largest)] < boxes[x].dim[get(pos[x], largest)]; };
unsigned int x = 0, y; //indexing variables
unsigned char change; //detects box rotation change
bool fail = false; //if it cannot be solved
for (x = 0; x < size && !fail; ++x)
{
//read in the next three dimensions
//make sure dim[0] >= dim[1] >= dim[2]
//simple enough to write
//mine was too ugly and I didn't want to be embarrassed
y = x;
while (y && !fail) //when y == 0, no more boxes to check
{
change = pos[y - 1];
while (check(y, true) || check(y, false)) //while invalid rotation
{
if (++pos[y - 1] == 3) //rotate, when pos == 3, no solution
{
fail = true;
break;
}
}
if (change != pos[y - 1]) //if rotated box
--y;
else
break;
}
}
if (fail)
{
std::cout << -1;
}
else
{
unsigned long long max = 0;
for (x = 0; x < size; ++x)
max += boxes[x].dim[pos[x]];
std::cout << max;
}
return 0;
}
It works for the test cases I've written, but given that I don't know what caused yours to fail, I can't tell you what mine does differently (assuming it also doesn't fail your test conditions).
If you are allowed, this problem might benefit from a tree data structure.
First, define the three possible cases of block:
1) Cube - there is only one possible option for orientation, since every orientation results in the same height (applied toward total height) and the same footprint (applied to the restriction that the footprint of each block is completely contained by the block below it).
2) Square Rectangle - there are three possible orientations for this rectangle with two equal dimensions (for examples, a 4x4x1 or a 4x4x7 would both fit this).
3) All Different Dimensions - there are six possible orientations for this shape, where each side is different from the rest.
For the first box, choose how many orientations its shape allows, and create corresponding nodes at the first level (a root node with zero height will allow using simple binary trees, rather than requiring a more complicated type of tree that allows multiple elements within each node). Then, for each orientation, choose how many orientations the next box allows but only create nodes for those that are valid for the given orientation of the current box. If no orientations are possible given the orientation of the current box, remove that entire unique branch of orientations (the first parent node with multiple valid orientations will have one orientation removed by this pruning, but that parent node and all of its ancestors will be preserved otherwise).
By doing this, you can check for sets of boxes that have no solution by checking whether there are any elements below the root node, since an empty tree indicates that all possible orientations have been pruned away by invalid combinations.
If the tree is not empty, then just walk the tree to find the highest sum of heights within each branch of the tree, recursively up the tree to the root - the sum value is your maximum height, such as the following pseudocode:
std::size_t maximum_height() const{
if(leftnode == nullptr || rightnode == nullptr)
return this_node_box_height;
else{
auto leftheight = leftnode->maximum_height() + this_node_box_height;
auto rightheight = rightnode->maximum_height() + this_node_box_height;
if(leftheight >= rightheight)
return leftheight;
else
return rightheight;
}
}
The benefits of using a tree data structure are
1) You will greatly reduce the number of possible combinations you have to store and check, because in a tree, the invalid orientations will be eliminated at the earliest possible point - for example, using your 2x2x5 first box, with three possible orientations (as a Square Rectangle), only two orientations are possible because there is no possible way to orient it on its 2x2 end and still fit the 4x3x1 block on it. If on average only two orientations are possible for each block, you will need a much smaller number of nodes than if you compute every possible orientation and then filter them as a second step.
2) Detecting sets of blocks where there is no solution is much easier, because the data structure will only contain valid combinations.
3) Working with the finished tree will be much easier - for example, to find the sequence of orientations of the highest, rather than just the actual height, you could pass an empty std::vector to a modified highest() implementation, and let it append the actual orientation of each highest node as it walks the tree, in addition to returning the height.

grass fire algorithm taking way too long, how to optimize?

So I am working with openCV and trying to write a bunch of algorithms "from scratch" so to speak so that I can really understand what the library is doing. I wrote a modified grass fire algorithm to segment BLOBs from an image that I have already digitized. However, the algorithm takes over 2 minutes to run on my very capable laptop (16 gigs ram, quad core i7, etc...). What am I doing here that is making it so complex? Alternately, is there a better algorithm for extracting BLOBs from a digitized image?
THANKS!
Here is the algorithm
std::vector<boundingBox> grassFire(cv::Mat digitalImage){
std::vector<boundingBox> blobList;
int minY, minX, maxY, maxX, area, yRadius, xRadius, xCenter, yCenter;
for(int curRow = 0; curRow<digitalImage.rows; curRow++){
for(int curCol = 0; curCol<digitalImage.cols; curCol++){
//if there is something at that spot in the image
if((int)digitalImage.at<unsigned char>(curRow, curCol)){
minY = curRow;
maxY = curRow;
minX = curCol;
maxX = curCol;
area = 0;
yRadius = 0;
xRadius = 0;
for(int fireRow=curRow; fireRow<digitalImage.rows; fireRow++){
//is in keeps track of the row and started keeps track of the col
//is in will break if no pixel in the row is part of the blob
//started will break the inner loop if a nonpixel is reached AFTER a pixel is reached
bool isIn = false;
bool started = false;
for(int fireCol = curCol; fireCol<digitalImage.cols; fireCol++){
//make sure that the pixel is still in
if((int)digitalImage.at<unsigned char>(fireRow, fireCol)){
//signal that an in pixel has been found
started = true;
//signal that the row is still in
isIn = true;
//add to the area
area++;
//reset the extrema variables
if(fireCol > maxX){maxX = fireCol;}
if(fireCol < minX){minX = fireCol;}
if(fireRow > maxY){maxY = fireRow;}
//no need to check min y since it is set already by the loop trigger
//set the checked pixel values to 0 to avoid double counting
digitalImage.at<unsigned char>(fireRow, fireCol) = 0;
}
//break if the next pixel is not in and youve already seen an in pixel
//do nothing otherwise
else{if(started){break;}}
//if the entire blob has been detected
if(!isIn){break;}
}
}
}else{}//just continue the loop if the current pixel is not in
//calculate all blob specific values for the blob at hand
xRadius =(int)((double)(maxX - minX)/2.);
yRadius =(int)((double)(maxY - minY)/2.);
xCenter = maxX - xRadius;
yCenter = maxY - yRadius;
//add the blob to the vector in the appropriate position (largest area first)
int pos = 0;
for(auto elem : blobList){
if(elem.getArea() > area){
pos++;
}
else{break;}
}
blobList.insert(blobList.begin() + pos, boundingBox(area, xRadius, yRadius, xCenter, yCenter));
}
}
return blobList;
}
You say `just continue the loop if the current pixel is not in but you don't continue the loop there, and fall thru to the code that adds another element to blobList (which code will access past the end of the lit of no element satisfies the condition in that for loop).
And using this
for(const auto &elem : blobList)
would avoid making copies of all those boundingBoxes.

How to randomly position a push_back()ed sprite from a list - SFML?

I am making this space based shooter using SFML and MS Visual Studio 10 in C++. So, my enemy sprites are declared in a std::list. In order to keep them coming indefinitely, I used a large size for the list. But, this affects performance and will eventually terminate after a period of time. So, instead I opted to push_back() an element into the list every time I erase an element, so as to keep the size of the list constant and spawn enemies indefinite number of times. This does not affect performance. However, every time an enemy sprite is erased, a new sprite is being generated at position (0,0) i.e. top left corner. So, in order to randomly generate their positions off-screen, I used the same RNG I used to initialize their positions at the start of the program. Now, they are being randomly generated off-screen but the whole list is being erased and a new list is being generated again. Here's my code:
std::list<sf::Sprite>::iterator enemyit = enemy.begin(), next;
while(enemyit != enemy.end())
{
next = enemyit;
next++;
if(enemyit->getGlobalBounds().intersects(player.getGlobalBounds()))
{
enemy.erase(enemyit);
enemy.push_back(sf::Sprite(enemytex));
srand(time(NULL));
float y = -200;
for(std::list<sf::Sprite>::iterator enemyit = enemy.begin(); enemyit != enemy.end(); enemyit++)
{
float x = rand() % dist + wastage;
enemyit->setPosition(x, y);
y = y - enemyit->getGlobalBounds().height * 2;
}
}
enemyit = next;
}
Obviously, it's the for loop. I am iterating the whole list. How do I change it so that only one element is generated when only one is erased? What is the condition I should set? I've tried everything to the best of my knowledge, but nothing. Please help me on this.
first of all, your actual code would give you a lot of problems with that loops ( while and for ), and adding or removing items of a list while you are using iterators.
The best solution to your problem is to do a first loop to check all the enemies that have destroyed, and after the check, make a loop to generate them.
std::list<sf::Sprite>::iterator enemyit = enemy.begin(), next;
int erasedEnemies = 0;
while(enemyit != enemy.end())
{
next = enemyit;
next++;
if(enemyit->getGlobalBounds().intersects(player.getGlobalBounds()))
{
enemy.erase(enemyit);
++erasedEnemies;
}
enemyit = next;
}
for( int i = 0; i < erasedEnemies; ++i )
{
sf::Sprite tempSprite(enemytex);
srand(time(NULL));
float y = -200;
float x = rand() % dist + wastage;
tempSprite.setPosition(x, y);
y = y - tempSprite.getGlobalBounds().height * 2;
enemy.push_back(tempSprite);
}
Hope it helps you.

How to efficiently change a contiguous portion of a matrix?

Given a matrix of M rows and N columns, and allocated as a byte array of M*N elements (these elements are initially set to zero), I would modify this matrix in according to the following rule: the elements that are found in the neighborhood of a certain element must be set to a given value. In other words, given a matrix, I should set a region of the matrix: for this purpose I should access not contiguous portion of the array.
In order to perform the above operation, I have access to the following information:
the pointer to the element that is located in the center of the neighborhood (this pointer must not be changed during the above operation); the position (row and column) of this element is also provided;
the size L*L of the neighborhood (L is always an odd number).
The code that implements this operation should be executed as fast as possible in C++: for this reason I thought of using the above pointer to access different pieces of the array. Instead, the position (row and column) of the central element of the neighborhood could allow me to check whether the specified region exceeds the dimensions of the matrix (for example, the center of the region may be located on the edge of the matrix): in this case I should set only that part of the region that is located in the matrix.
int M = ... // number of matrix rows
int N = ... // number of matrix columns
char* centerPtr = ... // pointer to the center of the region
int i = ... // position of the central element
int j = ... // of the region to be modified
char* tempPtr = centerPtr - (N+1)*L/2;
for(int k=0; k < L; k++)
{
memset(tempPtr,value,N);
tempPtr += N;
}
How can I improve the code?
How to handle the fact that one region may exceeds the dimensions of a matrix?
How to make the code more efficient with respect to the execution time?
Your code is probably optimal for the general case where the region does not overlap the outside of the matrix. The main efficiency problem you can cause with this kind of code is to make the outer loop over columns instead of rows. This destroys cache and paging performance. You haven't done that.
Using pointers has little or no speed advantage with most modern compilers. Optimizers will come up with very good pointer code from normal array indices. In some cases I've seen array index code run substantially faster than hand-tweaked pointer code for the same thing. So don't use pointer arithmetic if index arithmetic is clearer.
There are 8 boundary cases: north, northwest, west, ..., northeast. Each of these will need a custom version of your loop to touch the right elements. I'll show the northwest case and let you work out the rest.
The fastest possible way to handle the cases is a 3-level "if" tree:
if (j < L/2) { // northwest, west, or southwest
if (i < L/2) {
// northwest
char* tempPtr = centerPtr - (L/2 - i) * N - (L/2 - j);
for(int k = 0; k < L; k++) {
memset(tempPtr, value, L - j);
tempPtr += N;
}
} else if (i >= M - L/2) {
// southwest
} else {
// west
}
} else if (j >= N - L/2) { // symmetrical cases for east.
if (i < L/2) {
// northeast
} else if (i >= M - L/2) {
// southeast
} else {
// east
}
} else {
if (i < L/2) {
// north
} else if (i >= M - L/2) {
// south
} else {
// no overlap
}
}
It's tedious to do it like this, but you'll have no more than 3 comparisons per region.

allocating memory per thread in a parallel_for loop

I originally have a single-threaded loop which iterates over all pixels of an image and may do various operation with the data.
The library I am using dictates that retrieving pixels from an image must be done one line at a time. To this end I malloc a block of memory which can host one row of pixels (BMM_Color_fl is a struct containing one pixel's RGBA data as four float values, and GetLinearPixels() copies one row of pixels from a bitmap into a BMM_Color_fl array.)
BMM_Color_fl* line = (BMM_Color_fl*)malloc(width * sizeof(BMM_Color_fl));
for (int y = 0; y < height, y++)
{
bmp->GetLinearPixels(0, y, width, line); //Copy data of row Y from bitmap into line.
BMM_Color_fl* pixel = line; //Get first pixel of line.
for (int x = 0; x < width; x++, pixel++) // For each pixel in the row...
{
//Do stuff with a pixel.
}
}
free(line);
So far so good!
For the sake of reducing execution time of this loop, I have written a concurrent version using parallel_for, which looks like this:
parallel_for(0, height, [&](int y)
{
BMM_Color_fl* line = (BMM_Color_fl*)malloc(width * sizeof(BMM_Color_fl));
bmp->GetLinearPixels(0, y, width, line);
BMM_Color_fl* pixel = line;
for (int x = 0; x < width; x++, pixel++)
{
//Do stuff with a pixel.
}
free(line);
});
While the multithreaded loop is already faster than the original, I realize it is impossible for all threads to use the same memory block, so currently I am allocating and freeing the memory at each loop iteration, which is obviously wasteful as there will never be more threads than loop iterations.
My question is if and how can I have each thread malloc exactly one line buffer and use it repeatedly (and ideally, free it at the end)?
As a disclaimer I must state I am a novice C++ user.
Implementation of suggested solutions:
Concurrency::combinable<std::vector<BMM_Color_fl>> line;
parallel_for(0, height, [&] (int y)
{
std::vector<BMM_Color_fl> lineL = line.local();
if (lineL.capacity() < width) lineL.reserve(width);
bmp->GetLinearPixels(0, y, width, &lineL[0]);
for (int x = 0; x < width; x++)
{
BMM_Color_fl* pixel = &lineL[x];
//Do stuff with a pixel.
}
});
As suggested, I canned the malloc and replaced it with a vector+reserve.
You can use Concurrency::combinable class to achieve this.
I am lazy to post the code, but I am sure it is possible.
Instead of having each thread call parallel_for() have them call another function which allocates the memory, calls parallel_for(), and then frees the memory.