allocating memory per thread in a parallel_for loop

allocating memory per thread in a parallel_for loop - c++

I originally have a single-threaded loop which iterates over all pixels of an image and may do various operation with the data.
The library I am using dictates that retrieving pixels from an image must be done one line at a time. To this end I malloc a block of memory which can host one row of pixels (BMM_Color_fl is a struct containing one pixel's RGBA data as four float values, and GetLinearPixels() copies one row of pixels from a bitmap into a BMM_Color_fl array.)
BMM_Color_fl* line = (BMM_Color_fl*)malloc(width * sizeof(BMM_Color_fl));
for (int y = 0; y < height, y++)
{
bmp->GetLinearPixels(0, y, width, line); //Copy data of row Y from bitmap into line.
BMM_Color_fl* pixel = line; //Get first pixel of line.
for (int x = 0; x < width; x++, pixel++) // For each pixel in the row...
{
//Do stuff with a pixel.
}
}
free(line);
So far so good!
For the sake of reducing execution time of this loop, I have written a concurrent version using parallel_for, which looks like this:
parallel_for(0, height, [&](int y)
{
BMM_Color_fl* line = (BMM_Color_fl*)malloc(width * sizeof(BMM_Color_fl));
bmp->GetLinearPixels(0, y, width, line);
BMM_Color_fl* pixel = line;
for (int x = 0; x < width; x++, pixel++)
{
//Do stuff with a pixel.
}
free(line);
});
While the multithreaded loop is already faster than the original, I realize it is impossible for all threads to use the same memory block, so currently I am allocating and freeing the memory at each loop iteration, which is obviously wasteful as there will never be more threads than loop iterations.
My question is if and how can I have each thread malloc exactly one line buffer and use it repeatedly (and ideally, free it at the end)?
As a disclaimer I must state I am a novice C++ user.
Implementation of suggested solutions:
Concurrency::combinable<std::vector<BMM_Color_fl>> line;
parallel_for(0, height, [&] (int y)
{
std::vector<BMM_Color_fl> lineL = line.local();
if (lineL.capacity() < width) lineL.reserve(width);
bmp->GetLinearPixels(0, y, width, &lineL[0]);
for (int x = 0; x < width; x++)
{
BMM_Color_fl* pixel = &lineL[x];
//Do stuff with a pixel.
}
});
As suggested, I canned the malloc and replaced it with a vector+reserve.

You can use Concurrency::combinable class to achieve this.
I am lazy to post the code, but I am sure it is possible.

Instead of having each thread call parallel_for() have them call another function which allocates the memory, calls parallel_for(), and then frees the memory.

Related

Threads are slow c++

im trying to draw a mandelbrot and want to use 4 threats to do the calculation at the same time but a different part of the image , here are the functions
void Mandelbrot(int x_min,int x_max,int y_min,int y_max,Image &im)
{
for (int i = y_min; i < y_max; i++)
{
for (int j = x_min; j < x_max; j++)
{
//scaled x and y cordinate
double x0 = mape(j, 0, W, MinX, MaxX);
double y0 = mape(i, 0, H, MinY, MaxY);
double x = 0.0f;
double y = 0.0f;
int iteration = 0;
double z = 0;
while (abs(z)<2.0f && iteration < maxIteration)// && iteration < maxIteration)
{
double xtemp = x * x - y * y + x0;
y = 2 * x * y + y0;
x = xtemp;
iteration++;
z = x * x + y * y;
if (z > 10)//must be 10
break;
}
int b =mape(iteration, 0, maxIteration, 0, 255);
if (iteration == maxIteration)
b = 0;
im.setPixel(j, i, Color(b,b,0));
}
}
}
mape functions just convert a number from one range to another
Here is the thread function
void th(Image& im)
{
float size = (float)im.getSize().x / num_th;
int x_min = 0, x_max = size, y_min = 0, y_max = im.getSize().y;
thread t[num_th];
for (size_t i = 0; i < num_th; i++)
{
t[i] = thread(Mandelbrot, x_min, x_max, y_min, y_max, ref(im));
x_min = x_max;
x_max += size;
}
for (size_t i = 0; i<num_th; i++)
{
t[i].join();
}
}
The main function looks like this
int main()
{
Image img;
while(1)//here is while window.open()
{
th(img);
//here im drawing
}
}
So i am not getting any performance boost but it gets even slower , can anyone tell my where is the problem what im doing wrong , it happened to me before too
I sow a question what is an image , it's a class from the SFML library dont'n know if this is of any help.

Your code is incomplete to be able to answer you concretely, but there are a few suspicions:
Spawning a thread has non-trivial overhead. If the amount of work performed by the thread is not large enough, the overhead of launching it may cost more than any gains you would get through parallelism.
Excessive locking and contention. Does not look like a problem in your code, as you don't seem to use any locks at all. Be careful (though as long as they don't write to the same addresses, it should be correct.)
False sharing: Possible problem in your code. Cache lines tend to be 64 bytes. Any write to any portion of a cache line causes the whole line to be committed to memory. If two threads are looking at the same cache line and one of them writes to it, even if all the other threads use a different part of that cache line, they all will have their copy invalidated and will have to re-fetch. This can cause significant problems if multiple threads work in non-overlapping data that share a cache line and cause these invalidations. If they iterate at the same rate through the same data, it can cause this problem to recur over and over. This problem can be significant, and always worth considering.
memory layout causing your cache to be thrashed. While walking through an array, going "across" may align with actual memory layout, reading one full cacheline after another, but scanning "vertically" touches one portion of a cache line then jumps to the corresponding portion of another cache line. If this happens in many threads and you have a lot of memory to churn through, it can mean that your cache is vastly underutilized. Just something to beware of, whether your machine is row- or column- major, and write code to match it, and avoid jumping around in memory.

C++: Modify same pointer using multiple threads

I am attempting an experiment with processing images in which I am modifying the color in each pixel. I tried implementing my code using "buckets", in which I divide the image into smaller regions - each receiving a dedicated thread to process the image.
On my end, I do not really care if multiple threads are attempting to modify the same resource, in fact, that seems to be the point. Theoretically, the threads should be modifying different locations in memory in the form of pixels. When printing my results however, only the first tasks seems to iterate - leading me to think that some kind of race condition is occurring.
The function below is what manages the creation of each task, and
supplies it with starting coordinates and span to operate on. I
believe this is working fine, but it's here just for context:
Image*
CCManager::CCAsync(uint8_t bucketSize, Image* source,
const std::vector<float>& correction)
{
Image* newImg = new Image(); // This will contain our end result
newImg->resize(source->width(), source->height());
assert(buckets > 0);
// Now compute the width and height each bucket will render.
uint32_t width;
uint32_t height;
if(buckets == 2) // Each bucket takes a vertical rectangle
{
width = source->xSize()/2;
height = source->ySize();
}
else
{
// Set width and height to produce square grids (powers of 2)
// *** Not shown for Brevity ***
}
std::vector<std::thread> tasks; // The threads we are managing
// These coordinates will be fed as starting locations for each task
uint32_t startX = 0;
uint32_t startY = 0;
uint8_t tasksFinished = 0;
for(int i = 1; i <= buckets; ++i)
{
// Create a new task with a region for operation
tasks.emplace_back(std::thread(&CCManager::applyCCTask,this, startX, startY,
width, height, source, newImg, correction));
// No iteration is required for 1 bucket, simply paint the whole image
if(buckets == 1){break;}
// **** I REMOVED PART OF THE CODE THAT SHOWS WHERE EXPONENT
// IS DEFINED AND DETERMINED FOR BREVITY
// Reached last column, start a new row
if(i % exponent == 0)
{
startX = 0;
startY+= height;
}
// Keep incrementing horizontally
else
{
startX+= width;
}
}
while(tasksFinished < buckets)
{
// Join with whichever tasks finished
for(int i = 0; i < buckets; ++i)
{
if(tasks[i].joinable())
{
tasks[i].join();
tasksFinished++;
}
}
}
tasks.clear();
return newImg;
}
Having provided the new and source image pointers for each task, here they are in action.
The follow function retrieves the color in each pixel, and calls a
method that applies the correction accordingly.
void
CCManager::applyCCTask(uint32_t x, uint32_t y, uint32_t width, uint32_t height,
Image* source, Image* newImg,
const std::vector<float>& correction)
{
// ** THIS ACTUALLY PRINTS THE CORRECT COORDINATES AND REGION SPAN
// ** FOR EACH THREAD
printf("Task renders # (%i,%i) with %i x %i box\n", x,y,width,height);
assert(source);
assert(newImg);
for (; x < width; ++x )
{
for (; y < height; ++y)
{
Byte4 pixel = source->pixel (x, y);
Color color = pixel.color;
printf("Before correction: Pixel(%i,%i) color [%i,%i,%i]\n",x,y, pixel.color[0], pixel.color[1], pixel.color[2]);
Color correctedColor= addCorrectionToColor( color, correction);
Byte4* newPixel= &newImg->pixel( x, y );
newPixel->color[0] = correctedColor[0];
newPixel->color[1] = correctedColor[1];
newPixel->color[2] = correctedColor[2];
printf("After correction: Pixel(%i,%i) color [%i,%i,%i]\n",x,y, newImg->pixel( x, y ).color[0], newImg->pixel( x, y ).color[1], newImg->pixel( x, y ).color[2]);
}
}
printf("Task Finished!\n");
}
With the code shown, all tasks end up printing the starting message with their area of operation, but inside the nested loop, the "Before" and "After" messages seem to only print from ONE task only.
Why am I not allowed to modify the same image from multiple threads even though the actual pixel data being modified is different for each thread? Can I circumvent that without adding resource locks such as a mutexes, the whole point of this experiment was to allow each thread to run independently without any hindrance.

writing slower than the operation itself?

I am struggling to understand behavior of my functions.
My code is written in C++ in visual studio 2012. Running on Windows 7 64 bit. I am working with 2D arrays of float numbers. when I time my function I see that the time for function is reduced by 10X or more if I just stop writing my results to the output pointer. Does that mean that writing is slow?
Here is an example:
void TestSpeed(float** pInput, float** pOutput)
{
UINT32 y, x, i, j;
for (y = 3; y < 100-3; y++)
{
for (x = 3; x < 100-3; x++)
{
float fSum = 0;
for (i = y - 3; i <= y+3; i++)
{
for (j = x-3; j <= x+3; j++)
{
fSum += pInput[y][x]*exp(-(pInput[y][x]-pInput[i][j])*(pInput[y][x]-pInput[i][j]));
}
}
pOutput[y][x] = fSum;
}
}
}
If I comment out the line "pOutput[y][x] = fSum;" then the functions runs very quick. Why is that?
I am calling 2-3 such functions sequentially. Would it help to use stack instead of heap to write chunk of results and passing it onto next function and then write back to heap buffer after that chunk is ready?
In some cases I saw that if I replace pOutput[y][x] by a line buffer allocated on stack like,
float fResult[100] and use it to store results works faster for larger data size.

Your code makes a lot of operation and it needs time. Depending on what you are doing with the output you may consider the diagonalization or decomposition of your input matrix. Or you can look for values in yor output which are n times an other value etc and don't calculate the exponential for theese.

On-the-fly terrain chunk generation

I'm writing an engine that can generate landscapes using noise functions, and load in new chunks as the player moves around the terrain. I spent the best part of two days figuring out how to place these chunks in the right position, so they don't overlap or get placed on top of existing chunks. It works well functionally, but there is a massive performance hit the further away you generate the chunks from the player (e.g. if you generate in a 3 chunk radius around the player, it's lighting fast, but if you increase that to a radius of 20 chunks it slows down very fast).
I know exactly why that is, but I can't think of any other way to do this. Before I go any further, here's the code I'm currently using, hopefully it's commented well enough to understand:
// Get the player's position rounded to the nearest chunk on the grid.
D3DXVECTOR3 roundedPlayerPos(SnapToMultiple(m_Dx->m_Camera->GetPosition().x, CHUNK_X), 0, SnapToMultiple(m_Dx->m_Camera->GetPosition().z, CHUNK_Z));
// Iterate through every point on an invisible grid. At each point, check if it is
// inside a circle the size of the grid (so we generate chunks in a circle around
// the player, not a square). At each point that is inside the circle, add a chunk to
// the ChunksToAdd vector.
for (int x = -CHUNK_RANGE-1; x <= CHUNK_RANGE; x++)
{
for (int z = -CHUNK_RANGE-1; z <= CHUNK_RANGE; z++)
{
if (IsInside(roundedPlayerPos, CHUNK_X*CHUNK_RANGE, D3DXVECTOR3(roundedPlayerPos.x+x*CHUNK_X, 0, roundedPlayerPos.z+z*CHUNK_Z)))
{
Chunk chunkToAdd;
chunkToAdd.chunk = 0;
chunkToAdd.position = D3DXVECTOR3((roundedPlayerPos.x + x*CHUNK_X), 0, (roundedPlayerPos.z + z*CHUNK_Z));
chunkToAdd.chunkExists = false;
m_ChunksToAdd.push_back(chunkToAdd);
}
}
}
// Iterate through the ChunksToAdd vector. For each chunk in this vector, compare it's
// position to every chunk in the Chunks vector (which stores each generated chunk).
// If the statement returns true, then there is already a chunk at that location, and
// we don't need to generate another.
for (i = 0; i < m_ChunksToAdd.size(); i++)
{
for (int j = 0; j < m_Chunks.size(); j++)
{
// Check the chunk in the ChunksToAdd vector with the chunk in the Chunks vector (chunks which are already generated).
if (m_ChunksToAdd[i].position.x == m_Chunks[j].position.x && m_ChunksToAdd[i].position.z == m_Chunks[j].position.z)
{
m_ChunksToAdd[i].chunkExists = true;
}
}
}
// Determine the closest chunk to the player, so we can generate that first.
// Iterate through the ChunksToAdd vector, and if the vector doesn't exist (if it
// does exist, we're not going to generate it so ignore it), compare the current (i)
// chunk against the current closest chunk. If it is larger, move on, and if it is
// smaller, store it's position as the new smallest chunk.
int closest = 0;
for (j = 0; j < m_ChunksToAdd.size(); j++)
{
if (!m_ChunksToAdd[j].chunkExists)
{
// Get the distance from the player to the chunk for the current closest chunk, and
// the chunk being tested.
float x1 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[j].position));
float x2 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[closest].position));
// If the chunk being tested is closer to the player, make it the new closest chunk.
if (x1 <= x2)
closest = j;
}
}
// After determining the position of the closest chunk, generate the volume and mesh, and add it
// to the Chunks vector for rendering.
if (!m_ChunksToAdd[closest].chunkExists) // Only add it if the chunk doesn't already exist in the Chunks vector.
{
Chunk chunk;
chunk.chunk = new chunkClass;
chunk.chunk->m_Position = m_ChunksToAdd[closest].position;
chunk.chunk->GenerateVolume(m_Simplex);
chunk.chunk->GenerateMesh(m_Dx->GetDevice());
chunk.position = m_ChunksToAdd[closest].position;
chunk.chunkExists = true;
m_Chunks.push_back(chunk);
}
// Clear the ChunksToAdd vector ready for another frame.
m_ChunksToAdd.clear();
(if it wasn't already obvious, this is run every frame.)
The problem area is to do with the CHUNK_RANGE variable. The larger this value, the more the first two loops are iterated through each frame, slowing the whole thing down tremendously. I need some advice or suggestions on how to do this more efficiently, thanks.
EDIT: Here's some improved code:
// Get the player's position rounded to the nearest chunk on the grid.
D3DXVECTOR3 roundedPlayerPos(SnapToMultiple(m_Dx->m_Camera->GetPosition().x, CHUNK_X), 0, SnapToMultiple(m_Dx->m_Camera->GetPosition().z, CHUNK_Z));
// Find if the player has changed into another chunk, if they have, we will scan
// to see if more chunks need to be generated.
static D3DXVECTOR3 roundedPlayerPosOld = roundedPlayerPos;
static bool playerPosChanged = true;
if (roundedPlayerPosOld != roundedPlayerPos)
{
roundedPlayerPosOld = roundedPlayerPos;
playerPosChanged = true;
}
// Iterate through every point on an invisible grid. At each point, check if it is
// inside a circle the size of the grid (so we generate chunks in a circle around
// the player, not a square). At each point that is inside the circle, add a chunk to
// the ChunksToAdd vector.
if (playerPosChanged)
{
m_ChunksToAdd.clear();
for (int x = -CHUNK_CREATE_RANGE-1; x <= CHUNK_CREATE_RANGE; x++)
{
for (int z = -CHUNK_CREATE_RANGE-1; z <= CHUNK_CREATE_RANGE; z++)
{
if (IsInside(roundedPlayerPos, CHUNK_X*CHUNK_CREATE_RANGE, D3DXVECTOR3(roundedPlayerPos.x+x*CHUNK_X, 0, roundedPlayerPos.z+z*CHUNK_Z)))
{
bool chunkExists = false;
for (int j = 0; j < m_Chunks.size(); j++)
{
// Check the chunk in the ChunksToAdd vector with the chunk in the Chunks vector (chunks which are already generated).
if ((roundedPlayerPos.x + x*CHUNK_X) == m_Chunks[j].position.x && (roundedPlayerPos.z + z*CHUNK_Z) == m_Chunks[j].position.z)
{
chunkExists = true;
break;
}
}
if (!chunkExists)
{
Chunk chunkToAdd;
chunkToAdd.chunk = 0;
chunkToAdd.position = D3DXVECTOR3((roundedPlayerPos.x + x*CHUNK_X), 0, (roundedPlayerPos.z + z*CHUNK_Z));
m_ChunksToAdd.push_back(chunkToAdd);
}
}
}
}
}
playerPosChanged = false;
// If there are chunks to render.
if (m_ChunksToAdd.size() > 0)
{
// Determine the closest chunk to the player, so we can generate that first.
// Iterate through the ChunksToAdd vector, and if the vector doesn't exist (if it
// does exist, we're not going to generate it so ignore it), compare the current (i)
// chunk against the current closest chunk. If it is larger, move on, and if it is
// smaller, store it's position as the new smallest chunk.
int closest = 0;
for (j = 0; j < m_ChunksToAdd.size(); j++)
{
// Get the distance from the player to the chunk for the current closest chunk, and
// the chunk being tested.
float x1 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[j].position));
float x2 = ABS(DistanceFrom(roundedPlayerPos, m_ChunksToAdd[closest].position));
// If the chunk being tested is closer to the player, make it the new closest chunk.
if (x1 <= x2)
closest = j;
}
// After determining the position of the closest chunk, generate the volume and mesh, and add it
// to the Chunks vector for rendering.
Chunk chunk;
chunk.chunk = new chunkClass;
chunk.chunk->m_Position = m_ChunksToAdd[closest].position;
chunk.chunk->GenerateVolume(m_Simplex);
chunk.chunk->GenerateMesh(m_Dx->GetDevice());
chunk.position = m_ChunksToAdd[closest].position;
m_Chunks.push_back(chunk);
m_ChunksToAdd.erase(m_ChunksToAdd.begin()+closest);
}
// Remove chunks that are far away from the player.
for (i = 0; i < m_Chunks.size(); i++)
{
if (DistanceFrom(roundedPlayerPos, m_Chunks[i].position) > (CHUNK_REMOVE_RANGE*CHUNK_X)*(CHUNK_REMOVE_RANGE*CHUNK_X))
{
m_Chunks[i].chunk->Shutdown();
delete m_Chunks[i].chunk;
m_Chunks[i].chunk = 0;
m_Chunks.erase(m_Chunks.begin()+i);
}
}

Have you tried profiling it to work out exactly where the bottleneck is?
Do you need to check all of those chunks or could you get away with checking the direction the player is looking and only generate the ones in view?
Is there any reason why you draw the chunk closest to the player first if you're generating it all once per frame before displaying it? Skipping the stage where you sort them may free up a bit of processing power.
Is there any reason you couldn't combine the first two loops to just create a vector of chunks which need generating?

It sounds like you're trying to do too much work (i.e. building chunks) on the render thread. If you can do the work of a three chunk radius really fast you should limit it to that per frame. How many chunks are you trying to generate, in each situation, per frame?
I'm going to assume that generating each chunk is independent, therefore, you can probably move the work to another thread - then show the chunk when it is ready.

Can someone explain how I am to access this array? (image processing program)

I am working on the implementation of functions for an already written image processing program. I am given explanations of functions, but not sure how they are designating pixels of the image.
In this case, I need to flip the image horizontally, i.e., rotates 180 degrees around the vertical axis
Is this what makes the "image" i am to flip?
void Image::createImage(int width_x, int height_y)
{
width = width_x;
height = height_y;
if (pixelData!=NULL)
freePixelData();
if (width <= 0 || height <= 0) {
return;
}
pixelData = new Color* [width]; // array of Pixel*
for (int x = 0; x < width; x++) {
pixelData[x] = new Color [height]; // this is 2nd dimension of pixelData
}
}
I do not know if all the functions I have written are correct.
Also, the Image class calls on a Color class
So to re-ask: what am I "flipping" here?
Prototype for function is:
void flipLeftRight();
As there is no input into the function, and I am told it modifies pixelData, how do I flip left to right?

A quick in place flip. Untested, but the idea is there.
void flipHorizontal(u8 *image, u32 width, u32 height)
{
for(int i=0; i < height; i++)
{
for(int j=0; j < width/2; j++)
{
int sourceIndex = i * width + j;
int destIndex = (i+1) * width - j - 1;
image[sourceIndex] ^= image[destIndex];
image[destIndex] ^= image[sourceIndex];
image[sourceIndex] ^= image[destIndex];
}
}
}

well, the simplest approach would be to read it 1 row at a time into a temporary buffer the same size as 1 row.
Then you could use something like std::reverse on the temporary buffer and write it back.
You could also do it in place, but this is the simplest approach.
EDIT: what i;ve described is a mirror, not a flip, to mirror you also need to reverse the order of the rows. Nothing too bad, to do that I would create a buffer the same size as the image, copy the image and then write it back with the coordinates adjusted. Something like y = height - x and x = width - x.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js