Why am I running out of heap memory? - c++

So I'm writing a raytracer in C++ using Jetbrains Clion IDE. When I try to create a 600 * 600 image with multisampling antialiasing enabled, I run out of memory. I get this error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
This application has requested the Runtime to terminate it in an
unusual way. Please contact the application's support team for more
information.
Code of my render function:
width: 600
height: 600
numberOfSamples: 80
void Camera::render(const int width, const int height){
int resolution = width * height;
double scale = tan(Algebra::deg2rad(fov * 0.5)); //deg to rad
ColorRGB *pixels = new ColorRGB[resolution];
long loopCounter = 0;
Vector3D camRayOrigin = getCameraPosition();
for (int i = 0; i < width; ++i) {
for (int j = 0; j < height; ++j) {
double zCamDir = (height/2) / scale;
ColorRGB finalColor = ColorRGB(0,0,0,0);
int tempCount = 0;
for (int k = 0 ; k < numberOfSamples; k++) {
tempCount++;
//If it is single sampled, then we want to cast ray in the middle of the pixel, otherwise we offset the ray by a random value between 0-1
double randomNumber = Algebra::getRandomBetweenZeroAndOne();
double xCamDir = (i - (width / 2)) + (numberOfSamples == 1 ? 0.5 : randomNumber);
double yCamDir = ((height / 2) - j) + (numberOfSamples == 1 ? 0.5 : randomNumber);
Vector3D camRayDirection = convertCameraToWorldCoordinates(Vector3D(xCamDir, yCamDir, zCamDir)).unitVector();
Ray r(camRayOrigin, camRayDirection);
finalColor = finalColor + getColorFromRay(r);
}
pixels[loopCounter] = finalColor / numberOfSamples;
loopCounter++;
}
}
CreateImage::createRasterImage(height, width, "RenderedImage.bmp", pixels);
delete pixels; //Release memory
}
I'm a beginner in C++, so I'd really appreciate your help. I also tried doing the same thing in C# in Microsoft Visual Studio and the memory usage never exceeded 200MB. I feel like I'm doing some mistake. I can provide you with more details if you want to help me.

Memory allocated using new [] must be deallocated using delete [].
Your program has undefined behavior due to use of
delete pixels; //Release memory
to deallocate memory. It needs to be:
delete [] pixels;

Related

munmap_chunk() - Invalid pointer error

I'm writing a renderer using low-level SDL functions to learn how it all works. I am now trying to do polygon drawing, but I run into errors possibly due to my inexperience with C++. When running the code I get a munmap_chunk() - Invalid pointer error. Searching reveals that it is most likely due to free()-ing the memory twice. The error happens when returning from the function. I realize that the error comes from automatically free()ing memory which has been automatically free()d before, but I'm not experienced enough with C++ to spot the error. Any clues?
My code:
void DrawPolygon (const vector<vec3> & verts, vec3 color){
// 0. Project to the screen
vector<ivec2> vertices(verts.size());
for(int i = 0; i < verts.size(); i++){
VertexShader(verts.at(i), vertices.at(i));
}
// 1. Find max and min y-value of the polygon
// and compute the number of rows it occupies.
int miny = vertices[0].y;
int maxy = vertices[0].y;
for (int i = 1; i < 3; i++){
if (vertices[i].y < miny){
miny = vertices[i].y;
}
if (vertices[i].y > maxy){
maxy = vertices[i].y;
}
}
int rows = abs(maxy - miny) + 1;
// 2. Resize leftPixels and rightPixels
// so that they have an element for each row.
vector<ivec2> leftPixels(rows);
vector<ivec2> rightPixels(rows);
// 3. Initialize the x-coordinates in leftPixels
// to some really large value and the x-coordinates
// in rightPixels to some really small value.
for(int i = 0; i < rows; i++){
leftPixels[i].x = std::numeric_limits<int>::max();
rightPixels[i].x = std::numeric_limits<int>::min();
leftPixels[i].y = miny + i;
rightPixels[i].y = miny + i;
}
// 4. Loop through all edges of the polygon and use
// linear interpolation to find the x-coordinate for
// each row it occupies. Update the corresponding
// values in rightPixels and leftPixels.
for(int i = 0; i < 3; i++){
ivec2 a = vertices[i];
ivec2 b = vertices[(i+1)%3];
// find the number of pixels to draw
ivec2 delta = glm::abs(a - b);
int pixels = glm::max(delta.x, delta.y) + 1;
// interpolate to find the pixels
vector<ivec2> line (pixels);
Interpolate(a, b, line);
for(int j = 0; j < pixels; j++){
ivec2 p = line[j];
ivec2 cmpl = leftPixels[p.y - miny];
ivec2 cmpr = rightPixels[p.y - miny];
if(p.x < cmpl.x){
leftPixels[p.y - miny].x = p.x;
//leftPixels[p.y - miny] = cmpl;
}
if(p.x > cmpr.x){
rightPixels[p.y - miny].x = p.x;
//cmpr.x = p.x;
//rightPixels[p.y - miny] = cmpr;
}
}
}
for(int i = 0; i < leftPixels.size(); i++){
ivec2 l = leftPixels.at(i);
ivec2 r = rightPixels.at(i);
// y coord the same, iterate over x
int y = l.y;
for(int x = l.x; x <= r.x; x++){
PutPixelSDL(screen, x, y, color);
}
}
}
Using valgrind gives me this output (this is the first error it reports). Weirdly, the program recovers and keeps running with the expected result, apparently not getting the same error again:
==5706== Invalid write of size 4
==5706== at 0x40AD61: DrawPolygon(std::vector<glm::detail::tvec3<float>, std::allocator<glm::detail::tvec3<float> > > const&, glm::detail::tvec3<float>) (in /home/actimia/prog/dgi14/lab3/ThirdLab)
==5706== by 0x409C78: Draw() (in /home/actimia/prog/dgi14/lab3/ThirdLab)
==5706== by 0x409668: main (in /home/actimia/prog/dgi14/lab3/ThirdLab)
I think my previous post on similar topic would be useful.
https://stackoverflow.com/a/22658693/2724703
From your Valgrind report, it look like your program is doing memory corruption due to overflow. This does not seems like "double free" error(this is overflow scenario). You have mentioned that sometime valgrind is not reporting any error this makes this problem more difficult. However there is certainly a memory corruption and you must fix them. Memory error sometime occur intermittent due to various reason(different input parameter, multi-threaded, change of execution sequence).

C++ Heap Corruption: Local heap variable causing issues

I am working on some simple terrain with DirectX9 by manually assembling the verts for the ground.
On the part of my code where I set up the indices I get an error though:
Windows has triggered a breakpoint in test.exe.
This may be due to a corruption of the heap, which indicates a bug in test.exe or any of the DLLs it has loaded.
Here is the part of my code that is giving me problems, and I'm almost 100% sure that it is linked to my indices pointer, but I delete it when I'm finished... so I'm not sure what the problem is.
int total = widthQuads * heightQuads * 6;
DWORD *indices = new DWORD[totalIdx];
for (int y = 0; y < heightQuads; y++)
{
for (int x = 0; x < widthQuads; x++)
{ //Width of nine:
int lowerLeft = x + y * 9;
int lowerRight = (x + 1) + y * 9;
int topLeft = x + (y + 1) * 9;
int topRight = (x + 1) + (y + 1) * 9;
//First triangle:
indices[counter++] = topLeft;
indices[counter++] = lowerRight;
indices[counter++] = lowerLeft;
//Second triangle:
indices[counter++] = topLeft;
indices[counter++] = topRight;
indices[counter++] = lowerRight;
}
}
d3dDevice->CreateIndexBuffer(sizeof(DWORD)* total, 0, D3DFMT_INDEX16,
D3DPOOL_MANAGED, &groundindex, 0);
void* mem = 0;
groundindex->Lock(0, 0, &mem, 0);
memcpy(mem, indices, total * sizeof (DWORD));
groundindex->Unlock();
delete[] indices;
When I remove this block my program runs OK.
The code you've given looks OK - with one caveat: the initial value of counter is not in the code itself. So either you don't start at counter = 0, or some other piece of code is stomping on your indices buffer.
That's the beauty of heap corruptions. There is no guarantee that the bug is in the removed portion on the code. It may simply hide the bug that exists somewhere else in your code.
int total = widthQuads * heightQuads * 6;
DWORD *indices = new DWORD[totalIdx];
Shouldn't you be doing "new DWORD[total];" here?

Quick code to resize DIB image and maintain good img quality

There is many algorithms to do image resizing - lancorz, bicubic, bilinear, e.g. But most of them are pretty complex and therefore consume too much CPU.
What I need is fast relatively simple C++ code to resize images with acceptable quality.
Here is an example of what I'm currently doing:
for (int y = 0; y < height; y ++)
{
int srcY1Coord = int((double)(y * srcHeight) / height);
int srcY2Coord = min(srcHeight - 1, max(srcY1Coord, int((double)((y + 1) * srcHeight) / height) - 1));
for (int x = 0; x < width; x ++)
{
int srcX1Coord = int((double)(x * srcWidth) / width);
int srcX2Coord = min(srcWidth - 1, max(srcX1Coord, int((double)((x + 1) * srcWidth) / width) - 1));
int srcPixelsCount = (srcX2Coord - srcX1Coord + 1) * (srcY2Coord - srcY1Coord + 1);
RGB32 color32;
UINT32 r(0), g(0), b(0), a(0);
for (int xSrc = srcX1Coord; xSrc <= srcX2Coord; xSrc ++)
for (int ySrc = srcY1Coord; ySrc <= srcY2Coord; ySrc ++)
{
RGB32 curSrcColor32 = pSrcDIB->GetDIBPixel(xSrc, ySrc);
r += curSrcColor32.r; g += curSrcColor32.g; b += curSrcColor32.b; a += curSrcColor32.alpha;
}
color32.r = BYTE(r / srcPixelsCount); color32.g = BYTE(g / srcPixelsCount); color32.b = BYTE(b / srcPixelsCount); color32.alpha = BYTE(a / srcPixelsCount);
SetDIBPixel(x, y, color32);
}
}
The code above is fast enough, but the quality is not ok on scaling pictures up.
Therefore, possibly someone already has fast and good C++ code sample for scaling DIBs?
Note: I was using StretchDIBits before - it was super-slow when was needed to downsize 10000x10000 picture down to 100x100 size, my code is much, much faster, I just want to have a bit higher quality
P.S. I'm using my own SetPixel/GetPixel functions, to work directly with data array and fast, that's not device context!
Why are you doing it on the CPU? Using GDI, there's a good chance of some hardware acceleration. Use StretchBlt and SetStretchBltMode.
In pseudocode:
create source dc and destination dc using CreateCompatibleDC
create source and destination bitmaps
SelectObject source bitmap into source DC and dest bitmap into dest DC
SetStretchBltMode
StretchBlt
release DCs
Allright, here is the answer, had to do it myself... It works perfectly well for scaling pictures up (for scaling down my initial code works perfectly well too). Hope someone will find a good use for it, it's fast enough and produced very good picture quality.
for (int y = 0; y < height; y ++)
{
double srcY1Coord = (y * srcHeight) / (double)height;
int srcY1CoordInt = (int)(srcY1Coord);
double srcY2Coord = ((y + 1) * srcHeight) / (double)height - 0.00000000001;
int srcY2CoordInt = min(maxSrcYcoord, (int)(srcY2Coord));
double yMultiplierForFirstCoord = (0.5 * (1 - (srcY1Coord - srcY1CoordInt)));
double yMultiplierForLastCoord = (0.5 * (srcY2Coord - srcY2CoordInt));
for (int x = 0; x < width; x ++)
{
double srcX1Coord = (x * srcWidth) / (double)width;
int srcX1CoordInt = (int)(srcX1Coord);
double srcX2Coord = ((x + 1) * srcWidth) / (double)width - 0.00000000001;
int srcX2CoordInt = min(maxSrcXcoord, (int)(srcX2Coord));
RGB32 color32;
ASSERT(srcX1Coord < srcWidth && srcY1Coord < srcHeight);
double r(0), g(0), b(0), a(0), multiplier(0);
for (int xSrc = srcX1CoordInt; xSrc <= srcX2CoordInt; xSrc ++)
for (int ySrc = srcY1CoordInt; ySrc <= srcY2CoordInt; ySrc ++)
{
RGB32 curSrcColor32 = pSrcDIB->GetDIBPixel(xSrc, ySrc);
double xMultiplier = xSrc < srcX1Coord ? (0.5 * (1 - (srcX1Coord - srcX1CoordInt))) : (xSrc >= srcX2Coord ? (0.5 * (srcX2Coord - srcX2CoordInt)) : 0.5);
double yMultiplier = ySrc < srcY1Coord ? yMultiplierForFirstCoord : (ySrc >= srcY2Coord ? yMultiplierForLastCoord : 0.5);
double curPixelMultiplier = xMultiplier + yMultiplier;
if (curPixelMultiplier > 0)
{
r += (curSrcColor32.r * curPixelMultiplier); g += (curSrcColor32.g * curPixelMultiplier); b += (curSrcColor32.b * curPixelMultiplier); a += (curSrcColor32.alpha * curPixelMultiplier);
multiplier += curPixelMultiplier;
}
}
color32.r = BYTE(r / multiplier); color32.g = BYTE(g / multiplier); color32.b = BYTE(b / multiplier); color32.alpha = BYTE(a / multiplier);
SetDIBPixel(x, y, color32);
}
}
P.S. Please don't ask why I’m not using StretchDIBits - leave comments for these who understand that not always system api is available or acceptable.
Again, why do it on the CPU? Why not use OpenGL / DirectX and fragment shaders? In pseudocode:
upload source texture (cache it if it's to be reused)
create destination texture
use shader program
render quad
download output texture
where shader program is the filtering method you're using. The GPU is much better at processing pixels than CPU/GetPixel/SetPixel.
You could probably find fragment shaders for lots of different filtering methods on the web - GPU Gems is a good place to start.

Sometimes I get EXEC_BAD_ACCESS (Access violation) when reversing an array

I am loading an image using the OpenEXR library.
This works fine, except the image is loaded rotated 180 degrees. I use the loop shown below to reverse the array but sometimes the program will quit and xcode will give me an EXEC_BAD_ACCESS error (Which I assume is the same as an access violation in msvc). It does not happen everytime, just once every 5-10 times.
Ideally I'd want to reverse the array in place, although that led to errors everytime and using memcpy would fail but without causing an error, just a blank image. I'd like to know what's causing this problem first.
Here is the code I am using: (Rgba is a struct of 4 "Half"s r, g, b, and a, defined in OpenEXR)
Rgba* readRgba(const char filename[], int& width, int& height){
Rgba* pixelBuffer = new Rgba[width * height];
Rgba* temp = new Rgba[width * height];
// ....EXR Loading code....
// TODO: *Sometimes* the following code results in a bad memory access error. No idea why.
// Flip the image to conform with OpenGL coordinates.
for (int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
temp[(i*width)+j] = pixelBuffer[(width*height)-(i*width)+j];
}
}
delete pixelBuffer;
return temp;
}
Thanks in advance!
Change:
temp[(i*width)+j] = pixelBuffer[(width*height)-(i*width)+j];
to:
temp[(i*width)+j] = pixelBuffer[(width*height)-(i*width)+j - 1];
(Hint: think about what happens when i = 0 and j = 0 !)
And here's how you can optimize this code, to save memory and for cycles:
Rgba* readRgba(const char filename[], int& width, int& height)
{
Rgba* pixelBuffer = new Rgba[width * height];
Rgba tempPixel;
// ....EXR Loading code....
// Flip the image to conform with OpenGL coordinates.
for (int i = 0; i <= height/2; i++)
for(int j = 0; j < width && (i*width + j) <= (height*width/2); j++)
{
tempPixel = pixelBuffer[i*width + j];
pixelBuffer[i*width + j] = pixelBuffer[height*width - (i*width + j) -1];
pixelBuffer[height*width - (i*width + j) -1] = tempPixel;
}
return pixelBuffer;
}
Note that optimal (from a memory usage best practices point of view) would be to pass pixelBuffer* as a parameter and already allocated. It's a good practice to allocate and release the memory in the same piece of code.

Why is free() bogging my program down?

I am using free to free the memory allocated for a bunch of temporary arrays in a recursive function. I would post the code but it is pretty long. When I comment out these free() calls, the program runs in less than a second. However, when I am using them, the programs takes about 20 seconds to run. Why is this happening, and how can it be fixed? This is like 100 or so MB so I'd rather not just leave the memory leak.
Additionally, when I run the program that includes all of the free() calls with profiling enabled, it runs in less than a second. I don't know how that would have an effect, but it does.
After using only some of the free() calls, it seems that there are a few in particular that cause the program to slow down. The rest do not seem to have an effect.
Ok... here's the code as requested:
void KDTree::BuildBranch(int height, Mailbox** objs, int nObjects)
{
int dnObjects = nObjects * 2;
int dnmoObjects = dnObjects - 1;
//Check for termination
if(height == -1 || nObjects < minObjectsPerNode)
{
//Create leaf
tree[nodeIndex] = KDTreeNode();
if(nObjects == 1)
tree[nodeIndex].InitializeLeaf(objs[0], 1);
else
tree[nodeIndex].InitializeLeaf(objs, nObjects);
//Added a node, increment index
nodeIndex++;
return;
}
//Save this node's index and increment the current index to save space for this node
int thisNodeIndex = nodeIndex;
nodeIndex++;
//Allocate memory for split options
float* xMins = (float*)malloc(nObjects * sizeof(float));
float* yMins = (float*)malloc(nObjects * sizeof(float));
float* zMins = (float*)malloc(nObjects * sizeof(float));
float* xMaxs = (float*)malloc(nObjects * sizeof(float));
float* yMaxs = (float*)malloc(nObjects * sizeof(float));
float* zMaxs = (float*)malloc(nObjects * sizeof(float));
//Find all possible split locations
int index = 0;
BoundingBox* tempBox = new BoundingBox();
for(int i = 0; i < nObjects; i++)
{
//Get bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add mins to split lists
xMins[index] = tempBox->x0;
yMins[index] = tempBox->y0;
zMins[index] = tempBox->z0;
//Add maxs
xMaxs[index] = tempBox->x1;
yMaxs[index] = tempBox->y1;
zMaxs[index] = tempBox->z1;
index++;
}
//Sort lists
Util::sortFloats(xMins, nObjects);
Util::sortFloats(yMins, nObjects);
Util::sortFloats(zMins, nObjects);
Util::sortFloats(xMaxs, nObjects);
Util::sortFloats(yMaxs, nObjects);
Util::sortFloats(zMaxs, nObjects);
//Allocate bin lists
Bin* xLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* xRight = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* yLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* yRight = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* zLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* zRight = (Bin*)malloc(dnObjects * sizeof(Bin));
//Initialize all bins
for(int i = 0; i < dnObjects; i++)
{
xLeft[i] = Bin(0, 0.0f);
xRight[i] = Bin(0, 0.0f);
yLeft[i] = Bin(0, 0.0f);
yRight[i] = Bin(0, 0.0f);
zLeft[i] = Bin(0, 0.0f);
zRight[i] = Bin(0, 0.0f);
}
//Construct min and max bins bins from split locations
//Merge min/max lists together for each axis
int minIndex = 0, maxIndex = 0;
for(int i = 0; i < dnObjects; i++)
{
if(maxIndex == nObjects || (xMins[minIndex] <= xMaxs[maxIndex] && minIndex != nObjects))
{
//Add split location to both bin lists
xLeft[i].rightEdge = xMins[minIndex];
xRight[i].rightEdge = xMins[minIndex];
//Add geometry to mins counter
xLeft[i+1].objectBoundCounter++;
minIndex++;
}
else
{
//Add split location to both bin lists
xLeft[i].rightEdge = xMaxs[maxIndex];
xRight[i].rightEdge = xMaxs[maxIndex];
//Add geometry to maxs counter
xRight[i].objectBoundCounter++;
maxIndex++;
}
}
//Repeat for y axis
minIndex = 0, maxIndex = 0;
for(int i = 0; i < dnObjects; i++)
{
if(maxIndex == nObjects || (yMins[minIndex] <= yMaxs[maxIndex] && minIndex != nObjects))
{
//Add split location to both bin lists
yLeft[i].rightEdge = yMins[minIndex];
yRight[i].rightEdge = yMins[minIndex];
//Add geometry to mins counter
yLeft[i+1].objectBoundCounter++;
minIndex++;
}
else
{
//Add split location to both bin lists
yLeft[i].rightEdge = yMaxs[maxIndex];
yRight[i].rightEdge = yMaxs[maxIndex];
//Add geometry to maxs counter
yRight[i].objectBoundCounter++;
maxIndex++;
}
}
//Repeat for z axis
minIndex = 0, maxIndex = 0;
for(int i = 0; i < dnObjects; i++)
{
if(maxIndex == nObjects || (zMins[minIndex] <= zMaxs[maxIndex] && minIndex != nObjects))
{
//Add split location to both bin lists
zLeft[i].rightEdge = zMins[minIndex];
zRight[i].rightEdge = zMins[minIndex];
//Add geometry to mins counter
zLeft[i+1].objectBoundCounter++;
minIndex++;
}
else
{
//Add split location to both bin lists
zLeft[i].rightEdge = zMaxs[maxIndex];
zRight[i].rightEdge = zMaxs[maxIndex];
//Add geometry to maxs counter
zRight[i].objectBoundCounter++;
maxIndex++;
}
}
//Free split memory
free(xMins);
free(xMaxs);
free(yMins);
free(yMaxs);
free(zMins);
free(zMaxs);
//PreCalcs
float voxelL = xRight[dnmoObjects].rightEdge - xLeft[0].rightEdge;
float voxelD = zRight[dnmoObjects].rightEdge - zLeft[0].rightEdge;
float voxelH = yRight[dnmoObjects].rightEdge - yLeft[0].rightEdge;
float voxelSA = 2.0f * voxelL * voxelD + 2.0f * voxelL * voxelH + 2.0f * voxelD * voxelH;
//Minimum cost preset to no split at all
float minCost = (float)nObjects;
float splitLoc;
int minLeftCounter = 0, minRightCounter = 0;
int axis = -1;
//---------------------------------------------------------------------------------------------
//Check costs of x-axis split planes keeping track of derivative using
//the fact that there is a minimum point on the graph costs vs split location
//Since there is one object per split plane
int splitIndex = 1;
float lastCost = nObjects * voxelL;
float tempCost;
float lastSplit = xLeft[1].rightEdge;
int leftCount = xLeft[1].objectBoundCounter, rightCount = nObjects - xRight[1].objectBoundCounter;
int lastLO = 0, lastRO = nObjects;
//Keep looping while cost is decreasing
while(splitIndex < dnObjects)
{
tempCost = leftCount * (xLeft[splitIndex].rightEdge - xLeft[0].rightEdge) + rightCount * (xLeft[dnmoObjects].rightEdge - xLeft[splitIndex].rightEdge);
if(tempCost < lastCost)
{
lastCost = tempCost;
lastSplit = xLeft[splitIndex].rightEdge;
lastLO = leftCount;
lastRO = rightCount;
}
//Update counters
splitIndex++;
leftCount += xLeft[splitIndex].objectBoundCounter;
rightCount -= xRight[splitIndex].objectBoundCounter;
}
//Calculate full SAH cost
lastCost = ((lastLO * (2 * (lastSplit - xLeft[0].rightEdge) * voxelD + 2 * (lastSplit - xLeft[0].rightEdge) * voxelH + 2 * voxelD * voxelH)) + (lastRO * (2 * (xLeft[dnmoObjects].rightEdge - lastSplit) * voxelD + 2 * (xLeft[dnmoObjects].rightEdge - lastSplit) * voxelH + 2 * voxelD * voxelH))) / voxelSA;
if(lastCost < minCost)
{
minCost = lastCost;
splitLoc = lastSplit;
minLeftCounter = lastLO;
minRightCounter = lastRO;
axis = 0;
}
//---------------------------------------------------------------------------------------------
//Repeat for y axis
splitIndex = 1;
lastCost = nObjects * voxelH;
lastSplit = yLeft[1].rightEdge;
leftCount = yLeft[1].objectBoundCounter;
rightCount = nObjects - yRight[1].objectBoundCounter;
lastLO = 0;
lastRO = nObjects;
//Keep looping while cost is decreasing
while(splitIndex < dnObjects)
{
tempCost = leftCount * (yLeft[splitIndex].rightEdge - yLeft[0].rightEdge) + rightCount * (yLeft[dnmoObjects].rightEdge - yLeft[splitIndex].rightEdge);
if(tempCost < lastCost)
{
lastCost = tempCost;
lastSplit = yLeft[splitIndex].rightEdge;
lastLO = leftCount;
lastRO = rightCount;
}
//Update counters
splitIndex++;
leftCount += yLeft[splitIndex].objectBoundCounter;
rightCount -= yRight[splitIndex].objectBoundCounter;
}
//Calculate full SAH cost
lastCost = ((lastLO * (2 * (lastSplit - yLeft[0].rightEdge) * voxelD + 2 * (lastSplit - yLeft[0].rightEdge) * voxelL + 2 * voxelD * voxelL)) + (lastRO * (2 * (yLeft[dnmoObjects].rightEdge - lastSplit) * voxelD + 2 * (yLeft[dnmoObjects].rightEdge - lastSplit) * voxelL + 2 * voxelD * voxelL))) / voxelSA;
if(lastCost < minCost)
{
minCost = lastCost;
splitLoc = lastSplit;
minLeftCounter = lastLO;
minRightCounter = lastRO;
axis = 1;
}
//---------------------------------------------------------------------------------------------
//Repeat for z axis
splitIndex = 1;
lastCost = nObjects * voxelD;
lastSplit = zLeft[1].rightEdge;
leftCount = zLeft[1].objectBoundCounter;
rightCount = nObjects - zRight[1].objectBoundCounter;
lastLO = 0;
lastRO = nObjects;
//Keep looping while cost is decreasing
while(splitIndex < dnObjects)
{
tempCost = leftCount * (zLeft[splitIndex].rightEdge - zLeft[0].rightEdge) + rightCount * (zLeft[dnmoObjects].rightEdge - zLeft[splitIndex].rightEdge);
if(tempCost < lastCost)
{
lastCost = tempCost;
lastSplit = zLeft[splitIndex].rightEdge;
lastLO = leftCount;
lastRO = rightCount;
}
//Update counters
splitIndex++;
leftCount += zLeft[splitIndex].objectBoundCounter;
rightCount -= zRight[splitIndex].objectBoundCounter;
}
//Calculate full SAH cost
lastCost = ((lastLO * (2 * (lastSplit - zLeft[0].rightEdge) * voxelL + 2 * (lastSplit - zLeft[0].rightEdge) * voxelH + 2 * voxelH * voxelL)) + (lastRO * (2 * (zLeft[dnmoObjects].rightEdge - lastSplit) * voxelL + 2 * (zLeft[dnmoObjects].rightEdge - lastSplit) * voxelH + 2 * voxelH * voxelL))) / voxelSA;
if(lastCost < minCost)
{
minCost = lastCost;
splitLoc = lastSplit;
minLeftCounter = lastLO;
minRightCounter = lastRO;
axis = 2;
}
//Free bin memory
free(xLeft);
free(xRight);
free(yLeft);
free(yRight);
free(zLeft);
free(zRight);
//---------------------------------------------------------------------------------------------
//Make sure a split is in our best interest
if(axis == -1)
{
//If not decrement the node counter
nodeIndex--;
BuildBranch(-1, objs, nObjects);
return;
}
//Allocate space for left and right lists
Mailbox** leftList = (Mailbox**)malloc(minLeftCounter * sizeof(void*));
Mailbox** rightList = (Mailbox**)malloc(minRightCounter * sizeof(void*));
//Sort objects into lists of those to the left and right of the split plane
int leftIndex = 0, rightIndex = 0;
leftCount = 0;
rightCount = 0;
switch(axis)
{
case 0:
for(int i = 0; i < nObjects; i++)
{
//Get object bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add to left and right lists when necessary
if(tempBox->x0 < splitLoc)
{
leftList[leftIndex++] = objs[i];
leftCount++;
}
if(tempBox->x1 > splitLoc)
{
rightList[rightIndex++] = objs[i];
rightCount++;
}
}
break;
case 1:
for(int i = 0; i < nObjects; i++)
{
//Get object bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add to left and right lists when necessary
if(tempBox->y0 < splitLoc)
{
leftList[leftIndex++] = objs[i];
leftCount++;
}
if(tempBox->y1 > splitLoc)
{
rightList[rightIndex++] = objs[i];
rightCount++;
}
}
break;
case 2:
for(int i = 0; i < nObjects; i++)
{
//Get object bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add to left and right lists when necessary
if(tempBox->z0 < splitLoc)
{
leftList[leftIndex++] = objs[i];
leftCount++;
}
if(tempBox->z1 > splitLoc)
{
rightList[rightIndex++] = objs[i];
rightCount++;
}
}
break;
};
//Delete the bounding box
delete tempBox;
//Delete old objects array
free(objs);
//Construct left and right branches
BuildBranch(height - 1, leftList, leftCount);
BuildBranch(height - 1, rightList, rightCount);
//Build this node
tree[thisNodeIndex] = KDTreeNode();
tree[thisNodeIndex].InitializeInterior(axis, splitLoc, nodeIndex - 1);
return;
}
EDIT:
Ok well I tried to replace the malloc/free with new/delete and that had no effect on the speed. I also found that it is only the free() on xLeft/xRight arrays that seem to affect the execution time significantly. I was able to eliminate the problem by moving the free() calls to after the recursive calls, although I do not know why this is making a difference because I don't see anywhere that these arrays are used after the original location for free(). As for why I am using malloc... some portions of this program use cache aligned memory, so I had been using _aligned_malloc. Although there probably is a way to get new to cache align, this is the only way I know to do it.
Is it possible that you are linking against a debug version of the runtime library that is doing something extra in free() like filling the memory with a garbage value? I have seen this behavior when you link against overly aggressive memory debugging libraries. The code that you have posted does not look strange. I would be interested to know what would happen if you replaced the arrays with std::vector or std::deque though. Vector should have behavior quite similar to the arrays and Deque may actually improve the speed a little if the arrays are large because the memory manager will not have to guarantee contiguous space.
If your program doing all of the free()ing on exit, then you might as well just skip the calls. The entire process heap is freed when you app exits.
Edit: ----
Ok, now that the code is posted, it appears to me that you aren't just freeing on exit, so you should definitely try and figure out if this is a wierd symptom of a bug, or just a costly implementation of free(). Instead of removing the free() calls, time how long it takes to execute them. is the heap manager really using up the whole 19 seconds?
I do see several places were multiple allocations have the same scope and lifetime. You could turn these into a single malloc/free call, althought that would make the code less clear and harder to mantain. So you have to ask yourself, how much does that 20 seconds matter?
Probably just the behavior of the heap manager your CRT uses. It's probably updating free lists, or some other internal structure to manage memory.
You probably should reexamine how your program allocates and uses memory if your bottleneck is here.
Having had a look at the code one big thing that comes to my mind is this - mixture of malloc(...), new(...), delete(...), free(...)
BoundingBox* tempBox = new BoundingBox();
// ....
//Delete the bounding box
delete tempBox;
yet in other places you have
Bin* xLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
// ....
free(xMins);
In short, you are mixing the C++'s runtime in calling new(...) and delete(...) with malloc(...) and free(...).. After all, this is in C++, so a question for you here...
Why did you use the malloc(...) and free(...) which is from C in the middle of this C++ code? The repercussions I could see here, is that the C++ runtime is different in terms of using the memory allocation unlike C in the aspect of OOP paradigm.
Having said this, your best bet is:
Replace all calls to malloc with new.
Replace all calls to free with delete.
Re run the program again and see if that makes a different. Can you confirm this?
Hope this helps,
Best regards,
Tom.
+1 to malloc/free making my eyes hurt in C++. Ignoring that for a second and looking at the code, three ideas:
Roll up your malloc calls to one large malloc and free (for the x/y/left/right/etc structures) instead of 12. Set the pointers into this large buffer as appropriate.
Still talking about the x/y/left/right variables: Employ a small stack based buffer, that you can use when the number of objects is small. When the number of objects is large, then dynamically allocate. When it is not, just set your pointer to the local stack buffer. This can avoid dynamic memory management all together for small inputs.
Right now, your "object" list is dynamically allocated, freed, and reallocated with each recursive call (!!). This is confusing because ownership isn't clear; but also it's a performance issue. Consider reworking the code so one list of "objects" is ever used.
C++ stores some extra information when you allocate using new like the type of the object or number of characters(in case of array) etc..If you are using free, it could be a fragmentation problem where you are actually deleting only the chunks of data in between but not freeing the actual information stored by new. Just a thought.
When you corrupt the heap, it often becomes very slow. Try to run it in debug mode with debug version of your runtime as well.
It could be poor locality of reference for your code. For example, I see the following:
//Allocate memory for split options
float* xMins = (float*)malloc(nObjects * sizeof(float));
float* yMins = (float*)malloc(nObjects * sizeof(float));
float* zMins = (float*)malloc(nObjects * sizeof(float));
float* xMaxs = (float*)malloc(nObjects * sizeof(float));
float* yMaxs = (float*)malloc(nObjects * sizeof(float));
float* zMaxs = (float*)malloc(nObjects * sizeof(float));
...
free(xMins);
free(xMaxs);
free(yMins);
free(yMaxs);
free(zMins);
free(zMaxs);
Now, assuming that the allocations proceed basically linearly, then free(xMaxs); may need to dereference memory that was allocated some number of pages away from xMins (which was just dereferenced during free(xMins);), so you might need to swap in a page from the backing store in order to perform the free (which causes a huge slowdown in execution when that happens). Re-ordering the free()'s to match the allocation order could help... In this case, that'd mean
free(xMins);
free(yMins);
free(zMins);
free(xMaxs);
free(yMaxs);
free(zMaxs);
It sounds like you are running your program from a debugger in Windows, which by default causes a special debug heap to be used, which dramatically slows down memory deallocations. This applies even to non-debug builds, as long as they are launched from a debugger (such as Visual Studio). You should be able to disable this behavior by setting the environment variable _NO_DEBUG_HEAP=1 before running your program (I recommend setting it in the project configuration settings rather than in the system settings, if possible).
You didn't describe anything about your programming environment in the original question, however, so I had to make certain assumptions about it that might be wrong. If you're not running your program under Windows, for example, then my answer doesn't apply and I have no idea what the cause of your problem might be.