The following question concerns the use of a vector and memcpy. Vector functions being used are, .push_back, .data(), .size().
Information about the msg.
a)
#define BUFFERSIZE 8<<20
char* msg = new char[(BUFFERSIZE / 4)];
Question: Why doesn't code block b) work?
When I run the code below, using vector.push_back in a for loop, it causes the software I'm working with to stop working. I'm not sending the "msg" nor am I reading it, I'm just creating it.
b)
mVertex vertex;
vector<mVertex>mVertices;
for (int i = 0; i < 35; i++)
{
vertex.posX = 2.0;
vertex.posY = 2.0;
vertex.posZ = 2.0;
vertex.norX = 2.0;
vertex.norY = 2.0;
vertex.norZ = 2.0;
vertex.U = 0.5;
vertex.V = 0.5;
mVertices.push_back(vertex);
}
memcpy(msg, // destination
mVertices.data(), // content
(mVertices.size() * sizeof(mVertex))); // size
Screenshot of the error message from the software
By adding +1 to mVertices.size() at the very last row, the software works fine. See the example code below.
c)
mVertex vertex;
vector<mVertex>mVertices;
for (int i = 0; i < 35; i++)
{
vertex.posX = 2.0;
vertex.posY = 2.0;
vertex.posZ = 2.0;
vertex.norX = 2.0;
vertex.norY = 2.0;
vertex.norZ = 2.0;
vertex.U = 0.5;
vertex.V = 0.5;
mVertices.push_back(vertex);
}
memcpy(msg, // destination
mVertices.data(), // content
(mVertices.size()+1 * sizeof(mVertex))); // size
The code also work, if I remove the for loop.
d)
mVertex vertex;
vector<mVertex>mVertices;
vertex.posX = 2.0;
vertex.posY = 2.0;
vertex.posZ = 2.0;
vertex.norX = 2.0;
vertex.norY = 2.0;
vertex.norZ = 2.0;
vertex.U = 0.5;
vertex.V = 0.5;
mVertices.push_back(vertex);
memcpy(msg, // destination
mVertices.data(), // content
(mVertices.size() * sizeof(mVertex))); // size
The problem is a basic macro issue: macros define text replacements, not logical or arithmetic expressions.
#define BUFFERSIZE 8<<20
creates a text macro. When you use it in this expression (I've removed the redundant parentheses):
char* msg = new char[BUFFERSIZE / 4];
the preprocessor replaces BUFFERSIZE with 8 << 20, so it's as if you had written
char* msg = new char[8 << 20 / 4];
and the problem is that 8 << 20 / 4 is 256. That's because the expression is evaluated as 8 << (20/4), where presumably you intended it to be (8 << 20) / 4. To fix that (and you should always do this with macros), put parentheses around the expression in the macro itself:
#define BUFFERSIZE (8<<20)
Incidentally, that's why using a named variable (whether constexpr or otherwise) makes the problem go away: the variable gets the value 8 << 20, not the text, so all is good.
The define doesn't do what you think.
#define BUFFERSIZE 8<<20
yields BUFFERSIZE / 4 == 8 << 20 / 4 == 8 << 5 == 256. So the allocated memory in msg is to small to hold mVertices and
memcpy(msg, mVertices.data(), (mVertices.size() * sizeof(mVertex)));
writes into wrong memory. This can produce runtime errors.
You should use constexpr instead of define to avoid such problems.
Related
I am new to DPC++, and I try to develop a MPI based DPC++ Poisson solver. I read the book and am very confused about the buffer and the pointer with the shared or host memoery. What is the difference between those two things, and what should I use when I develop the code.
Now, I use the buffer initialized by an std::array with const size for serial code and it works well. However when I couple the DPC++ code with MPI, I have to declare a local length for each device, but I fail to do that. Here I attach my code
define nx 359
define ny 359
constexpr int local_len[2];
global_len[0] = nx + 1;
global_len[1] = ny + 1;
for (int i = 1; i < process; i++)
{
if (process % i == 0)
{
px = i;
py = process / i;
config_e = 1. / (2. * (global_len[1] * (px - 1) / py + global_len[0] * (py - 1) / px));
}
if (config_e >= cmax)
{
cmax = config_e;
cart_num_proc[0] = px;
cart_num_proc[1] = py;
}
}
local_len[0] = global_len[0] / cart_num_proc[0];
local_len[1] = global_len[1] / cart_num_proc[1];
constexpr int lx = local_len[0];
constexpr int ly = local_len[1];
queue Q{};
double *m_cellValue = malloc_shared<double>(size, Q);
I got the error
error: default initialization of an object of const type 'const int[2]'
error: cannot assign to variable 'local_len' with const-qualified type 'const int[2]'
main.cpp:52:18: error: cannot assign to variable 'local_len' with const-qualified type 'const int[2]'
Is there any way to just define a variable size array to do the parallel for in DPC++?
You are too eager in using constexpr. Remove all three occurrences in this code, and it should compile. So this has nothing to do with DPC++.
I am making a simple text renderer with vulkan and I am using freetype to format my text.
I read the freetype tutorial and I have come up with the following function:
void Scribe::CreateSingleLineGeometry(const string &text, ScGeomMetaInfo &info,
ScGeomtry &geometry)
{
float texture_length = info.texture_len;
float h_offset = info.h_offset;
float char_l = info.char_len;
float v_anchor = info.v_offset;
auto &vertices = geometry.first;
auto &indices = geometry.second;
auto length = max(ft_face->bbox.xMax - ft_face->bbox.xMin,
ft_face->bbox.yMax - ft_face->bbox.yMin);
for(auto &c: text)
{
auto glyph_index = FT_Get_Char_Index(ft_face, c);
auto error = FT_Load_Glyph(ft_face, glyph_index, FT_LOAD_NO_HINTING);
auto metrics = ft_face->glyph->metrics;
float g_height = metrics.height;
float g_bearing = metrics.horiBearingY;
float correction = 1.0 - (g_height) / float(length) -
float(metrics.horiBearingY - g_height) / float(length);
correction *= char_l;
float bearingX = char_l * float(metrics.horiBearingX) / float(length);
// Insert the vertex positions and uvs into the returned geometry
float h_coords[] = {h_offset + bearingX, h_offset + bearingX + char_l};
float v_coords[] = {v_anchor + correction, v_anchor + correction + char_l};
auto glyph_data = glyph_map[glyph_index];
float tex_h_coords[] = {glyph_data.lt_uv.x, glyph_data.rb_uv.x};
float tex_v_coords[] = {glyph_data.lt_uv.y, glyph_data.rb_uv.y};
for(int x=0; x<2; x++) {
for(int y=0; y<2; y++) {
vertices.insert(end(vertices), {h_coords[x], v_coords[y], 0});
vertices.insert(end(vertices), {tex_h_coords[x], tex_v_coords[y]});
}
}
// Setup the indices of the current quad
// There's 4 vertices in a quad, so we offset each index by (quad_size * num_quads)
uint delta = 4 * info.c_num++;
indices.insert(end(indices), {0+delta, 3+delta, 1+delta, 0+delta, 3+delta, 2+delta});
h_offset += char_l * float(metrics.horiAdvance) / float(length) + 0.03;
}
}
In particular I want to emphasize the line:
h_offset += char_l * float(metrics.horiAdvance) / float(length) + 0.03;
That 0.03 at the end of the line doesn't come from anywhere, I inserted it there to make things look good.
This is the result with that extra offset:
Which I think looks pretty good. However, if I were to remove the extra offset:
h_offset += char_l * float(metrics.horiAdvance) / float(length);
I get something that doesn't look right at all. Why isn't the advance enough to correctly format the font?
as an exercise, i'm translating my master's thesis finite-difference time-domain code for simulation of wave propagation from matlab to c++ and i've come across the following problem.
i would like to create a class that corresponds to a non-physical absorbing layer called cpml. the size of the layer depends on the desired parameters of the simulation, so the arrays that define the absorbing layer have to be dynamic.
#ifndef fdtd_h
#define fdtd_h
#include <cmath>
#include <iostream>
#include <sstream>
using namespace std;
class cpml {
public:
int thickness;
int n_1, n_2, n_3;
double cut_off_freq;
double kappa_x_max, sigma_x_1_max, sigma_x_2_max, alpha_x_max;
double *kappa_x_tau_xy, *sigma_x_tau_xy, *alpha_x_tau_xy;
void set_cpml_parameters_tau_xy();
};
void cpml::set_cpml_parameters_tau_xy(){
double temp1[thickness], temp2[thickness], temp3[thickness];
for(int j = 1; j < thickness; j++){
temp1[j] = 1 + kappa_x_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1);
temp2[j] = sigma_x_1_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1 + n_2);
temp3[j] = alpha_x_max * pow((double)(j - 0.5) / (double)(thickness - 1), n_3);
}
kappa_x_tau_xy = temp1;
sigma_x_tau_xy = temp2;
for(int i = 1; i < thickness; i++){
cout << sigma_x_tau_xy[i] << endl;
}
alpha_x_tau_xy = temp3;
}
#endif /* fdtd_h */
when i call the function cpml::set_cpml_parameters_tau_xy() in my main function, the first value of the array sigma_x_tau_xy is correct. however, the further values aren't.
#include "fdtd.h"
using namespace std;
int main() {
cpml cpml;
int cpml_thickness = 10;
cpml.thickness = cpml_thickness;
int n_1 = 3, n_2 = 0, n_3 = 3;
cpml.n_1 = n_1; cpml.n_2 = n_2; cpml.n_3 = n_3;
double cut_off_freq = 1;
cpml.cut_off_freq = cut_off_freq;
double kappa_x_max = 0;
double sigma_x_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x), sigma_x_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x);
double alpha_x_max = 2 * PI * cpml.cut_off_freq;
double kappa_y_max = 0;
double sigma_y_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y), sigma_y_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y);
double alpha_y_max = 2 * PI * cpml.cut_off_freq;
cpml.kappa_x_max = kappa_x_max; cpml.sigma_x_1_max = sigma_x_1_max; cpml.sigma_x_2_max = sigma_x_2_max; cpml.alpha_x_max = alpha_x_max;
cpml.kappa_y_max = kappa_y_max; cpml.sigma_y_1_max = sigma_y_1_max; cpml.sigma_y_2_max = sigma_y_2_max; cpml.alpha_y_max = alpha_y_max;
cpml.set_cpml_parameters_tau_xy();
for(int j = 1; j < cpml.thickness; j++){
cout << *(cpml.sigma_x_tau_xy + j) << endl;
}
}
what am i doing wrong and how do i make the dynamic array members of the class cpml contain the correct values when called in the main function?
Two problems: The lesser of them is that your program is technically not a valid C++ program, since C++ doesn't have variable-length arrays (which your arrays temp1, temp2 and temp3 are).
The more serious problem is that you save pointers to local variables. When a function returns, local variables go out of scope and no longer exist. Pointers to them will become invalid, and using those pointers will lead to undefined behavior.
Both problems are easily solved by using std::vector instead of arrays and pointers.
You cannot declare an array in C++ without a "constant" expression for its size (the bounds must be known at compile time). That means this code is invalid:
double temp1[thickness], temp2[thickness], temp3[thickness];
What you should instead do is the following:
class cmpl
{
//...
std::vector<double> kappa_x_tau_xy, sigma_x_tau_xy, alpha_x_tau_xy;
// ...
};
void cpml::set_cpml_parameters_tau_xy(){
alpha_x_tau_xy.resize(thickness);
kappa_x_tau_xy.resize(thickness);
sigma_x_tau_xy.resize(thickness);
//...
std::vector will handle all the dynamic allocation under the hood for you. If your code compiled, it was because you were using a nonstandard GCC extension for variable length arrays. Turn your warnings up -Wall -pedantic -Werror when you compile and it should complain more.
Note that you also have issues in array bounds. Whereas Matlab is 1-indexed, C++ is 0-indexed, so you'll need to do this, too:
for(int j = 0; j < thickness; j++){
alpha_x_tau_xy[j] = 1 + kappa_x_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1);
kappa_x_tau_xy = sigma_x_1_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1 + n_2);
sigma_x_tau_xy = alpha_x_max * pow((double)(j - 0.5) / (double)(thickness - 1), n_3);
}
You have a similar issue in main:
for(int j = 1; j < cpml.thickness; j++){
cout << *(cpml.sigma_x_tau_xy + j) << endl;
}
Should become:
for(int j = 0; j < cpml.thickness; j++){
cout << cpml.sigma_x_tau_xy[j] << endl;
}
Additional Notes:
Your code is very unstructured. Consider putting all of the cmpl-related getting and setting into the cmpl class ([Encapsulation])(https://en.wikipedia.org/wiki/Encapsulation_(computer_programming)). This will make it easer for the client (you in this case) to interact with the object.
This will include hiding your class data as protected or private and exposing functions to get and set those variables (don't forget const where appropriate).
Add a constructor to initialize all of the fields at once. As it stands now, your class consists of mostly uninitialized garbage for much of its lifetime. If someone where to prematurely try to access a field, you're in Undefined Behavior territory.
std::endl is good for printing newline characters, but restrict that to Debug-only code. The reason being is that it flushes the buffer every time its called, which can make your code overall slower if it's printing a lot. Use a newline character "\n" instead for Release.
An additional benefit of std::vector is that it makes copying and assigning to a cmpl well behaved. Otherwise, the compiler will generate a copy constructor and copy assignment, which when used will be a shallow copy instead of the deep copy that you'd want.
After restructuring your class, your main might look something like this:
int main() {
int cpml_thickness = 10;
int n_1 = 3, n_2 = 0, n_3 = 3;
double cut_off_freq = 1;
double kappa_x_max = 0;
double sigma_x_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x), sigma_x_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x);
double alpha_x_max = 2 * PI * cut_off_freq;
double kappa_y_max = 0;
double sigma_y_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y), sigma_y_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y);
double alpha_y_max = 2 * PI * cut_off_freq;
cpml cpml(cpml_thickness, n_1, n_2, n_3, cut_off_freq, kappa_x_max, kappa_y_max, sigma_x_1_max, sigma_x_2max, alpha_x_max, alpha_y_max);
cpml.set_cpml_parameters_tau_xy();
cpml.PrintSigmaTauXY(std::cout);
}
Which is arguably better. (You might use a getter to get sigma_tau_xy from the class and then print it yourself, though). And then you can think about how to simplify things even further by creating objects that represent the logical groupings of alpha_x_max and alpha_y_max etc. This could be a std::pair or a full-on struct with its own getters and setters. Now their own logic is grouped together and is easy to pass around/reference/think about. Your constructor for cmpl also becomes simpler, where you accept a single parameter that represents both x and y instead of separate ones for both.
Matlab doesn't really encourage an Object-Oriented approach in my (admittedly breif) experience, but in C++ it's easy.
I am using intrinsics to accelerate the running openCV code. But after i replaced the code with Intrinsics, the runtime cost of the code is almost the same or maybe even worse. i cannot figure out what and why this is happening. I have been searching this issue for quite long time but noting change. It is appreciated if someone can help me out. Thank you very much! Here is my code
// if useSSE is true,run the code with intrinsics and takes 1.45ms in my computer
// and if not run the general code and takes the same time.
cv::Mat<float> results(shape.rows,2);
if (useSSE) {
float* pshape = (float*)shape.data;
results = shape.clone();
float* presults = (float*)results.data;
// use SSE
__m128 xyxy_center = _mm_set_ps(bbox.center_y, bbox.center_x, bbox.center_y, bbox.center_x);
float bbox_width = bbox.width/2;
float bbox_height = bbox.height/2;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
gettimeofday(&start, NULL); // this is for counting time
int shape_size = shape.rows*shape.cols;
for (int i=0; i<shape_size; i +=4) {
__m128 a = _mm_loadu_ps(pshape+i);
__m128 result = _mm_div_ps(_mm_sub_ps(a, xyxy_center), xyxy_size);
_mm_storeu_ps(presults+i, result);
}
}else {
//SSE TO BE DONE
for (int i = 0; i < shape.rows; i++){
results(i, 0) = (shape(i, 0) - bbox.center_x) / (bbox.width / 2.0);
results(i, 1) = (shape(i, 1) - bbox.center_y) / (bbox.height / 2.0);
}
}
gettimeofday(&end, NULL);
diff = 1000000*(end.tv_sec-start.tv_sec)+end.tv_sec-start.tv_usec;
std::cout<<diff<<"-----"<<std::endl;
return results;
Your SSE optimization will corrupt memory near results variable, if shape.rows % 2 == 1
Try avoiding using i variable in the loop, use pointers directly. Compiler may optimize additional plus operation, or it may not.
Use multiplication instead of division:
float bbox_width_inv = 2./bbox.width;
float bbox_height_inv = 2./bbox.height;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
float* p_shape_end = p_shape + shape.rows*shape.cols;
float* p_shape_end_batch = p_shape + shape.rows*shape.cols & (~3);
for (; p_shape<p_shape_end_batch; p_shape+=4, presults+=4) {
__m128 a = _mm_loadu_ps(pshape);
__m128 result = _mm_mul_ps(_mm_sub_ps(a, xyxy_center), xyxy_size_inv);
_mm_storeu_ps(presults, result);
}
while (p_shape < p_shape_end) {
presults++ = (p_shape++ - bbox.center_x) * bbox_width_inv;
presults++ = (p_shape++ - bbox.center_y) * bbox_height_inv;
}
Try to disassemble code generated from intrinsics, and make sure there is enough registers to perform your operations, and it doesn't store temporary results into RAM
I am using free to free the memory allocated for a bunch of temporary arrays in a recursive function. I would post the code but it is pretty long. When I comment out these free() calls, the program runs in less than a second. However, when I am using them, the programs takes about 20 seconds to run. Why is this happening, and how can it be fixed? This is like 100 or so MB so I'd rather not just leave the memory leak.
Additionally, when I run the program that includes all of the free() calls with profiling enabled, it runs in less than a second. I don't know how that would have an effect, but it does.
After using only some of the free() calls, it seems that there are a few in particular that cause the program to slow down. The rest do not seem to have an effect.
Ok... here's the code as requested:
void KDTree::BuildBranch(int height, Mailbox** objs, int nObjects)
{
int dnObjects = nObjects * 2;
int dnmoObjects = dnObjects - 1;
//Check for termination
if(height == -1 || nObjects < minObjectsPerNode)
{
//Create leaf
tree[nodeIndex] = KDTreeNode();
if(nObjects == 1)
tree[nodeIndex].InitializeLeaf(objs[0], 1);
else
tree[nodeIndex].InitializeLeaf(objs, nObjects);
//Added a node, increment index
nodeIndex++;
return;
}
//Save this node's index and increment the current index to save space for this node
int thisNodeIndex = nodeIndex;
nodeIndex++;
//Allocate memory for split options
float* xMins = (float*)malloc(nObjects * sizeof(float));
float* yMins = (float*)malloc(nObjects * sizeof(float));
float* zMins = (float*)malloc(nObjects * sizeof(float));
float* xMaxs = (float*)malloc(nObjects * sizeof(float));
float* yMaxs = (float*)malloc(nObjects * sizeof(float));
float* zMaxs = (float*)malloc(nObjects * sizeof(float));
//Find all possible split locations
int index = 0;
BoundingBox* tempBox = new BoundingBox();
for(int i = 0; i < nObjects; i++)
{
//Get bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add mins to split lists
xMins[index] = tempBox->x0;
yMins[index] = tempBox->y0;
zMins[index] = tempBox->z0;
//Add maxs
xMaxs[index] = tempBox->x1;
yMaxs[index] = tempBox->y1;
zMaxs[index] = tempBox->z1;
index++;
}
//Sort lists
Util::sortFloats(xMins, nObjects);
Util::sortFloats(yMins, nObjects);
Util::sortFloats(zMins, nObjects);
Util::sortFloats(xMaxs, nObjects);
Util::sortFloats(yMaxs, nObjects);
Util::sortFloats(zMaxs, nObjects);
//Allocate bin lists
Bin* xLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* xRight = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* yLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* yRight = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* zLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
Bin* zRight = (Bin*)malloc(dnObjects * sizeof(Bin));
//Initialize all bins
for(int i = 0; i < dnObjects; i++)
{
xLeft[i] = Bin(0, 0.0f);
xRight[i] = Bin(0, 0.0f);
yLeft[i] = Bin(0, 0.0f);
yRight[i] = Bin(0, 0.0f);
zLeft[i] = Bin(0, 0.0f);
zRight[i] = Bin(0, 0.0f);
}
//Construct min and max bins bins from split locations
//Merge min/max lists together for each axis
int minIndex = 0, maxIndex = 0;
for(int i = 0; i < dnObjects; i++)
{
if(maxIndex == nObjects || (xMins[minIndex] <= xMaxs[maxIndex] && minIndex != nObjects))
{
//Add split location to both bin lists
xLeft[i].rightEdge = xMins[minIndex];
xRight[i].rightEdge = xMins[minIndex];
//Add geometry to mins counter
xLeft[i+1].objectBoundCounter++;
minIndex++;
}
else
{
//Add split location to both bin lists
xLeft[i].rightEdge = xMaxs[maxIndex];
xRight[i].rightEdge = xMaxs[maxIndex];
//Add geometry to maxs counter
xRight[i].objectBoundCounter++;
maxIndex++;
}
}
//Repeat for y axis
minIndex = 0, maxIndex = 0;
for(int i = 0; i < dnObjects; i++)
{
if(maxIndex == nObjects || (yMins[minIndex] <= yMaxs[maxIndex] && minIndex != nObjects))
{
//Add split location to both bin lists
yLeft[i].rightEdge = yMins[minIndex];
yRight[i].rightEdge = yMins[minIndex];
//Add geometry to mins counter
yLeft[i+1].objectBoundCounter++;
minIndex++;
}
else
{
//Add split location to both bin lists
yLeft[i].rightEdge = yMaxs[maxIndex];
yRight[i].rightEdge = yMaxs[maxIndex];
//Add geometry to maxs counter
yRight[i].objectBoundCounter++;
maxIndex++;
}
}
//Repeat for z axis
minIndex = 0, maxIndex = 0;
for(int i = 0; i < dnObjects; i++)
{
if(maxIndex == nObjects || (zMins[minIndex] <= zMaxs[maxIndex] && minIndex != nObjects))
{
//Add split location to both bin lists
zLeft[i].rightEdge = zMins[minIndex];
zRight[i].rightEdge = zMins[minIndex];
//Add geometry to mins counter
zLeft[i+1].objectBoundCounter++;
minIndex++;
}
else
{
//Add split location to both bin lists
zLeft[i].rightEdge = zMaxs[maxIndex];
zRight[i].rightEdge = zMaxs[maxIndex];
//Add geometry to maxs counter
zRight[i].objectBoundCounter++;
maxIndex++;
}
}
//Free split memory
free(xMins);
free(xMaxs);
free(yMins);
free(yMaxs);
free(zMins);
free(zMaxs);
//PreCalcs
float voxelL = xRight[dnmoObjects].rightEdge - xLeft[0].rightEdge;
float voxelD = zRight[dnmoObjects].rightEdge - zLeft[0].rightEdge;
float voxelH = yRight[dnmoObjects].rightEdge - yLeft[0].rightEdge;
float voxelSA = 2.0f * voxelL * voxelD + 2.0f * voxelL * voxelH + 2.0f * voxelD * voxelH;
//Minimum cost preset to no split at all
float minCost = (float)nObjects;
float splitLoc;
int minLeftCounter = 0, minRightCounter = 0;
int axis = -1;
//---------------------------------------------------------------------------------------------
//Check costs of x-axis split planes keeping track of derivative using
//the fact that there is a minimum point on the graph costs vs split location
//Since there is one object per split plane
int splitIndex = 1;
float lastCost = nObjects * voxelL;
float tempCost;
float lastSplit = xLeft[1].rightEdge;
int leftCount = xLeft[1].objectBoundCounter, rightCount = nObjects - xRight[1].objectBoundCounter;
int lastLO = 0, lastRO = nObjects;
//Keep looping while cost is decreasing
while(splitIndex < dnObjects)
{
tempCost = leftCount * (xLeft[splitIndex].rightEdge - xLeft[0].rightEdge) + rightCount * (xLeft[dnmoObjects].rightEdge - xLeft[splitIndex].rightEdge);
if(tempCost < lastCost)
{
lastCost = tempCost;
lastSplit = xLeft[splitIndex].rightEdge;
lastLO = leftCount;
lastRO = rightCount;
}
//Update counters
splitIndex++;
leftCount += xLeft[splitIndex].objectBoundCounter;
rightCount -= xRight[splitIndex].objectBoundCounter;
}
//Calculate full SAH cost
lastCost = ((lastLO * (2 * (lastSplit - xLeft[0].rightEdge) * voxelD + 2 * (lastSplit - xLeft[0].rightEdge) * voxelH + 2 * voxelD * voxelH)) + (lastRO * (2 * (xLeft[dnmoObjects].rightEdge - lastSplit) * voxelD + 2 * (xLeft[dnmoObjects].rightEdge - lastSplit) * voxelH + 2 * voxelD * voxelH))) / voxelSA;
if(lastCost < minCost)
{
minCost = lastCost;
splitLoc = lastSplit;
minLeftCounter = lastLO;
minRightCounter = lastRO;
axis = 0;
}
//---------------------------------------------------------------------------------------------
//Repeat for y axis
splitIndex = 1;
lastCost = nObjects * voxelH;
lastSplit = yLeft[1].rightEdge;
leftCount = yLeft[1].objectBoundCounter;
rightCount = nObjects - yRight[1].objectBoundCounter;
lastLO = 0;
lastRO = nObjects;
//Keep looping while cost is decreasing
while(splitIndex < dnObjects)
{
tempCost = leftCount * (yLeft[splitIndex].rightEdge - yLeft[0].rightEdge) + rightCount * (yLeft[dnmoObjects].rightEdge - yLeft[splitIndex].rightEdge);
if(tempCost < lastCost)
{
lastCost = tempCost;
lastSplit = yLeft[splitIndex].rightEdge;
lastLO = leftCount;
lastRO = rightCount;
}
//Update counters
splitIndex++;
leftCount += yLeft[splitIndex].objectBoundCounter;
rightCount -= yRight[splitIndex].objectBoundCounter;
}
//Calculate full SAH cost
lastCost = ((lastLO * (2 * (lastSplit - yLeft[0].rightEdge) * voxelD + 2 * (lastSplit - yLeft[0].rightEdge) * voxelL + 2 * voxelD * voxelL)) + (lastRO * (2 * (yLeft[dnmoObjects].rightEdge - lastSplit) * voxelD + 2 * (yLeft[dnmoObjects].rightEdge - lastSplit) * voxelL + 2 * voxelD * voxelL))) / voxelSA;
if(lastCost < minCost)
{
minCost = lastCost;
splitLoc = lastSplit;
minLeftCounter = lastLO;
minRightCounter = lastRO;
axis = 1;
}
//---------------------------------------------------------------------------------------------
//Repeat for z axis
splitIndex = 1;
lastCost = nObjects * voxelD;
lastSplit = zLeft[1].rightEdge;
leftCount = zLeft[1].objectBoundCounter;
rightCount = nObjects - zRight[1].objectBoundCounter;
lastLO = 0;
lastRO = nObjects;
//Keep looping while cost is decreasing
while(splitIndex < dnObjects)
{
tempCost = leftCount * (zLeft[splitIndex].rightEdge - zLeft[0].rightEdge) + rightCount * (zLeft[dnmoObjects].rightEdge - zLeft[splitIndex].rightEdge);
if(tempCost < lastCost)
{
lastCost = tempCost;
lastSplit = zLeft[splitIndex].rightEdge;
lastLO = leftCount;
lastRO = rightCount;
}
//Update counters
splitIndex++;
leftCount += zLeft[splitIndex].objectBoundCounter;
rightCount -= zRight[splitIndex].objectBoundCounter;
}
//Calculate full SAH cost
lastCost = ((lastLO * (2 * (lastSplit - zLeft[0].rightEdge) * voxelL + 2 * (lastSplit - zLeft[0].rightEdge) * voxelH + 2 * voxelH * voxelL)) + (lastRO * (2 * (zLeft[dnmoObjects].rightEdge - lastSplit) * voxelL + 2 * (zLeft[dnmoObjects].rightEdge - lastSplit) * voxelH + 2 * voxelH * voxelL))) / voxelSA;
if(lastCost < minCost)
{
minCost = lastCost;
splitLoc = lastSplit;
minLeftCounter = lastLO;
minRightCounter = lastRO;
axis = 2;
}
//Free bin memory
free(xLeft);
free(xRight);
free(yLeft);
free(yRight);
free(zLeft);
free(zRight);
//---------------------------------------------------------------------------------------------
//Make sure a split is in our best interest
if(axis == -1)
{
//If not decrement the node counter
nodeIndex--;
BuildBranch(-1, objs, nObjects);
return;
}
//Allocate space for left and right lists
Mailbox** leftList = (Mailbox**)malloc(minLeftCounter * sizeof(void*));
Mailbox** rightList = (Mailbox**)malloc(minRightCounter * sizeof(void*));
//Sort objects into lists of those to the left and right of the split plane
int leftIndex = 0, rightIndex = 0;
leftCount = 0;
rightCount = 0;
switch(axis)
{
case 0:
for(int i = 0; i < nObjects; i++)
{
//Get object bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add to left and right lists when necessary
if(tempBox->x0 < splitLoc)
{
leftList[leftIndex++] = objs[i];
leftCount++;
}
if(tempBox->x1 > splitLoc)
{
rightList[rightIndex++] = objs[i];
rightCount++;
}
}
break;
case 1:
for(int i = 0; i < nObjects; i++)
{
//Get object bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add to left and right lists when necessary
if(tempBox->y0 < splitLoc)
{
leftList[leftIndex++] = objs[i];
leftCount++;
}
if(tempBox->y1 > splitLoc)
{
rightList[rightIndex++] = objs[i];
rightCount++;
}
}
break;
case 2:
for(int i = 0; i < nObjects; i++)
{
//Get object bounding box
objs[i]->prim->MakeBoundingBox(tempBox);
//Add to left and right lists when necessary
if(tempBox->z0 < splitLoc)
{
leftList[leftIndex++] = objs[i];
leftCount++;
}
if(tempBox->z1 > splitLoc)
{
rightList[rightIndex++] = objs[i];
rightCount++;
}
}
break;
};
//Delete the bounding box
delete tempBox;
//Delete old objects array
free(objs);
//Construct left and right branches
BuildBranch(height - 1, leftList, leftCount);
BuildBranch(height - 1, rightList, rightCount);
//Build this node
tree[thisNodeIndex] = KDTreeNode();
tree[thisNodeIndex].InitializeInterior(axis, splitLoc, nodeIndex - 1);
return;
}
EDIT:
Ok well I tried to replace the malloc/free with new/delete and that had no effect on the speed. I also found that it is only the free() on xLeft/xRight arrays that seem to affect the execution time significantly. I was able to eliminate the problem by moving the free() calls to after the recursive calls, although I do not know why this is making a difference because I don't see anywhere that these arrays are used after the original location for free(). As for why I am using malloc... some portions of this program use cache aligned memory, so I had been using _aligned_malloc. Although there probably is a way to get new to cache align, this is the only way I know to do it.
Is it possible that you are linking against a debug version of the runtime library that is doing something extra in free() like filling the memory with a garbage value? I have seen this behavior when you link against overly aggressive memory debugging libraries. The code that you have posted does not look strange. I would be interested to know what would happen if you replaced the arrays with std::vector or std::deque though. Vector should have behavior quite similar to the arrays and Deque may actually improve the speed a little if the arrays are large because the memory manager will not have to guarantee contiguous space.
If your program doing all of the free()ing on exit, then you might as well just skip the calls. The entire process heap is freed when you app exits.
Edit: ----
Ok, now that the code is posted, it appears to me that you aren't just freeing on exit, so you should definitely try and figure out if this is a wierd symptom of a bug, or just a costly implementation of free(). Instead of removing the free() calls, time how long it takes to execute them. is the heap manager really using up the whole 19 seconds?
I do see several places were multiple allocations have the same scope and lifetime. You could turn these into a single malloc/free call, althought that would make the code less clear and harder to mantain. So you have to ask yourself, how much does that 20 seconds matter?
Probably just the behavior of the heap manager your CRT uses. It's probably updating free lists, or some other internal structure to manage memory.
You probably should reexamine how your program allocates and uses memory if your bottleneck is here.
Having had a look at the code one big thing that comes to my mind is this - mixture of malloc(...), new(...), delete(...), free(...)
BoundingBox* tempBox = new BoundingBox();
// ....
//Delete the bounding box
delete tempBox;
yet in other places you have
Bin* xLeft = (Bin*)malloc(dnObjects * sizeof(Bin));
// ....
free(xMins);
In short, you are mixing the C++'s runtime in calling new(...) and delete(...) with malloc(...) and free(...).. After all, this is in C++, so a question for you here...
Why did you use the malloc(...) and free(...) which is from C in the middle of this C++ code? The repercussions I could see here, is that the C++ runtime is different in terms of using the memory allocation unlike C in the aspect of OOP paradigm.
Having said this, your best bet is:
Replace all calls to malloc with new.
Replace all calls to free with delete.
Re run the program again and see if that makes a different. Can you confirm this?
Hope this helps,
Best regards,
Tom.
+1 to malloc/free making my eyes hurt in C++. Ignoring that for a second and looking at the code, three ideas:
Roll up your malloc calls to one large malloc and free (for the x/y/left/right/etc structures) instead of 12. Set the pointers into this large buffer as appropriate.
Still talking about the x/y/left/right variables: Employ a small stack based buffer, that you can use when the number of objects is small. When the number of objects is large, then dynamically allocate. When it is not, just set your pointer to the local stack buffer. This can avoid dynamic memory management all together for small inputs.
Right now, your "object" list is dynamically allocated, freed, and reallocated with each recursive call (!!). This is confusing because ownership isn't clear; but also it's a performance issue. Consider reworking the code so one list of "objects" is ever used.
C++ stores some extra information when you allocate using new like the type of the object or number of characters(in case of array) etc..If you are using free, it could be a fragmentation problem where you are actually deleting only the chunks of data in between but not freeing the actual information stored by new. Just a thought.
When you corrupt the heap, it often becomes very slow. Try to run it in debug mode with debug version of your runtime as well.
It could be poor locality of reference for your code. For example, I see the following:
//Allocate memory for split options
float* xMins = (float*)malloc(nObjects * sizeof(float));
float* yMins = (float*)malloc(nObjects * sizeof(float));
float* zMins = (float*)malloc(nObjects * sizeof(float));
float* xMaxs = (float*)malloc(nObjects * sizeof(float));
float* yMaxs = (float*)malloc(nObjects * sizeof(float));
float* zMaxs = (float*)malloc(nObjects * sizeof(float));
...
free(xMins);
free(xMaxs);
free(yMins);
free(yMaxs);
free(zMins);
free(zMaxs);
Now, assuming that the allocations proceed basically linearly, then free(xMaxs); may need to dereference memory that was allocated some number of pages away from xMins (which was just dereferenced during free(xMins);), so you might need to swap in a page from the backing store in order to perform the free (which causes a huge slowdown in execution when that happens). Re-ordering the free()'s to match the allocation order could help... In this case, that'd mean
free(xMins);
free(yMins);
free(zMins);
free(xMaxs);
free(yMaxs);
free(zMaxs);
It sounds like you are running your program from a debugger in Windows, which by default causes a special debug heap to be used, which dramatically slows down memory deallocations. This applies even to non-debug builds, as long as they are launched from a debugger (such as Visual Studio). You should be able to disable this behavior by setting the environment variable _NO_DEBUG_HEAP=1 before running your program (I recommend setting it in the project configuration settings rather than in the system settings, if possible).
You didn't describe anything about your programming environment in the original question, however, so I had to make certain assumptions about it that might be wrong. If you're not running your program under Windows, for example, then my answer doesn't apply and I have no idea what the cause of your problem might be.