I'm writing a small ray tracer using bounding volume hierarchies to accelerate ray tracing.
Long story short, I have a binary tree and I might need to visit multiple leafs.
Current I have a node with two children left and right, then during travel() if some condition, in this example intersect(), the children are visited:
class BoundingBoxNode{
BoundingBoxNode* left, *right;
void travel(param &p);
inline bool intersect(param &p){...};
};
void BoundingBoxNode::travel(param &p){
if(this->intersect(p)){
if(left)
left->travel(p);
if(right)
right->travel(p);
}
}
This approach uses recursive methods calls, however, I need to optimized this code as much as possible... And according to Optimization Reference Manual for IA-32, function calls deeper than 16 can be very expensive, so I would like to do this using a while loop instead of recursive calls.
But I do NOT wish to do dynamic heap allocations as these are expensive. So I was thinking that maybe I could use that fact that every time a while loop starts over the stack will be in the same position.
In the following very ugly hack I rely on alloca() to always allocate the same address:
class BoundingBoxNode{
BoundingBoxNode* left, right;
inline void travel(param &p){
int stack_size = 0;
BoundingBoxNode* current = this;
while(stack_size >= 0){
BoundingBoxNode* stack = alloca(stack_size * 4 + 2*4);
if(current->intersect(p)){
if(current->left){
stack[stack_size] = current->left;
stack_size++;
}
if(current->right){
stack[stack_size] = current->right;
stack_size++;
}
}
stack_size--;
current = stack[stack_size];
}
};
inline bool intersect(param &p){...};
};
However surprising it may seem this approach does fail :)
But it does work as long as the stack is smaller than 4 or 5... I'm also quite confident that this approach is possible, I just really think I need some help implementing it correctly.
So how can I manipulate the stack manually from C++, is it possible that I can use some compiler extension... Or must I do this is assembler, and if so, how do I write assembler than can be compiled with both GCC and ICC.
I hope somebody can help me... I don't need a perfect solution, just a hack, if it works it's good enough for this purpose :)
Regards Jonas Finnemann Jensen
So, you've got a recursive function that you want to convert to a loop. You correctly work out that your function is not tail call so you have to implement it with a stack.
Now, why are you worried about the number of times that you allocate your "scratch space" stack? Is this not done once per traversal? -- if not then pass the scratch area in to the traverse function itself so it can be allocated once and then re-used for each traversal.
If the stack is small enough to fit in the cache it will stay hot and the fact that it isn't on the real C++ stack won't matter.
Once you've done all of that profile it both ways and see if it made any difference -- keep the faster version.
Stack allocations cannot be resized.
In your example, it isn't immediately obvious which data you need to allocate - besides the call stack itself. You could basically hold the current path in a vector preallocated to the maximum depth. The loop gets ugly, but that's life...
If you need many small allocations that can be released as a whole (after the algorithm completes), use a continuous pool for your allocations.
If you know an upper boundary for the required memory, the allocation is just a pointer increment:
class CPool
{
std::vector<char> m_data;
size_t m_head;
public:
CPool(size_t size) : m_data(size()), m_head(0) {}
void * Alloc(size_t size)
{
if (m_data.size() - head < size)
throw bad_alloc();
void * result = &(m_data[m_head]);
m_head += size;
return result;
}
void Free(void * p) {} // free is free ;)
};
If you don't have an upper boundary for the total size, use "pool on a rope" - i.e. when the big chunk of memory runs out, get a new one, and put these chunks in a list.
You don't need the stack, you just need a stack. You can probably use a std::stack<BoundingBoxNode* >, if I look at your code.
The C++ Standard provides no means of manipulating the stack - it doesn't even require that there be a stack. Have you actually measured the performance of your code using dynamic allocation?
The fact that it works with small stack sizes is probably a coincidence. You'd have to maintain multiple stacks and copy between them. You're never guaranteed that successive calls to alloca will return the same address.
Best approach is probably a fixed size for the stack, with an assert to catch overflows. Or you could determine the max stack size from the tree depth on construction and dynamically allocate a stack that will be used for every traversal... assuming you're not traversing on multiple threads, at least.
Since alloca allocations are cummulative, I suggest you do a first alloca to store the "this" pointer, thus becoming the "base" of the stack, keep track of how many elements your stack can hold and allocate only the size needed:
inline void travel(param &p){
BoundingBoxNode* stack = alloca(sizeof(BoundingBoxNode*)*3);
int stack_size = 3, stack_idx = 0;
stack[stk_idx] = this;
do {
BoundingBoxNode* current = stack[stk_idx];
if( current->intersect(p)){
int need = current->left ? ( current->right ? 2 : 1 ) : 0;
if ( stack-size - stk_idx < need ) {
// Let us simplify and always allocate enough for two
alloca(sizeof(BoundingBoxNode*)*2);
stack_size += 2;
}
if(current->left){
stack[stack_idx++] = current->left;
}
if(current->right){
stack[stack_idx++] = current->right;
}
}
stack_idx--;
} while(stack_idx > 0)
};
From your question, it appears there is a lot that still needs to be learned.
The most important thing to learn is: don't assume anything about performance without first measuring your runtime execution and analysing the results to determine exactly where the bottlenecks to performance are.
The function 'alloca' allocates a chunk of memory from the stack, the stack size is increased (by moving the stack pointer). Each call to 'alloca' creates a new chunk of memory until you run out of stack space, it does not re-use previously allocated memory, the data that was pointed to by 'stack' is lost when you allocate another chunk of memory and assign it to 'stack'. This is a memory leak. In this case, the memory is automatically freed when the function exits so it's not a serious leak, but, you've lost the data.
I would leave the "Optimization Reference Manual for IA-32" well alone. It assumes you know exactly how the CPU works. Let the compiler worry about optimisations it will do a good enough job for what you're doing - the compiler writers hopefully know that reference inside out. With modern PC's, the common bottleneck to performance is usually memory bandwidth.
I believe the '16 deep' function calls being expensive is to do with how the CPU is managing its stack and is a guideline only. The CPU keeps the top of the stack in onboard cache, when the cache is full, the bottom of the stack is paged to RAM which is where the performance starts to decrease. Functions with lots of arguments won't nest as deeply as functions with no arguments. And it's not just function calls, it's also local variables and memory allocated using alloca. In fact, using alloca is probably a performance hit since the CPU will be designed to optimise its stack for common use cases - a few parameters and a few local variables. Stray from the common case and performance drops off.
Try using std::stack as MSalters has suggested above. Get that working.
Use a C++ data structure. You are using C++ after all. A std::vector<> can be pre-allocated in chunks for an amortized cost of pretty much nil. And it's safe too (as you have noticed that using the normal stack is not. Especially when you're using threads)
And no, it's not expensive. It's as fast as a stack allocation.
std::list<> yes, that will be expensive. But that's because you can't pre-allocate that. std::vector<> is chunk-allocated by default.
Related
in a function I need to store some integers in a vector. The function is called a lot of times. I know that they are less then 10 but the number is variable for each call of the function. What is the choice to have better performances?
In example I found that this:
std::vector<int> list(10)
std::vector<int>::iterator it=list.begin();
unsigned int nume_of_elements_stored;
for ( ... iterate on some structures ... ){
if (... a specific condition ...){
*it= integer from structures ;
it++;
nume_of_elements_stored++;
}
}
is slower than:
std::vector<int> list;
unsigned int num_of_elements_stored(0);
for ( ... iterate on some structures ... ){
if (... a specific condition ...){
list.push_back( integer from structures )
}
}
num_of_elements_stored=list.size();
I'm going to go down an extremely uncool route here. At the risk of being crucified, I would suggest that std::vector isn't so great here. An exception would be if you get lucky with the memory allocator and get that temporal locality through the allocator that creating and destroying a bunch of teeny vectors normally wouldn't provide.
Wait!
Before people kill me, I want to say that vector is awesome, generally speaking, as one of the most well-rounded data structures available. But when you're looking at a hotspot like this (hopefully with a profiler) as a result of creating a bunch of teeny vectors repeatedly in a tight loop, that's where this kind of straightforward usage of vector can bite you.
The trouble is that it's a heap-allocated structure (basically a dynamic array), and when we're dealing with a boatload of teeny arrays like this, we really want to use that often-cached memory at the top of the stack that's so cheap to allocate/free when we can.
One way to mitigate this is to reuse the same vector across repeated calls. Store it in the outside caller function's scope and pass it in by reference, clear it, do your push_backs, rinse and repeat. It's worth noting that clear doesn't free any memory in the vector, so it keeps that former capacity around (useful here when we want to reuse the same memory and play to temporal locality).
But here we can play to that stack. As a simplified example (using C-style code that isn't very kosher in C++ or even bothers with exception-safety, but easier to illustrate):
int stack_mem[32];
int num = 0;
int cap = 32;
int* ptr = stack_mem;
for ( ... iterate on some structures ... )
{
if (... a specific condition ...)
{
if (num == cap)
{
cap *= 2;
int* new_ptr = static_cast<int*>(malloc(cap * sizeof(int)));
memcpy(new_ptr, ptr, num * sizeof(int));
if (ptr != stack_mem)
free(ptr);
ptr = new_ptr;
}
ptr[num++] = your_int;
}
}
if (ptr != stack_mem)
free(ptr);
Of course if you use something like this, you should properly wrap it in a reusable class template that does bounds-checking, doesn't use memcpy, has exception-safety, a formal push_back method, emplace_back, copy ctor, move ctor, swap, possibly a fill ctor, range ctor, erase, range erase, insert, range insert, size, empty, iterators/begin/end, uses placement new to avoid requiring copy assignment or default ctor, etc.
The solution uses the stack when N <= 32 (can use a different number suited for your common-case needs) and then switches to heap when exceeded. This allows it to handle your common case scenarios efficiently but also not just go kablooey in those rare case scenarios when N might be huge in some pathological case. That makes it somewhat comparable to variable-length arrays in C (something I actually wish we had in C++, at least until std::dynarray is available) but without the stack overflow tendencies VLAs could have since it switches to heap in rare case scenarios.
I applied all these standard-compliant formalities with a structure based on this idea with a class template that accepts <T, FixedN>, and now use it almost as much as vector since I work with so many cases like this with teeny arrays being repeatedly created that should, in the vast majority of common cases, fit on the stack (but always with those ultra rare exceptional possibilities). It wiped off many profiler hotspots I was getting related to memory off the map.
... but applying this basic idea might give you quite a boost. You can apply that kind of effort above of wrapping it into a safe container preserving C++ object semantics if it pays off in your measurements, and I think it should quite a bit in your case.
I would probably go with sort of a middle ground:
std::vector<int> list;
list.reserve(10);
...and the rest could be pretty much like your second version. To be honest, however, it's probably open to question whether this will really make a lot of difference though.
If you use static vector it will be allocated only once.
First example works slower because it allocates and destroys vector each call.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.
all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3
Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.
People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.
I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.
Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.
For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.
Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.
With my current project, I did my best to adhere to the principle that premature optimization is the root of all evil. However, now the code is tested, and it is time for optimization. I did some profiling, and it turns out my code spends almost 20% of its time in a function where it finds all possible children, puts them in a vector, and returns them. As a note, I am optimizing for speed, memory limitations are not a factor.
Right now the function looks like this:
void Board::GetBoardChildren(std::vector<Board> &children)
{
children.reserve(open_columns_.size()); // only reserve max number of children
UpdateOpenColumns();
for (auto i : open_columns_)
{
short position_adding_to = ColumnToPosition(i);
MakeMove(position_adding_to); // make the possible move
children.push_back(*this); // add to vector of children
ReverseMove(); // undo move
}
}
According to the profiling, my code spends about 40% of the time just on the line children.push_back(*this); I am calling the function like this:
std::vector<Board> current_children;
current_state.GetBoardChildren(current_children);
I was thinking since the maximum number of possible children is small (7), would it be better to just use an array? Or is there not a ton I can do to optimize this function?
From your responses to my comments, it seems very likely that most of the time is spent copying the board in
children.push_back(*this);
You need to find a way to avoid making all those copies, or a way to make them cheaper.
Simply changing the vector into an array or a list will likely not make any difference to performance.
The most important question is: Do you really need all States at once inside current_state?
If you just iterate over them once or twice in the default order, then there is no need for a vector, just generate them on demand.
If you really need it, here is the next step. Since Board is expensive for copying, a DifferenceBoard that keeps only track of the difference may be better. Pseudocode:
struct DifferenceBoard { // or maybe inherit from Board that a DifferenceBoard
// can be built from another DifferenceBoard
Board *original;
int fromposition, toposition;
State state_at_position;
State get(int y, int x) const {
if ((x,y) == fromposition) return Empty;
if ((x,y) == toposition ) return state_at_position;
return original->get();
}
};
Deal all, I have implemented some functions and like to ask some basic thing as I do not have a sound fundamental knowledge on C++. I hope, you all would be kind enough to tell me what should be the good way as I can learn from you. (Please, this is not a homework and i donot have any experts arround me to ask this)
What I did is; I read the input x,y,z, point data (around 3GB data set) from a file and then compute one single value for each point and store inside a vector (result). Then, it will be used in next loop. And then, that vector will not be used anymore and I need to get that memory as it contains huge data set. I think I can do this in two ways.
(1) By just initializing a vector and later by erasing it (see code-1). (2) By allocating a dynamic memory and then later de-allocating it (see code-2). I heard this de-allocation is inefficient as de-allocation again cost memory or maybe I misunderstood.
Q1)
I would like to know what would be the optimized way in terms of memory and efficiency.
Q2)
Also, I would like to know whether function return by reference is a good way of giving output. (Please look at code-3)
code-1
int main(){
//read input data (my_data)
vector<double) result;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result.push_back(value);
// do some other stuff
//loop over result and use each value for some other stuff
for (int i=0; i<result.size(); i++){
//do some stuff
}
//result will not be used anymore and thus erase data
result.clear()
code-2
int main(){
//read input data
vector<double) *result = new vector<double>;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result->push_back(value);
// do some other stuff
//loop over result and use each value for some other stuff
for (int i=0; i<result->size(); i++){
//do some stuff
}
//de-allocate memory
delete result;
result = 0;
}
code03
vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
{
vector<Position3D> *points_at_grid_cutting = new vector<Position3D>;
vector<Position3D>::iterator point;
for (point=begin(); point!=end(); point++) {
//do some stuff
}
return (*points_at_grid_cutting);
}
For such huge data sets I would avoid using std containers at all and make use of memory mapped files.
If you prefer to go on with std::vector, use vector::clear() or vector::swap(std::vector()) to free memory allocated.
erase will not free the memory used for the vector. It reduces the size but not the capacity, so the vector still holds enough memory for all those doubles.
The best way to make the memory available again is like your code-1, but let the vector go out of scope:
int main() {
{
vector<double> result;
// populate result
// use results for something
}
// do something else - the memory for the vector has been freed
}
Failing that, the idiomatic way to clear a vector and free the memory is:
vector<double>().swap(result);
This creates an empty temporary vector, then it exchanges the contents of that with result (so result is empty and has a small capacity, while the temporary has all the data and the large capacity). Finally, it destroys the temporary, taking the large buffer with it.
Regarding code03: it's not good style to return a dynamically-allocated object by reference, since it doesn't provide the caller with much of a reminder that they are responsible for freeing it. Often the best thing to do is return a local variable by value:
vector<Position3D> ReturnLabel(VoxelGrid grid, int segment) const
{
vector<Position3D> points_at_grid_cutting;
// do whatever to populate the vector
return points_at_grid_cutting;
}
The reason is that provided the caller uses a call to this function as the initialization for their own vector, then something called "named return value optimization" kicks in, and ensures that although you're returning by value, no copy of the value is made.
A compiler that doesn't implement NRVO is a bad compiler, and will probably have all sorts of other surprising performance failures, but there are some cases where NRVO doesn't apply - most importantly when the value is assigned to a variable by the caller instead of used in initialization. There are three fixes for this:
1) C++11 introduces move semantics, which basically sort it out by ensuring that assignment from a temporary is cheap.
2) In C++03, the caller can play a trick called "swaptimization". Instead of:
vector<Position3D> foo;
// some other use of foo
foo = ReturnLabel();
write:
vector<Position3D> foo;
// some other use of foo
ReturnLabel().swap(foo);
3) You write a function with a more complicated signature, such as taking a vector by non-const reference and filling the values into that, or taking an OutputIterator as a template parameter. The latter also provides the caller with more flexibility, since they need not use a vector to store the results, they could use some other container, or even process them one at a time without storing the whole lot at once.
Your code seems like the computed value from the first loop is only used context-insensitively in the second loop. In other words, once you have computed the double value in the first loop, you could act immediately on it, without any need to store all values at once.
If that's the case, you should implement it that way. No worries about large allocations, storage or anything. Better cache performance. Happiness.
vector<double) result;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result.push_back(value);
If the "result" vector will end up having thousands of values, this will result in many reallocations. It would be best if you initialize it with a large enough capacity to store, or use the reserve function :
vector<double) result (someSuitableNumber,0.0);
This will reduce the number of reallocation, and possible optimize your code further.
Also I would write : vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
Like this :
void vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment, vector<Position3D> & myVec_out) const //myVec_out is populated inside func
Your idea of returning a reference is correct, since you want to avoid copying.
`Destructors in C++ must not fail, therefore deallocation does not allocate memory, because memory can't be allocated with the no-throw guarantee.
Apart: Instead of looping multiple times, it is probably better if you do the operations in an integrated manner, i.e. instead of loading the whole dataset, then reducing the whole dataset, just read in the points one by one, and apply the reduction directly, i.e. instead of
load_my_data()
for_each (p : my_data)
result.push_back(p)
for_each (p : result)
reduction.push_back (reduce (p))
Just do
file f ("file")
while (f)
Point p = read_point (f)
reduction.push_back (reduce (p))
If you don't need to store those reductions, simply output them sequentially
file f ("file")
while (f)
Point p = read_point (f)
cout << reduce (p)
code-1 will work fine and is almost the same as code-2, with no major advantages or disadvantages.
code03 Somebody else should answer that but i believe the difference between a pointer and a reference in this case would be marginal, I do prefer pointers though.
That being said, I think you might be approaching the optimization from the wrong angle. Do you really need all points to compute the output of a point in your first loop? Or can you rewrite your algorithm to read only one point, compute the value as you would in your first loop and then use it immediately the way you want to? Maybe not with single Points, but with batches of points. That could potentially cut back on your memory require quite a bit with only a small increase in processing time.
RecursiveSort::RecursiveSort(int myArray[], int first, int arraySize)
{
int smallest = first, j;
if (smallest < arraySize)
{
smallest = first;
for (j=first+1; j<arraySize; j++)
{
if (myArray[j] < myArray[smallest])
{
smallest = j;
}
}
swap(myArray[first], myArray[smallest]);
first++;
RecursiveSort::RecursiveSort(myArray, first, arraySize);
}
};
and in my main(); i will call RecursiveSort sort(myArray, 0, arraySize);
There is a stack overflow when arraySize > 4000, and the program crashes. Is it possible to call class destructors somewhere to prevent stack overflow? I have tried using "release" instead of "debug" (project properties > configuration manager > configuration pull-down menu). However this causes other issues when I try to integrate a "TimeStamp_Lib.lib" library which is used to measure how long the sort takes.
Any advice/suggestions would be greatly appreciated, thank you!
You should make this a free function, so no objects would be created. If, for some reason the algorithm has to run in a constructor of a class, you can call it from the constructor.
That said, unless the compiler optimizes the recursion into iteration itself, recursion won't be suitable for algorithms that require going that deep. To sort 4000 integers, your function goes 4000 levels deep which eats a lot of stack space. By comparison, a quick sort goes some log2(4000) = 12 levels deep.
There are no allocations in that function - your issue is not that you have allocated too much memory.
I'd imagine you are recursing too many times. That will require some form of algorithm change, or just a limit on how many items you can sort with this method. If you can change this in such a way that the compiler can optimise away the recursion, that would work (though it could be implementation specific). Even better, remove the recursion yourself and change it to a stack-based loop. Here's some more information about that
Yes, destructors can be manually invoked. However, they destroy an object, they don't free its memory. Recursion is easy to think in terms of, but its not so useful in real life. Try moving all your objects to a pool outside the stack, or better yet work in an iterative way.