Exit the entire recursion stack - c++

I'm calling a function fooA from main() that calls another function fooB that is recursive.
When I wish to return, I keep using exit(1) to halt execution. What is the right way to exit when the recursion tree is deep?
Returning through the recursion stack may not be of help because returning usually clears a part solution I build and I don't want to do that. I want to do execute more piece of code from main().
I read Exceptions can be used, it would be nice if I can get a code snippet.

The goto statement won't work to hop from one function back to another; Nikos C. is correct that it wouldn't account for releasing the stack frames of each of the calls you've made, so when you got to the function you goto'ed to, the stack pointer would be pointing to the stack frame of the function you were just in... no, that just won't work. Similarly, you can't simply call (either directly, or indirectly via a function pointer) the function you want to end up in when your algorithm is done. You'd never get back to the context you were in prior to diving into your recursive algorithm. You could conceivably architect a system this way, but in essence each time you did this you'd "leak" what was currently on the stack (not quite the same as leaking heap memory, but a similar effect). And if you were deep into a highly recursive algorithm, that could be a lot of "leaked" stack space.
No, you need to somehow return back to the calling context. There are only three ways to do so in C++:
Exit each function in turn by returning from it to its caller
backing up through the call chain in an orderly fashion.
Throw an exception and catch it at the point right after you
launched into your recursive algorithm (which automatically destroys
any objects created by each function on the stack in an orderly
fashion).
Use setjmp() & longjmp() to do something similar to throwing &
catching an exception, but "throwing" a longjmp() will not destroy
objects on the stack; if any such objects own heap allocations,
those allocations will be leaked.
To do option 1, you have to write your recursive function such that once a solution is reached, it returns some sort of indication that it's complete to its caller (which may be the same function), and its caller sees that fact & relays that fact on to its caller by returning to it (which may be the same function), so on and so on, until finally all stack frames of the recursive algorithm are released and you return to whatever function called the first function in the recursive algorithm.
To do option 2, you wrap the call to your recursive algorithm in a try{...} and immediately after it you catch(){...} the expected thrown object (which could conceivably be the result of the computation, or just some object that lets the caller know "hey, I'm done, you know where to find the result"). Example:
try
{
callMyRecursiveFunction(someArg);
}
catch( whateverTypeYouWantToThrow& result )
{
...do whatever you want to do with the result,
including copy it to somewhere else...
}
...and in your recursive function, when you finish the results, you simply:
throw(whateverTypeYouWantToThrow(anyArgsItsConstructorNeeds));
To do option 3...
#include <setjmp.h>
static jmp_buf jmp; // could be allocated other ways; the longjmp() user just needs to have access to it.
.
.
.
if (!setjmp(jmp)) // setjmp() returns zero 1st time, or whatever int value you send back to it with longjmp()
{
callMyRecursiveFunction(someArg);
}
...and in your recursive function, when you finish the results, you simply:
longjmp(jmp, 1); // this passes 1 back to the setjmp(). If your result is an int, you
// could pass that back to setjmp(), but you can't pass zero back.
The bad thing about using setjmp()/longjmp() is that if there are any stack-allocated objects still "alive" on the stack when you call longjmp(), execution will jump back to the setjmp() point, skipping the destructors for those objects. If your algorithm uses only POD types, that's not an issue. It's also not an issue if the non-POD types your algorithm uses do NOT contain any heap allocations (e.g. from malloc() or new). If your algorithm uses non-POD types that contain heap allocations, then you're only safe with options 1 & 2 above. But if your algorithm meets the criteria of being OK with setjmp()/longjmp(), and if your algorithm is buried under a ton of recursive calls at the point it finishes, setjmp()/longjmp() may be the fastest way back to the initial calling context. If that won't work, option 1 is probably your best bet in terms of speed. Option 2 may seem convenient (and would possibly eliminate a condition check at the start of each recursion call), but the overhead associated with the system automatically unwinding the callstack is somewhat significant.
It's typically said you should reserve exceptions for "exceptional events" (events expected to be very rare), and the overhead associated with unwinding the callstack is why. Older compilers used something akin to setjmp()/longjmp() to implement exceptions (setjmp() at the location of the try & catch, and longjmp() at the location of a throw), but there was of course extra overhead associated with determining what objects on the stack need destroyed, even if there are no such objects. Plus, every time you'd run across a try, it would have to save the context just in case there was a throw, and if exceptions are truly exceptional events, the time spent saving that context was simply wasted. Newer compilers are now more likely to use what are known as "Zero Cost Exceptions" (a.k.a. Table Based Exceptions), which seems like that would solve all the world's problems, but it doesn't.... It makes normal runtime faster because there is no longer a need to save the context every time you run across a try, but in the event that a throw executes, there is now even more overhead associated with decoding information stored in massive tables that the runtime has to process in order to figure out how to unwind the stack based on the location where the throw was encountered and content of the runtime stack. So exceptions aren't free, even though they're very convenient. You'll find a lot of stuff on the internet where people make claims about how unreasonably expensive they are and how much they slow down your code, and you'll also find lots of stuff by people refuting those claims, with both sides presenting hard data to bolster their claims. What you should take away from the arguments is that using exceptions is great if you expect them to rarely occur, because they result in cleaner interfaces & logic that's free of a ton of condition checking for "badness" every time you make a function call. But you shouldn't use exceptions as a means of normal communication between a caller and its callees, because that mode of communication is significantly more expensive than simply using return values.

This happened to me while finding the path from root to node of a binary tree. I was using a stack to store the nodes in preorder and the recursion wouldnt stop until the last node returned NULL. I used a global variable, integer i=1, and when I reached the node I was looking for I set that variable to 0 and used while(i==0) return stack; to allow the program to go back up the memory stack without popping my nodes off.

Related

How do production compilers implement destructor handling on flow control

Long story short - I am writing a compiler, and reaching the OOP features I am faced with a dilemma involving the handling of destructors. Basically I have two options:
1 - put all destructors for objects that need calling at that point in the program. This option sounds like it will be performance friendly and simple but will bloat the code, since depending on the control flow certain destructors can be duplicated multiple times.
2 - partition destructors for each block of code with labels and "spaghetti jump" only through those that are needed. The upside - no destructors will be duplicated, the downside - it will involve non-sequential execution and jumping around, and also extra hidden variables and conditionals, which will be needed for example to determine whether execution leaves a block to continue execution in the parent block or to break/continue/goto/return, which also increases its complexity. And the extra variables and checks might very well eat up the space being saved by this approach, depending on how many objects and how complex structure and control flow inside of it is.
And I know the usual response to such questions is "do both, profile and decide" and that's what I would do if this was a trivial task, but writing a full featured compiler has proven somewhat arduous so I prefer to get some expert input rather than build two bridges, see which one does better and burn the other one.
I put c++ in the tags because that's the language I am using and am somewhat familiar with it and the RAII paradigm, which is what my compiler is modeling around as well.
For the most part, a destructor call can be treated in the same manner as an ordinary function call.
The lesser part is dealing with EH. I've noticed MSC generates a mix of inlined destructor calls in "ordinary" code, and, for x86-64, creates separate cleanup code that itself may or may not have copies of destructor logic in it.
IMO, the simplest solution would be to always call nontrivial destructors as ordinary functions.
If optimization seems possible on the horizon, treat the aforementioned calls like anything else: Will it fit in the cache with everything else? Will doing this take up too much space in the image? Etc..
A frontend may insert "calls" to nontrivial destructors at the end of each actionable block in its output AST.
A backend may treat such things as ordinary function calls, wire them together, make a big block-o-destructor call logic somewhere and jump to that, etc...
Linking functions to the same logic seems quite common. For example, MSC tends to link all trivial functions to the same implementation, destructor or otherwise, optimizing or not.
This is primarily from experience. As usual, YMMV.
One more thing:
EH cleanup logic tends to work like a jump table: For a given function, you can just jump into a single list of destructor calls, depending on where an exception was thrown (if applicable).
I don't know how commercial compilers come up with the code, but assuming we ignore exceptions at this point [1], the approach I would take is to make a call to the destructor, not inline it. Each destructor would contain the complete destructor for that object. Use a loop to deal with destructors of arrays.
To inline the calls is an optimisation, and you shouldn't do that unless you "know it pays off" (code-size vs. speed).
You will need to deal with "destruction in the enclosing block", but assuming you don't have jumps out of the block, that should be easy. Jumps out of block (e.g. return, break, etc) will mean that you have to jump to a piece of code that cleans up the block you are in.
[1] Commercial compilers have special tables based on "where was the exception thrown", and a piece of code generated to do that cleanup - typically reusing the same cleanup for many exception points by having multiple jump labels in each chunk of cleanup.
Compilers use a mix of both approaches. MSVC uses inline destructor calls for normal code flow and clean up code blocks in reverse order for early returns and exceptions. During normal flow, it uses a single hidden local integer to track constructor progress thus far, so it knows where to jump upon early returns. A single integer is sufficient because scope always forms a tree (rather than say using a bitmask for each class that has or has not been constructed successfully). For example, the following fairly short code using a class with a non-trivial destructor and some random returns sprinkled throughout...
...
if (randomBool()) return;
Foo a;
if (randomBool()) return;
Foo b;
if (randomBool()) return;
{
Foo c;
if (randomBool()) return;
}
{
Foo d;
if (randomBool()) return;
}
...
...can expand to pseudocode like below on x86, where the constructor progress is incremented immediately after each constructor call (sometimes by more than one to the next unique value) and decremented (or 'popped' to an earlier value) immediately before each destructor call. Note that classes with trivial destructors do not affect this value.
...
save previous exception handler // for x86, not 64-bit table based handling
preallocate stack space for locals
set new exception handler address to ExceptionCleanup
set constructor progress = 0
if randomBool(), goto Cleanup0
Foo a;
set constructor progress = 1 // Advance 1
if randomBool(), goto Cleanup1
Foo b;
set constructor progress = 2 // And once more
if randomBool(), goto Cleanup2
{
Foo c;
set constructor progress = 3
if randomBool(), goto Cleanup3
set constructor progress = 2 // Pop to 2 again
c.~Foo();
}
{
Foo d;
set constructor progress = 4 // Increment 2 to 4, not 3 again
if randomBool(), goto Cleanup4
set constructor progress = 2 // Pop to 2 again
d.~Foo();
}
// alternate Cleanup2
set constructor progress = 1
b.~Foo();
// alternate Cleanup1
set constructor progress = 0
a.~Foo();
Cleanup0:
restore previous exception handler
wipe stack space for locals
return;
ExceptionCleanup:
switch (constructor progress)
{
case 0: goto Cleanup0; // nothing to destroy
case 1: goto Cleanup1;
case 2: goto Cleanup2;
case 3: goto Cleanup3;
case 4: goto Cleanup4;
}
// admitting ignorance here, as I don't know how the exception
// is propagated upward, and whether the exact same cleanup
// blocks are shared for both early returns and exceptions.
Cleanup4:
set constructor progress = 2
d.~Foo();
goto Cleanup2;
Cleanup3:
set constructor progress = 2
c.~Foo();
// fall through to Cleanup2;
Cleanup2:
set constructor progress = 1
b.~Foo();
Cleanup1:
set constructor progress = 0
a.~Foo();
goto Cleanup0;
// or it may instead return directly here
The compiler may of course rearrange these blocks anyway it thinks is more efficient, rather than putting all the cleanup at the end. Early returns could jump instead to the alternate Cleanup1/2 at the end of the function. On 64-bit MSVC code, exceptions are handled via tables that map the instruction pointer of when the exception happened to respective code cleanup blocks.
An optimizing compiler is transforming the internal representations of the compiled source code.
It usually build a directed (usually cyclic) graph of basic blocks. When building this control flow graph it is adding the call to the destructors.
For GCC (it is a free software compiler - and so is Clang/LLVM -, so you could study its source code), you probably could try to compile some simple C++ test case code with -fdump-tree-all and then see that it is done at gimplification time. BTW, you could customize g++ with MELT to explore its internal representations.
BTW, I don't think that how you deal with destructors is that important (notice that in C++ they are implicitly called at syntactically defined places, like } of their defining scope). Most of the work of such a compiler is in optimizing (then, dealing with destructors is not very relevant; they nearly are routines like others).

How to handle or avoid a stack overflow in C++

In C++ a stack overflow usually leads to an unrecoverable crash of the program. For programs that need to be really robust, this is an unacceptable behaviour, particularly because stack size is limited. A few questions about how to handle the problem.
Is there a way to prevent stack overflow by a general technique. (A scalable, robust solution, that includes dealing with external libraries eating a lot of stack, etc.)
Is there a way to handle stack overflows in case they occur? Preferably, the stack gets unwound until there's a handler to deal with that kinda issue.
There are languages out there, that have threads with expandable stacks. Is something like that possible in C++?
Any other helpful comments on the solution of the C++ behaviour would be appreciated.
Handling a stack overflow is not the right solution, instead, you must ensure that your program does not overflow the stack.
Do not allocate large variables on the stack (where what is "large" depends on the program). Ensure that any recursive algorithm terminates after a known maximum depth. If a recursive algorithm may recurse an unknown number of times or a large number of times, either manage the recursion yourself (by maintaining your own dynamically allocated stack) or transform the recursive algorithm into an equivalent iterative algorithm
A program that must be "really robust" will not use third-party or external libraries that "eat a lot of stack."
Note that some platforms do notify a program when a stack overflow occurs and allow the program to handle the error. On Windows, for example, an exception is thrown. This exception is not a C++ exception, though, it is an asynchronous exception. Whereas a C++ exception can only be thrown by a throw statement, an asynchronous exception may be thrown at any time during the execution of a program. This is expected, though, because a stack overflow can occur at any time: any function call or stack allocation may overflow the stack.
The problem is that a stack overflow may cause an asynchronous exception to be thrown even from code that is not expected to throw any exceptions (e.g., from functions marked noexcept or throw() in C++). So, even if you do handle this exception somehow, you have no way of knowing that your program is in a safe state. Therefore, the best way to handle an asynchronous exception is not to handle it at all(*). If one is thrown, it means the program contains a bug.
Other platforms may have similar methods for "handling" a stack overflow error, but any such methods are likely to suffer from the same problem: code that is expected not to cause an error may cause an error.
(*) There are a few very rare exceptions.
You can protect against stack overflows using good programming practices, like:
Be very carefull with recursion, I have recently seen a SO resulting from badly written recursive CreateDirectory function, if you are not sure if your code is 100% ok, then add guarding variable that will stop execution after N recursive calls. Or even better dont write recursive functions.
Do not create huge arrays on stack, this might be hidden arrays like a very big array as a class field. Its always better to use vector.
Be very carefull with alloca, especially if it is put into some macro definition. I have seen numerous SO resulting from string conversion macros put into for loops that were using alloca for fast memory allocations.
Make sure your stack size is optimal, this is more important in embeded platforms. If you thread does not do much, then give it small stack, otherwise use larger. I know reservation should only take some address range - not physical memory.
those are the most SO causes I have seen in past few years.
For automatic SO finding you should be able to find some static code analysis tools.
Re: expandable stacks. You could give yourself more stack space with something like this:
#include <iostream>
int main()
{
int sp=0;
// you probably want this a lot larger
int *mystack = new int[64*1024];
int *top = (mystack + 64*1024);
// Save SP and set SP to our newly created
// stack frame
__asm__ (
"mov %%esp,%%eax; mov %%ebx,%%esp":
"=a"(sp)
:"b"(top)
:
);
std::cout << "sp=" << sp << std::endl;
// call bad code here
// restore old SP so we can return to OS
__asm__(
"mov %%eax,%%esp":
:
"a"(sp)
:);
std::cout << "Done." << std::endl;
delete [] mystack;
return 0;
}
This is gcc's assembler syntax.
C++ is a powerful language, and with that power comes the ability to shoot yourself in the foot. I'm not aware of any portable mechanism to detect and correct/abort when stack overflow occurs. Certainly any such detection would be implementation-specific. For example g++ provides -fstack-protector to help monitor your stack usage.
In general your best bet is to be proactive in avoiding large stack-based variables and careful with recursive calls.
I don't think that that would work. It would be better to push/pop esp than move to a register because you don't know if the compiler will decide to use eax for something.
Here ya go:
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/resetstkoflw?view=msvc-160
You wouldn't catch the EXCEPTION_STACK_OVERFLOW structured exception yourself because the OS is going to catch it (in Windows' case).
Yes, you can safely recover from an structured exception (called "asynchronous" above) unlike what was indicated above. Windows wouldn't work at all if you couldn't. PAGE_FAULTs are structured exceptions that are recovered from.
I am not as familiar with how things work under Linux and other platforms.

Iterating without incurring the cost of IF statements

My question is based on curiosity and not whether there is another approach to the problem or not. It is a strange/interesting question, so please read it with an open mind.
Let's assume there is a game loop that is being called every frame. The game loop in turn calls several functions through a myriad of if statements. For example, if the user has GUI to false then don't refresh the GUI otherwise call RefreshGui(). There are many other if statements in the loop and they call their respective functions if they are true. Some are if/if-else.../else which are more costly in the worst case. Even the functions that are called, if the if statement is true, have logic. If user wants raypicking on all objects call FunctionA(), if user wants raypicking on lights, call FunctionB(), ... , else call all functions. Hopefully you get the idea.
My point is, that is a lot of redundant if statements. So I decided to use function pointers instead. Now my assumption is that a function pointer is always going to be faster than an if statement. It is a replacement for if/else. So if the user wants to switch between two different camera modes, he/she presses the C key to toggle between them. The callback function for the keyboard changes the function pointer to the correct UpdateCamera function (in this case, the function pointer can point to either UpdateCameraFps() or UpdateCameraArcBall() )... you get the gist of it.
Now to the question itself. What if I have several update functions all with the same signature (let's say void (*Update)(float time) ), so that a function pointer can potentially point to any one of them. Then, I have a vector which is used to store the pointers. Then in my main update loop, I go through the vector and call each update function. I can remove/add and even change the order of the updates, without changing the underlying code. In the best case, I might only be calling one update function or in the worst case all of them, all with a very clean while loop and no nasty (potentially nested) if statements. I have implemented this part and it works great. I am aware, that, with each iteration of the while loop responsible for iterating through the vector, I am checking whether the itrBegin == itrEnd. More specifically while (itrBegin != itrEnd). Is there any way to avoid the call to the if statements? Can I use branch prediction to my advantage (or am I taking advantage of it already without knowing)?
Again, please take the question as-is, i.e. I am not looking for a different approach (although you are more than welcome to give one).
EDIT: A few replies state that this is an unneeded premature optimization and I should not be focusing on it and that the if-statement(s) cost is minuscule compared to the work done in all the separate update functions. Very true, and I completely agree, but that was not the point of the question and I apologize if I did not make the question clearer. I did learn quite a few new things with all the replies though!
there is a game loop that is being called every frame
That's a backwards way of describing it. A game loop doesn't run during a frame, a frame is handled in the body of the game loop.
my assumption is that a function pointer is always going to be faster than an if statement
Have you tested that? It's not likely to be true, especially if you're changing the pointer frequently (which really messes with the CPU's branch prediction).
Can I use branch prediction to my advantage (or am I taking advantage of it already without knowing)?
This is just wishful thinking. By having one indirect call inside your loop calling a bunch of different functions you are definitely working against the CPU branch prediction logic.
More specifically while (itrBegin != itrEnd). Is there any way to avoid the call to the if statements?
One thing you could do in order to avoid conditionals as you iterate the chain of functions is to use a linked list. Then each function can call the next one unconditionally, and you simply install your termination logic as the last function in the chain (longjmp or something). Or you could hopefully just never terminate, include glSwapBuffers (or the equivalent for your graphics API) in the list and just link it back to the beginning.
First, profile your code. Then optimize the parts that need it.
"if" statements are the least of your concerns. Typically, with optimization, you focus on loops, I/O operations, API calls (e.g. SQL), containers/algorithms that are inefficient and used frequently.
Using function pointers to try to optimize is typically the worst thing you can do. You kill any chance at code readability and work against the CPU and compiler. I recommend using polymorphism or just use the "if" statements.
To me, this is asking for an event-driven approach. Rather than checking every time if you need to do something, monitor for the incoming request to do something.
I don't know if you consider it a deviation from your approach, but it would reduce the number of if...then statements to 1.
while( active )
{
// check message queue
if( messages )
{
// act on each message and update flags accordingly
}
// draw based on flags (whether or not they changed is irrelevant)
}
EDIT: Also I agree with the poster who stated that the loop should not be based on frames; the frames should be based on the loop.
If the conditions checked by your ifs are not changing during the loop, you could check them all once, and set a function pointer to the function you'd like to call in that case. Then in the loop call the function the function pointer points to.

Does it take time to deallocate memory?

I have a C++ program which, during execution, will allocate about 3-8Gb of memory to store a hash table (I use tr1/unordered_map) and various other data structures.
However, at the end of execution, there will be a long pause before returning to shell.
For example, at the very end of my main function I have
std::cout << "End of execution" << endl;
But the execution of my program will go something like
$ ./program
do stuff...
End of execution
[long pause of maybe 2 min]
$ -- returns to shell
Is this expected behavior or am I doing something wrong?
I'm guessing that the program is deallocating the memory at the end. But, commercial applications which use large amounts of memory (such as photoshop) do not exhibit this pause when you close the application.
Please advise :)
Edit: The biggest data structure is an unordered_map keyed with a string and stores a list of integers.
I am using g++ -O2 on linux, the computer I am using has 128GB of memory (with most of that free). There are a few giant objects
Solution: I ended up getting rid of the hashtable since it was almost full anyways. This solved my problem.
If the data structures are sufficiently complicated when your program finishes, freeing them might actually take a long time.
If your program actually must create such complicated structures (do some memory profiling to make sure), there probably is no clean way around this.
You can short cut that freeing of memory by a dirty hack - at least on those operating systems where all memory allocated by a process is automatically freed when the process terminates.
You would do that by directly calling the libc's exit(3) function or the operating system's _exit(2). However, I would be very careful about verifying this does not short-circuit any other (important) cleanups some C++ destructor code might be doing. And what this does or does not do is highly system dependent (operating system, compiler, libc, the APIs you were using, ...).
Yes the deallocation of memory can take some time, and also possibly you have code executing like destructors being called. Photoshop does not use 3-8GB of memory.
Also you should perhaps add profiling to your application to confirm it is the deallocation of memory and not something else.
(I started this as a reply to ndim, but it got to long)
As ndim already posted, termination can take a long time.
Likely reasons are:
you have lots of allocations, and parts of the heap are swapped to disk.
long running destructors
other atexit routines
OS specific cleanup, such as notifying DLL's of thread & process termination on Windows (don't know what exactly happens on Linux.)
exit is not the worst workaround here, however, actual behavior is system dependent. e.g. exit on WIndows / MSVC CRT will run global destructors / atexit routines, then call ExitProcess which does close handles (but not necessarily flush them - at least it's not guaranteed).
Downsides: Destructors of heap allocated objects don't run - if you rely on them (e.g. to save state), you are toast. Also, tracking down real memory leaks gets much harder.
Find the cause You should first analyze what is happening.
e.g. by manually freeing the root objects that are still allocated, you can separate the deallocation time from other process cleanup. Memory is the likely cause accordign to your description, but it's not the only possible one. Some cleanup code deadlocking before it runs into a timeout is possible, too. Monitoring stats (such as CPU/swap activity/disk use) can give clues.
Check the release build - debug builds usually use extra data on the heap that can immensely increase cleanup cost.
Different allocators
Ifdeallocation is the problem, you might benefit a lot from using custom allocation mechanisms. Example: if your map only grows (items are never removed), an arena allocator can help a lot. If your lists of integers have many nodes, switch to a vector, or use a rope if you need random insertion.
Certainly it's possible.
About 7 years ago I had a similar problem on a project, there was much less memory but computers were slower too I suppose.
We had to look at the assembly languge for free in the end to work out why it was so slow and it seemed that it was essentially keeping the freed blocks in a linked list so they could be reallocated and was also scanning that list looking for blocks to combine. Scanning the list was an O(n) operation but freeing 'n' objects turned it into O(n^2)
Our test data took about 5 seconds to free the memory but some customers had about 10 times as much data as we every used and it was taking 5-10 minutes to shut down the program on their systems.
We fixed it, as has been suggested by just terminating the process instead and letting the operating system clear up the mess (which we knew was safe to do on our application).
Perhaps you have a more sensible free function that we had several years ago, but I just wanted to post that it's entirely possible if you have many objects to free and an O(n) free operation.
I can't imagine how you'd use enough memory for it to matter, but one way I sped up a program was to use boost::object_pool to allocate memory for a binary tree. The major benefit for me was that I could just put the object pool as a member variable of the tree, and when the tree went out of scope or was deleted, the object pool would be deleted all at once (letting me not have to use a recursive deconstructor for the nodes). object_pool does call all of its objects decontructors at exit though. I'm not sure if it handles empty decontructors in a special way or not.
If you don't need your allocator to call a constructor, you can also use boost::pool, which I think may deallocate faster because it doesn't have to call deconstructors at all and just deleted the chunk of memory in one free().
Freeing memory may well take time - data structures are being updated. How much time depends on the allocator being used.
Also there might be more than just memory deallocation going on - if destructors are being executed, there may be a lot more than that going on.
2 minutes does sound like a lot of time though - you might want to step through the clean up code in a debugger (or use a profiler if that's more convenient) to see what's actually taking all the time.
The time is probably not entirely wasted deallocating memory, but calling all the destructors. You can provide your own allocator that does not call the destructor (if the object in the map doesn't need to be destructed, but only deallocated).
Also take a look at this other question: C++ STL-conforming Allocators
Normally, deallocating memory as a process ends is not taken care of as part of the process, but rather as an operating system cleanup function. You might try something like valgrind to make sure your memory is being dealt with properly. However, the compiler also does certain things to set up and tear down your program, so some sort of performance profiling, or using a debugger to step through what is taking place at teardown time might be useful.
when your program exits the destructors of all the global objects are called.
if one of them takes a long time, you will see this behavior.
look for global objects and investigate their destructors.
Sorry, but this is a terrible question. You need to show the source code showing the specific algorithms and data structures that you are using.
It could be de-allocating, but that's just a wild guess. What are your destructors doing? Maybe is paging like crazy. Just because your application allocates X amount of memory, that doesn't mean it will get it. Most likely it will be paging off virtual memory. Depending on how the specifics of your application and OS, you might be doing a lot of page faults.
In such cases, it might help to run iostat and vmstat on the background to see what the heck is going on. If you see a lot of I/O that's a sure sign you are page faulting. I/O operations will always be more expensive that memory ops.
I would be very surprised if indeed all that lapsed time at the end is purely due to de-allocation.
Run vmstat and iostat as soon as you get the "ending" message, and look for any indications of I/O going bananas.
The objects in memory are organized in a heap. They are not deleted at once, they are deleted one by one, and the cost of deleting an object is O(log n). Freeing them takes loooong.
The answer is then, yes, it takes so much time.
You can avoid free being called on an object by using a destructor call my_object->~my_class() instead of delete my_object. You can avoid free on all objects of a class by overriding and nullifying operator delete( void * ) {} inside the class. Derived classes with virtual destructors will inherit that delete, otherwise you can copy-paste (or maybe using base::operator delete;).
This is much cleaner than calling exit. Just be sure you don't need that memory back!
I guess your unordered map is a global variable, whose constructor is called at process startup, and destructor is called at process exit.
How could you know if the map is guilty?
You can test if your unordered_map is responsible (and I guess it is) by allocating it with a new, and, well, ahem... forget to delete it.
If your process' exit goes faster, then you have your culprit.
Why this is so sloooooow?
Now, just by reading your post, for your unordered map, I see potential allocations for:
strings allocated buffer
list items (each one being a string + other things)
unordered map items + the bucket array
If you have 3-8 Gb of data in this unordered map, this means that each item above will need some kind of new and delete. And if you free every item, one by one, it could take time.
Other reasons?
Note that if you add items to your map item by item while your process executing, the new are not exactly perceptible... But the moment you want to clean all, all your allocated items must be destroyed at the same time, which could explain the perceived difference between construction/use and destruction...
Now, the destructors could take time for an additional reason.
For example, on Visual C++ 2008 in debug mode, for example, upon destruction of STL iterators, the destructor verifies the iterators are still correct. This caused quite a slowdown upon my object destruction (which was basically a tree of nodes, each node having list of child nodes, with iterators everywhere).
You are working on gcc, so perhaps they have their own debug testing, or perhaps your destructors are doing additional work (e.g. logging?)...
In my experience, the calls to free or delete should not take a significant amount of time. That said, I have seen plenty of cases where it does take non-trivial time to destruct objects because of destructors that did non-trivial things. If you can't tell what's taking time during the destruction, use a debugger and/or a profiler to determine what's going on. If the profiler shows you that it really is calls to free() that take a lot of time, then you should improve your memory allocation scheme, because you must be creating an extremely large number of small objects.
As you noted plenty of applications allocate large amounts of memory, and incur no significant memory during shutdown, so there's no reason your program can't do the same.
I would recommend (as some others have) a simple forced process termination, if you're certain that you've nothing left to do but free memory (for example, no file i/o and such left to do).
The thing is that when you free memory, typically, it's not actually returned to the OS - it's held in a list to be reallocated, and this is obviously slow. However, if you terminate process, the OS will lump reclaim all your memory at once, which should be substantially faster. However, as others have said, if you have any destructors that need to run, you should ensure that they are run before force calling exit() or ExitProcess or anysuch function.
What you should be aware of is that deallocating memory that is spread out (e.g., two nodes in a map) is much slower due to cache effects than deallocating memory in a vector, because the CPU needs to access the memory to free it and run any destructors. If you deallocated a very large amount of memory that's very fragmented, you could be falling afoul of this, and should consider changing to some more contiguous structures.
I actually had a problem where allocating memory was faster than de-allocating it, and after allocating memory and then de-allocating it, I had a memory leak. Eventually, I worked out that this is why.
I am currently facing a similar issue, with a CPU & memory intensive research program of mine. It runs until a specified time limit, prints a solutions and exits. The destructor call of a single object (containing up to 10⁶ relatively small objects) was what unexpectedly took time at the end of execution (about 10sec. to free 5Gb of data).
I was not satisfied by the answers advising to avoid executing every destructor, so here is the solution I came up with:
Original code:
void process() {
vector<unordered_map<State, int>> large_obj(100);
// Processing...
} // Takes a few seconds to exit (destructor calls)
Solution:
void process(bool free_mem = false) {
auto * large_obj_ = new vector<unordered_map<State, int>>(100);
auto &large_obj = *large_obj;
// Processing...
// (No changes required here, 'large_obj' can be used exactly as before)
if(free_mem)
delete large_obj_;
}
It has the advantage of being completely transparent apart from a few lines to insert, and it can even be parametrized to take some time to free the memory if needed. It is explicit which object will intentionally not be freed to avoid leaving things in an "unstable" state. Memory is cleaned up instantly by the OS on exit when free_mem = false.

What can modify the frame pointer?

I have a very strange bug cropping up right now in a fairly massive C++ application at work (massive in terms of CPU and RAM usage as well as code length - in excess of 100,000 lines). This is running on a dual-core Sun Solaris 10 machine. The program subscribes to stock price feeds and displays them on "pages" configured by the user (a page is a window construct customized by the user - the program allows the user to configure such pages). This program used to work without issue until one of the underlying libraries became multi-threaded. The parts of the program affected by this have been changed accordingly. On to my problem.
Roughly once in every three executions the program will segfault on startup. This is not necessarily a hard rule - sometimes it'll crash three times in a row then work five times in a row. It's the segfault that's interesting (read: painful). It may manifest itself in a number of ways, but most commonly what will happen is function A calls function B and upon entering function B the frame pointer will suddenly be set to 0x000002. Function A:
result_type emit(typename type_trait<T_arg1>::take _A_a1) const
{ return emitter_type::emit(impl_, _A_a1); }
This is a simple signal implementation. impl_ and _A_a1 are well-defined within their frame at the crash. On actual execution of that instruction, we end up at program counter 0x000002.
This doesn't always happen on that function. In fact it happens in quite a few places, but this is one of the simpler cases that doesn't leave that much room for error. Sometimes what will happen is a stack-allocated variable will suddenly be sitting on junk memory (always on 0x000002) for no reason whatsoever. Other times, that same code will run just fine. So, my question is, what can mangle the stack so badly? What can actually change the value of the frame pointer? I've certainly never heard of such a thing. About the only thing I can think of is writing out of bounds on an array, but I've built it with a stack protector which should come up with any instances of that happening. I'm also well within the bounds of my stack here. I also don't see how another thread could overwrite the variable on the stack of the first thread since each thread has it's own stack (this is all pthreads). I've tried building this on a linux machine and while I don't get segfaults there, roughly one out of three times it will freeze up on me.
Stack corruption, 99.9% definitely.
The smells you should be looking carefully for are:-
Use of 'C' arrays
Use of 'C' strcpy-style functions
memcpy
malloc and free
thread-safety of anything using pointers
Uninitialised POD variables.
Pointer Arithmetic
Functions trying to return local variables by reference
I had that exact problem today and was knee-deep in gdb mud and debugging for a straight hour before occurred to me that I simply wrote over array boundaries (where I didn't expect it the least) of a C array.
So, if possible, use vectors instead because any decend STL implementation will give good compiler messages if you try that in debug mode (whereas C arrays punish you with segfaults).
I'm not sure what you're calling a "frame pointer", as you say:
On actual execution of that
instruction, we end up at program
counter 0x000002
Which makes it sound like the return address is being corrupted. The frame pointer is a pointer that points to the location on the stack of the current function call's context. It may well point to the return address (this is an implementation detail), but the frame pointer itself is not the return address.
I don't think there's enough information here to really give you a good answer, but some things that might be culprits are:
incorrect calling convention. If you're calling a function using a calling convention different from how the function was compiled, the stack may become corrupted.
RAM hit. Anything writing through a bad pointer can cause garbage to end up on the stack. I'm not familiar with Solaris, but most thread implementations have the threads in the same process address space, so any thread can access any other thread's stack. One way a thread can get a pointer into another thread's stack is if the address of a local variable is passed to an API that ultimately deals with the pointer on a different thread. unless you synchronize things properly, this will end up with the pointer accessing invalid data. Given that you're dealing with a "simple signal implementation", it seems like it's possible that one thread is sending a signal to another. Maybe one of the parameters in that signal has a pointer to a local?
There's some confusion here between stack overflow and stack corruption.
Stack Overflow is a very specific issue cause by try to use using more stack than the operating system has allocated to your thread. The three normal causes are like this.
void foo()
{
foo(); // endless recursion - whoops!
}
void foo2()
{
char myBuffer[A_VERY_BIG_NUMBER]; // The stack can't hold that much.
}
class bigObj
{
char myBuffer[A_VERY_BIG_NUMBER];
}
void foo2( bigObj big1) // pass by value of a big object - whoops!
{
}
In embedded systems, thread stack size may be measured in bytes and even a simple calling sequence can cause problems. By default on windows, each thread gets 1 Meg of stack, so causing stack overflow is much less of a common problem. Unless you have endless recursion, stack overflows can always be mitigated by increasing the stack size, even though this usually is NOT the best answer.
Stack Corruption simply means writing outside the bounds of the current stack frame, thus potentially corrupting other data - or return addresses on the stack.
At it's simplest:-
void foo()
{
char message[10];
message[10] = '!'; // whoops! beyond end of array
}
That sounds like a stack overflow problem - something is writing beyond the bounds of an array and trampling over the stack frame (and probably the return address too) on the stack. There's a large literature on the subject. "The Shell Programmer's Guide" (2nd Edition) has SPARC examples that may help you.
With C++ unitialized variables and race conditions are likely suspects for intermittent crashes.
Is it possible to run the thing through Valgrind? Perhaps Sun provides a similar tool. Intel VTune (Actually I was thinking of Thread Checker) also has some very nice tools for thread debugging and such.
If your employer can spring for the cost of the more expensive tools, they can really make these sorts of problems a lot easier to solve.
It's not hard to mangle the frame pointer - if you look at the disassembly of a routine you will see that it is pushed at the start of a routine and pulled at the end - so if anything overwrites the stack it can get lost. The stack pointer is where the stack is currently at - and the frame pointer is where it started at (for the current routine).
Firstly I would verify that all of the libraries and related objects have been rebuilt clean and all of the compiler options are consistent - I've had a similar problem before (Solaris 2.5) that was caused by an object file that hadn't been rebuilt.
It sounds exactly like an overwrite - and putting guard blocks around memory isn't going to help if it is simply a bad offset.
After each core dump examine the core file to learn as much as you can about the similarities between the faults. Then try to identify what is getting overwritten. As I remember the frame pointer is the last stack pointer - so anything logically before the frame pointer shouldn't be modified in the current stack frame - so maybe record this and copy it elsewhere and compare upon return.
Is something meaning to assign a value of 2 to a variable but instead is assigning its address to 2?
The other details are lost on me but "2" is the recurring theme in your problem description. ;)
I would second that this definitely sounds like a stack corruption due to out of bound array or buffer writing. Stack protector would be good as long as the writing is sequential, not random.
I second the notion that it is likely stack corruption. I'll add that the switch to a multi-threaded library makes me suspicious that what has happened is a lurking bug has been exposed. Possibly the sequencing the buffer overflow was occurring on unused memory. Now it's hitting another thread's stack. There are many other possible scenarios.
Sorry if that doesn't give much of a hint at how to find it.
I tried Valgrind on it, but unfortunately it doesn't detect stack errors:
"In addition to the performance penalty an important limitation of Valgrind is its inability to detect bounds errors in the use of static or stack allocated data."
I tend to agree that this is a stack overflow problem. The tricky thing is tracking it down. Like I said, there's over 100,000 lines of code to this thing (including custom libraries developed in-house - some of it going as far back as 1992) so if anyone has any good tricks for catching that sort of thing, I'd be grateful. There's arrays being worked on all over the place and the app uses OI for its GUI (if you haven't heard of OI, be grateful) so just looking for a logical fallacy is a mammoth task and my time is short.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
No one asked this, but I built with gcc-4.2. Also, I can guarantee ABI safety here so that's also not the issue. As for the "garbage at the end of the stack" on the RAM hit, the fact that it is universally 2 (though in different places in the code) makes me doubt that as garbage tends to be random.
It is impossible to know, but here are some hints that I can come up with.
In pthreads you must allocate the stack and pass it to the thread. Did you allocate enough? There is no automatic stack growth like in a single threaded process.
If you are sure that you don't corrupt the stack by writing past stack allocated data check for rouge pointers (mostly uninitialized pointers).
One of the threads could overwrite some data that others depend on (check your data synchronisation).
Debugging is usually not very helpful here. I would try to create lots of log output (traces for entry and exit of every function/method call) and then analyze the log.
The fact that the error manifest itself differently on Linux may help. What thread mapping are you using on Solaris? Make sure you map every thread to it's own LWP to ease the debugging.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
If you pass anything on the stack by reference or by address, this would most certainly happen if another thread tried to use it after the first thread returned from a function.
You might be able to repro this by forcing the app onto a single processor. I don't know how you do that with Sparc.