Policy with catching std::bad_alloc - c++

So I use Qt a lot with my development and love it. The usual design pattern with Qt objects is to allocate them using new.
Pretty much all of the examples (especially code generated by the Qt designer) do absolutely no checking for the std::bad_alloc exception. Since the objects allocated (usually widgets and such) are small this is hardly ever a problem. After all, if you fail to allocate something like 20 bytes, odds are there's not much you can do to remedy the problem.
Currently, I've adopted a policy of wrapping "large" (anything above a page or two in size) allocations in a try/catch. If that fails, I display a message to the user, pretty much anything smaller, I'll just let the app crash with a std::bad_alloc exception.
So, I wonder what the schools of thought on this are on this?
Is it good policy to check each and every new operation? Or only ones I expect to have the potential to fail?
Also, it is clearly a whole different story when dealing with an embedded environment where resources can be much more constrained. I am asking in the context of a desktop application, but would be interested in answers involving other scenarios as well.

The problem is not "where to catch" but "what to do when an exception is caught".
If you want to check, instead of wrapping with try catch you'd better use
#include <new>
x = new (std::nothrow) X();
if (x == NULL) {
// allocation failed
}
My usual practice is
in a non-interactive program, catch at main level and display an adequate error message there.
in a program having a user interaction loop, catch also in the loop so that the user can close some things and try to continue.
Exceptionally, there are other places where a catch is meaningful, but it's rare.

Handle the exception when you can. If an allocation fails, and your application can't continue without that bit of memory, why bother checking for the error?
Handle the error when it can be handled, when there is a meaningful way to recover. If there's nothing you can do about the error, just let it propagate.

I usually catch exceptions at the point where the user has initiated an action. For console application this means in main, for GUI applications I put handlers in places like button on-click handlers and such.
I believe that it makes little sense catching exceptions in the middle of an action, the user usually expects the operation to either succeeds or completely fail.

This is a relatively old thread, but it did come up when I was searching for "std::bad_alloc" considerations when doing new/delete overriding here in 2012.
I would not take the concept "oh well nothing you can do anyhow" as a viable option.
I personally use in my own allocations the "if(alloc()){} else { error/handling }" way mentioned above. This way I can properly handle and, or, report each case in their own meaningful context.
Now, some other possible solutions are:
1) Override the new/delete for the application where you can add your own out of memory handling.
Although like other posters state, and in particular with out knowledge of the particular contexts, the main option is probably to just shut down the application.
If this is the case you will want your handler to either have preallocated it's needed memory, and, or, use static memory so hopefully the handler it's self will not become exhausted.
Here you could have at least perhaps a dialog pop up and say something on the lines of:
"The application ran out of memory. This a fatal error and must now self terminate.
The application must be run in the minimum system memory requirements. Send debug reports to xxxx".
The handler could also save any work in progress, etc., fitting the application.
At any rate you wouldn't want to use this for something critical like (warning, amateur humor ahead): the space shuttle, a heart rate monitor, a kidney dialysis machine, etc.
These things require much more robust solutions of course, using fail safes, emergency garbage collection methods, 100% testing/debugging/fuzzing, etc.
2) Similar to the first, set the global "set_new_handler()" with a handler of your own to catch the out of memory condition at a global scope.
Can at least handle things as mentions in #1.

The real question is reallty should you catch std::bad_alloc exceptions?
I most cases if you run out of memory you are doomed anyway and might consider ending your program.

Handle it in main() (or the equivalent top level exception handler in Qt)
The reason is that std::bad_alloc either happens when you exhaust the memory space (2 or 3 GB on 32 bits systems, doesn't happen on 64 bits systems) or when you exhaust swap space. Modern heap allocators aren't tuned to run from swap space, so that will be a slow, noisy death - chances are your users will kill your app well beforehand as it's UI is no longer responding. And on Linux, the OS memory handling is so poor by default that your app is likely to be killed automatically.
So, there is little you can do. Confess you have a bug, and see if you can save any work the user may have done. To be able to do so, it's best to abort as much as possible. Yes, this may in fact lose some of the last user input. But it's that very action that likely triggered the OOM situation.. The goal is to save whatever data you can trust.

Related

When can the problem actually be fixed by catching an exception?

Here's the thing. There's something I don't quite understand about exceptions, and to me they seem like a construct that almost works, but can't be used cleanly.
I have a simple question. When has catching an exception been a useful or necessary component of solving the root cause of the problem? I.e. when have you been able to write code that fixes a problem signaled through an exception? I am looking for factual data, or experience you have had.
Here's what I mean. A normal program does work. If some piece of work can't be completed for reason X, the function responsible for doing the work throws an exception. But who catches the exception? As I see it, there are three reasons you might want to catch an exception:
You catch it because you want to change its type and rethrow it. (This happens when you translate mechanical exception, such as std::out_of_range, to business exceptions, such as could_not_complete_transaction)
You catch it because you want to log it, or let the user know about the problem, before aborting.
You catch it because you actually know how to solve the problem.
It is point 3 that I'm skeptical about. I have never actually caught an exception knowing what to do to solve it. When you get a std::out_of_memory, what are you supposed to do with it? It's not like you can barter the operating system to get more memory. That's just not something you can fix. And it's not just std::out_of_memory, there are also business class exceptions that suffer from this. Think about a potential connection_error exception: what can you do to fix this except wait and retry later and hope it fixes itself?
Now, to be fair, I do know of one case in which code does catch an exception and tries to fix the problem. I know that there are certain Win32 SEH handlers that catch a Stack Overflow exception and try to fix the problem by enlarging the size of the thread stack if it's possible. However, this works because SEH has try-resume semantics, which C++ exceptions don't have (you can't resume at the point the exception occurred).
The main part of the question is over. However, there's also another problem I have with exceptions that, to me, seems exactly the reason why you don't have catch clauses that fix the problem: the code that catches the exception necessarily has to be coupled with the code that throws it. Because, in order to fix the problem, it must have domain specific knowledge about what the problem cause is. But when some library documents that "if this function fails, an internal_error exception will be thrown", how am I supposed to be able to fix the problem when I don't know how the library works internally?
PS: Please note that this is not a "exceptions vs. error codes" kind of question; I am well aware that error codes suck as an error handling mechanism. They actually suffer from the same problem I have explained for exceptions.
I think your problem is that you equate "solve the problem" with "make the program keep going correctly". That is the wrong way to think of exceptions, or error handling in general.
Error handling code of any kind should not be something that is internally fixable by the program. That is, error handling logic (like catching exceptions) should not be entered because of programming mistakes.
If the user gives you a non-existent filename, that's not a programming mistake; that's a user-error. You cannot "fix" that without going back to the user and getting an existing file. But exceptions do allow you to undo what you were trying to do, restore the program to a valid state, and then communicate what happened to the user.
An invalid_connection is similarly not a programming mistake. Unlike the above, it's not necessarily a user error either. It's something that's expected to be able to happen, and different programs will handle it in different ways. Some will want to try again. Others will want to halt and let the user know.
The point is, because there is no one means to handle this condition, it cannot be done by the library. The error must be given to the caller of the library to figure out what to do.
If you have a function that parses integers, and you are given text that doesn't conform to an integer, it's not that function's job to figure out what to do next. The caller needs to be notified that the string they provided is malformed and that something ought to be done.
The caller needs to handle the error.
You don't abort most programs because a file that was supposed to contain integers didn't contain integers. But your parsing function does need to communicate this fact to the caller, and the caller does need to deal with that possibility.
That's what "catching exceptions" is for.
Now, unexpected environmental conditions like OOM are a different story. This is not usually external code's fault, but it's also not usually a programming error. And if it is a programming error (ie: memory leak), it's not one you can deal with in most cases. P0709 has an entire section on the ability (or lack thereof) of programs to be able to generally respond to OOM. The result is that, even when programs are coded defensively against OOM exceptions, they're usually still broken when they run out of memory.
Especially when dealing with OS's that don't commit pages to memory until you actually use them.
Here is my take,
There are more reasons to catch exceptions, for example, if it is a critical application, such as ones found in power substations etc. and an exception is caught to which there is no known system recovery or solution, you may want to have a controlled shutdown, protect certain modules, protect connected embedded systems etc. instead of just letting the system crash on its own. The latter could be disastrous...
I.e. when have you been able to write code that fixes a problem signaled through an exception?
When you get a std::out_of_memory, what are you supposed to do with it? It's not like you can barter the operating system to get more memory.
Actually I feel like that was my primary coding style for a while. An example: a system I worked on did not have a huge amount of memory and the system was dedicated, so, it was only my app and nothing else. Whenever I had an out_of_memory type of exception, I'd just kill the older process and open the one with the higher priority. Of course I'd wait for the kill to happen in a controlled fashion.
Think about a potential connection_error exception: what can you do to fix this except wait and retry later and hope it fixes itself?
I'd try to connect through another medium such as bluetooth, fiber, bus etc. Normally of course there would be a primary medium of contact, and the others wouldn't be called unless there is an exception.
But when some library documents that "if this function fails, an internal_error exception will be thrown", how am I supposed to be able to fix the problem when I don't know how the library works internally?
Most often an exception in a dedicated library has different consequences in your system than its own. You may not need to read the library and its internal workings to fix the problem. You just need to study its effect on your software and handle that situation. That's probably the easiest solution. And that is a lot easier to do if the library raises a known exception instead of just crashing or giving gibberish answers.
One obvious thing that came to mind was socket connections.
You try and connect to Server A and the program finds that it can't do that
Try connecting to Server B
The other examples regarding user input are equally as valid if not more so.
I admit that seeing something along the lines of
try
{
connectToServerA();
}
catch(cantConnectToServer)
{
connectToServerB();
}
would look like a bit of a weird pattern to see in real world code. It might make sense if the function takes an address and we iterate through a list of potential addresses.
Broadly speaking I agree with you often all you want to do is log the error and terminate - but some systems, which have to be robust and "always on" shouldn't just terminate if they encounter a problem.
Webservers are one obvious example. You don't just terminate because one users connection faulters, because that would drop the session for all the other connected users. There might be parts of code where raising an exception is the simplest way to deal with such a failure however.

Debugging crashes in production environments

First, I should give you a bit of context. The program in question is
a fairly typical server application implemented in C++. Across the
project, as well as in all of the underlying libraries, error
management is based on C++ exceptions.
My question is pertinent to dealing with unrecoverable errors and/or
programmer errors---the loose equivalent of "unchecked" Java
exceptions, for want of a better parallel. I am especially interested
in common practices for dealing with such conditions in production
environments.
For production environments in particular, two conflicting goals stand
out in the presence of the above class of errors: ease of debugging
and availability (in the sense of operational performance). Each of
these suggests in turn a specific strategy:
Install a top-level exception handler to absorb all uncaught
exceptions, thus ensuring continuous availability. Unfortunately,
this makes error inspection more involved, forcing the programmer to
rely on fine-grained logging or other code "instrumentation"
techniques.
Crash as hard as possible; this enables one to perform a post-mortem
analysis of the condition that led to the error via a core
dump. Naturally, one has to provide a means for the system to resume
operation in a timely manner after the crash, and this may be far
from trivial.
So I end-up with two half-baked solutions; I would like a compromise
between service availability and debugging facilities. What am I
missing ?
Note: I have flagged the question as C++ specific, as I am interested
in solutions and idiosyncrasies that apply to it particular;
nonetheless, I am aware there will be considerable overlap with other
languages/environments.
Disclaimer: Much like the OP I code for servers, thus this entire answer is focused on this specific use case. The strategy for embedded software or deployed applications should probably be widely different, no idea.
First of all, there are two important (and rather different) aspects to this question:
Easing investigation (as much as possible)
Ensuring recovery
Let us treat both separately, as dividing is conquering. And let's start by the tougher bit.
Ensuring Recovery
The main issue with C++/Java style of try/catch is that it is extremely easy to corrupt your environment because try and catch can mutate what is outside their own scope. Note: contrast to Rust and Go in which a task should not share mutable data with other tasks and a fail will kill the whole task without hope of recovery.
As a result, there are 3 recovery situations:
unrecoverable: the process memory is corrupted beyond repairs
recoverable, manually: the process can be salvaged in the top-level handler at the cost of reinitializing a substantial part of its memory (caches, ...)
recoverable, automatically: okay, once we reach the top-level handler, the process is ready to be used again
An completely unrecoverable error is best addressed by crashing. Actually, in a number of cases (such as a pointer outside your process memory), the OS will help in making it crash. Unfortunately, in some cases it won't (a dangling pointer may still point within your process memory), that's how memory corruptions happen. Oops. Valgrind, Asan, Purify, etc... are tools designed to help you catch those unfortunate errors as early as possible; the debugger will assist (somewhat) for those which make it past that stage.
An error that can be recovered, but requires manual cleanup, is annoying. You will forget to clean in some rarely hit cases. Thus it should be statically prevented. A simple transformation (moving caches inside the scope of the top-level handler) allows you to transform this into an automatically recoverable situation.
In the latter case, obviously, you can just catch, log, and resume your process, waiting for the next query. Your goal should be for this to be the only situation occurring in Production (cookie points if it does not even occur).
Easing Investigation
Note: I will take the opportunity to promote a project by Mozilla called rr which could really, really, help investigating once it matures. Check the quick note at the end of this section.
Without surprise, in order to investigate you will need data. Preferably, as much as possible, and well ordered/labelled.
There are two (practiced) ways to obtain data:
continuous logging, so that when an exception occurs, you have as much context as possible
exception logging, so that upon an exception, you log as much as possible
Logging continuously implies performance overhead and (when everything goes right) a flood of useless logs. On the other hand, exception logging implies having enough trust in the system ability to perform some actions in case of exceptions (which in case of bad_alloc... oh well).
In general, I would advise a mix of both.
Continuous Logging
Each log should contain:
a timestamp (as precise as possible)
(possibly) the server name, the process ID and thread ID
(possibly) a query/session correlator
the filename, line number and function name of where this log came from
of course, a message, which should contain dynamic information (if you have a static message, you can probably enrich it with dynamic information)
What is worth logging ?
At least I/O. All inputs, at least, and outputs can help spotting the first deviation from expected behavior. I/O include: inbound query and corresponding response, as well as interactions with other servers, databases, various local caches, timestamps (for time-related decisions), ...
The goal of such logging is to be able to reproduce the issue spotted in a control environment (which can be setup thanks to all this information). As a bonus, it can be useful as crude performance monitor since it gives some check-points during the process (note: I am talking about monitoring and not profiling for a reason, this can allow you to raise alerts and spot where, roughly, time is spent, but you will need more advanced analysis to understand why).
Exception Logging
The other option is to enrich exception. As an example of a crude exception: std::out_of_range yields the follow reason (from what): vector::_M_range_check when thrown from libstdc++'s vector.
This is pretty much useless if, like me, vector is your container of choice and therefore there are about 3,640 locations in your code where this could have been thrown.
The basics, to get a useful exception, are:
a precise message: "access to index 32 in vector of size 4" is slightly more helpful, no ?
a call stack: it requires platform specific code to retrieve it, though, but can be automatically inserted in your base exception constructor, so go for it!
Note: once you have a call-stack in your exceptions, you will quickly find yourself addicted and wrapping lesser-abled 3rd party software into an adapter layer if only to translate their exceptions into yours; we all did it ;)
On top of those basics, there is a very interesting feature of RAII: attaching notes to the current exception during unwinding. A simple handler retaining a reference to a variable and checking whether an exception is unwinding in its destructor costs only a single if check in general, and does all the important logging when unwinding (but then, exception propagation is costly already, so...).
Finally, you can also enrich and rethrow in catch clauses, but this quickly litters the code with try/catch blocks so I advise using RAII instead.
Note: there is a reason that std exceptions do NOT allocate memory, it allows throwing exceptions without the throw being itself preempted by a std::bad_alloc; I advise to consciously pick having richer exceptions in general with the potential of a std::bad_alloc thrown when attempting to create an exception (which I have yet to see happening). You have to make your own choice.
And Delayed Logging ?
The idea behind delayed logging is that instead of calling your log handler, as usual, you will instead defer logging all finer-grained traces and only get to them in case of issue (aka, exception).
The idea, therefore, is to split logging:
important information is logged immediately
finer-grained information is written to a scratch-pad, which can be called to log them in case of exception
Of course, there are questions:
the scratch pad is (mostly) lost in case of crash; you should be able to access it via your debugger if you get a memory dump though it's not as pleasant.
the scratch pad requires a policy: when to discard it ? (end of the session ? end of the transaction ? ...), how much memory ? (as much as it wants ? bounded ? ...)
what of the performance cost: even if not writing the logs to disk/network, it still cost to format them!
I have actually never used such a scratch pad, for now all non-crasher bugs that I ever had were solved solely using I/O logging and rich exceptions. Still, should I implement it I would recommend making it:
transaction local: since I/O is logged, we should not need more insight that this
memory bounded: evicting older traces as we progress
log-level driven: just as regular logging, I would want to be able to only enable some logs to get into the scratch pad
And Conditional / Probabilistic Logging ?
Writing one trace every N is not really interesting; it's actually more confusing than anything. On the other hand, logging in-depth one transaction every N can help!
The idea here is to reduce the amount of logs written, in general, whilst still getting a chance to observe bugs traces in detail in the wild. The reduction is generally driven by the logging infrastructure constraints (there is a cost to transferring and writing all those bytes) or by the performance of the software (formatting the logs slows software down).
The idea of probabilistic logging is to "flip a coin" at the start of each session/transaction to decide whether it'll be a fast one or a slow one :)
A similar idea (conditional logging) is to read a special debug field in a transaction field that initiates a full logging (at the cost of speed).
A quick note on rr
With an overhead of only 20%, and this overhead applying only on the CPU processing, it might actually be worth using rr systematically. If this is not feasible, however, it could be feasible to have 1 out of N servers being launched under rr and used to catch hard to find bugs.
This is similar to A/B testing, but for debugging purposes, and can be driven either by a willing commitment of the client (flag in the transaction) or with a probabilistic approach.
Oh, and in the general case, when you are not hunting down anything, it can be easily deactivated altogether. No sense in paying those 20% then.
That's all folks
I could apologize for the lengthy read, but the truth I probably just skimmed the topic. Error Recovery is hard. I would appreciate comments and remarks, to help improve this answer.
If the error is unrecoverable, by definition there is nothing the application can do in production environment, to recover from the error. In other words, the top-level exception handler is not really a solution. Even if the application displays a friendly message like "access violation", "possible memory corruption", etc, that doesn't actually increase availability.
When the application crashes in a production environment, you should get as much information as possible for post-mortem analysis (your second solution).
That said, if you get unrecoverable errors in a production environment, the main problems are your product QA process (it's lacking), and (much before that), writing unsafe/untested code.
When you finish investigating such a crash, you should not only fix the code, but fix your development process so that such crashes are no longer possible (i.e. if the corruption is an uninitialized pointer write, go over your code base and initialize all pointers and so on).

Productive use of SIGSEGV

Are there many productive uses of handling SIGSEGV, other than a last ditch "something bad happened"?
From the SIGSEGV wiki page, debuggers use it to catch errors in a user's program and inform the user of what happened. The way that I see it, it's a way to query the virtual memory system, and since we have a virtual memory system, I feel like SIGSEGV could be used in a more productive way. One thing I thought of was that you could have a stack somewhere, try to put things onto it, and then when you catch a SIGSEGV, increase the size of the stack, and retry the operation. I feel like this could be useful if, for example, you are processing messages and yet don't know the size of your messages but due to speed, you want to be optimistic about writing them to memory.
Is this a legitimate use of the signal? Are there other ways to incorporate it into your software design as a method of control flow and execution rather than as a last ditch method of error handling?
Debuggers don't "use" SIGSEGV to catch faults in a user program. Debuggers trap the event so that you (the programmer) can see what has happened.
You really don't want to use SIGSEGV like that. It would be incredibly slow. When a memory violation happens, the O/S has to check if it is a valid access to memory that is currently paged out of the system, as opposed to something you shouldn't be accessing.
You'd have to make similar checks after the operating system had decided you shouldn't be doing that, to check whether or not you were accessing the wrong place on the stack, and decide whether or not it was because the stack needed expanding. Moreover it is unlikely you'd be able to return to the instruction that caused the access violation to re-execute it. You normally have to be the O/S to do that.
If it did work, the code needed would be highly compiler, O/S and architecture specific.
Basically, don't do it.
A typical usage is to catch access to specific memory locations.
You set up memory access rights with mprotect, then you handle the fault, give access, and continue the execution. This can be used for debugging, creative logging, custom i/o-memory mapping, ...
The productive uses of SIGSEGV and other signals are typically limited in production environments due to portability concerns. The list of things you can do safely in a signal handler is pretty short, and the interaction between threads and signals is largely unspecified.
For this reason, the use of SIGSEGV is usually limited to debuggers and similar programs that have to run unknown code without changing it.
In the production environment context it is useful to handle SIGSEGV - to log a message and handle the shutdown gracefully (shutdown network connections etc, perhaps having a restart).

Is it not possible to make a C++ application "Crash Proof"?

Let's say we have an SDK in C++ that accepts some binary data (like a picture) and does something. Is it not possible to make this SDK "crash-proof"? By crash I primarily mean forceful termination by the OS upon memory access violation, due to invalid input passed by the user (like an abnormally short junk data).
I have no experience with C++, but when I googled, I found several means that sounded like a solution (use a vector instead of an array, configure the compiler so that automatic bounds check is performed, etc.).
When I presented this to the developer, he said it is still not possible.. Not that I don't believe him, but if so, how is language like Java handling this? I thought the JVM performs everytime a bounds check. If so, why can't one do the same thing in C++ manually?
UPDATE
By "Crash proof" I don't mean that the application does not terminate. I mean it should not abruptly terminate without information of what happened (I mean it will dump core etc., but is it not possible to display a message like "Argument x was not valid" etc.?)
You can check the bounds of an array in C++, std::vector::at does this automatically.
This doesn't make your app crash proof, you are still allowed to deliberately shoot yourself in the foot but nothing in C++ forces you to pull the trigger.
No. Even assuming your code is bug free. For one, I have looked at many a crash reports automatically submitted and I can assure you that the quality of the hardware out there is much bellow what most developers expect. Bit flips are all too common on commodity machines and cause random AVs. And, even if you are prepared to handle access violations, there are certain exceptions that the OS has no choice but to terminate the process, for example failure to commit a stack guard page.
By crash I primarily mean forceful termination by the OS upon memory access violation, due to invalid input passed by the user (like an abnormally short junk data).
This is what usually happens. If you access some invalid memory usually OS aborts your program.
However the question what is invalid memory... You may freely fill with garbage all the memory in heap and stack and this is valid from OS point of view, it would not be valid from your point of view as you created garbage.
Basically - you need to check the input data carefully and relay on this. No OS would do this for you.
If you check your input data carefully you would likely to manage the data ok.
I primarily mean forceful termination
by the OS upon memory access
violation, due to invalid input passed
by the user
Not sure who "the user" is.
You can write programs that won't crash due to invalid end-user input. On some systems, you can be forcefully terminated due to using too much memory (or because some other program is using too much memory). And as Remus says, there is no language which can fully protect you against hardware failures. But those things depend on factors other than the bytes of data provided by the user.
What you can't easily do in C++ is prove that your program won't crash due to invalid input, or go wrong in even worse ways, creating serious security flaws. So sometimes[*] you think that your code is safe against any input, but it turns out not to be. Your developer might mean this.
If your code is a function that takes for example a pointer to the image data, then there's nothing to stop the caller passing you some invalid pointer value:
char *image_data = malloc(1);
free(image_data);
image_processing_function(image_data);
So the function on its own can't be "crash-proof", it requires that the rest of the program doesn't do anything to make it crash. Your developer also might mean this, so perhaps you should ask him to clarify.
Java deals with this specific issue by making it impossible to create an invalid reference - you don't get to manually free memory in Java, so in particular you can't retain a reference to it after doing so. It deals with a lot of other specific issues in other ways, so that the situations which are "undefined behavior" in C++, and might well cause a crash, will do something different in Java (probably throw an exception).
[*] let's face it: in practice, in large software projects, "often".
I think this is a case of C++ codes not being managed codes.
Java, C# codes are managed, that is they are effectively executed by an Interpreter which is able to perform bound checking and detect crash conditions.
With the case of C++, you need to perform bound and other checking yourself. However, you have the luxury of using Exception Handling, which will prevent crash during events beyond your control.
The bottom line is, C++ codes themselves are not crash proof, but a good design and development can make them to be so.
In general, you can't make a C++ API crash-proof, but there are techniques that can be used to make it more robust. Off the top of my head (and by no means exhaustive) for your particular example:
Sanity check input data where possible
Buffer limit checks in the data processing code
Edge and corner case testing
Fuzz testing
Putting problem inputs in the unit test for regression avoidance
If "crash proof" only mean that you want to ensure that you have enough information to investigate crash after it occurred solution can be simple. Most cases when debugging information is lost during crash resulted from corruption and/or loss of stack data due to illegal memory operation by code running in one of threads. If you have few places where you call library or SDK that you don't trust you can simply save the stack trace right before making call into that library at some memory location pointed to by global variable that will be included into partial or full memory dump generated by system when your application crashes. On windows such functionality provided by CrtDbg API.On Linux you can use backtrace API - just search doc on show_stackframe(). If you loose your stack information you can then instruct your debugger to use that location in memory as top of the stack after you loaded your dump file. Well it is not very simple after all, but if you haunted by memory dumps without any clue what happened it may help.
Another trick often used in embedded applications is cycled memory buffer for detailed logging. Logging to the buffer is very cheap since it is never saved, but you can get idea on what happen milliseconds before crash by looking at content of the buffer in your memory dump after the crash.
Actually, using bounds checking makes your application more likely to crash!
This is good design because it means that if your program is working, it's that much more likely to be working /correctly/, rather than working incorrectly.
That said, a given application can't be made "crash proof", strictly speaking, until the Halting Problem has been solved. Good luck!

What to do when an out-of-memory error occurs? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What's the graceful way of handling out of memory situations in C/C++?
Hi,
this seems to be a simple question a first glance. And I don't want to start a huge discussion on what-is-the-best-way-to-do-this....
Context: Windows >= 5, 32 bit, C++, Windows SDK / Win32 API
But after asking a similar question, I read some MSDN and about the Win32 memory management, so now I'm even more confused on what to do if an allocation fails, let's say the C++ new operator.
So I'm very interested now in how you implement (and implicitly, if you do implement) an error handling for OOM in your applications.
If, where (main function?), for which operations (allocations) , and how you handle an OOM error.
(I don't really mean that subjectively, turning this into a question of preference, I just like to see different approaches that account for different conditions or fit different situations. So feel free to offer answers for GUI apps, services - user-mode stuff ....)
Some exemplary reactions to OOM to show what I mean:
GUI app: Message box, exit process
non-GUI app: Log error, exit process
service: try to recover, e.g. kill the thread that raised an exception, but continue execution
critical app: try again until an allocation succeeds (reducing the requested amount of memory)
hands from OOM, let STL / boost / OS handle it
Thank you for your answers!
The best-explained way will receive the great honour of being an accepted answer :D - even if it only consists of a MessageBox line, but explains why evering else was useless, wrong or unneccessary.
Edit: I appreciate your answers so far, but I'm missing a bit of an actual answer; what I mean is most of you say don't mind OOM since you can't do anything when there's no memory left (system hangs / poor performance). But does that mean to avoid any error handling for OOM? Or only do a simple try-catch in the main showing a MessageBox?
On most modern OSes, OOM will occur long after the system has become completely unusable, since before actually running out, the virtual memory system will start paging physical RAM out to make room for allocating additional virtual memory and in all likelihood the hard disk will begin to thrash like crazy as pages have to be swapped in and out at higher and higher frequencies.
In short, you have much more serious concerns to deal with before you go anywhere near OOM conditions.
Side note: At the moment, the above statement isn't as true as it used to be, since 32-bit machines with loads of physical RAM can exhaust their address space before they start to page. But this is still not common and is only temporary, as 64-bit ramps up and approaches mainstream adoption.
Edit: It seems that 64-bit is already mainstream. While perusing the Dell web site, I couldn't find a single 32-bit system on offer.
You do the exact same thing you do when:
you created 10,000 windows
you allocated 10,000 handles
you created 2,000 threads
you exceeded your quota of kernel pool memory
you filled up the hard disk to capacity.
You send your customer a very humble message where you apologize for writing such crappy code and promise a delivery date for the bug fix. Any else is not nearly good enough. How you want to be notified about it is up to you.
Basically, you should do whatever you can to avoid having the user lose important data. If disk space is available, you might write out recovery files. If you want to be super helpful, you might allocate recovery files while your program is open, to ensure that they will be available in case of emergency.
Simply display a message or dialog box (depending on whether your in a terminal or window system), saying "Error: Out of memory", possibly with debugging info, and include an option for your user to file a bug report, or a web link to where they can do that.
If your really out of memory then, in all honesty, there's no point doing anything other than gracefully exiting, trying to handle the error is useless as there is nothing you can do.
In my case, what happens when you have an app that fragments the memory up so much it cannot allocate the contiguous block needed to process the huge amount of nodes?
Well, I split the processing up as much as I could.
For OOM, you can do the same thing, chop your processes up into as many pieces as possible and do them sequentially.
Of course, for handling the error until you get to fix it (if you can!), you typically let it crash. Then you determine that those memory allocs are failing (like you never expected) and put a error message direct to the user along the lines of "oh dear, its all gone wrong. log a call with the support dept". In all cases, you inform the user however you like. Though, its established practice to use whatever mechanism the app currently uses - if it writes to a log file, do that, if it displays an error dialog, do the same, if it uses the Windows 'send info to microsoft' dialog, go right ahead and let that be the bearer of bad tidings - users are expecting it, so don't try to be clever and do something else.
It depends on your app, your skill level, and your time. If it needs to be running 24/7 then obviously you must handle it. It depends on the situation. Perhaps it may be possible to try a slower algorithm but one that requires less heap. Maybe you can add functionality so that if OOM does occur your app is capable of cleaning itself up, and so you can try again.
So I think the answer is 'ALL OF THE ABOVE!', apart from LET IT CRASH. You take pride in your work, right?
Don't fall into the 'there's loads of memory so it probably won't happen' trap. If every app writer took that attitude you'd see OOM far more often, and not all apps are running on a desktop machines, take a mobile phone for example, it's highly likely for you to run into OOM on a RAM starved platform like that, trust me!
If all else fails display a useful message (assuming there's enough memory for a MessageBox!)